2025-05-07T20:22:35.2900281Z Current runner version: '2.323.0' 2025-05-07T20:22:35.2907631Z Runner name: 'i-09c05d8e2aea2c844' 2025-05-07T20:22:35.2908654Z Machine name: 'ip-10-0-65-139' 2025-05-07T20:22:35.2911533Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:22:35.2913935Z Contents: read 2025-05-07T20:22:35.2914458Z Metadata: read 2025-05-07T20:22:35.2914974Z Packages: read 2025-05-07T20:22:35.2915478Z ##[endgroup] 2025-05-07T20:22:35.2917704Z Secret source: None 2025-05-07T20:22:35.2918799Z Prepare workflow directory 2025-05-07T20:22:35.3443409Z Prepare all required actions 2025-05-07T20:22:35.3480153Z Getting action download info 2025-05-07T20:22:35.5695775Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:22:35.7981248Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:22:36.0540288Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:22:37.6063262Z Getting action download info 2025-05-07T20:22:37.6974440Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:22:37.9570064Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.10, 12.8.0, 12.6.3, clang) 2025-05-07T20:22:38.0198720Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:22:38.0335890Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:22:38.0349018Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:22:38.0350615Z ##[endgroup] 2025-05-07T20:22:39.2774758Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:22:39.2775200Z Instance Type: g5.4xlarge 2025-05-07T20:22:39.2775464Z AMI Name: unknown 2025-05-07T20:22:39.2816319Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:22:44.7176826Z ##[group]Run actions/checkout@v4 2025-05-07T20:22:44.7177157Z with: 2025-05-07T20:22:44.7177423Z submodules: true 2025-05-07T20:22:44.7177672Z repository: pytorch/FBGEMM 2025-05-07T20:22:44.7178078Z token: *** 2025-05-07T20:22:44.7178296Z ssh-strict: true 2025-05-07T20:22:44.7178522Z ssh-user: git 2025-05-07T20:22:44.7178753Z persist-credentials: true 2025-05-07T20:22:44.7179020Z clean: true 2025-05-07T20:22:44.7179259Z sparse-checkout-cone-mode: true 2025-05-07T20:22:44.7179539Z fetch-depth: 1 2025-05-07T20:22:44.7179765Z fetch-tags: false 2025-05-07T20:22:44.7179994Z show-progress: true 2025-05-07T20:22:44.7180226Z lfs: false 2025-05-07T20:22:44.7180444Z set-safe-directory: true 2025-05-07T20:22:44.7180704Z env: 2025-05-07T20:22:44.7180923Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:44.7181244Z BUILD_ENV: build_binary 2025-05-07T20:22:44.7181511Z BUILD_TARGET: genai 2025-05-07T20:22:44.7181750Z BUILD_VARIANT: cuda 2025-05-07T20:22:44.7182025Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:44.7182290Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:44.7182545Z ##[endgroup] 2025-05-07T20:22:44.8359806Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:22:44.8361099Z ##[group]Getting Git version info 2025-05-07T20:22:44.8361563Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.8362198Z [command]/usr/bin/git version 2025-05-07T20:22:44.8362478Z git version 2.47.1 2025-05-07T20:22:44.8366966Z ##[endgroup] 2025-05-07T20:22:44.8389458Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/c98a28e5-12a9-4c25-a94b-b5f916230478' before making global git config changes 2025-05-07T20:22:44.8390377Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:22:44.8394941Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.8432026Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.8435336Z ##[group]Initializing the repository 2025-05-07T20:22:44.8439450Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.8480383Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T20:22:44.8481002Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T20:22:44.8481558Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T20:22:44.8481962Z hint: 2025-05-07T20:22:44.8482275Z hint: git config --global init.defaultBranch 2025-05-07T20:22:44.8482626Z hint: 2025-05-07T20:22:44.8482965Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T20:22:44.8483527Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T20:22:44.8483967Z hint: 2025-05-07T20:22:44.8484195Z hint: git branch -m 2025-05-07T20:22:44.8484717Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/ 2025-05-07T20:22:44.8492220Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T20:22:44.8528044Z ##[endgroup] 2025-05-07T20:22:44.8528509Z ##[group]Disabling automatic garbage collection 2025-05-07T20:22:44.8532196Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:22:44.8565193Z ##[endgroup] 2025-05-07T20:22:44.8565617Z ##[group]Setting up auth 2025-05-07T20:22:44.8571332Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:22:44.8602815Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:22:44.8978599Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:22:44.9010963Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:22:44.9359837Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:44.9410319Z ##[endgroup] 2025-05-07T20:22:44.9410786Z ##[group]Fetching the repository 2025-05-07T20:22:44.9418295Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:22:45.3961534Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:22:45.3962068Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:22:45.3998488Z ##[endgroup] 2025-05-07T20:22:45.3998901Z ##[group]Determining the checkout info 2025-05-07T20:22:45.4001628Z ##[endgroup] 2025-05-07T20:22:45.4007292Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:22:45.4046064Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:22:45.4087792Z ##[group]Checking out the ref 2025-05-07T20:22:45.4091638Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:22:45.5172855Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T20:22:45.5173200Z 2025-05-07T20:22:45.5173494Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T20:22:45.5174031Z changes and commit them, and you can discard any commits you make in this 2025-05-07T20:22:45.5174564Z state without impacting any branches by switching back to a branch. 2025-05-07T20:22:45.5174884Z 2025-05-07T20:22:45.5175107Z If you want to create a new branch to retain commits you create, you may 2025-05-07T20:22:45.5175593Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T20:22:45.5175871Z 2025-05-07T20:22:45.5175992Z git switch -c 2025-05-07T20:22:45.5176197Z 2025-05-07T20:22:45.5176361Z Or undo this operation with: 2025-05-07T20:22:45.5176546Z 2025-05-07T20:22:45.5176648Z git switch - 2025-05-07T20:22:45.5177045Z 2025-05-07T20:22:45.5177284Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T20:22:45.5177640Z 2025-05-07T20:22:45.5178033Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:22:45.5186540Z ##[endgroup] 2025-05-07T20:22:45.5186979Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:22:45.5192253Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:45.5241628Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:22:45.5273762Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:22:45.5307062Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:22:45.5335356Z ##[endgroup] 2025-05-07T20:22:45.5335767Z ##[group]Fetching submodules 2025-05-07T20:22:45.5338860Z [command]/usr/bin/git submodule sync 2025-05-07T20:22:45.5686511Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:22:45.6021218Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T20:22:45.6024775Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T20:22:45.6028149Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T20:22:45.6032710Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T20:22:45.6036639Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T20:22:45.6040872Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T20:22:45.6044339Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T20:22:45.6077121Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'... 2025-05-07T20:22:45.9180885Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'... 2025-05-07T20:22:46.4841145Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'... 2025-05-07T20:22:47.0385484Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'... 2025-05-07T20:22:48.1051024Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'... 2025-05-07T20:22:48.4545988Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'... 2025-05-07T20:22:48.7984282Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'... 2025-05-07T20:22:49.9716653Z From https://github.com/asmjit/asmjit 2025-05-07T20:22:49.9717242Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T20:22:50.0198784Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:22:50.7287006Z From https://github.com/jwfromm/composable_kernel 2025-05-07T20:22:50.7287489Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T20:22:51.0101912Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:22:51.6752366Z From https://github.com/pytorch/cpuinfo 2025-05-07T20:22:51.6753972Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T20:22:51.7847925Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:22:52.9290707Z From https://github.com/jwfromm/cutlass 2025-05-07T20:22:52.9291339Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T20:22:53.6293158Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:22:54.4226596Z From https://github.com/google/googletest 2025-05-07T20:22:54.4227078Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T20:22:54.4637058Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:22:55.1519707Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T20:22:55.1520228Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T20:22:55.1606190Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:22:55.9522517Z From https://github.com/nlohmann/json 2025-05-07T20:22:55.9523452Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T20:22:56.0673803Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:22:56.0692594Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:22:56.1035562Z Entering 'external/asmjit' 2025-05-07T20:22:56.1068790Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.1100417Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.1132586Z Entering 'external/cutlass' 2025-05-07T20:22:56.1164078Z Entering 'external/googletest' 2025-05-07T20:22:56.1196231Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.1228284Z Entering 'external/json' 2025-05-07T20:22:56.1275206Z ##[endgroup] 2025-05-07T20:22:56.1275646Z ##[group]Persisting credentials for submodules 2025-05-07T20:22:56.1282187Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:22:56.1620297Z Entering 'external/asmjit' 2025-05-07T20:22:56.1687724Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.1760053Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.1826220Z Entering 'external/cutlass' 2025-05-07T20:22:56.1900560Z Entering 'external/googletest' 2025-05-07T20:22:56.1967759Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.2039895Z Entering 'external/json' 2025-05-07T20:22:56.2124532Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:22:56.2463720Z Entering 'external/asmjit' 2025-05-07T20:22:56.2528199Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:22:56.2531397Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.2594384Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:22:56.2597060Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.2658900Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:22:56.2661699Z Entering 'external/cutlass' 2025-05-07T20:22:56.2723037Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:22:56.2726547Z Entering 'external/googletest' 2025-05-07T20:22:56.2790404Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:22:56.2793792Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.2856634Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:22:56.2859099Z Entering 'external/json' 2025-05-07T20:22:56.2920206Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:22:56.3007596Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:22:56.3348414Z Entering 'external/asmjit' 2025-05-07T20:22:56.3381318Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.3415408Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.3447719Z Entering 'external/cutlass' 2025-05-07T20:22:56.3480728Z Entering 'external/googletest' 2025-05-07T20:22:56.3513448Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.3546437Z Entering 'external/json' 2025-05-07T20:22:56.3595168Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:22:56.3934290Z Entering 'external/asmjit' 2025-05-07T20:22:56.3967884Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.4000424Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.4033163Z Entering 'external/cutlass' 2025-05-07T20:22:56.4065665Z Entering 'external/googletest' 2025-05-07T20:22:56.4098664Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.4132022Z Entering 'external/json' 2025-05-07T20:22:56.4201411Z ##[endgroup] 2025-05-07T20:22:56.4221739Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:22:56.4251278Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:22:56.4436666Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:22:56.4437002Z with: 2025-05-07T20:22:56.4437262Z name: fbgemm_genai_x86_clang_py3.10_cu12.8.0.whl 2025-05-07T20:22:56.4437614Z merge-multiple: false 2025-05-07T20:22:56.4437885Z repository: pytorch/FBGEMM 2025-05-07T20:22:56.4438174Z run-id: 14891846252 2025-05-07T20:22:56.4438399Z env: 2025-05-07T20:22:56.4438645Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.4438972Z BUILD_ENV: build_binary 2025-05-07T20:22:56.4439229Z BUILD_TARGET: genai 2025-05-07T20:22:56.4439470Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.4439724Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:56.4439992Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.4440250Z ##[endgroup] 2025-05-07T20:22:56.6808931Z Downloading single artifact 2025-05-07T20:22:56.7712018Z Preparing to download the following artifacts: 2025-05-07T20:22:56.7713067Z - fbgemm_genai_x86_clang_py3.10_cu12.8.0.whl (ID: 3081404175, Size: 18501011, Expected Digest: sha256:11df06046b7d4c3f3f186959566dfdd554d7e11b3fd21f4c28aab1ad73234076) 2025-05-07T20:22:56.8196918Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-73f140a1-be13-5b4d-b1a3-0329f4aec114/artifacts/7fea0a3d48a0a904e7ca275a23bd63820365acfdb69b50cc760cc4ba3d0dc013.zip 2025-05-07T20:22:56.8198485Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:56.8843863Z (node:57009) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:22:56.8844956Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:22:57.1433009Z SHA256 digest of downloaded artifact is 11df06046b7d4c3f3f186959566dfdd554d7e11b3fd21f4c28aab1ad73234076 2025-05-07T20:22:57.1433762Z Artifact download completed successfully. 2025-05-07T20:22:57.1434114Z Total of 1 artifact(s) downloaded 2025-05-07T20:22:57.1440009Z Download artifact has finished successfully 2025-05-07T20:22:57.1709254Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:22:57.1709672Z with: 2025-05-07T20:22:57.1709904Z driver-version: 570.133.07 2025-05-07T20:22:57.1710175Z env: 2025-05-07T20:22:57.1710413Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:57.1710730Z BUILD_ENV: build_binary 2025-05-07T20:22:57.1710991Z BUILD_TARGET: genai 2025-05-07T20:22:57.1711242Z BUILD_VARIANT: cuda 2025-05-07T20:22:57.1711490Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:57.1711763Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:57.1712020Z ##[endgroup] 2025-05-07T20:22:57.1803035Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:22:57.1803436Z with: 2025-05-07T20:22:57.1803893Z timeout_minutes: 10 2025-05-07T20:22:57.1804149Z max_attempts: 3 2025-05-07T20:22:57.1828510Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:22:57.1852624Z retry_wait_seconds: 10 2025-05-07T20:22:57.1852904Z polling_interval_seconds: 1 2025-05-07T20:22:57.1853183Z warning_on_retry: true 2025-05-07T20:22:57.1853448Z continue_on_error: false 2025-05-07T20:22:57.1853704Z env: 2025-05-07T20:22:57.1853943Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:57.1854262Z BUILD_ENV: build_binary 2025-05-07T20:22:57.1854528Z BUILD_TARGET: genai 2025-05-07T20:22:57.1854771Z BUILD_VARIANT: cuda 2025-05-07T20:22:57.1855032Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:57.1855307Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:57.1855565Z DRIVER_VERSION: 570.133.07 2025-05-07T20:22:57.1855828Z ##[endgroup] 2025-05-07T20:22:57.2653179Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:22:57.2653809Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:22:57.2658099Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:22:57.8167653Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:22:57.8181118Z No packages marked for removal. 2025-05-07T20:22:57.8231615Z Dependencies resolved. 2025-05-07T20:22:57.8241042Z Nothing to do. 2025-05-07T20:22:57.8241423Z Complete! 2025-05-07T20:22:57.8547901Z + install_nvidia_driver_common 2025-05-07T20:22:57.8552770Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:22:57.8553083Z + lspci 2025-05-07T20:22:57.8555013Z Before installing NVIDIA driver 2025-05-07T20:22:57.8740811Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:57.8742047Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:57.8742998Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:57.8743882Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:57.8744663Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:57.8745574Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:57.8746363Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:57.8747170Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:57.8747845Z + lsmod 2025-05-07T20:22:57.8783848Z Module Size Used by 2025-05-07T20:22:57.8784205Z xt_conntrack 16384 1 2025-05-07T20:22:57.8784597Z nft_chain_nat 16384 3 2025-05-07T20:22:57.8785035Z xt_MASQUERADE 20480 1 2025-05-07T20:22:57.8785547Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:57.8786058Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:57.8786485Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:57.8786947Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:57.8787485Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:57.8787787Z xfrm_user 57344 1 2025-05-07T20:22:57.8788077Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:57.8788384Z xt_addrtype 16384 2 2025-05-07T20:22:57.8788652Z nft_compat 20480 4 2025-05-07T20:22:57.8788975Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:57.8789418Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:57.8789815Z br_netfilter 36864 0 2025-05-07T20:22:57.8790111Z bridge 323584 1 br_netfilter 2025-05-07T20:22:57.8790434Z stp 16384 1 bridge 2025-05-07T20:22:57.8790731Z llc 16384 2 bridge,stp 2025-05-07T20:22:57.8791024Z overlay 167936 0 2025-05-07T20:22:57.8791287Z tls 135168 0 2025-05-07T20:22:57.8791552Z nls_ascii 16384 1 2025-05-07T20:22:57.8791812Z nls_cp437 20480 1 2025-05-07T20:22:57.8792078Z vfat 24576 1 2025-05-07T20:22:57.8792356Z fat 86016 1 vfat 2025-05-07T20:22:57.8792632Z sunrpc 696320 1 2025-05-07T20:22:57.8792892Z ena 180224 0 2025-05-07T20:22:57.8793149Z i8042 45056 0 2025-05-07T20:22:57.8793414Z serio 28672 3 i8042 2025-05-07T20:22:57.8793808Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:57.8794095Z button 24576 0 2025-05-07T20:22:57.8794362Z sch_fq_codel 20480 17 2025-05-07T20:22:57.8794628Z dm_mod 188416 0 2025-05-07T20:22:57.8794893Z fuse 163840 1 2025-05-07T20:22:57.8795160Z loop 36864 0 2025-05-07T20:22:57.8795423Z configfs 57344 1 2025-05-07T20:22:57.8795694Z dax 45056 1 dm_mod 2025-05-07T20:22:57.8795984Z dmi_sysfs 20480 0 2025-05-07T20:22:57.8796246Z crc32_pclmul 16384 0 2025-05-07T20:22:57.8796512Z crc32c_intel 24576 0 2025-05-07T20:22:57.8796781Z efivarfs 24576 1 2025-05-07T20:22:57.8797035Z + modinfo nvidia 2025-05-07T20:22:57.8802586Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:57.8803090Z import_ns: DMA_BUF 2025-05-07T20:22:57.8803355Z alias: char-major-195-* 2025-05-07T20:22:57.8803635Z version: 570.133.07 2025-05-07T20:22:57.8803904Z supported: external 2025-05-07T20:22:57.8804255Z license: Dual MIT/GPL 2025-05-07T20:22:57.8804593Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:57.8804945Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:57.8805630Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:57.8806068Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:57.8806427Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:57.8806768Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:57.8807093Z depends: i2c-core,drm 2025-05-07T20:22:57.8807361Z retpoline: Y 2025-05-07T20:22:57.8807585Z name: nvidia 2025-05-07T20:22:57.8807961Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:57.8808452Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:57.8808906Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:57.8809446Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:57.8809772Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:57.8810095Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:57.8810435Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:57.8810754Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:57.8811071Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:57.8811442Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:57.8811843Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:57.8812196Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:57.8812503Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:57.8812822Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:57.8813197Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:57.8813602Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:57.8813994Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:57.8814433Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.8814856Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:57.8815289Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.8815720Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:57.8816079Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:57.8816461Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:57.8816848Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:57.8817207Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:57.8817540Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:57.8817889Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:57.8818229Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:57.8818555Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:57.8818914Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:57.8819297Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:57.8819642Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:57.8819991Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:57.8820356Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:57.8820710Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:57.8821061Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:57.8821411Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:57.8821718Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:57.8822078Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:57.8822414Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:57.8822745Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:57.8823092Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:57.8823462Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:57.8824100Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:57.8824454Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:57.8824809Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:57.8825163Z parm: rm_firmware_active:charp 2025-05-07T20:22:57.8825618Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:22:57.8825876Z ++ command -v nvidia-smi 2025-05-07T20:22:57.8826145Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:22:57.8826412Z + set +e 2025-05-07T20:22:57.8826737Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:22:59.6862805Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:22:59.6863497Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:22:59.6863996Z + '[' 0 -ne 0 ']' 2025-05-07T20:22:59.6864446Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:22:59.6865007Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:22:59.6865905Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:22:59.6866865Z + set -e 2025-05-07T20:22:59.6867802Z + '[' 1 -eq 0 ']' 2025-05-07T20:22:59.6868616Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:22:59.6869584Z + post_install_nvidia_driver_common 2025-05-07T20:22:59.6870107Z + sudo modprobe nvidia 2025-05-07T20:22:59.8125095Z + echo 'After installing NVIDIA driver' 2025-05-07T20:22:59.8125420Z + lspci 2025-05-07T20:22:59.8125777Z After installing NVIDIA driver 2025-05-07T20:22:59.8244449Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:59.8244978Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:59.8245556Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:59.8246093Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:59.8246610Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:59.8247160Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:59.8247683Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:59.8248175Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:59.8248604Z + lsmod 2025-05-07T20:22:59.8278347Z Module Size Used by 2025-05-07T20:22:59.8278682Z nvidia_uvm 1884160 0 2025-05-07T20:22:59.8278965Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:22:59.8279275Z drm 602112 1 nvidia 2025-05-07T20:22:59.8279599Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:22:59.8279939Z backlight 24576 1 drm 2025-05-07T20:22:59.8280257Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:22:59.8280567Z xt_conntrack 16384 1 2025-05-07T20:22:59.8280834Z nft_chain_nat 16384 3 2025-05-07T20:22:59.8281108Z xt_MASQUERADE 20480 1 2025-05-07T20:22:59.8281422Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:59.8281770Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:59.8282185Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:59.8282636Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:59.8282968Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:59.8283274Z xfrm_user 57344 1 2025-05-07T20:22:59.8283558Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:59.8283854Z xt_addrtype 16384 2 2025-05-07T20:22:59.8284126Z nft_compat 20480 4 2025-05-07T20:22:59.8284453Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:59.8284883Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:59.8285269Z br_netfilter 36864 0 2025-05-07T20:22:59.8285562Z bridge 323584 1 br_netfilter 2025-05-07T20:22:59.8285873Z stp 16384 1 bridge 2025-05-07T20:22:59.8286168Z llc 16384 2 bridge,stp 2025-05-07T20:22:59.8286474Z overlay 167936 0 2025-05-07T20:22:59.8286740Z tls 135168 0 2025-05-07T20:22:59.8286997Z nls_ascii 16384 1 2025-05-07T20:22:59.8288997Z nls_cp437 20480 1 2025-05-07T20:22:59.8289276Z vfat 24576 1 2025-05-07T20:22:59.8289535Z fat 86016 1 vfat 2025-05-07T20:22:59.8289821Z sunrpc 696320 1 2025-05-07T20:22:59.8290091Z ena 180224 0 2025-05-07T20:22:59.8290351Z i8042 45056 0 2025-05-07T20:22:59.8290614Z serio 28672 3 i8042 2025-05-07T20:22:59.8290906Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:59.8291185Z button 24576 0 2025-05-07T20:22:59.8291448Z sch_fq_codel 20480 17 2025-05-07T20:22:59.8291718Z dm_mod 188416 0 2025-05-07T20:22:59.8291979Z fuse 163840 1 2025-05-07T20:22:59.8292236Z loop 36864 0 2025-05-07T20:22:59.8292653Z configfs 57344 1 2025-05-07T20:22:59.8292923Z dax 45056 1 dm_mod 2025-05-07T20:22:59.8293202Z dmi_sysfs 20480 0 2025-05-07T20:22:59.8293465Z crc32_pclmul 16384 0 2025-05-07T20:22:59.8293739Z crc32c_intel 24576 0 2025-05-07T20:22:59.8293998Z efivarfs 24576 1 2025-05-07T20:22:59.8294257Z + modinfo nvidia 2025-05-07T20:22:59.8295088Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:59.8295785Z import_ns: DMA_BUF 2025-05-07T20:22:59.8296148Z alias: char-major-195-* 2025-05-07T20:22:59.8296443Z version: 570.133.07 2025-05-07T20:22:59.8296703Z supported: external 2025-05-07T20:22:59.8296963Z license: Dual MIT/GPL 2025-05-07T20:22:59.8297270Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:59.8297629Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:59.8297963Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:59.8298307Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:59.8298662Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:59.8299012Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:59.8299335Z depends: i2c-core,drm 2025-05-07T20:22:59.8299603Z retpoline: Y 2025-05-07T20:22:59.8299834Z name: nvidia 2025-05-07T20:22:59.8300207Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:59.8300702Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:59.8301163Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:59.8301599Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:59.8301917Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:59.8302236Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:59.8302568Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:59.8302878Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:59.8303202Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:59.8303587Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:59.8303990Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:59.8304342Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:59.8304661Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:59.8304975Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:59.8305358Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:59.8305776Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:59.8306172Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:59.8306599Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.8307025Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:59.8307465Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.8307892Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:59.8308244Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:59.8308629Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:59.8309162Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:59.8309518Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:59.8309854Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:59.8310200Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:59.8310531Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:59.8310855Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:59.8311217Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:59.8311593Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:59.8311938Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:59.8312296Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:59.8312650Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:59.8313092Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:59.8313450Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:59.8313934Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:59.8314241Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:59.8314591Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:59.8314931Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:59.8315254Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:59.8315599Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:59.8315974Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:59.8316334Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:59.8316679Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:59.8317043Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:59.8317403Z parm: rm_firmware_active:charp 2025-05-07T20:22:59.8317697Z + set +e 2025-05-07T20:22:59.8317907Z + nvidia-smi 2025-05-07T20:23:01.2265526Z Wed May 7 20:23:01 2025 2025-05-07T20:23:01.2265950Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.2266517Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:01.2267021Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:01.2267527Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:01.2268073Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:01.2268522Z | | | MIG M. | 2025-05-07T20:23:01.2268872Z |=========================================+========================+======================| 2025-05-07T20:23:01.2329704Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:01.2330194Z | 0% 29C P0 62W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:01.2330603Z | | | N/A | 2025-05-07T20:23:01.2331013Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:01.2331416Z 2025-05-07T20:23:01.2331829Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.2332277Z | Processes: | 2025-05-07T20:23:01.2332734Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:01.2333158Z | ID ID Usage | 2025-05-07T20:23:01.2333525Z |=========================================================================================| 2025-05-07T20:23:01.2334547Z | No running processes found | 2025-05-07T20:23:01.2335417Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.6457087Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:03.0325966Z NVIDIA A10G 2025-05-07T20:23:03.2951463Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:03.2951743Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:03.2951998Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:03.2952304Z + set -e 2025-05-07T20:23:03.2952530Z INFO: Ignoring allowed status 0 2025-05-07T20:23:03.2961133Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:03.2964246Z + sudo yum install -y yum-utils 2025-05-07T20:23:03.7216289Z Last metadata expiration check: 0:05:54 ago on Wed May 7 20:17:09 2025. 2025-05-07T20:23:03.7467439Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:03.7868794Z Dependencies resolved. 2025-05-07T20:23:03.8051548Z Nothing to do. 2025-05-07T20:23:03.8051799Z Complete! 2025-05-07T20:23:03.8437623Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:03.8438263Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.8439143Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.1340352Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.1913875Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:04.7483623Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:04.7736681Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:04.8140991Z Dependencies resolved. 2025-05-07T20:23:04.8320328Z ================================================================================ 2025-05-07T20:23:04.8320948Z Package Arch Version Repository Size 2025-05-07T20:23:04.8321507Z ================================================================================ 2025-05-07T20:23:04.8321858Z Downgrading: 2025-05-07T20:23:04.8322231Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:04.8322837Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:04.8323279Z 2025-05-07T20:23:04.8323426Z Transaction Summary 2025-05-07T20:23:04.8323975Z ================================================================================ 2025-05-07T20:23:04.8324436Z Downgrade 2 Packages 2025-05-07T20:23:04.8324617Z 2025-05-07T20:23:04.8324761Z Total download size: 6.8 M 2025-05-07T20:23:04.8325145Z Downloading Packages: 2025-05-07T20:23:04.8741766Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 31 MB/s | 1.2 MB 00:00 2025-05-07T20:23:04.9175838Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 67 MB/s | 5.6 MB 00:00 2025-05-07T20:23:04.9184854Z -------------------------------------------------------------------------------- 2025-05-07T20:23:04.9188120Z Total 80 MB/s | 6.8 MB 00:00 2025-05-07T20:23:04.9190502Z Running transaction check 2025-05-07T20:23:04.9297204Z Transaction check succeeded. 2025-05-07T20:23:04.9297640Z Running transaction test 2025-05-07T20:23:04.9593273Z Transaction test succeeded. 2025-05-07T20:23:04.9595394Z Running transaction 2025-05-07T20:23:05.5130644Z Preparing : 1/1 2025-05-07T20:23:05.6195516Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:05.6217229Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.6427866Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.6428489Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.6538036Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.6560105Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:07.1212316Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:07.1212947Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:07.1213510Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:07.1214066Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:07.2584752Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:07.2585693Z WARNING: 2025-05-07T20:23:07.2585956Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:07.2586197Z 2025-05-07T20:23:07.2586301Z Available Versions: 2025-05-07T20:23:07.2586456Z 2025-05-07T20:23:07.2586563Z Version 2023.7.20250331: 2025-05-07T20:23:07.2586895Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:07.2587163Z 2025-05-07T20:23:07.2587292Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:07.2587512Z 2025-05-07T20:23:07.2587609Z Release notes: 2025-05-07T20:23:07.2588053Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:07.2588440Z 2025-05-07T20:23:07.2588536Z Version 2023.7.20250414: 2025-05-07T20:23:07.2588867Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:07.2589127Z 2025-05-07T20:23:07.2589254Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:07.2589473Z 2025-05-07T20:23:07.2589570Z Release notes: 2025-05-07T20:23:07.2589981Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:07.2590366Z 2025-05-07T20:23:07.2590460Z Version 2023.7.20250428: 2025-05-07T20:23:07.2590792Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:07.2591052Z 2025-05-07T20:23:07.2591173Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:07.2591402Z 2025-05-07T20:23:07.2591492Z Release notes: 2025-05-07T20:23:07.2600729Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:07.2601159Z 2025-05-07T20:23:07.2601287Z ================================================================================ 2025-05-07T20:23:07.2954378Z 2025-05-07T20:23:07.2954538Z 2025-05-07T20:23:07.2954636Z Downgraded: 2025-05-07T20:23:07.2955035Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:07.2955623Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:07.2955999Z 2025-05-07T20:23:07.2956090Z Complete! 2025-05-07T20:23:07.3395256Z + sudo systemctl restart docker 2025-05-07T20:23:11.3420890Z Wed May 7 20:23:11 2025 2025-05-07T20:23:11.3421344Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.3421874Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:11.3422371Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:11.3422884Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:11.3423566Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:11.3424239Z | | | MIG M. | 2025-05-07T20:23:11.3424586Z |=========================================+========================+======================| 2025-05-07T20:23:11.3503358Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:11.3504711Z | 0% 29C P0 63W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:11.3505126Z | | | N/A | 2025-05-07T20:23:11.3505534Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:11.3505947Z 2025-05-07T20:23:11.3506508Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.3507111Z | Processes: | 2025-05-07T20:23:11.3507578Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:11.3508237Z | ID ID Usage | 2025-05-07T20:23:11.3508606Z |=========================================================================================| 2025-05-07T20:23:11.3509053Z | No running processes found | 2025-05-07T20:23:11.3509536Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:12.2444947Z Command completed after 1 attempt(s). 2025-05-07T20:23:12.2536834Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:12.2537353Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:12.2552450Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.2552829Z env: 2025-05-07T20:23:12.2553084Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.2553408Z BUILD_ENV: build_binary 2025-05-07T20:23:12.2553784Z BUILD_TARGET: genai 2025-05-07T20:23:12.2554050Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.2554305Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:12.2554592Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.2554928Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.2555282Z ##[endgroup] 2025-05-07T20:23:12.5955250Z ################################################################################ 2025-05-07T20:23:12.5955615Z # Print System Info 2025-05-07T20:23:12.5955848Z # 2025-05-07T20:23:12.5970599Z # [2025-05-07T20:23:12.596Z] + print_system_info 2025-05-07T20:23:12.5970960Z ################################################################################ 2025-05-07T20:23:12.5971195Z 2025-05-07T20:23:12.5971313Z ################################################################################ 2025-05-07T20:23:12.5971662Z [INFO] Printing environment variables ... 2025-05-07T20:23:12.5971965Z + printenv 2025-05-07T20:23:12.5972094Z 2025-05-07T20:23:12.5991178Z SHELL=/bin/bash 2025-05-07T20:23:12.5991577Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:12.5991987Z BUILD_VARIANT=cuda 2025-05-07T20:23:12.5992518Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_c0e31a38-2415-4935-9bb9-cb592934615a 2025-05-07T20:23:12.5993117Z GITHUB_ACTION=__run 2025-05-07T20:23:12.5993424Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.5993892Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:12.5994191Z RUNNER_NAME=i-09c05d8e2aea2c844 2025-05-07T20:23:12.5994506Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:12.5994819Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:12.5995089Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:12.5995474Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:12.5995922Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:12.5996211Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:12.5996518Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:12.5996996Z *** 2025-05-07T20:23:12.5997207Z LOGNAME=ec2-user 2025-05-07T20:23:12.5997445Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:12.5997720Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:12.5997969Z GITHUB_ACTIONS=true 2025-05-07T20:23:12.5998197Z SYSTEMD_EXEC_PID=55565 2025-05-07T20:23:12.5998495Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:12.5999052Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:12.5999575Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:12.5999857Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:12.6000129Z RUNNER_OS=Linux 2025-05-07T20:23:12.6000363Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:12.6000616Z HOME=/home/ec2-user 2025-05-07T20:23:12.6000879Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:12.6001181Z LANG=C.UTF-8 2025-05-07T20:23:12.6001495Z RUNNER_TRACKING_ID=github_dcbe1522-ed89-4cf6-b5ae-52b4cb845c2e 2025-05-07T20:23:12.6001861Z RUNNER_ARCH=X64 2025-05-07T20:23:12.6002153Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:12.6002849Z BUILD_TARGET=genai 2025-05-07T20:23:12.6003385Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_c0e31a38-2415-4935-9bb9-cb592934615a 2025-05-07T20:23:12.6004298Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_c0e31a38-2415-4935-9bb9-cb592934615a 2025-05-07T20:23:12.6005048Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:12.6005726Z INVOCATION_ID=e7feec3cfd9b4570a4e9feb57496356b 2025-05-07T20:23:12.6006064Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:12.6006341Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:12.6006937Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_c0e31a38-2415-4935-9bb9-cb592934615a 2025-05-07T20:23:12.6007556Z BUILD_ENV=build_binary 2025-05-07T20:23:12.6007796Z GITHUB_ACTOR=q10 2025-05-07T20:23:12.6008021Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:12.6008254Z KERN_NAME_LC=linux 2025-05-07T20:23:12.6008488Z BUILD_CUDA_VERSION=12.8.0 2025-05-07T20:23:12.6008801Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:12.6009154Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:12.6009404Z USER=ec2-user 2025-05-07T20:23:12.6009647Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:12.6009939Z SHLVL=1 2025-05-07T20:23:12.6010135Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:12.6010460Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:12.6010920Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:12.6011285Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:12.6011538Z KERN_NAME=Linux 2025-05-07T20:23:12.6011780Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:12.6012196Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:12.6012636Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:12.6012925Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:12.6013174Z JOURNAL_STREAM=8:86301 2025-05-07T20:23:12.6013514Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:12.6013946Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:12.6014353Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:12.6014774Z GITHUB_BASE_REF=main 2025-05-07T20:23:12.6015059Z CI=true 2025-05-07T20:23:12.6015333Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:12.6015692Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:12.6016054Z GITHUB_ACTION_REF= 2025-05-07T20:23:12.6016340Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:12.6016968Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_c0e31a38-2415-4935-9bb9-cb592934615a 2025-05-07T20:23:12.6017571Z MACHINE_NAME=x86_64 2025-05-07T20:23:12.6017804Z _=/usr/bin/printenv 2025-05-07T20:23:12.6017942Z 2025-05-07T20:23:12.6018070Z ################################################################################ 2025-05-07T20:23:12.6018396Z [INFO] Print ldd version ... 2025-05-07T20:23:12.6018672Z + ldd --version 2025-05-07T20:23:12.6018805Z 2025-05-07T20:23:12.6018912Z ldd (GNU libc) 2.34 2025-05-07T20:23:12.6019187Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:12.6019649Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:12.6020201Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:12.6020667Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:12.6020893Z 2025-05-07T20:23:12.6021016Z ################################################################################ 2025-05-07T20:23:12.6021343Z [INFO] Print CPU info ... 2025-05-07T20:23:12.6021597Z + nproc 2025-05-07T20:23:12.6021710Z 2025-05-07T20:23:12.6034729Z 16 2025-05-07T20:23:12.6036520Z 2025-05-07T20:23:12.6037178Z + lscpu 2025-05-07T20:23:12.6037343Z 2025-05-07T20:23:12.6147666Z Architecture: x86_64 2025-05-07T20:23:12.6148075Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:12.6148998Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6149412Z Byte Order: Little Endian 2025-05-07T20:23:12.6149745Z CPU(s): 16 2025-05-07T20:23:12.6150051Z On-line CPU(s) list: 0-15 2025-05-07T20:23:12.6150387Z Vendor ID: AuthenticAMD 2025-05-07T20:23:12.6150745Z Model name: AMD EPYC 7R32 2025-05-07T20:23:12.6151071Z CPU family: 23 2025-05-07T20:23:12.6151572Z Model: 49 2025-05-07T20:23:12.6151882Z Thread(s) per core: 2 2025-05-07T20:23:12.6152190Z Core(s) per socket: 8 2025-05-07T20:23:12.6152483Z Socket(s): 1 2025-05-07T20:23:12.6152775Z Stepping: 0 2025-05-07T20:23:12.6153093Z BogoMIPS: 5600.08 2025-05-07T20:23:12.6155349Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6157599Z Hypervisor vendor: KVM 2025-05-07T20:23:12.6157924Z Virtualization type: full 2025-05-07T20:23:12.6158283Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:12.6158668Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:12.6159047Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:12.6159417Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:12.6159754Z NUMA node(s): 1 2025-05-07T20:23:12.6160066Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:12.6160413Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:12.6160803Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:12.6161182Z Vulnerability L1tf: Not affected 2025-05-07T20:23:12.6161544Z Vulnerability Mds: Not affected 2025-05-07T20:23:12.6161920Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:12.6162333Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:12.6162722Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:12.6163288Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:12.6163903Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:12.6164512Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:12.6165248Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:12.6166184Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:12.6166886Z Vulnerability Srbds: Not affected 2025-05-07T20:23:12.6167279Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:12.6167612Z 2025-05-07T20:23:12.6167710Z + cat /proc/cpuinfo 2025-05-07T20:23:12.6167854Z 2025-05-07T20:23:12.6167955Z processor : 0 2025-05-07T20:23:12.6168184Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6168447Z cpu family : 23 2025-05-07T20:23:12.6168673Z model : 49 2025-05-07T20:23:12.6168890Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6169150Z stepping : 0 2025-05-07T20:23:12.6169376Z microcode : 0x830107f 2025-05-07T20:23:12.6169712Z cpu MHz : 3100.881 2025-05-07T20:23:12.6169941Z cache size : 512 KB 2025-05-07T20:23:12.6170170Z physical id : 0 2025-05-07T20:23:12.6170384Z siblings : 16 2025-05-07T20:23:12.6170597Z core id : 0 2025-05-07T20:23:12.6170809Z cpu cores : 8 2025-05-07T20:23:12.6171015Z apicid : 0 2025-05-07T20:23:12.6171230Z initial apicid : 0 2025-05-07T20:23:12.6171459Z fpu : yes 2025-05-07T20:23:12.6171663Z fpu_exception : yes 2025-05-07T20:23:12.6171891Z cpuid level : 13 2025-05-07T20:23:12.6172112Z wp : yes 2025-05-07T20:23:12.6174249Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6176563Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6177068Z bogomips : 5600.08 2025-05-07T20:23:12.6177304Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6177557Z clflush size : 64 2025-05-07T20:23:12.6177782Z cache_alignment : 64 2025-05-07T20:23:12.6178073Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6178411Z power management: 2025-05-07T20:23:12.6178553Z 2025-05-07T20:23:12.6178640Z processor : 1 2025-05-07T20:23:12.6178871Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6179127Z cpu family : 23 2025-05-07T20:23:12.6179347Z model : 49 2025-05-07T20:23:12.6179563Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6179820Z stepping : 0 2025-05-07T20:23:12.6180049Z microcode : 0x830107f 2025-05-07T20:23:12.6180286Z cpu MHz : 3073.173 2025-05-07T20:23:12.6180515Z cache size : 512 KB 2025-05-07T20:23:12.6180753Z physical id : 0 2025-05-07T20:23:12.6180970Z siblings : 16 2025-05-07T20:23:12.6181187Z core id : 1 2025-05-07T20:23:12.6181401Z cpu cores : 8 2025-05-07T20:23:12.6181610Z apicid : 2 2025-05-07T20:23:12.6181822Z initial apicid : 2 2025-05-07T20:23:12.6182047Z fpu : yes 2025-05-07T20:23:12.6182253Z fpu_exception : yes 2025-05-07T20:23:12.6182483Z cpuid level : 13 2025-05-07T20:23:12.6182705Z wp : yes 2025-05-07T20:23:12.6184722Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6187032Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6187549Z bogomips : 5600.08 2025-05-07T20:23:12.6187784Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6188028Z clflush size : 64 2025-05-07T20:23:12.6188260Z cache_alignment : 64 2025-05-07T20:23:12.6188547Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6188883Z power management: 2025-05-07T20:23:12.6189020Z 2025-05-07T20:23:12.6189113Z processor : 2 2025-05-07T20:23:12.6189343Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6189597Z cpu family : 23 2025-05-07T20:23:12.6189807Z model : 49 2025-05-07T20:23:12.6190030Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6190286Z stepping : 0 2025-05-07T20:23:12.6190503Z microcode : 0x830107f 2025-05-07T20:23:12.6190741Z cpu MHz : 3265.151 2025-05-07T20:23:12.6190967Z cache size : 512 KB 2025-05-07T20:23:12.6191188Z physical id : 0 2025-05-07T20:23:12.6191409Z siblings : 16 2025-05-07T20:23:12.6191714Z core id : 2 2025-05-07T20:23:12.6191918Z cpu cores : 8 2025-05-07T20:23:12.6192136Z apicid : 4 2025-05-07T20:23:12.6192351Z initial apicid : 4 2025-05-07T20:23:12.6192573Z fpu : yes 2025-05-07T20:23:12.6192788Z fpu_exception : yes 2025-05-07T20:23:12.6193017Z cpuid level : 13 2025-05-07T20:23:12.6193234Z wp : yes 2025-05-07T20:23:12.6195398Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6197698Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6198217Z bogomips : 5600.08 2025-05-07T20:23:12.6198444Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6198697Z clflush size : 64 2025-05-07T20:23:12.6198932Z cache_alignment : 64 2025-05-07T20:23:12.6199224Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6199550Z power management: 2025-05-07T20:23:12.6199696Z 2025-05-07T20:23:12.6199784Z processor : 3 2025-05-07T20:23:12.6200012Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6200264Z cpu family : 23 2025-05-07T20:23:12.6200482Z model : 49 2025-05-07T20:23:12.6200702Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6200949Z stepping : 0 2025-05-07T20:23:12.6201170Z microcode : 0x830107f 2025-05-07T20:23:12.6201410Z cpu MHz : 3094.769 2025-05-07T20:23:12.6201630Z cache size : 512 KB 2025-05-07T20:23:12.6201856Z physical id : 0 2025-05-07T20:23:12.6202079Z siblings : 16 2025-05-07T20:23:12.6202287Z core id : 3 2025-05-07T20:23:12.6202504Z cpu cores : 8 2025-05-07T20:23:12.6202730Z apicid : 6 2025-05-07T20:23:12.6202941Z initial apicid : 6 2025-05-07T20:23:12.6203159Z fpu : yes 2025-05-07T20:23:12.6203372Z fpu_exception : yes 2025-05-07T20:23:12.6203609Z cpuid level : 13 2025-05-07T20:23:12.6203824Z wp : yes 2025-05-07T20:23:12.6205833Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6208127Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6208646Z bogomips : 5600.08 2025-05-07T20:23:12.6208882Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6209126Z clflush size : 64 2025-05-07T20:23:12.6219295Z cache_alignment : 64 2025-05-07T20:23:12.6219638Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6219978Z power management: 2025-05-07T20:23:12.6220132Z 2025-05-07T20:23:12.6220223Z processor : 4 2025-05-07T20:23:12.6220461Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6220715Z cpu family : 23 2025-05-07T20:23:12.6220940Z model : 49 2025-05-07T20:23:12.6221174Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6221437Z stepping : 0 2025-05-07T20:23:12.6221660Z microcode : 0x830107f 2025-05-07T20:23:12.6221915Z cpu MHz : 3259.383 2025-05-07T20:23:12.6222150Z cache size : 512 KB 2025-05-07T20:23:12.6222377Z physical id : 0 2025-05-07T20:23:12.6222604Z siblings : 16 2025-05-07T20:23:12.6222824Z core id : 4 2025-05-07T20:23:12.6223037Z cpu cores : 8 2025-05-07T20:23:12.6223253Z apicid : 8 2025-05-07T20:23:12.6223620Z initial apicid : 8 2025-05-07T20:23:12.6224097Z fpu : yes 2025-05-07T20:23:12.6224405Z fpu_exception : yes 2025-05-07T20:23:12.6224722Z cpuid level : 13 2025-05-07T20:23:12.6224995Z wp : yes 2025-05-07T20:23:12.6227199Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6229493Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6230010Z bogomips : 5600.08 2025-05-07T20:23:12.6230254Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6230513Z clflush size : 64 2025-05-07T20:23:12.6230749Z cache_alignment : 64 2025-05-07T20:23:12.6231035Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6231380Z power management: 2025-05-07T20:23:12.6231532Z 2025-05-07T20:23:12.6231622Z processor : 5 2025-05-07T20:23:12.6231858Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6232112Z cpu family : 23 2025-05-07T20:23:12.6232338Z model : 49 2025-05-07T20:23:12.6232563Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6232826Z stepping : 0 2025-05-07T20:23:12.6233053Z microcode : 0x830107f 2025-05-07T20:23:12.6233297Z cpu MHz : 2906.026 2025-05-07T20:23:12.6233587Z cache size : 512 KB 2025-05-07T20:23:12.6233820Z physical id : 0 2025-05-07T20:23:12.6234043Z siblings : 16 2025-05-07T20:23:12.6234254Z core id : 5 2025-05-07T20:23:12.6234465Z cpu cores : 8 2025-05-07T20:23:12.6234679Z apicid : 10 2025-05-07T20:23:12.6234893Z initial apicid : 10 2025-05-07T20:23:12.6235121Z fpu : yes 2025-05-07T20:23:12.6235345Z fpu_exception : yes 2025-05-07T20:23:12.6235570Z cpuid level : 13 2025-05-07T20:23:12.6235799Z wp : yes 2025-05-07T20:23:12.6237806Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6240096Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6240613Z bogomips : 5600.08 2025-05-07T20:23:12.6240845Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6241097Z clflush size : 64 2025-05-07T20:23:12.6241338Z cache_alignment : 64 2025-05-07T20:23:12.6241623Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6241957Z power management: 2025-05-07T20:23:12.6242097Z 2025-05-07T20:23:12.6242194Z processor : 6 2025-05-07T20:23:12.6242420Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6242678Z cpu family : 23 2025-05-07T20:23:12.6242901Z model : 49 2025-05-07T20:23:12.6243123Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6243384Z stepping : 0 2025-05-07T20:23:12.6243615Z microcode : 0x830107f 2025-05-07T20:23:12.6243858Z cpu MHz : 3205.385 2025-05-07T20:23:12.6244091Z cache size : 512 KB 2025-05-07T20:23:12.6244325Z physical id : 0 2025-05-07T20:23:12.6244546Z siblings : 16 2025-05-07T20:23:12.6244768Z core id : 6 2025-05-07T20:23:12.6244986Z cpu cores : 8 2025-05-07T20:23:12.6245197Z apicid : 12 2025-05-07T20:23:12.6245420Z initial apicid : 12 2025-05-07T20:23:12.6245647Z fpu : yes 2025-05-07T20:23:12.6245853Z fpu_exception : yes 2025-05-07T20:23:12.6246086Z cpuid level : 13 2025-05-07T20:23:12.6246445Z wp : yes 2025-05-07T20:23:12.6248533Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6250813Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6251318Z bogomips : 5600.08 2025-05-07T20:23:12.6251555Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6251805Z clflush size : 64 2025-05-07T20:23:12.6252033Z cache_alignment : 64 2025-05-07T20:23:12.6252333Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6252672Z power management: 2025-05-07T20:23:12.6252811Z 2025-05-07T20:23:12.6252900Z processor : 7 2025-05-07T20:23:12.6253134Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6253388Z cpu family : 23 2025-05-07T20:23:12.6253604Z model : 49 2025-05-07T20:23:12.6253828Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6254084Z stepping : 0 2025-05-07T20:23:12.6254309Z microcode : 0x830107f 2025-05-07T20:23:12.6254543Z cpu MHz : 3270.817 2025-05-07T20:23:12.6254781Z cache size : 512 KB 2025-05-07T20:23:12.6255010Z physical id : 0 2025-05-07T20:23:12.6255232Z siblings : 16 2025-05-07T20:23:12.6255446Z core id : 7 2025-05-07T20:23:12.6255660Z cpu cores : 8 2025-05-07T20:23:12.6255868Z apicid : 14 2025-05-07T20:23:12.6256092Z initial apicid : 14 2025-05-07T20:23:12.6256320Z fpu : yes 2025-05-07T20:23:12.6256530Z fpu_exception : yes 2025-05-07T20:23:12.6256761Z cpuid level : 13 2025-05-07T20:23:12.6256984Z wp : yes 2025-05-07T20:23:12.6259007Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6261304Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6261816Z bogomips : 5600.08 2025-05-07T20:23:12.6262054Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6262302Z clflush size : 64 2025-05-07T20:23:12.6262536Z cache_alignment : 64 2025-05-07T20:23:12.6262817Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6263161Z power management: 2025-05-07T20:23:12.6263299Z 2025-05-07T20:23:12.6263396Z processor : 8 2025-05-07T20:23:12.6263636Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6263913Z cpu family : 23 2025-05-07T20:23:12.6264156Z model : 49 2025-05-07T20:23:12.6264381Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6264644Z stepping : 0 2025-05-07T20:23:12.6264872Z microcode : 0x830107f 2025-05-07T20:23:12.6265109Z cpu MHz : 3163.691 2025-05-07T20:23:12.6265344Z cache size : 512 KB 2025-05-07T20:23:12.6265578Z physical id : 0 2025-05-07T20:23:12.6265814Z siblings : 16 2025-05-07T20:23:12.6266027Z core id : 0 2025-05-07T20:23:12.6266250Z cpu cores : 8 2025-05-07T20:23:12.6266470Z apicid : 1 2025-05-07T20:23:12.6266680Z initial apicid : 1 2025-05-07T20:23:12.6266917Z fpu : yes 2025-05-07T20:23:12.6267136Z fpu_exception : yes 2025-05-07T20:23:12.6267364Z cpuid level : 13 2025-05-07T20:23:12.6267592Z wp : yes 2025-05-07T20:23:12.6269605Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6272071Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6272580Z bogomips : 5600.08 2025-05-07T20:23:12.6272822Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6273077Z clflush size : 64 2025-05-07T20:23:12.6273309Z cache_alignment : 64 2025-05-07T20:23:12.6273664Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6274009Z power management: 2025-05-07T20:23:12.6274151Z 2025-05-07T20:23:12.6274255Z processor : 9 2025-05-07T20:23:12.6274482Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6274740Z cpu family : 23 2025-05-07T20:23:12.6274963Z model : 49 2025-05-07T20:23:12.6275180Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6275442Z stepping : 0 2025-05-07T20:23:12.6275666Z microcode : 0x830107f 2025-05-07T20:23:12.6275906Z cpu MHz : 2976.806 2025-05-07T20:23:12.6276139Z cache size : 512 KB 2025-05-07T20:23:12.6276430Z physical id : 0 2025-05-07T20:23:12.6276663Z siblings : 16 2025-05-07T20:23:12.6276899Z core id : 1 2025-05-07T20:23:12.6277136Z cpu cores : 8 2025-05-07T20:23:12.6277350Z apicid : 3 2025-05-07T20:23:12.6277571Z initial apicid : 3 2025-05-07T20:23:12.6277801Z fpu : yes 2025-05-07T20:23:12.6278010Z fpu_exception : yes 2025-05-07T20:23:12.6278246Z cpuid level : 13 2025-05-07T20:23:12.6278469Z wp : yes 2025-05-07T20:23:12.6280480Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6282777Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6283291Z bogomips : 5600.08 2025-05-07T20:23:12.6283531Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6283787Z clflush size : 64 2025-05-07T20:23:12.6284017Z cache_alignment : 64 2025-05-07T20:23:12.6284313Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6284650Z power management: 2025-05-07T20:23:12.6284792Z 2025-05-07T20:23:12.6284883Z processor : 10 2025-05-07T20:23:12.6285123Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6285393Z cpu family : 23 2025-05-07T20:23:12.6285604Z model : 49 2025-05-07T20:23:12.6285826Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6286083Z stepping : 0 2025-05-07T20:23:12.6286296Z microcode : 0x830107f 2025-05-07T20:23:12.6286541Z cpu MHz : 3114.815 2025-05-07T20:23:12.6286767Z cache size : 512 KB 2025-05-07T20:23:12.6286990Z physical id : 0 2025-05-07T20:23:12.6287212Z siblings : 16 2025-05-07T20:23:12.6287424Z core id : 2 2025-05-07T20:23:12.6287626Z cpu cores : 8 2025-05-07T20:23:12.6287838Z apicid : 5 2025-05-07T20:23:12.6288056Z initial apicid : 5 2025-05-07T20:23:12.6288274Z fpu : yes 2025-05-07T20:23:12.6288485Z fpu_exception : yes 2025-05-07T20:23:12.6288713Z cpuid level : 13 2025-05-07T20:23:12.6288926Z wp : yes 2025-05-07T20:23:12.6290927Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6293298Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6293805Z bogomips : 5600.08 2025-05-07T20:23:12.6294145Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6294394Z clflush size : 64 2025-05-07T20:23:12.6294622Z cache_alignment : 64 2025-05-07T20:23:12.6294909Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6295233Z power management: 2025-05-07T20:23:12.6295376Z 2025-05-07T20:23:12.6295464Z processor : 11 2025-05-07T20:23:12.6295695Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6295945Z cpu family : 23 2025-05-07T20:23:12.6296167Z model : 49 2025-05-07T20:23:12.6296396Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6296646Z stepping : 0 2025-05-07T20:23:12.6296869Z microcode : 0x830107f 2025-05-07T20:23:12.6297110Z cpu MHz : 3207.896 2025-05-07T20:23:12.6297336Z cache size : 512 KB 2025-05-07T20:23:12.6297559Z physical id : 0 2025-05-07T20:23:12.6297782Z siblings : 16 2025-05-07T20:23:12.6297999Z core id : 3 2025-05-07T20:23:12.6298205Z cpu cores : 8 2025-05-07T20:23:12.6298417Z apicid : 7 2025-05-07T20:23:12.6298629Z initial apicid : 7 2025-05-07T20:23:12.6298850Z fpu : yes 2025-05-07T20:23:12.6299061Z fpu_exception : yes 2025-05-07T20:23:12.6299288Z cpuid level : 13 2025-05-07T20:23:12.6299500Z wp : yes 2025-05-07T20:23:12.6301504Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6303803Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6304316Z bogomips : 5600.08 2025-05-07T20:23:12.6304540Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6304797Z clflush size : 64 2025-05-07T20:23:12.6305029Z cache_alignment : 64 2025-05-07T20:23:12.6305308Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6305637Z power management: 2025-05-07T20:23:12.6305780Z 2025-05-07T20:23:12.6305871Z processor : 12 2025-05-07T20:23:12.6306097Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6306345Z cpu family : 23 2025-05-07T20:23:12.6306564Z model : 49 2025-05-07T20:23:12.6306779Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6307034Z stepping : 0 2025-05-07T20:23:12.6307252Z microcode : 0x830107f 2025-05-07T20:23:12.6307488Z cpu MHz : 3272.436 2025-05-07T20:23:12.6307706Z cache size : 512 KB 2025-05-07T20:23:12.6307935Z physical id : 0 2025-05-07T20:23:12.6308153Z siblings : 16 2025-05-07T20:23:12.6308358Z core id : 4 2025-05-07T20:23:12.6308567Z cpu cores : 8 2025-05-07T20:23:12.6308780Z apicid : 9 2025-05-07T20:23:12.6308982Z initial apicid : 9 2025-05-07T20:23:12.6309210Z fpu : yes 2025-05-07T20:23:12.6309424Z fpu_exception : yes 2025-05-07T20:23:12.6309650Z cpuid level : 13 2025-05-07T20:23:12.6309869Z wp : yes 2025-05-07T20:23:12.6311871Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6314343Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6314853Z bogomips : 5600.08 2025-05-07T20:23:12.6315078Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6315325Z clflush size : 64 2025-05-07T20:23:12.6315555Z cache_alignment : 64 2025-05-07T20:23:12.6315926Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6316258Z power management: 2025-05-07T20:23:12.6316397Z 2025-05-07T20:23:12.6316489Z processor : 13 2025-05-07T20:23:12.6316715Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6316966Z cpu family : 23 2025-05-07T20:23:12.6317182Z model : 49 2025-05-07T20:23:12.6317395Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6317655Z stepping : 0 2025-05-07T20:23:12.6317873Z microcode : 0x830107f 2025-05-07T20:23:12.6318110Z cpu MHz : 3213.101 2025-05-07T20:23:12.6318335Z cache size : 512 KB 2025-05-07T20:23:12.6318560Z physical id : 0 2025-05-07T20:23:12.6318772Z siblings : 16 2025-05-07T20:23:12.6318986Z core id : 5 2025-05-07T20:23:12.6319193Z cpu cores : 8 2025-05-07T20:23:12.6319397Z apicid : 11 2025-05-07T20:23:12.6319614Z initial apicid : 11 2025-05-07T20:23:12.6319837Z fpu : yes 2025-05-07T20:23:12.6320042Z fpu_exception : yes 2025-05-07T20:23:12.6320271Z cpuid level : 13 2025-05-07T20:23:12.6320490Z wp : yes 2025-05-07T20:23:12.6322508Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6325076Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6325581Z bogomips : 5600.08 2025-05-07T20:23:12.6325813Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6326061Z clflush size : 64 2025-05-07T20:23:12.6326284Z cache_alignment : 64 2025-05-07T20:23:12.6326568Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6326915Z power management: 2025-05-07T20:23:12.6327052Z 2025-05-07T20:23:12.6327141Z processor : 14 2025-05-07T20:23:12.6327372Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6327622Z cpu family : 23 2025-05-07T20:23:12.6327831Z model : 49 2025-05-07T20:23:12.6328046Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6328296Z stepping : 0 2025-05-07T20:23:12.6328508Z microcode : 0x830107f 2025-05-07T20:23:12.6328745Z cpu MHz : 3276.735 2025-05-07T20:23:12.6328976Z cache size : 512 KB 2025-05-07T20:23:12.6329198Z physical id : 0 2025-05-07T20:23:12.6329418Z siblings : 16 2025-05-07T20:23:12.6329628Z core id : 6 2025-05-07T20:23:12.6329833Z cpu cores : 8 2025-05-07T20:23:12.6330047Z apicid : 13 2025-05-07T20:23:12.6330263Z initial apicid : 13 2025-05-07T20:23:12.6330481Z fpu : yes 2025-05-07T20:23:12.6330690Z fpu_exception : yes 2025-05-07T20:23:12.6330919Z cpuid level : 13 2025-05-07T20:23:12.6331132Z wp : yes 2025-05-07T20:23:12.6333154Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6337320Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6337836Z bogomips : 5600.08 2025-05-07T20:23:12.6338067Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6338318Z clflush size : 64 2025-05-07T20:23:12.6338551Z cache_alignment : 64 2025-05-07T20:23:12.6338838Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6339163Z power management: 2025-05-07T20:23:12.6339304Z 2025-05-07T20:23:12.6339516Z processor : 15 2025-05-07T20:23:12.6339754Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.6340019Z cpu family : 23 2025-05-07T20:23:12.6340268Z model : 49 2025-05-07T20:23:12.6340496Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.6340747Z stepping : 0 2025-05-07T20:23:12.6340973Z microcode : 0x830107f 2025-05-07T20:23:12.6341216Z cpu MHz : 3253.540 2025-05-07T20:23:12.6341442Z cache size : 512 KB 2025-05-07T20:23:12.6341677Z physical id : 0 2025-05-07T20:23:12.6341914Z siblings : 16 2025-05-07T20:23:12.6342129Z core id : 7 2025-05-07T20:23:12.6342347Z cpu cores : 8 2025-05-07T20:23:12.6342564Z apicid : 15 2025-05-07T20:23:12.6342782Z initial apicid : 15 2025-05-07T20:23:12.6343017Z fpu : yes 2025-05-07T20:23:12.6343243Z fpu_exception : yes 2025-05-07T20:23:12.6343473Z cpuid level : 13 2025-05-07T20:23:12.6343700Z wp : yes 2025-05-07T20:23:12.6345724Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.6348016Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.6348531Z bogomips : 5600.08 2025-05-07T20:23:12.6348762Z TLB size : 3072 4K pages 2025-05-07T20:23:12.6349013Z clflush size : 64 2025-05-07T20:23:12.6349248Z cache_alignment : 64 2025-05-07T20:23:12.6349537Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.6349877Z power management: 2025-05-07T20:23:12.6350017Z 2025-05-07T20:23:12.6350021Z 2025-05-07T20:23:12.6350158Z ################################################################################ 2025-05-07T20:23:12.6350482Z [INFO] Print PCI info ... 2025-05-07T20:23:12.6350780Z + lspci -v 2025-05-07T20:23:12.6350913Z 2025-05-07T20:23:12.6351103Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:12.6351512Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:12.6351858Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:12.6352079Z 2025-05-07T20:23:12.6352291Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:12.6352694Z Physical Slot: 1 2025-05-07T20:23:12.6352952Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.6353164Z 2025-05-07T20:23:12.6353421Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:12.6353957Z Physical Slot: 1 2025-05-07T20:23:12.6354227Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:12.6354461Z 2025-05-07T20:23:12.6354743Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:12.6355200Z Physical Slot: 3 2025-05-07T20:23:12.6355454Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.6355813Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.6356182Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:12.6356420Z 2025-05-07T20:23:12.6356731Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.6357347Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.6357647Z Physical Slot: 4 2025-05-07T20:23:12.6357913Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:12.6358314Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.6358681Z Capabilities: 2025-05-07T20:23:12.6358958Z Kernel driver in use: nvme 2025-05-07T20:23:12.6359134Z 2025-05-07T20:23:12.6359444Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.6359945Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.6360302Z Physical Slot: 5 2025-05-07T20:23:12.6360560Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.6360940Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.6361337Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.6361675Z Capabilities: 2025-05-07T20:23:12.6361961Z Kernel driver in use: ena 2025-05-07T20:23:12.6362221Z Kernel modules: ena 2025-05-07T20:23:12.6362367Z 2025-05-07T20:23:12.6362559Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:12.6362964Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:12.6363266Z Physical Slot: 30 2025-05-07T20:23:12.6363541Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:12.6363939Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:12.6364350Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:12.6364743Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:12.6365090Z Capabilities: 2025-05-07T20:23:12.6365369Z Kernel driver in use: nvidia 2025-05-07T20:23:12.6365643Z Kernel modules: nvidia 2025-05-07T20:23:12.6365797Z 2025-05-07T20:23:12.6366113Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.6366656Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.6367055Z Physical Slot: 31 2025-05-07T20:23:12.6367412Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.6367848Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.6368534Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:12.6369006Z Capabilities: 2025-05-07T20:23:12.6379698Z Kernel driver in use: nvme 2025-05-07T20:23:12.6379901Z 2025-05-07T20:23:12.6379905Z 2025-05-07T20:23:12.6380042Z ################################################################################ 2025-05-07T20:23:12.6380403Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:12.6380710Z + uname -a 2025-05-07T20:23:12.6380842Z 2025-05-07T20:23:12.6381272Z Linux ip-10-0-65-139.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:12.6381801Z 2025-05-07T20:23:12.6381888Z + uname -m 2025-05-07T20:23:12.6382015Z 2025-05-07T20:23:12.6382108Z x86_64 2025-05-07T20:23:12.6382223Z 2025-05-07T20:23:12.6382317Z + cat /proc/version 2025-05-07T20:23:12.6382470Z 2025-05-07T20:23:12.6383031Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:12.6383682Z 2025-05-07T20:23:12.6383777Z + cat /etc/os-release 2025-05-07T20:23:12.6383933Z 2025-05-07T20:23:12.6384053Z NAME="Amazon Linux" 2025-05-07T20:23:12.6384284Z VERSION="2023" 2025-05-07T20:23:12.6384509Z ID="amzn" 2025-05-07T20:23:12.6384719Z ID_LIKE="fedora" 2025-05-07T20:23:12.6384939Z VERSION_ID="2023" 2025-05-07T20:23:12.6385193Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:12.6385500Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:12.6385805Z ANSI_COLOR="0;33" 2025-05-07T20:23:12.6386079Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:12.6386625Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:12.6387085Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:12.6387523Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:12.6387993Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:12.6388392Z VENDOR_NAME="AWS" 2025-05-07T20:23:12.6388648Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:12.6388961Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:12.6389126Z 2025-05-07T20:23:12.6389353Z ################################################################################ 2025-05-07T20:23:12.6389685Z # Print EC2 Instance Info 2025-05-07T20:23:12.6389940Z # 2025-05-07T20:23:12.6390167Z # [2025-05-07T20:23:12.637Z] + print_ec2_info 2025-05-07T20:23:12.6390505Z ################################################################################ 2025-05-07T20:23:12.6390733Z 2025-05-07T20:23:12.6495034Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:12.6615820Z instance-id: i-09c05d8e2aea2c844 2025-05-07T20:23:12.6731058Z instance-type: g5.4xlarge 2025-05-07T20:23:12.6771202Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.6771584Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.6780507Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.6780887Z env: 2025-05-07T20:23:12.6781130Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.6781460Z BUILD_ENV: build_binary 2025-05-07T20:23:12.6781729Z BUILD_TARGET: genai 2025-05-07T20:23:12.6781980Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.6782242Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:12.6782525Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.6782859Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.6783218Z ##[endgroup] 2025-05-07T20:23:13.0152700Z ################################################################################ 2025-05-07T20:23:13.0153097Z [INFO] Printing general display info ... 2025-05-07T20:23:13.0181524Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:13.1329376Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:13.1338318Z /usr/bin/sudo 2025-05-07T20:23:13.1349043Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:13.1359956Z /usr/bin/yum 2025-05-07T20:23:13.1361604Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:13.1382068Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:13.5961711Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:13.6679902Z ================================================================================ 2025-05-07T20:23:13.6680431Z WARNING: 2025-05-07T20:23:13.6680778Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:13.6681122Z 2025-05-07T20:23:13.6681256Z Available Versions: 2025-05-07T20:23:13.6681468Z 2025-05-07T20:23:13.6681612Z Version 2023.7.20250331: 2025-05-07T20:23:13.6681958Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:13.6682248Z 2025-05-07T20:23:13.6682390Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:13.6682617Z 2025-05-07T20:23:13.6682710Z Release notes: 2025-05-07T20:23:13.6683138Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:13.6683516Z 2025-05-07T20:23:13.6683612Z Version 2023.7.20250414: 2025-05-07T20:23:13.6683940Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:13.6684204Z 2025-05-07T20:23:13.6684324Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:13.6684540Z 2025-05-07T20:23:13.6684636Z Release notes: 2025-05-07T20:23:13.6685038Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:13.6685415Z 2025-05-07T20:23:13.6685508Z Version 2023.7.20250428: 2025-05-07T20:23:13.6685828Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:13.6686319Z 2025-05-07T20:23:13.6686452Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:13.6686671Z 2025-05-07T20:23:13.6686763Z Release notes: 2025-05-07T20:23:13.6687167Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:13.6687538Z 2025-05-07T20:23:13.6687663Z ================================================================================ 2025-05-07T20:23:13.7855113Z Dependencies resolved. 2025-05-07T20:23:13.8143802Z ================================================================================ 2025-05-07T20:23:13.8144435Z Package Arch Version Repository Size 2025-05-07T20:23:13.8144945Z ================================================================================ 2025-05-07T20:23:13.8145303Z Upgrading: 2025-05-07T20:23:13.8145732Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:13.8146342Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:13.8146843Z 2025-05-07T20:23:13.8147356Z Transaction Summary 2025-05-07T20:23:13.8147730Z ================================================================================ 2025-05-07T20:23:13.8148182Z Upgrade 2 Packages 2025-05-07T20:23:13.8148396Z 2025-05-07T20:23:13.8148548Z Total download size: 6.9 M 2025-05-07T20:23:13.8148920Z Downloading Packages: 2025-05-07T20:23:13.8581454Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 29 MB/s | 1.2 MB 00:00 2025-05-07T20:23:13.8989398Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 68 MB/s | 5.7 MB 00:00 2025-05-07T20:23:13.8998357Z -------------------------------------------------------------------------------- 2025-05-07T20:23:13.9001593Z Total 81 MB/s | 6.9 MB 00:00 2025-05-07T20:23:13.9004239Z Running transaction check 2025-05-07T20:23:13.9103600Z Transaction check succeeded. 2025-05-07T20:23:13.9104217Z Running transaction test 2025-05-07T20:23:13.9399753Z Transaction test succeeded. 2025-05-07T20:23:13.9403106Z Running transaction 2025-05-07T20:23:14.4963533Z Preparing : 1/1 2025-05-07T20:23:14.6019594Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.6037695Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.6231816Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.6232622Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.6332899Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.6353956Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.7820705Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:14.7821539Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.7822273Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:14.7822837Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:14.9218136Z ================================================================================ 2025-05-07T20:23:14.9218654Z WARNING: 2025-05-07T20:23:14.9218998Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:14.9219340Z 2025-05-07T20:23:14.9219461Z Available Versions: 2025-05-07T20:23:14.9219624Z 2025-05-07T20:23:14.9219722Z Version 2023.7.20250331: 2025-05-07T20:23:14.9220051Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:14.9220312Z 2025-05-07T20:23:14.9220441Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:14.9220670Z 2025-05-07T20:23:14.9220763Z Release notes: 2025-05-07T20:23:14.9221194Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:14.9221836Z 2025-05-07T20:23:14.9221950Z Version 2023.7.20250414: 2025-05-07T20:23:14.9222275Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:14.9222541Z 2025-05-07T20:23:14.9222667Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:14.9222888Z 2025-05-07T20:23:14.9222987Z Release notes: 2025-05-07T20:23:14.9223395Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:14.9223951Z 2025-05-07T20:23:14.9224048Z Version 2023.7.20250428: 2025-05-07T20:23:14.9224376Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:14.9224633Z 2025-05-07T20:23:14.9224761Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:14.9224978Z 2025-05-07T20:23:14.9225070Z Release notes: 2025-05-07T20:23:14.9225480Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:14.9225861Z 2025-05-07T20:23:14.9226183Z ================================================================================ 2025-05-07T20:23:14.9793616Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.9794100Z 2025-05-07T20:23:14.9794224Z Upgraded: 2025-05-07T20:23:14.9794727Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:14.9795343Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:14.9795744Z 2025-05-07T20:23:14.9795830Z Complete! 2025-05-07T20:23:15.0238515Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:15.0263353Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:15.5594146Z Last metadata expiration check: 0:00:11 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:15.5829944Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:15.6233842Z Dependencies resolved. 2025-05-07T20:23:15.6411067Z ================================================================================ 2025-05-07T20:23:15.6412057Z Package Architecture Version Repository Size 2025-05-07T20:23:15.6412921Z ================================================================================ 2025-05-07T20:23:15.6413535Z Installing: 2025-05-07T20:23:15.6414132Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:15.6414673Z 2025-05-07T20:23:15.6414770Z Transaction Summary 2025-05-07T20:23:15.6415079Z ================================================================================ 2025-05-07T20:23:15.6415399Z Install 1 Package 2025-05-07T20:23:15.6415540Z 2025-05-07T20:23:15.6415656Z Total download size: 319 k 2025-05-07T20:23:15.6415917Z Installed size: 837 k 2025-05-07T20:23:15.6416166Z Downloading Packages: 2025-05-07T20:23:15.7136741Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 7.6 MB/s | 319 kB 00:00 2025-05-07T20:23:15.7142539Z -------------------------------------------------------------------------------- 2025-05-07T20:23:15.7145005Z Total 4.3 MB/s | 319 kB 00:00 2025-05-07T20:23:15.7299741Z Running transaction check 2025-05-07T20:23:15.7355745Z Transaction check succeeded. 2025-05-07T20:23:15.7356450Z Running transaction test 2025-05-07T20:23:15.7816597Z Transaction test succeeded. 2025-05-07T20:23:15.7819917Z Running transaction 2025-05-07T20:23:15.8856089Z Preparing : 1/1 2025-05-07T20:23:15.9361779Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:16.1405615Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:16.2744028Z ================================================================================ 2025-05-07T20:23:16.2744493Z WARNING: 2025-05-07T20:23:16.2744843Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:16.2745465Z 2025-05-07T20:23:16.2745602Z Available Versions: 2025-05-07T20:23:16.2745839Z 2025-05-07T20:23:16.2745988Z Version 2023.7.20250331: 2025-05-07T20:23:16.2746449Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:16.2746777Z 2025-05-07T20:23:16.2746916Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:16.2747147Z 2025-05-07T20:23:16.2747240Z Release notes: 2025-05-07T20:23:16.2747668Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:16.2748051Z 2025-05-07T20:23:16.2748146Z Version 2023.7.20250414: 2025-05-07T20:23:16.2748475Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:16.2748734Z 2025-05-07T20:23:16.2748861Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:16.2749079Z 2025-05-07T20:23:16.2749175Z Release notes: 2025-05-07T20:23:16.2749580Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:16.2749971Z 2025-05-07T20:23:16.2750318Z Version 2023.7.20250428: 2025-05-07T20:23:16.2750655Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:16.2750913Z 2025-05-07T20:23:16.2751031Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:16.2751255Z 2025-05-07T20:23:16.2751344Z Release notes: 2025-05-07T20:23:16.2751753Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:16.2752136Z 2025-05-07T20:23:16.2752255Z ================================================================================ 2025-05-07T20:23:16.3092999Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:16.3093349Z 2025-05-07T20:23:16.3093440Z Installed: 2025-05-07T20:23:16.3093767Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:16.3094067Z 2025-05-07T20:23:16.3094158Z Complete! 2025-05-07T20:23:16.3552244Z + hostname 2025-05-07T20:23:16.3552454Z 2025-05-07T20:23:16.3566991Z ip-10-0-65-139.ec2.internal 2025-05-07T20:23:16.3568655Z 2025-05-07T20:23:16.3569141Z + sudo lshw -C display 2025-05-07T20:23:16.3569374Z 2025-05-07T20:23:16.7835033Z *-display:0 UNCLAIMED 2025-05-07T20:23:16.7835529Z description: VGA compatible controller 2025-05-07T20:23:16.7835984Z product: Amazon.com, Inc. 2025-05-07T20:23:16.7836390Z vendor: Amazon.com, Inc. 2025-05-07T20:23:16.7836748Z physical id: 3 2025-05-07T20:23:16.7837077Z bus info: pci@0000:00:03.0 2025-05-07T20:23:16.7837423Z version: 00 2025-05-07T20:23:16.7837718Z width: 32 bits 2025-05-07T20:23:16.7837978Z clock: 33MHz 2025-05-07T20:23:16.7838241Z capabilities: vga_controller bus_master 2025-05-07T20:23:16.7838574Z configuration: latency=0 2025-05-07T20:23:16.7838913Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:16.7839253Z *-display:1 2025-05-07T20:23:16.7839517Z description: 3D controller 2025-05-07T20:23:16.7839815Z product: GA102GL [A10G] 2025-05-07T20:23:16.7840086Z vendor: NVIDIA Corporation 2025-05-07T20:23:16.7840367Z physical id: 1e 2025-05-07T20:23:16.7840615Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:16.7840876Z version: a1 2025-05-07T20:23:16.7841102Z width: 64 bits 2025-05-07T20:23:16.7841335Z clock: 33MHz 2025-05-07T20:23:16.7841639Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:16.7842022Z configuration: driver=nvidia latency=0 2025-05-07T20:23:16.7842665Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:16.7875286Z 2025-05-07T20:23:16.7875588Z ################################################################################ 2025-05-07T20:23:16.7876074Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:16.8004532Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:16.8168986Z Wed May 7 20:23:16 2025 2025-05-07T20:23:16.8169575Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.8170118Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:16.8170625Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.8171134Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:16.8171676Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:16.8172115Z | | | MIG M. | 2025-05-07T20:23:16.8172465Z |=========================================+========================+======================| 2025-05-07T20:23:16.8246586Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:16.8247454Z | 0% 30C P0 58W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:16.8247850Z | | | N/A | 2025-05-07T20:23:16.8248257Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.8248662Z 2025-05-07T20:23:16.8249066Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.8249501Z | Processes: | 2025-05-07T20:23:16.8249951Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:16.8250374Z | ID ID Usage | 2025-05-07T20:23:16.8250747Z |=========================================================================================| 2025-05-07T20:23:16.8251635Z | No running processes found | 2025-05-07T20:23:16.8252111Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.9658731Z ################################################################################ 2025-05-07T20:23:16.9659121Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:16.9805223Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.9806026Z [CHECK] rocminfo not found 2025-05-07T20:23:16.9814497Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.9815581Z [CHECK] rocm-smi not found 2025-05-07T20:23:16.9861643Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.9862111Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.9874285Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:16.9874655Z env: 2025-05-07T20:23:16.9874894Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:16.9875221Z BUILD_ENV: build_binary 2025-05-07T20:23:16.9875535Z BUILD_TARGET: genai 2025-05-07T20:23:16.9875775Z BUILD_VARIANT: cuda 2025-05-07T20:23:16.9876027Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:16.9876303Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:16.9876619Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:16.9876970Z ##[endgroup] 2025-05-07T20:23:17.3278071Z ################################################################################ 2025-05-07T20:23:17.3278443Z # Setup Miniconda 2025-05-07T20:23:17.3278671Z # 2025-05-07T20:23:17.3295569Z # [2025-05-07T20:23:17.329Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:17.3295989Z ################################################################################ 2025-05-07T20:23:17.3296220Z 2025-05-07T20:23:17.3311017Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:17.4207164Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:17.4207687Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:17.4207983Z 2025-05-07T20:23:17.4225595Z 2025-05-07T20:23:17.4225805Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:17.4245936Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:18.1628469Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:18.1629031Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:18.1629383Z 2025-05-07T20:23:18.1775309Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:18.6266407Z Unpacking payload ... 2025-05-07T20:23:19.1480922Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:19.9478512Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:22.0666952Z 2025-05-07T20:23:22.0667366Z Installing base environment... 2025-05-07T20:23:22.0667601Z 2025-05-07T20:23:23.1517892Z Preparing transaction: ...working... done 2025-05-07T20:23:26.1580804Z Executing transaction: ...working... done 2025-05-07T20:23:26.8208134Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:26.9107538Z installation finished. 2025-05-07T20:23:26.9114232Z 2025-05-07T20:23:26.9114441Z + rm -f miniconda.sh 2025-05-07T20:23:26.9114653Z 2025-05-07T20:23:26.9425933Z 2025-05-07T20:23:26.9426478Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:26.9427346Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:27.3138464Z 2025-05-07T20:23:27.3138656Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:27.3139154Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:27.3139670Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:27.3140193Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:27.3140701Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:27.3141243Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:27.3141707Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:27.3142170Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:27.3142647Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:27.3143439Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:27.3143991Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:27.3144381Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:27.3144582Z 2025-05-07T20:23:27.3144799Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:27.3145113Z 2025-05-07T20:23:27.3811749Z 2025-05-07T20:23:27.3812077Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:27.3812362Z 2025-05-07T20:23:28.2223576Z 2025-05-07T20:23:28.2224371Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:28.2247810Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:23:41.5595836Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:23:43.1369681Z Solving environment: \ | / - \ | / - \ | / - done 2025-05-07T20:23:43.2342050Z 2025-05-07T20:23:43.2342441Z ## Package Plan ## 2025-05-07T20:23:43.2342686Z 2025-05-07T20:23:43.2342889Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:43.2343217Z 2025-05-07T20:23:43.2343357Z added / updated specs: 2025-05-07T20:23:43.2343740Z - conda-libmamba-solver 2025-05-07T20:23:43.2344134Z - libarchive 2025-05-07T20:23:43.2344379Z - libmamba 2025-05-07T20:23:43.2344601Z - libmambapy 2025-05-07T20:23:43.2344736Z 2025-05-07T20:23:43.2344741Z 2025-05-07T20:23:43.2344887Z The following packages will be downloaded: 2025-05-07T20:23:43.2345117Z 2025-05-07T20:23:43.2345237Z package | build 2025-05-07T20:23:43.2345571Z ---------------------------|----------------- 2025-05-07T20:23:43.2346000Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:23:43.2346497Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:23:43.2346951Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:23:43.2347448Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:23:43.2347911Z ------------------------------------------------------------ 2025-05-07T20:23:43.2348284Z Total: 1.4 MB 2025-05-07T20:23:43.2348501Z 2025-05-07T20:23:43.2348631Z The following packages will be UPDATED: 2025-05-07T20:23:43.2348851Z 2025-05-07T20:23:43.2354383Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:23:43.2355200Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:23:43.2355605Z 2025-05-07T20:23:43.2355834Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:23:43.2356165Z 2025-05-07T20:23:43.2356493Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:23:43.2357444Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:23:43.2357979Z 2025-05-07T20:23:43.2357983Z 2025-05-07T20:23:43.2357987Z 2025-05-07T20:23:43.2358145Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:43.2358528Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:23:43.2358770Z 2025-05-07T20:23:43.2360050Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:23:43.2360311Z 2025-05-07T20:23:43.2364324Z 2025-05-07T20:23:43.2376567Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:23:43.2376894Z 2025-05-07T20:23:43.2376900Z 2025-05-07T20:23:43.2376915Z 2025-05-07T20:23:43.2904497Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:23:43.2904926Z 2025-05-07T20:23:43.2904940Z 2025-05-07T20:23:43.2904944Z 2025-05-07T20:23:43.3012442Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.3012789Z 2025-05-07T20:23:43.3012793Z 2025-05-07T20:23:43.3012797Z 2025-05-07T20:23:43.3187426Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.3187841Z 2025-05-07T20:23:43.3187847Z 2025-05-07T20:23:43.3207506Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.3313693Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.3315810Z 2025-05-07T20:23:43.3437745Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.3438113Z 2025-05-07T20:23:43.3438120Z 2025-05-07T20:23:43.3442755Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.3443387Z 2025-05-07T20:23:43.3443392Z 2025-05-07T20:23:43.3553094Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.3553624Z 2025-05-07T20:23:43.3555444Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.3555801Z 2025-05-07T20:23:43.4548246Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.4548665Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.4553830Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.4554191Z 2025-05-07T20:23:43.4554400Z 2025-05-07T20:23:43.4554597Z  2025-05-07T20:23:43.4554808Z 2025-05-07T20:23:43.4554812Z 2025-05-07T20:23:43.4555013Z  2025-05-07T20:23:43.4555240Z 2025-05-07T20:23:43.4555243Z 2025-05-07T20:23:43.4555247Z 2025-05-07T20:23:43.4555883Z  done 2025-05-07T20:23:43.5558022Z Preparing transaction: | done 2025-05-07T20:23:43.6563577Z Verifying transaction: - done 2025-05-07T20:23:45.0583863Z Executing transaction: | / - \ | / - \ | / - \ | / done 2025-05-07T20:23:46.7856860Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:23:46.7882524Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:23:47.7087179Z Channels: 2025-05-07T20:23:47.7087524Z - defaults 2025-05-07T20:23:47.7087845Z Platform: linux-64 2025-05-07T20:23:48.9354789Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:23:49.0554374Z Solving environment: - \ Channels: 2025-05-07T20:23:49.0554715Z - defaults 2025-05-07T20:23:49.0554970Z Platform: linux-64 2025-05-07T20:23:49.3360419Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:23:49.5519118Z Solving environment: - \ | / done 2025-05-07T20:23:49.6336909Z done 2025-05-07T20:23:49.6999192Z 2025-05-07T20:23:49.6999753Z ## Package Plan ## 2025-05-07T20:23:49.6999962Z 2025-05-07T20:23:49.7000180Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:49.7000455Z 2025-05-07T20:23:49.7000587Z added / updated specs: 2025-05-07T20:23:49.7000967Z - conda 2025-05-07T20:23:49.7001154Z 2025-05-07T20:23:49.7001161Z 2025-05-07T20:23:49.7001358Z The following packages will be downloaded: 2025-05-07T20:23:49.7001714Z 2025-05-07T20:23:49.7001908Z package | build 2025-05-07T20:23:49.7002438Z ---------------------------|----------------- 2025-05-07T20:23:49.7002897Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:23:49.7003957Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:23:49.7004454Z ------------------------------------------------------------ 2025-05-07T20:23:49.7004810Z Total: 1.4 MB 2025-05-07T20:23:49.7005063Z 2025-05-07T20:23:49.7005189Z The following packages will be UPDATED: 2025-05-07T20:23:49.7005410Z 2025-05-07T20:23:49.7005732Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:49.7006269Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:23:49.7006523Z 2025-05-07T20:23:49.7006528Z 2025-05-07T20:23:49.7006532Z 2025-05-07T20:23:49.7006689Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:49.7007076Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:23:49.7007302Z 2025-05-07T20:23:49.7463774Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:23:49.7555498Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:49.7555906Z 2025-05-07T20:23:50.0146694Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.0147305Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.0357243Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.0357489Z 2025-05-07T20:23:50.0358034Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.0358321Z 2025-05-07T20:23:50.0363257Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.0363795Z 2025-05-07T20:23:50.0364080Z 2025-05-07T20:23:50.0364351Z  done 2025-05-07T20:23:50.1370558Z Preparing transaction: \ done 2025-05-07T20:23:50.2378371Z Verifying transaction: / done 2025-05-07T20:23:52.7414781Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:53.3897432Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:23:53.3901116Z + conda clean --packages --tarball -y 2025-05-07T20:23:53.3901354Z 2025-05-07T20:23:54.4375526Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:23:54.4375926Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:23:54.5026411Z 2025-05-07T20:23:54.5035298Z + conda clean --all -y 2025-05-07T20:23:54.5035557Z 2025-05-07T20:23:55.0749155Z There are no unused tarball(s) to remove. 2025-05-07T20:23:55.0749820Z Will remove 1 index cache(s). 2025-05-07T20:23:55.0750386Z There are no unused package(s) to remove. 2025-05-07T20:23:55.0751004Z There are no tempfile(s) to remove. 2025-05-07T20:23:55.0751585Z There are no logfile(s) to remove. 2025-05-07T20:23:55.1406088Z 2025-05-07T20:23:55.1410676Z + conda info 2025-05-07T20:23:55.1410922Z 2025-05-07T20:23:55.9309805Z 2025-05-07T20:23:55.9310450Z active environment : base 2025-05-07T20:23:55.9310966Z active env location : /home/ec2-user/miniconda 2025-05-07T20:23:55.9311464Z shell level : 1 2025-05-07T20:23:55.9311767Z user config file : /home/ec2-user/.condarc 2025-05-07T20:23:55.9312181Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:23:55.9312594Z conda version : 25.3.1 2025-05-07T20:23:55.9312899Z conda-build version : not installed 2025-05-07T20:23:55.9313220Z python version : 3.13.2.final.0 2025-05-07T20:23:55.9313683Z solver : libmamba (default) 2025-05-07T20:23:55.9314018Z virtual packages : __archspec=1=zen2 2025-05-07T20:23:55.9314342Z __conda=25.3.1=0 2025-05-07T20:23:55.9314637Z __cuda=12.8=0 2025-05-07T20:23:55.9314939Z __glibc=2.34=0 2025-05-07T20:23:55.9315241Z __linux=6.1.130=0 2025-05-07T20:23:55.9315534Z __unix=0=0 2025-05-07T20:23:55.9315897Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:23:55.9316701Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:23:55.9317077Z conda av metadata url : None 2025-05-07T20:23:55.9317478Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:23:55.9317945Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:23:55.9318360Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:23:55.9318762Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:23:55.9319160Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:23:55.9319527Z /home/ec2-user/.conda/pkgs 2025-05-07T20:23:55.9319891Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:23:55.9320259Z /home/ec2-user/.conda/envs 2025-05-07T20:23:55.9320592Z platform : linux-64 2025-05-07T20:23:55.9321491Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:23:55.9322371Z UID:GID : 1000:1000 2025-05-07T20:23:55.9322821Z netrc file : None 2025-05-07T20:23:55.9323104Z offline mode : False 2025-05-07T20:23:55.9323287Z 2025-05-07T20:23:56.0008612Z 2025-05-07T20:23:56.0009061Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:23:56.0009883Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_34024b15-9fe3-4d03-9b77-057f24e11cb3 ... 2025-05-07T20:23:56.0010739Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:23:56.0084754Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.10 2025-05-07T20:23:56.0085303Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.10 2025-05-07T20:23:56.0104175Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:56.0104562Z env: 2025-05-07T20:23:56.0104807Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:56.0105142Z BUILD_ENV: build_binary 2025-05-07T20:23:56.0105417Z BUILD_TARGET: genai 2025-05-07T20:23:56.0105687Z BUILD_VARIANT: cuda 2025-05-07T20:23:56.0105945Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:56.0106230Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:56.0106565Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:56.0106973Z ##[endgroup] 2025-05-07T20:23:56.3664563Z ################################################################################ 2025-05-07T20:23:56.3664948Z # Create Conda Environment 2025-05-07T20:23:56.3665217Z # 2025-05-07T20:23:56.3680173Z # [2025-05-07T20:23:56.367Z] + create_conda_environment build_binary 3.10 2025-05-07T20:23:56.3680612Z ################################################################################ 2025-05-07T20:23:56.3680839Z 2025-05-07T20:23:56.3695433Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:56.4605881Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:56.4606436Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:23:56.4606898Z + conda info --envs 2025-05-07T20:23:56.4607089Z 2025-05-07T20:23:57.2115882Z 2025-05-07T20:23:57.2116632Z # conda environments: 2025-05-07T20:23:57.2116938Z # 2025-05-07T20:23:57.2117188Z base /home/ec2-user/miniconda 2025-05-07T20:23:57.2117419Z 2025-05-07T20:23:57.2784303Z 2025-05-07T20:23:57.2785134Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:23:58.9228528Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:58.9228920Z 2025-05-07T20:23:58.9241989Z 2025-05-07T20:23:58.9249711Z [SETUP] Creating new Conda environment (Python 3.10) ... 2025-05-07T20:23:58.9271806Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.10 2025-05-07T20:23:59.6948279Z Channels: 2025-05-07T20:23:59.6948626Z - defaults 2025-05-07T20:23:59.6948959Z Platform: linux-64 2025-05-07T20:24:01.3521725Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:01.4527289Z Solving environment: / done 2025-05-07T20:24:01.4835972Z 2025-05-07T20:24:01.4836298Z ## Package Plan ## 2025-05-07T20:24:01.4836532Z 2025-05-07T20:24:01.4836853Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:01.4837333Z 2025-05-07T20:24:01.4837473Z added / updated specs: 2025-05-07T20:24:01.4837845Z - python=3.10 2025-05-07T20:24:01.4838042Z 2025-05-07T20:24:01.4838048Z 2025-05-07T20:24:01.4838232Z The following packages will be downloaded: 2025-05-07T20:24:01.4838476Z 2025-05-07T20:24:01.4838637Z package | build 2025-05-07T20:24:01.4839007Z ---------------------------|----------------- 2025-05-07T20:24:01.4839459Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:01.4839936Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:01.4840402Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:01.4840868Z python-3.10.16 | he870216_1 26.9 MB 2025-05-07T20:24:01.4841663Z setuptools-78.1.1 | py310h06a4308_0 1.7 MB 2025-05-07T20:24:01.4842113Z wheel-0.45.1 | py310h06a4308_0 115 KB 2025-05-07T20:24:01.4842520Z ------------------------------------------------------------ 2025-05-07T20:24:01.4842903Z Total: 28.8 MB 2025-05-07T20:24:01.4843140Z 2025-05-07T20:24:01.4843293Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:01.4843541Z 2025-05-07T20:24:01.4843989Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:01.4844505Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:01.4845162Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:01.4846309Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:01.4847068Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:01.4847592Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:01.4848230Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.4848733Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:01.4849275Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.4849792Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:01.4850275Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:01.4850752Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:01.4851204Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:01.4851658Z python pkgs/main/linux-64::python-3.10.16-he870216_1 2025-05-07T20:24:01.4852144Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:01.4852666Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py310h06a4308_0 2025-05-07T20:24:01.4853195Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:01.4853632Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:01.4854067Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:01.4854531Z wheel pkgs/main/linux-64::wheel-0.45.1-py310h06a4308_0 2025-05-07T20:24:01.4854977Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:01.4855403Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:01.4855671Z 2025-05-07T20:24:01.4855676Z 2025-05-07T20:24:01.4855680Z 2025-05-07T20:24:01.4855850Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:01.4856280Z python-3.10.16 | 26.9 MB | | 0% 2025-05-07T20:24:01.4856542Z 2025-05-07T20:24:01.4856944Z setuptools-78.1.1 | 1.7 MB | | 0%  2025-05-07T20:24:01.4857223Z 2025-05-07T20:24:01.4857228Z 2025-05-07T20:24:01.4857484Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:01.4857768Z 2025-05-07T20:24:01.4857772Z 2025-05-07T20:24:01.4868869Z 2025-05-07T20:24:01.4873694Z wheel-0.45.1 | 115 KB | | 0%  2025-05-07T20:24:01.4873966Z 2025-05-07T20:24:01.4874007Z 2025-05-07T20:24:01.4874011Z 2025-05-07T20:24:01.4874059Z 2025-05-07T20:24:01.4893039Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:01.4893553Z 2025-05-07T20:24:01.4893560Z 2025-05-07T20:24:01.4893583Z 2025-05-07T20:24:01.4893590Z 2025-05-07T20:24:01.4893597Z 2025-05-07T20:24:01.5251411Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:01.5251871Z 2025-05-07T20:24:01.5251878Z 2025-05-07T20:24:01.5256392Z 2025-05-07T20:24:01.5328634Z wheel-0.45.1 | 115 KB | ########## | 100%  2025-05-07T20:24:01.5329024Z 2025-05-07T20:24:01.5329029Z 2025-05-07T20:24:01.5329034Z 2025-05-07T20:24:01.5330633Z 2025-05-07T20:24:01.5515321Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.5515762Z 2025-05-07T20:24:01.5515769Z 2025-05-07T20:24:01.5515775Z 2025-05-07T20:24:01.5515781Z 2025-05-07T20:24:01.5519505Z 2025-05-07T20:24:01.5743994Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.5744402Z 2025-05-07T20:24:01.5749060Z 2025-05-07T20:24:01.5843602Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.5848198Z python-3.10.16 | 26.9 MB | 3 | 3% 2025-05-07T20:24:01.5848546Z 2025-05-07T20:24:01.5848550Z 2025-05-07T20:24:01.5848555Z 2025-05-07T20:24:01.5848559Z 2025-05-07T20:24:01.5848563Z 2025-05-07T20:24:01.6040957Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.6041385Z 2025-05-07T20:24:01.6041654Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.6041993Z 2025-05-07T20:24:01.6232616Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.6233006Z 2025-05-07T20:24:01.6233011Z 2025-05-07T20:24:01.6236962Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.6237257Z 2025-05-07T20:24:01.6239087Z 2025-05-07T20:24:01.6845744Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.7012382Z python-3.10.16 | 26.9 MB | #7 | 18% 2025-05-07T20:24:01.7012679Z 2025-05-07T20:24:01.7012685Z 2025-05-07T20:24:01.7012691Z 2025-05-07T20:24:01.7013759Z 2025-05-07T20:24:01.7024251Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.7024658Z 2025-05-07T20:24:01.7024663Z 2025-05-07T20:24:01.7024668Z 2025-05-07T20:24:01.7026562Z 2025-05-07T20:24:01.7147485Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.7147881Z 2025-05-07T20:24:01.7147886Z 2025-05-07T20:24:01.7149236Z 2025-05-07T20:24:01.7153225Z wheel-0.45.1 | 115 KB | ########## | 100%  2025-05-07T20:24:01.7153758Z 2025-05-07T20:24:01.7153765Z 2025-05-07T20:24:01.7154338Z 2025-05-07T20:24:01.7846514Z wheel-0.45.1 | 115 KB | ########## | 100%  2025-05-07T20:24:01.8853590Z python-3.10.16 | 26.9 MB | ######1 | 61% 2025-05-07T20:24:01.9881622Z python-3.10.16 | 26.9 MB | #########4 | 94% 2025-05-07T20:24:02.0620902Z python-3.10.16 | 26.9 MB | ########## | 100% 2025-05-07T20:24:02.0621540Z 2025-05-07T20:24:02.6605727Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:02.6613010Z python-3.10.16 | 26.9 MB | ########## | 100% 2025-05-07T20:24:02.6613550Z 2025-05-07T20:24:02.6613875Z 2025-05-07T20:24:02.6614131Z  2025-05-07T20:24:02.6614359Z 2025-05-07T20:24:02.6614363Z 2025-05-07T20:24:02.6614555Z  2025-05-07T20:24:02.6614791Z 2025-05-07T20:24:02.6614811Z 2025-05-07T20:24:02.6614815Z 2025-05-07T20:24:02.6615002Z  2025-05-07T20:24:02.6615236Z 2025-05-07T20:24:02.6615240Z 2025-05-07T20:24:02.6615245Z 2025-05-07T20:24:02.6615249Z 2025-05-07T20:24:02.6615436Z  2025-05-07T20:24:02.6615765Z 2025-05-07T20:24:02.6615780Z 2025-05-07T20:24:02.6615786Z 2025-05-07T20:24:02.6615792Z 2025-05-07T20:24:02.6615798Z 2025-05-07T20:24:02.6616030Z  done 2025-05-07T20:24:02.8620412Z Preparing transaction: \ | done 2025-05-07T20:24:04.0284379Z Verifying transaction: - \ | / - \ | / - \ | done 2025-05-07T20:24:06.3463290Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:06.3973787Z # 2025-05-07T20:24:06.3974493Z # To activate this environment, use 2025-05-07T20:24:06.3975831Z # 2025-05-07T20:24:06.3976376Z # $ conda activate build_binary 2025-05-07T20:24:06.3976939Z # 2025-05-07T20:24:06.3977408Z # To deactivate an active environment, use 2025-05-07T20:24:06.3978033Z # 2025-05-07T20:24:06.3978439Z # $ conda deactivate 2025-05-07T20:24:06.3978792Z 2025-05-07T20:24:06.5055843Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:06.5077188Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:09.6131802Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (25.1) 2025-05-07T20:24:09.6132517Z Collecting pip 2025-05-07T20:24:09.6132881Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:09.6133361Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:09.6134307Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 92.8 MB/s eta 0:00:00 2025-05-07T20:24:09.6134714Z Installing collected packages: pip 2025-05-07T20:24:09.6135080Z Attempting uninstall: pip 2025-05-07T20:24:09.6135408Z Found existing installation: pip 25.1 2025-05-07T20:24:09.6135758Z Uninstalling pip-25.1: 2025-05-07T20:24:09.6136080Z Successfully uninstalled pip-25.1 2025-05-07T20:24:09.6136438Z Successfully installed pip-25.1.1 2025-05-07T20:24:09.6136656Z 2025-05-07T20:24:09.6800523Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:09.6826737Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:10.5948059Z Channels: 2025-05-07T20:24:10.5948515Z - conda-forge 2025-05-07T20:24:10.5948972Z Platform: linux-64 2025-05-07T20:24:21.4389871Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:23.0607194Z Solving environment: - \ | / - done 2025-05-07T20:24:23.1239501Z 2025-05-07T20:24:23.1239804Z ## Package Plan ## 2025-05-07T20:24:23.1240067Z 2025-05-07T20:24:23.1240391Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:23.1240858Z 2025-05-07T20:24:23.1240984Z added / updated specs: 2025-05-07T20:24:23.1241279Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:23.1241491Z 2025-05-07T20:24:23.1241495Z 2025-05-07T20:24:23.1241628Z The following packages will be downloaded: 2025-05-07T20:24:23.1241869Z 2025-05-07T20:24:23.1242001Z package | build 2025-05-07T20:24:23.1242369Z ---------------------------|----------------- 2025-05-07T20:24:23.1242782Z cffi-1.17.1 | py310h8deb56e_0 238 KB conda-forge 2025-05-07T20:24:23.1243267Z cryptography-44.0.3 | py310h6c63255_0 1.5 MB conda-forge 2025-05-07T20:24:23.1243803Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:23.1244315Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:23.1244766Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:23.1245227Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:23.1245697Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:23.1246182Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:23.1246645Z python_abi-3.10 | 2_cp310 4 KB conda-forge 2025-05-07T20:24:23.1247151Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:23.1247682Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:23.1248139Z ------------------------------------------------------------ 2025-05-07T20:24:23.1248515Z Total: 6.3 MB 2025-05-07T20:24:23.1248748Z 2025-05-07T20:24:23.1248930Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:23.1249258Z 2025-05-07T20:24:23.1249812Z cffi conda-forge/linux-64::cffi-1.17.1-py310h8deb56e_0 2025-05-07T20:24:23.1250358Z cryptography conda-forge/linux-64::cryptography-44.0.3-py310h6c63255_0 2025-05-07T20:24:23.1250897Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:23.1251383Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:23.1251891Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:23.1252548Z python_abi conda-forge/linux-64::python_abi-3.10-2_cp310 2025-05-07T20:24:23.1253409Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:23.1254272Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:23.1254815Z 2025-05-07T20:24:23.1254998Z The following packages will be UPDATED: 2025-05-07T20:24:23.1255329Z 2025-05-07T20:24:23.1255971Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:23.1257257Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:23.1258343Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:23.1259257Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:23.1259783Z 2025-05-07T20:24:23.1259787Z 2025-05-07T20:24:23.1259801Z 2025-05-07T20:24:23.1259965Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:23.1260380Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:23.1260623Z 2025-05-07T20:24:23.1261043Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:23.1261308Z 2025-05-07T20:24:23.1271324Z 2025-05-07T20:24:23.1278270Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:23.1278643Z 2025-05-07T20:24:23.1278665Z 2025-05-07T20:24:23.1278671Z 2025-05-07T20:24:23.1284697Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:23.1285110Z 2025-05-07T20:24:23.1285115Z 2025-05-07T20:24:23.1285121Z 2025-05-07T20:24:23.1285127Z 2025-05-07T20:24:23.1307806Z cffi-1.17.1 | 238 KB | | 0%  2025-05-07T20:24:23.1308164Z 2025-05-07T20:24:23.1308170Z 2025-05-07T20:24:23.1308175Z 2025-05-07T20:24:23.1308181Z 2025-05-07T20:24:23.1313224Z 2025-05-07T20:24:23.1318912Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:23.1319331Z 2025-05-07T20:24:23.1319747Z 2025-05-07T20:24:23.1319753Z 2025-05-07T20:24:23.1319758Z 2025-05-07T20:24:23.1319764Z 2025-05-07T20:24:23.1322375Z 2025-05-07T20:24:23.1327199Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:23.1327608Z 2025-05-07T20:24:23.1327613Z 2025-05-07T20:24:23.1327618Z 2025-05-07T20:24:23.1327628Z 2025-05-07T20:24:23.1327645Z 2025-05-07T20:24:23.1327650Z 2025-05-07T20:24:23.1328487Z 2025-05-07T20:24:23.1332960Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:23.1333423Z 2025-05-07T20:24:23.1333429Z 2025-05-07T20:24:23.1333435Z 2025-05-07T20:24:23.1333440Z 2025-05-07T20:24:23.1333445Z 2025-05-07T20:24:23.1333451Z 2025-05-07T20:24:23.1333467Z 2025-05-07T20:24:23.1333473Z 2025-05-07T20:24:23.1333861Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:23.1334293Z 2025-05-07T20:24:23.1334299Z 2025-05-07T20:24:23.1334304Z 2025-05-07T20:24:23.1334310Z 2025-05-07T20:24:23.1334324Z 2025-05-07T20:24:23.1334330Z 2025-05-07T20:24:23.1334335Z 2025-05-07T20:24:23.1334341Z 2025-05-07T20:24:23.1334349Z 2025-05-07T20:24:23.1335044Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:23.1335448Z 2025-05-07T20:24:23.1335454Z 2025-05-07T20:24:23.1335459Z 2025-05-07T20:24:23.1335465Z 2025-05-07T20:24:23.1335731Z 2025-05-07T20:24:23.1335737Z 2025-05-07T20:24:23.1335743Z 2025-05-07T20:24:23.1335748Z 2025-05-07T20:24:23.1335754Z 2025-05-07T20:24:23.1335763Z 2025-05-07T20:24:23.1928816Z python_abi-3.10 | 4 KB | | 0%  2025-05-07T20:24:23.2062526Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:23.2062931Z 2025-05-07T20:24:23.2062937Z 2025-05-07T20:24:23.2062942Z 2025-05-07T20:24:23.2064850Z 2025-05-07T20:24:23.2242743Z cffi-1.17.1 | 238 KB | ########## | 100%  2025-05-07T20:24:23.2249263Z 2025-05-07T20:24:23.2284146Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:23.2284532Z 2025-05-07T20:24:23.2284538Z 2025-05-07T20:24:23.2285284Z 2025-05-07T20:24:23.2367869Z libgomp-15.1.0 | 442 KB | 7 | 7%  2025-05-07T20:24:23.2368254Z 2025-05-07T20:24:23.2368261Z 2025-05-07T20:24:23.2368266Z 2025-05-07T20:24:23.2368272Z 2025-05-07T20:24:23.2368299Z 2025-05-07T20:24:23.2371926Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:23.2372349Z 2025-05-07T20:24:23.2372355Z 2025-05-07T20:24:23.2372361Z 2025-05-07T20:24:23.2372366Z 2025-05-07T20:24:23.2372372Z 2025-05-07T20:24:23.2392493Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:23.2392904Z 2025-05-07T20:24:23.2395604Z 2025-05-07T20:24:23.2481759Z libgcc-15.1.0 | 810 KB | 1 | 2%  2025-05-07T20:24:23.2482143Z 2025-05-07T20:24:23.2482163Z 2025-05-07T20:24:23.2482169Z 2025-05-07T20:24:23.2482174Z 2025-05-07T20:24:23.2482179Z 2025-05-07T20:24:23.2482185Z 2025-05-07T20:24:23.2546579Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:23.2547001Z 2025-05-07T20:24:23.2547008Z 2025-05-07T20:24:23.2547013Z 2025-05-07T20:24:23.2547019Z 2025-05-07T20:24:23.2547024Z 2025-05-07T20:24:23.2547029Z 2025-05-07T20:24:23.2675702Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:23.2676146Z 2025-05-07T20:24:23.2676152Z 2025-05-07T20:24:23.2676158Z 2025-05-07T20:24:23.2676163Z 2025-05-07T20:24:23.2676169Z 2025-05-07T20:24:23.2676183Z 2025-05-07T20:24:23.2676189Z 2025-05-07T20:24:23.2726972Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:23.2727439Z 2025-05-07T20:24:23.2727446Z 2025-05-07T20:24:23.2727451Z 2025-05-07T20:24:23.2727466Z 2025-05-07T20:24:23.2727472Z 2025-05-07T20:24:23.2727478Z 2025-05-07T20:24:23.2727848Z 2025-05-07T20:24:23.2753134Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:23.2753687Z 2025-05-07T20:24:23.2753692Z 2025-05-07T20:24:23.2753698Z 2025-05-07T20:24:23.2753704Z 2025-05-07T20:24:23.2753718Z 2025-05-07T20:24:23.2753724Z 2025-05-07T20:24:23.2753729Z 2025-05-07T20:24:23.2753734Z 2025-05-07T20:24:23.2797029Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:23.2797362Z 2025-05-07T20:24:23.2797373Z 2025-05-07T20:24:23.2797377Z 2025-05-07T20:24:23.2797381Z 2025-05-07T20:24:23.2797385Z 2025-05-07T20:24:23.2797389Z 2025-05-07T20:24:23.2797393Z 2025-05-07T20:24:23.2799076Z 2025-05-07T20:24:23.3011978Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:23.3012306Z 2025-05-07T20:24:23.3013751Z 2025-05-07T20:24:23.3016015Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:23.3016294Z 2025-05-07T20:24:23.3016312Z 2025-05-07T20:24:23.3016316Z 2025-05-07T20:24:23.3016320Z 2025-05-07T20:24:23.3016324Z 2025-05-07T20:24:23.3016328Z 2025-05-07T20:24:23.3016332Z 2025-05-07T20:24:23.3016336Z 2025-05-07T20:24:23.3016674Z 2025-05-07T20:24:23.3026851Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:23.3027175Z 2025-05-07T20:24:23.3027180Z 2025-05-07T20:24:23.3027191Z 2025-05-07T20:24:23.3041382Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:23.3041884Z 2025-05-07T20:24:23.3041888Z 2025-05-07T20:24:23.3041892Z 2025-05-07T20:24:23.3041897Z 2025-05-07T20:24:23.3041907Z 2025-05-07T20:24:23.3041911Z 2025-05-07T20:24:23.3041915Z 2025-05-07T20:24:23.3041918Z 2025-05-07T20:24:23.3043674Z 2025-05-07T20:24:23.3176200Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:23.3176728Z 2025-05-07T20:24:23.3176736Z 2025-05-07T20:24:23.3176743Z 2025-05-07T20:24:23.3176750Z 2025-05-07T20:24:23.3181519Z cffi-1.17.1 | 238 KB | ########## | 100%  2025-05-07T20:24:23.3182045Z 2025-05-07T20:24:23.3182053Z 2025-05-07T20:24:23.3182060Z 2025-05-07T20:24:23.3182068Z 2025-05-07T20:24:23.3318395Z cffi-1.17.1 | 238 KB | ########## | 100%  2025-05-07T20:24:23.3318686Z 2025-05-07T20:24:23.3318691Z 2025-05-07T20:24:23.3318695Z 2025-05-07T20:24:23.3318699Z 2025-05-07T20:24:23.3318703Z 2025-05-07T20:24:23.3318707Z 2025-05-07T20:24:23.3318722Z 2025-05-07T20:24:23.3318726Z 2025-05-07T20:24:23.3318730Z 2025-05-07T20:24:23.3320670Z 2025-05-07T20:24:23.3333837Z python_abi-3.10 | 4 KB | ########## | 100%  2025-05-07T20:24:23.3334164Z 2025-05-07T20:24:23.3334169Z 2025-05-07T20:24:23.3334173Z 2025-05-07T20:24:23.3334177Z 2025-05-07T20:24:23.3334181Z 2025-05-07T20:24:23.3334185Z 2025-05-07T20:24:23.3334189Z 2025-05-07T20:24:23.3334193Z 2025-05-07T20:24:23.3334197Z 2025-05-07T20:24:23.3334201Z 2025-05-07T20:24:23.3499550Z python_abi-3.10 | 4 KB | ########## | 100%  2025-05-07T20:24:23.3500084Z 2025-05-07T20:24:23.3500091Z 2025-05-07T20:24:23.3500098Z 2025-05-07T20:24:23.3500105Z 2025-05-07T20:24:23.3500112Z 2025-05-07T20:24:23.4944784Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:23.4945152Z 2025-05-07T20:24:23.4945157Z 2025-05-07T20:24:23.4945161Z 2025-05-07T20:24:23.4945165Z 2025-05-07T20:24:23.4945194Z 2025-05-07T20:24:23.4945198Z 2025-05-07T20:24:23.4949105Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:23.4949421Z 2025-05-07T20:24:23.4949425Z 2025-05-07T20:24:23.4949429Z 2025-05-07T20:24:23.4949433Z 2025-05-07T20:24:23.4949437Z 2025-05-07T20:24:23.4949449Z 2025-05-07T20:24:23.5174800Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:23.5175242Z 2025-05-07T20:24:23.5175247Z 2025-05-07T20:24:23.5175251Z 2025-05-07T20:24:23.5175266Z 2025-05-07T20:24:23.5175278Z 2025-05-07T20:24:23.5175282Z 2025-05-07T20:24:23.5175286Z 2025-05-07T20:24:23.5182345Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:23.5182694Z 2025-05-07T20:24:23.5182699Z 2025-05-07T20:24:23.5182711Z 2025-05-07T20:24:23.5182716Z 2025-05-07T20:24:23.5182720Z 2025-05-07T20:24:23.5182725Z 2025-05-07T20:24:23.5182729Z 2025-05-07T20:24:23.5270137Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:23.5276607Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:23.5434921Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:23.5435307Z 2025-05-07T20:24:23.5435312Z 2025-05-07T20:24:23.5435316Z 2025-05-07T20:24:23.5435320Z 2025-05-07T20:24:23.5435324Z 2025-05-07T20:24:23.5435329Z 2025-05-07T20:24:23.5435333Z 2025-05-07T20:24:23.5435337Z 2025-05-07T20:24:23.5439105Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:23.5439442Z 2025-05-07T20:24:23.5439447Z 2025-05-07T20:24:23.5439451Z 2025-05-07T20:24:23.5439455Z 2025-05-07T20:24:23.5439459Z 2025-05-07T20:24:23.5439463Z 2025-05-07T20:24:23.5439467Z 2025-05-07T20:24:23.5439793Z 2025-05-07T20:24:23.6209299Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:23.6209664Z 2025-05-07T20:24:23.6209669Z 2025-05-07T20:24:23.6210394Z 2025-05-07T20:24:23.6216911Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:23.6217447Z 2025-05-07T20:24:23.6217451Z 2025-05-07T20:24:23.6217606Z 2025-05-07T20:24:23.6370938Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:23.6371402Z 2025-05-07T20:24:23.6371747Z 2025-05-07T20:24:23.6380285Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:23.6380735Z 2025-05-07T20:24:23.6381256Z 2025-05-07T20:24:23.6546572Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:23.6547148Z 2025-05-07T20:24:23.6547154Z 2025-05-07T20:24:23.6547158Z 2025-05-07T20:24:23.6547162Z 2025-05-07T20:24:23.6547166Z 2025-05-07T20:24:23.6547181Z 2025-05-07T20:24:23.6547185Z 2025-05-07T20:24:23.6547189Z 2025-05-07T20:24:23.6547193Z 2025-05-07T20:24:23.6547197Z 2025-05-07T20:24:23.6575532Z python_abi-3.10 | 4 KB | ########## | 100%  2025-05-07T20:24:23.6575960Z 2025-05-07T20:24:23.6578297Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:23.6578593Z 2025-05-07T20:24:23.6627504Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:23.6627881Z 2025-05-07T20:24:23.6627888Z 2025-05-07T20:24:23.6627894Z 2025-05-07T20:24:23.6627900Z 2025-05-07T20:24:23.6627906Z 2025-05-07T20:24:23.6627911Z 2025-05-07T20:24:23.6627917Z 2025-05-07T20:24:23.6627933Z 2025-05-07T20:24:23.6627939Z 2025-05-07T20:24:23.6628314Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:23.6628628Z 2025-05-07T20:24:23.6628633Z 2025-05-07T20:24:23.6628641Z 2025-05-07T20:24:23.6628645Z 2025-05-07T20:24:23.6628656Z 2025-05-07T20:24:23.6628660Z 2025-05-07T20:24:23.6628664Z 2025-05-07T20:24:23.6628668Z 2025-05-07T20:24:23.6628672Z 2025-05-07T20:24:23.6636333Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:23.6636922Z 2025-05-07T20:24:23.6637231Z 2025-05-07T20:24:23.6637483Z  2025-05-07T20:24:23.6637791Z 2025-05-07T20:24:23.6637797Z 2025-05-07T20:24:23.6638018Z  2025-05-07T20:24:23.6638258Z 2025-05-07T20:24:23.6638262Z 2025-05-07T20:24:23.6638266Z 2025-05-07T20:24:23.6638448Z  2025-05-07T20:24:23.6638679Z 2025-05-07T20:24:23.6638683Z 2025-05-07T20:24:23.6638687Z 2025-05-07T20:24:23.6638691Z 2025-05-07T20:24:23.6638881Z  2025-05-07T20:24:23.6639122Z 2025-05-07T20:24:23.6639128Z 2025-05-07T20:24:23.6639133Z 2025-05-07T20:24:23.6639139Z 2025-05-07T20:24:23.6639145Z 2025-05-07T20:24:23.6639420Z  2025-05-07T20:24:23.6639673Z 2025-05-07T20:24:23.6639684Z 2025-05-07T20:24:23.6639688Z 2025-05-07T20:24:23.6639693Z 2025-05-07T20:24:23.6639697Z 2025-05-07T20:24:23.6639707Z 2025-05-07T20:24:23.6639920Z  2025-05-07T20:24:23.6640169Z 2025-05-07T20:24:23.6640173Z 2025-05-07T20:24:23.6640177Z 2025-05-07T20:24:23.6640181Z 2025-05-07T20:24:23.6640185Z 2025-05-07T20:24:23.6640188Z 2025-05-07T20:24:23.6640192Z 2025-05-07T20:24:23.6640384Z  2025-05-07T20:24:23.6640626Z 2025-05-07T20:24:23.6640630Z 2025-05-07T20:24:23.6640634Z 2025-05-07T20:24:23.6640643Z 2025-05-07T20:24:23.6640648Z 2025-05-07T20:24:23.6640652Z 2025-05-07T20:24:23.6640656Z 2025-05-07T20:24:23.6640659Z 2025-05-07T20:24:23.6640855Z  2025-05-07T20:24:23.6641098Z 2025-05-07T20:24:23.6641102Z 2025-05-07T20:24:23.6641106Z 2025-05-07T20:24:23.6641110Z 2025-05-07T20:24:23.6641114Z 2025-05-07T20:24:23.6641118Z 2025-05-07T20:24:23.6641122Z 2025-05-07T20:24:23.6641321Z 2025-05-07T20:24:23.6641325Z 2025-05-07T20:24:23.6641531Z  2025-05-07T20:24:23.6641773Z 2025-05-07T20:24:23.6641777Z 2025-05-07T20:24:23.6641781Z 2025-05-07T20:24:23.6641785Z 2025-05-07T20:24:23.6641789Z 2025-05-07T20:24:23.6641793Z 2025-05-07T20:24:23.6641797Z 2025-05-07T20:24:23.6641801Z 2025-05-07T20:24:23.6641805Z 2025-05-07T20:24:23.6641809Z 2025-05-07T20:24:23.6642158Z  done 2025-05-07T20:24:23.7644781Z Preparing transaction: | done 2025-05-07T20:24:23.8650197Z Verifying transaction: - done 2025-05-07T20:24:25.3676448Z Executing transaction: | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:25.5473833Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:27.3006181Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:27.3019037Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:27.3042614Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:28.1645949Z Channels: 2025-05-07T20:24:28.1646255Z - conda-forge 2025-05-07T20:24:28.1646505Z Platform: linux-64 2025-05-07T20:24:31.4977236Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:31.8769293Z Solving environment: \ done 2025-05-07T20:24:31.9402108Z 2025-05-07T20:24:31.9402405Z ## Package Plan ## 2025-05-07T20:24:31.9402591Z 2025-05-07T20:24:31.9402849Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:31.9403192Z 2025-05-07T20:24:31.9403307Z added / updated specs: 2025-05-07T20:24:31.9403587Z - libxcrypt 2025-05-07T20:24:31.9403732Z 2025-05-07T20:24:31.9403737Z 2025-05-07T20:24:31.9403871Z The following packages will be downloaded: 2025-05-07T20:24:31.9404113Z 2025-05-07T20:24:31.9404242Z package | build 2025-05-07T20:24:31.9404599Z ---------------------------|----------------- 2025-05-07T20:24:31.9405030Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:31.9405468Z ------------------------------------------------------------ 2025-05-07T20:24:31.9405845Z Total: 98 KB 2025-05-07T20:24:31.9406073Z 2025-05-07T20:24:31.9406220Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:31.9406459Z 2025-05-07T20:24:31.9406713Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:31.9407023Z 2025-05-07T20:24:31.9407028Z 2025-05-07T20:24:31.9407032Z 2025-05-07T20:24:31.9407189Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:32.1123601Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:32.1152295Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:32.1261231Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:32.1264977Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:32.1265843Z 2025-05-07T20:24:32.1266180Z done 2025-05-07T20:24:32.2269890Z Preparing transaction: / done 2025-05-07T20:24:32.3273342Z Verifying transaction: \ done 2025-05-07T20:24:32.4280026Z Executing transaction: / done 2025-05-07T20:24:35.9563134Z [SETUP] Copying over ... 2025-05-07T20:24:35.9563893Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.10/crypt.h 2025-05-07T20:24:35.9564459Z 2025-05-07T20:24:35.9603538Z 2025-05-07T20:24:37.5974944Z [SETUP] Installed Python version: Python 3.10.16 2025-05-07T20:24:37.5975426Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:24:37.6010446Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:37.6010951Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:37.6026195Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:37.6026751Z env: 2025-05-07T20:24:37.6027006Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:37.6027338Z BUILD_ENV: build_binary 2025-05-07T20:24:37.6027614Z BUILD_TARGET: genai 2025-05-07T20:24:37.6027873Z BUILD_VARIANT: cuda 2025-05-07T20:24:37.6028131Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:24:37.6028419Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:37.6028754Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:37.6029121Z ##[endgroup] 2025-05-07T20:24:37.9509511Z ################################################################################ 2025-05-07T20:24:37.9510086Z # Install C/C++ Compilers 2025-05-07T20:24:37.9510495Z # 2025-05-07T20:24:37.9525737Z # [2025-05-07T20:24:37.952Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:24:37.9526406Z ################################################################################ 2025-05-07T20:24:37.9526774Z 2025-05-07T20:24:37.9541350Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:38.0461497Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:38.0470617Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:24:38.0494156Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:24:38.9642019Z Channels: 2025-05-07T20:24:38.9642285Z - conda-forge 2025-05-07T20:24:38.9642542Z Platform: linux-64 2025-05-07T20:24:42.3619254Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:42.7289500Z Solving environment: \ done 2025-05-07T20:24:42.7910481Z 2025-05-07T20:24:42.7910852Z ## Package Plan ## 2025-05-07T20:24:42.7911100Z 2025-05-07T20:24:42.7911382Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:42.7911703Z 2025-05-07T20:24:42.7911806Z added / updated specs: 2025-05-07T20:24:42.7912084Z - sysroot_linux-64=2.17 2025-05-07T20:24:42.7912260Z 2025-05-07T20:24:42.7912277Z 2025-05-07T20:24:42.7912411Z The following packages will be downloaded: 2025-05-07T20:24:42.7912637Z 2025-05-07T20:24:42.7912765Z package | build 2025-05-07T20:24:42.7913097Z ---------------------------|----------------- 2025-05-07T20:24:42.7913624Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:24:42.7914204Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:24:42.7914781Z ------------------------------------------------------------ 2025-05-07T20:24:42.7915254Z Total: 15.4 MB 2025-05-07T20:24:42.7915546Z 2025-05-07T20:24:42.7915709Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:42.7915951Z 2025-05-07T20:24:42.7916264Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:24:42.7917054Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:24:42.7917397Z 2025-05-07T20:24:42.7917401Z 2025-05-07T20:24:42.7917405Z 2025-05-07T20:24:42.7917557Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:42.7918119Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:42.7918483Z 2025-05-07T20:24:43.0045552Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:24:43.0087901Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:43.0088169Z 2025-05-07T20:24:43.0205978Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:24:43.0206613Z 2025-05-07T20:24:43.1046074Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:43.1513353Z sysroot_linux-64-2.1 | 14.5 MB | ######### | 90% 2025-05-07T20:24:43.3629121Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:43.3629404Z 2025-05-07T20:24:43.3631226Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:43.3631489Z 2025-05-07T20:24:43.8195040Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:43.8199758Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:43.8200461Z 2025-05-07T20:24:43.8200868Z 2025-05-07T20:24:43.8201137Z  done 2025-05-07T20:24:43.9204885Z Preparing transaction: / done 2025-05-07T20:24:44.1210263Z Verifying transaction: \ | done 2025-05-07T20:24:44.3287928Z Executing transaction: - \ done 2025-05-07T20:24:44.4809271Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:24:44.4809758Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:24:46.2391983Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:24:46.2405600Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:24:46.2429969Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:24:47.1980447Z Channels: 2025-05-07T20:24:47.1980808Z - conda-forge 2025-05-07T20:24:47.1981152Z Platform: linux-64 2025-05-07T20:24:50.6202635Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:51.5819545Z Solving environment: \ | / done 2025-05-07T20:24:51.6463054Z 2025-05-07T20:24:51.6463378Z ## Package Plan ## 2025-05-07T20:24:51.6463544Z 2025-05-07T20:24:51.6463766Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:51.6464198Z 2025-05-07T20:24:51.6464345Z added / updated specs: 2025-05-07T20:24:51.6464656Z - gxx_linux-64=11.4.0 2025-05-07T20:24:51.6464829Z 2025-05-07T20:24:51.6464834Z 2025-05-07T20:24:51.6464971Z The following packages will be downloaded: 2025-05-07T20:24:51.6465224Z 2025-05-07T20:24:51.6465348Z package | build 2025-05-07T20:24:51.6465694Z ---------------------------|----------------- 2025-05-07T20:24:51.6466119Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:24:51.6466657Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:24:51.6467141Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:24:51.6467609Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:24:51.6468076Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:24:51.6468538Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:24:51.6468986Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:24:51.6469484Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:24:51.6469983Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:24:51.6470439Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:24:51.6470937Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:24:51.6471440Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:24:51.6471863Z ------------------------------------------------------------ 2025-05-07T20:24:51.6472217Z Total: 91.6 MB 2025-05-07T20:24:51.6472446Z 2025-05-07T20:24:51.6472581Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:51.6472811Z 2025-05-07T20:24:51.6473092Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:24:51.6473796Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:24:51.6474719Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:24:51.6475264Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:24:51.6475795Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:24:51.6476475Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:24:51.6477036Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:51.6477623Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:24:51.6478144Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:24:51.6478708Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:51.6479091Z 2025-05-07T20:24:51.6479210Z The following packages will be UPDATED: 2025-05-07T20:24:51.6479430Z 2025-05-07T20:24:51.6479762Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:24:51.6480511Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:24:51.6480940Z 2025-05-07T20:24:51.6480944Z 2025-05-07T20:24:51.6480948Z 2025-05-07T20:24:51.6481103Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:51.6481496Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:51.6481739Z 2025-05-07T20:24:51.6482079Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:51.6482326Z 2025-05-07T20:24:51.6482336Z 2025-05-07T20:24:51.6486029Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:51.6486299Z 2025-05-07T20:24:51.6486557Z 2025-05-07T20:24:51.6490205Z 2025-05-07T20:24:51.6512662Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:51.6512999Z 2025-05-07T20:24:51.6513003Z 2025-05-07T20:24:51.6513015Z 2025-05-07T20:24:51.6513032Z 2025-05-07T20:24:51.6526248Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:51.6526661Z 2025-05-07T20:24:51.6526667Z 2025-05-07T20:24:51.6526672Z 2025-05-07T20:24:51.6526686Z 2025-05-07T20:24:51.6526709Z 2025-05-07T20:24:51.6528229Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:51.6528617Z 2025-05-07T20:24:51.6528622Z 2025-05-07T20:24:51.6528633Z 2025-05-07T20:24:51.6528637Z 2025-05-07T20:24:51.6528640Z 2025-05-07T20:24:51.6528644Z 2025-05-07T20:24:51.6544337Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:24:51.6544758Z 2025-05-07T20:24:51.6544773Z 2025-05-07T20:24:51.6544779Z 2025-05-07T20:24:51.6544784Z 2025-05-07T20:24:51.6544790Z 2025-05-07T20:24:51.6544795Z 2025-05-07T20:24:51.6544800Z 2025-05-07T20:24:51.6546808Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:24:51.6547238Z 2025-05-07T20:24:51.6547245Z 2025-05-07T20:24:51.6547263Z 2025-05-07T20:24:51.6547269Z 2025-05-07T20:24:51.6547275Z 2025-05-07T20:24:51.6547280Z 2025-05-07T20:24:51.6547286Z 2025-05-07T20:24:51.6547291Z 2025-05-07T20:24:51.6550503Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:51.6551015Z 2025-05-07T20:24:51.6551022Z 2025-05-07T20:24:51.6551029Z 2025-05-07T20:24:51.6551035Z 2025-05-07T20:24:51.6551040Z 2025-05-07T20:24:51.6551047Z 2025-05-07T20:24:51.6551052Z 2025-05-07T20:24:51.6551058Z 2025-05-07T20:24:51.6551064Z 2025-05-07T20:24:51.6552015Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:24:51.6552405Z 2025-05-07T20:24:51.6552411Z 2025-05-07T20:24:51.6552416Z 2025-05-07T20:24:51.6552421Z 2025-05-07T20:24:51.6552427Z 2025-05-07T20:24:51.6552442Z 2025-05-07T20:24:51.6552447Z 2025-05-07T20:24:51.6552453Z 2025-05-07T20:24:51.6552459Z 2025-05-07T20:24:51.6566795Z 2025-05-07T20:24:51.6569153Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:24:51.6569638Z 2025-05-07T20:24:51.6569644Z 2025-05-07T20:24:51.6569649Z 2025-05-07T20:24:51.6569655Z 2025-05-07T20:24:51.6569660Z 2025-05-07T20:24:51.6569665Z 2025-05-07T20:24:51.6569825Z 2025-05-07T20:24:51.6569829Z 2025-05-07T20:24:51.6569833Z 2025-05-07T20:24:51.6569837Z 2025-05-07T20:24:51.6569841Z 2025-05-07T20:24:51.7701871Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:24:51.7702256Z 2025-05-07T20:24:51.7702260Z 2025-05-07T20:24:51.7702264Z 2025-05-07T20:24:51.7702268Z 2025-05-07T20:24:51.7847091Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:51.7847514Z 2025-05-07T20:24:51.7847521Z 2025-05-07T20:24:51.7849211Z 2025-05-07T20:24:51.8724435Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:51.8724816Z 2025-05-07T20:24:51.8724820Z 2025-05-07T20:24:51.8724825Z 2025-05-07T20:24:51.8725158Z 2025-05-07T20:24:51.9016821Z libstdcxx-15.1.0 | 3.7 MB | #6 | 17%  2025-05-07T20:24:51.9017417Z 2025-05-07T20:24:51.9017426Z 2025-05-07T20:24:51.9067578Z 2025-05-07T20:24:51.9173054Z binutils_impl_linux- | 6.0 MB | 6 | 7%  2025-05-07T20:24:51.9173521Z 2025-05-07T20:24:51.9173526Z 2025-05-07T20:24:51.9221417Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:51.9222343Z 2025-05-07T20:24:51.9730940Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:51.9731227Z 2025-05-07T20:24:51.9731231Z 2025-05-07T20:24:51.9731235Z 2025-05-07T20:24:51.9731452Z 2025-05-07T20:24:51.9745265Z libstdcxx-15.1.0 | 3.7 MB | #######7 | 78%  2025-05-07T20:24:52.0017585Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:52.0017859Z 2025-05-07T20:24:52.0017864Z 2025-05-07T20:24:52.0018168Z 2025-05-07T20:24:52.0176281Z binutils_impl_linux- | 6.0 MB | #####9 | 59%  2025-05-07T20:24:52.0176609Z 2025-05-07T20:24:52.0176615Z 2025-05-07T20:24:52.0223615Z libstdcxx-devel_linu | 11.1 MB | ###7 | 37%  2025-05-07T20:24:52.0224145Z 2025-05-07T20:24:52.0702304Z gxx_impl_linux-64-11 | 11.2 MB | ##9 | 29%  2025-05-07T20:24:52.0702691Z 2025-05-07T20:24:52.0702696Z 2025-05-07T20:24:52.0702700Z 2025-05-07T20:24:52.0707306Z 2025-05-07T20:24:52.0755180Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:52.1183094Z gcc_impl_linux-64-11 | 53.0 MB | 5 | 5% 2025-05-07T20:24:52.1183427Z 2025-05-07T20:24:52.1183432Z 2025-05-07T20:24:52.1183436Z 2025-05-07T20:24:52.1183439Z 2025-05-07T20:24:52.1191299Z 2025-05-07T20:24:52.1228407Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:52.1230952Z 2025-05-07T20:24:52.1558515Z gxx_impl_linux-64-11 | 11.2 MB | ###### | 60%  2025-05-07T20:24:52.1558941Z 2025-05-07T20:24:52.1560363Z 2025-05-07T20:24:52.1758655Z libstdcxx-devel_linu | 11.1 MB | #######2 | 72%  2025-05-07T20:24:52.2184552Z gcc_impl_linux-64-11 | 53.0 MB | #1 | 12% 2025-05-07T20:24:52.2185092Z 2025-05-07T20:24:52.2185098Z 2025-05-07T20:24:52.2185104Z 2025-05-07T20:24:52.2185134Z 2025-05-07T20:24:52.2188494Z 2025-05-07T20:24:52.2228651Z libsanitizer-11.4.0 | 3.5 MB | ########7 | 87%  2025-05-07T20:24:52.2228979Z 2025-05-07T20:24:52.2380260Z gxx_impl_linux-64-11 | 11.2 MB | ########7 | 88%  2025-05-07T20:24:52.2380643Z 2025-05-07T20:24:52.2380657Z 2025-05-07T20:24:52.2381453Z 2025-05-07T20:24:52.2382086Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:52.2382380Z 2025-05-07T20:24:52.2382385Z 2025-05-07T20:24:52.2386495Z 2025-05-07T20:24:52.2586038Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:52.2586411Z 2025-05-07T20:24:52.2587907Z 2025-05-07T20:24:52.2760235Z libstdcxx-devel_linu | 11.1 MB | #########5 | 96%  2025-05-07T20:24:52.2785371Z gcc_impl_linux-64-11 | 53.0 MB | #7 | 18% 2025-05-07T20:24:52.2785755Z 2025-05-07T20:24:52.2785760Z 2025-05-07T20:24:52.2785764Z 2025-05-07T20:24:52.2785768Z 2025-05-07T20:24:52.2785772Z 2025-05-07T20:24:52.2785989Z 2025-05-07T20:24:52.3564878Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:24:52.3565262Z 2025-05-07T20:24:52.3565266Z 2025-05-07T20:24:52.3565270Z 2025-05-07T20:24:52.3565274Z 2025-05-07T20:24:52.3568826Z 2025-05-07T20:24:52.3763174Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:52.3891736Z gcc_impl_linux-64-11 | 53.0 MB | ##3 | 24% 2025-05-07T20:24:52.3892014Z 2025-05-07T20:24:52.3892019Z 2025-05-07T20:24:52.3892024Z 2025-05-07T20:24:52.3892030Z 2025-05-07T20:24:52.3892037Z 2025-05-07T20:24:52.3892042Z 2025-05-07T20:24:52.3894503Z 2025-05-07T20:24:52.4294209Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:24:52.4294530Z 2025-05-07T20:24:52.4294556Z 2025-05-07T20:24:52.4294560Z 2025-05-07T20:24:52.4294565Z 2025-05-07T20:24:52.4294568Z 2025-05-07T20:24:52.4296894Z 2025-05-07T20:24:52.4301526Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:52.4301953Z 2025-05-07T20:24:52.4301959Z 2025-05-07T20:24:52.4301964Z 2025-05-07T20:24:52.4301969Z 2025-05-07T20:24:52.4301975Z 2025-05-07T20:24:52.4301980Z 2025-05-07T20:24:52.4423367Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:52.4423694Z 2025-05-07T20:24:52.4423698Z 2025-05-07T20:24:52.4423702Z 2025-05-07T20:24:52.4423706Z 2025-05-07T20:24:52.4423710Z 2025-05-07T20:24:52.4423714Z 2025-05-07T20:24:52.4426602Z 2025-05-07T20:24:52.4647511Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:52.4647835Z 2025-05-07T20:24:52.4647840Z 2025-05-07T20:24:52.4647844Z 2025-05-07T20:24:52.4647848Z 2025-05-07T20:24:52.4647852Z 2025-05-07T20:24:52.4647863Z 2025-05-07T20:24:52.4647876Z 2025-05-07T20:24:52.4650393Z 2025-05-07T20:24:52.4688410Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:52.4688824Z 2025-05-07T20:24:52.4688838Z 2025-05-07T20:24:52.4688853Z 2025-05-07T20:24:52.4688857Z 2025-05-07T20:24:52.4688861Z 2025-05-07T20:24:52.4688864Z 2025-05-07T20:24:52.4688868Z 2025-05-07T20:24:52.4690680Z 2025-05-07T20:24:52.4762945Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:52.4763265Z 2025-05-07T20:24:52.4763269Z 2025-05-07T20:24:52.4763273Z 2025-05-07T20:24:52.4763276Z 2025-05-07T20:24:52.4763280Z 2025-05-07T20:24:52.4763284Z 2025-05-07T20:24:52.4763288Z 2025-05-07T20:24:52.4763291Z 2025-05-07T20:24:52.4763295Z 2025-05-07T20:24:52.4771330Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:24:52.4810964Z gcc_impl_linux-64-11 | 53.0 MB | ##9 | 30% 2025-05-07T20:24:52.4811324Z 2025-05-07T20:24:52.4811329Z 2025-05-07T20:24:52.4811347Z 2025-05-07T20:24:52.4811354Z 2025-05-07T20:24:52.4811359Z 2025-05-07T20:24:52.4811365Z 2025-05-07T20:24:52.4811370Z 2025-05-07T20:24:52.4811376Z 2025-05-07T20:24:52.4811381Z 2025-05-07T20:24:52.5127296Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:52.5127662Z 2025-05-07T20:24:52.5127666Z 2025-05-07T20:24:52.5127670Z 2025-05-07T20:24:52.5127674Z 2025-05-07T20:24:52.5127678Z 2025-05-07T20:24:52.5127682Z 2025-05-07T20:24:52.5127685Z 2025-05-07T20:24:52.5127689Z 2025-05-07T20:24:52.5127693Z 2025-05-07T20:24:52.5127697Z 2025-05-07T20:24:52.5159404Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:24:52.5159717Z 2025-05-07T20:24:52.5159722Z 2025-05-07T20:24:52.5159725Z 2025-05-07T20:24:52.5159729Z 2025-05-07T20:24:52.5159733Z 2025-05-07T20:24:52.5159737Z 2025-05-07T20:24:52.5159740Z 2025-05-07T20:24:52.5159744Z 2025-05-07T20:24:52.5159748Z 2025-05-07T20:24:52.5159984Z 2025-05-07T20:24:52.5268767Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:52.5269077Z 2025-05-07T20:24:52.5269081Z 2025-05-07T20:24:52.5269085Z 2025-05-07T20:24:52.5269301Z 2025-05-07T20:24:52.5269306Z 2025-05-07T20:24:52.5269318Z 2025-05-07T20:24:52.5269322Z 2025-05-07T20:24:52.5269326Z 2025-05-07T20:24:52.5269330Z 2025-05-07T20:24:52.5269333Z 2025-05-07T20:24:52.5269337Z 2025-05-07T20:24:52.5304872Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:24:52.5305309Z 2025-05-07T20:24:52.5305314Z 2025-05-07T20:24:52.5305317Z 2025-05-07T20:24:52.5305321Z 2025-05-07T20:24:52.5305325Z 2025-05-07T20:24:52.5305329Z 2025-05-07T20:24:52.5305333Z 2025-05-07T20:24:52.5305336Z 2025-05-07T20:24:52.5305340Z 2025-05-07T20:24:52.5305344Z 2025-05-07T20:24:52.5305849Z 2025-05-07T20:24:52.5679123Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:52.5679502Z 2025-05-07T20:24:52.5679507Z 2025-05-07T20:24:52.5679510Z 2025-05-07T20:24:52.5679514Z 2025-05-07T20:24:52.5771580Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:52.6289543Z gcc_impl_linux-64-11 | 53.0 MB | ###7 | 37% 2025-05-07T20:24:52.6289961Z 2025-05-07T20:24:52.6418736Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:52.6419013Z 2025-05-07T20:24:52.6419720Z 2025-05-07T20:24:52.6773854Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:52.7416749Z gcc_impl_linux-64-11 | 53.0 MB | ####7 | 48% 2025-05-07T20:24:52.7417008Z 2025-05-07T20:24:52.7417012Z 2025-05-07T20:24:52.7417016Z 2025-05-07T20:24:52.7417020Z 2025-05-07T20:24:52.7417024Z 2025-05-07T20:24:52.7419037Z 2025-05-07T20:24:52.7664756Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:52.7665087Z 2025-05-07T20:24:52.7665091Z 2025-05-07T20:24:52.7665095Z 2025-05-07T20:24:52.7665108Z 2025-05-07T20:24:52.7665503Z 2025-05-07T20:24:52.7777075Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:52.8418213Z gcc_impl_linux-64-11 | 53.0 MB | #####5 | 55% 2025-05-07T20:24:52.8418552Z 2025-05-07T20:24:52.8418556Z 2025-05-07T20:24:52.8418560Z 2025-05-07T20:24:52.8418571Z 2025-05-07T20:24:52.8418575Z 2025-05-07T20:24:52.8418579Z 2025-05-07T20:24:52.8420704Z 2025-05-07T20:24:52.8429753Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:52.8430198Z 2025-05-07T20:24:52.8430203Z 2025-05-07T20:24:52.8430207Z 2025-05-07T20:24:52.8430215Z 2025-05-07T20:24:52.8430220Z 2025-05-07T20:24:52.8430224Z 2025-05-07T20:24:52.8430529Z 2025-05-07T20:24:52.8477297Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:52.8477668Z 2025-05-07T20:24:52.8477672Z 2025-05-07T20:24:52.8477675Z 2025-05-07T20:24:52.8477679Z 2025-05-07T20:24:52.8477683Z 2025-05-07T20:24:52.8477695Z 2025-05-07T20:24:52.8477699Z 2025-05-07T20:24:52.8478132Z 2025-05-07T20:24:52.8487691Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:52.8488047Z 2025-05-07T20:24:52.8488059Z 2025-05-07T20:24:52.8488063Z 2025-05-07T20:24:52.8488067Z 2025-05-07T20:24:52.8488116Z 2025-05-07T20:24:52.8488121Z 2025-05-07T20:24:52.8488125Z 2025-05-07T20:24:52.8488426Z 2025-05-07T20:24:52.8779462Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:52.9053718Z gcc_impl_linux-64-11 | 53.0 MB | ######5 | 66% 2025-05-07T20:24:52.9053983Z 2025-05-07T20:24:52.9053988Z 2025-05-07T20:24:52.9053991Z 2025-05-07T20:24:52.9054002Z 2025-05-07T20:24:52.9054005Z 2025-05-07T20:24:52.9054160Z 2025-05-07T20:24:52.9054166Z 2025-05-07T20:24:52.9054171Z 2025-05-07T20:24:52.9054388Z 2025-05-07T20:24:52.9061360Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:52.9061738Z 2025-05-07T20:24:52.9061948Z 2025-05-07T20:24:52.9061953Z 2025-05-07T20:24:52.9061997Z 2025-05-07T20:24:52.9062001Z 2025-05-07T20:24:52.9062004Z 2025-05-07T20:24:52.9062008Z 2025-05-07T20:24:52.9062011Z 2025-05-07T20:24:52.9062015Z 2025-05-07T20:24:52.9397689Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:52.9398038Z 2025-05-07T20:24:52.9398041Z 2025-05-07T20:24:52.9398045Z 2025-05-07T20:24:52.9398049Z 2025-05-07T20:24:52.9398052Z 2025-05-07T20:24:52.9398056Z 2025-05-07T20:24:52.9398060Z 2025-05-07T20:24:52.9398064Z 2025-05-07T20:24:52.9398067Z 2025-05-07T20:24:52.9399458Z 2025-05-07T20:24:52.9409894Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:52.9410303Z 2025-05-07T20:24:52.9410307Z 2025-05-07T20:24:52.9410311Z 2025-05-07T20:24:52.9410315Z 2025-05-07T20:24:52.9410319Z 2025-05-07T20:24:52.9410322Z 2025-05-07T20:24:52.9410326Z 2025-05-07T20:24:52.9410330Z 2025-05-07T20:24:52.9410334Z 2025-05-07T20:24:52.9410978Z 2025-05-07T20:24:52.9597333Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:52.9597674Z 2025-05-07T20:24:52.9597678Z 2025-05-07T20:24:52.9597692Z 2025-05-07T20:24:52.9597696Z 2025-05-07T20:24:52.9597700Z 2025-05-07T20:24:52.9597703Z 2025-05-07T20:24:52.9597707Z 2025-05-07T20:24:52.9597750Z 2025-05-07T20:24:52.9597754Z 2025-05-07T20:24:52.9597757Z 2025-05-07T20:24:52.9600526Z 2025-05-07T20:24:52.9604889Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:52.9605398Z 2025-05-07T20:24:52.9605403Z 2025-05-07T20:24:52.9605407Z 2025-05-07T20:24:52.9605411Z 2025-05-07T20:24:52.9605415Z 2025-05-07T20:24:52.9605419Z 2025-05-07T20:24:52.9605423Z 2025-05-07T20:24:52.9605427Z 2025-05-07T20:24:52.9605430Z 2025-05-07T20:24:52.9605434Z 2025-05-07T20:24:52.9607087Z 2025-05-07T20:24:52.9783330Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:53.0785298Z gcc_impl_linux-64-11 | 53.0 MB | #######6 | 77% 2025-05-07T20:24:53.1787233Z gcc_impl_linux-64-11 | 53.0 MB | ########6 | 87% 2025-05-07T20:24:53.2355616Z gcc_impl_linux-64-11 | 53.0 MB | #########8 | 99% 2025-05-07T20:24:53.2355928Z 2025-05-07T20:24:53.2355933Z 2025-05-07T20:24:53.2356736Z 2025-05-07T20:24:53.3520775Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:53.4190813Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:53.4191687Z 2025-05-07T20:24:53.7291028Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:53.7291841Z 2025-05-07T20:24:53.7291849Z 2025-05-07T20:24:54.1834995Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:54.1842833Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:54.1843593Z 2025-05-07T20:24:54.1844006Z 2025-05-07T20:24:54.1844352Z  2025-05-07T20:24:54.1845180Z 2025-05-07T20:24:54.1845185Z 2025-05-07T20:24:54.1845542Z  2025-05-07T20:24:54.1845824Z 2025-05-07T20:24:54.1845840Z 2025-05-07T20:24:54.1845844Z 2025-05-07T20:24:54.1846105Z  2025-05-07T20:24:54.1846395Z 2025-05-07T20:24:54.1846401Z 2025-05-07T20:24:54.1846406Z 2025-05-07T20:24:54.1846411Z 2025-05-07T20:24:54.1846772Z  2025-05-07T20:24:54.1847101Z 2025-05-07T20:24:54.1847105Z 2025-05-07T20:24:54.1847109Z 2025-05-07T20:24:54.1847113Z 2025-05-07T20:24:54.1847117Z 2025-05-07T20:24:54.1847476Z  2025-05-07T20:24:54.1847744Z 2025-05-07T20:24:54.1847748Z 2025-05-07T20:24:54.1847751Z 2025-05-07T20:24:54.1847755Z 2025-05-07T20:24:54.1847759Z 2025-05-07T20:24:54.1847763Z 2025-05-07T20:24:54.1848266Z  2025-05-07T20:24:54.1848545Z 2025-05-07T20:24:54.1848549Z 2025-05-07T20:24:54.1848553Z 2025-05-07T20:24:54.1848557Z 2025-05-07T20:24:54.1848696Z 2025-05-07T20:24:54.1848700Z 2025-05-07T20:24:54.1848704Z 2025-05-07T20:24:54.1849025Z  2025-05-07T20:24:54.1849312Z 2025-05-07T20:24:54.1849316Z 2025-05-07T20:24:54.1849320Z 2025-05-07T20:24:54.1849324Z 2025-05-07T20:24:54.1849328Z 2025-05-07T20:24:54.1849332Z 2025-05-07T20:24:54.1849336Z 2025-05-07T20:24:54.1849340Z 2025-05-07T20:24:54.1849609Z  2025-05-07T20:24:54.1849914Z 2025-05-07T20:24:54.1849918Z 2025-05-07T20:24:54.1849922Z 2025-05-07T20:24:54.1849926Z 2025-05-07T20:24:54.1849930Z 2025-05-07T20:24:54.1849934Z 2025-05-07T20:24:54.1849976Z 2025-05-07T20:24:54.1849981Z 2025-05-07T20:24:54.1849991Z 2025-05-07T20:24:54.1850227Z  2025-05-07T20:24:54.1850485Z 2025-05-07T20:24:54.1850489Z 2025-05-07T20:24:54.1850493Z 2025-05-07T20:24:54.1850594Z 2025-05-07T20:24:54.1850599Z 2025-05-07T20:24:54.1850603Z 2025-05-07T20:24:54.1850607Z 2025-05-07T20:24:54.1850611Z 2025-05-07T20:24:54.1850615Z 2025-05-07T20:24:54.1850618Z 2025-05-07T20:24:54.1850873Z  2025-05-07T20:24:54.1851215Z 2025-05-07T20:24:54.1851219Z 2025-05-07T20:24:54.1851223Z 2025-05-07T20:24:54.1851227Z 2025-05-07T20:24:54.1851231Z 2025-05-07T20:24:54.1851235Z 2025-05-07T20:24:54.1851238Z 2025-05-07T20:24:54.1851242Z 2025-05-07T20:24:54.1851246Z 2025-05-07T20:24:54.1851250Z 2025-05-07T20:24:54.1851254Z 2025-05-07T20:24:54.1851504Z  done 2025-05-07T20:24:54.2852396Z Preparing transaction: \ done 2025-05-07T20:24:54.5865185Z Verifying transaction: / - \ done 2025-05-07T20:24:54.7875770Z Executing transaction: / - done 2025-05-07T20:24:54.9583037Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:24:59.0368406Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:59.0369179Z 2025-05-07T20:24:59.0382829Z 2025-05-07T20:24:59.0400569Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:59.0401196Z 2025-05-07T20:24:59.0412433Z 2025-05-07T20:24:59.0430336Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:59.0430946Z 2025-05-07T20:24:59.0443098Z 2025-05-07T20:24:59.0460282Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:59.0460903Z 2025-05-07T20:24:59.0472195Z 2025-05-07T20:25:00.9522511Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:00.9523107Z 2025-05-07T20:25:01.0164479Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:03.0315009Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:03.0315504Z 2025-05-07T20:25:03.0983746Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:05.1130971Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:05.1131441Z 2025-05-07T20:25:05.1770631Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:07.0851163Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:07.0851639Z 2025-05-07T20:25:07.1470430Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:07.1475574Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:07.1476901Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:07.1477795Z 2025-05-07T20:25:09.0805359Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:09.0806686Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:09.0807221Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:09.0807655Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:09.0808667Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:09.0809620Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:09.0810151Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:09.0810730Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:09.0811118Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:09.0811810Z #define __CHAR_BIT__ 8 2025-05-07T20:25:09.0813527Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:09.0814126Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:09.0814665Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:09.0815190Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:09.0815746Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:09.0816174Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.0816596Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:09.0817138Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:09.0817570Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:09.0818016Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:09.0818675Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:09.0819274Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:09.0819727Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:09.0820244Z #define __GCC_IEC_559 2 2025-05-07T20:25:09.0820630Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:09.0820983Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:09.0821484Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:09.0821911Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:09.0822322Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.0822897Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:09.0823299Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:09.0824048Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:09.0824471Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:09.0824885Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:09.0825383Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:09.0825761Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:09.0826149Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:09.0826596Z #define __INT8_C(c) c 2025-05-07T20:25:09.0826938Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:09.0827368Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.0827897Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:09.0828392Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:09.0828866Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:09.0829332Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:09.0829744Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.0830143Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:09.0830592Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:09.0831152Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:09.0831701Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:09.0832158Z #define __linux 1 2025-05-07T20:25:09.0832529Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:09.0832969Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:09.0833372Z #define __unix 1 2025-05-07T20:25:09.0833917Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:09.0834352Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:09.0834756Z #define __WINT_MIN__ 0U 2025-05-07T20:25:09.0835170Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:09.0835582Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:09.0835984Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:09.0836423Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:09.0836804Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:09.0837273Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:09.0837688Z #define __INT64_C(c) c ## L 2025-05-07T20:25:09.0838360Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:09.0838910Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:09.0839270Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:09.0839758Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:09.0840507Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:09.0840851Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:09.0841245Z #define __DBL_DIG__ 15 2025-05-07T20:25:09.0841681Z #define __FLT32_DIG__ 6 2025-05-07T20:25:09.0842120Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:09.0842593Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:09.0843034Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:09.0843498Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:09.0843964Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:09.0844402Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:09.0844799Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:09.0845307Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:09.0845908Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:09.0846312Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:09.0846706Z #define __unix__ 1 2025-05-07T20:25:09.0847107Z #define __INT_WIDTH__ 32 2025-05-07T20:25:09.0847505Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:09.0847859Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:09.0848288Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:09.0848800Z #define __UINT16_C(c) c 2025-05-07T20:25:09.0849170Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:09.0849591Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:09.0850104Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:09.0850585Z #define __gnu_linux__ 1 2025-05-07T20:25:09.0851059Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:09.0851450Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:09.0851847Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.0852352Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:09.0852710Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:09.0853066Z #define __GNUC__ 11 2025-05-07T20:25:09.0853501Z #define __pie__ 2 2025-05-07T20:25:09.0853873Z #define __MMX__ 1 2025-05-07T20:25:09.0854218Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:09.0854512Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:09.0854826Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:09.0855130Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:09.0855509Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:09.0855949Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.0856298Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:09.0856583Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:09.0867442Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:09.0867823Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:09.0868121Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:09.0868427Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:09.0868756Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:09.0869123Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:09.0869454Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:09.0869796Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:09.0870081Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:09.0870380Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:09.0870688Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:09.0870975Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:09.0871263Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:09.0871618Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:09.0872018Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:09.0872315Z #define __SSE2_MATH__ 1 2025-05-07T20:25:09.0872587Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:09.0872927Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.0873247Z #define __amd64 1 2025-05-07T20:25:09.0873632Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:09.0874217Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:09.0874557Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:09.0874904Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:09.0875194Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:09.0875637Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:09.0875925Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:09.0876224Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:09.0876511Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:09.0876809Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:09.0877107Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:09.0877418Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:09.0877688Z #define __x86_64 1 2025-05-07T20:25:09.0877949Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:09.0878364Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:09.0878868Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:09.0879381Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:09.0879897Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:09.0880317Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:09.0880597Z #define __LP64__ 1 2025-05-07T20:25:09.0880865Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.0881255Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:09.0881669Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:09.0881977Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:09.0882286Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:09.0882595Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:09.0882901Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:09.0883202Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:09.0883485Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:09.0883776Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:09.0884070Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:09.0884437Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:09.0884840Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:09.0885152Z #define __FLT_DIG__ 6 2025-05-07T20:25:09.0885404Z #define __NO_INLINE__ 1 2025-05-07T20:25:09.0885678Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:09.0886051Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:09.0886442Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:09.0886723Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:09.0887019Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:09.0887309Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:09.0887591Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:09.0887879Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:09.0888210Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:09.0888523Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:09.0888821Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:09.0889159Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:09.0889524Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:09.0889823Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:09.0890111Z #define __FLT128_DIG__ 33 2025-05-07T20:25:09.0890371Z #define __INT32_C(c) c 2025-05-07T20:25:09.0890639Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:09.0890954Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:09.0891262Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:09.0891570Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:09.0891927Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:09.0892267Z #define unix 1 2025-05-07T20:25:09.0892519Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:09.0892868Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.0893206Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:09.0893547Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:09.0893911Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:09.0894193Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:09.0894476Z #define __ELF__ 1 2025-05-07T20:25:09.0894877Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:09.0895198Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:09.0895496Z #define __FLT_RADIX__ 2 2025-05-07T20:25:09.0895857Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:09.0896341Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:09.0896742Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:09.0897025Z #define __SSE_MATH__ 1 2025-05-07T20:25:09.0897274Z #define __k8 1 2025-05-07T20:25:09.0897603Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:09.0898018Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:09.0898348Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:09.0898674Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:09.0898963Z #define __LDBL_DIG__ 18 2025-05-07T20:25:09.0899236Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:09.0899516Z #define __x86_64__ 1 2025-05-07T20:25:09.0899786Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:09.0900128Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:09.0900495Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.0900834Z #define __FLT64_DIG__ 15 2025-05-07T20:25:09.0901153Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.0901541Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:09.0901898Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.0902197Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:09.0902500Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.0902837Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:09.0903243Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:09.0903686Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:09.0904008Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:09.0904382Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:09.0904742Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:09.0905072Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:09.0905383Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:09.0905724Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:09.0906030Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:09.0906302Z #define __SEG_FS 1 2025-05-07T20:25:09.0906558Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:09.0906859Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:09.0907166Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.0907484Z #define __SEG_GS 1 2025-05-07T20:25:09.0907830Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:09.0908244Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:09.0908549Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:09.0908867Z #define __INT16_TYPE__ short int 2025-05-07T20:25:09.0909172Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:09.0909503Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:09.0909798Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:09.0910078Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:09.0910371Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:09.0910751Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:09.0911170Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.0911495Z #define linux 1 2025-05-07T20:25:09.0911748Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.0912054Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:09.0912351Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:09.0912632Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:09.0912924Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:09.0913214Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:09.0913763Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:09.0914228Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:09.0914590Z #define __code_model_small__ 1 2025-05-07T20:25:09.0914901Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:09.0915223Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:09.0915634Z #define __k8__ 1 2025-05-07T20:25:09.0915897Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:09.0916229Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:09.0916559Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:09.0916922Z #define __pic__ 2 2025-05-07T20:25:09.0917211Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.0917561Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:09.0917881Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.0918263Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:09.0918676Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:09.0919071Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:09.0919379Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:09.0919709Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:09.0920049Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:09.0920329Z #define __linux__ 1 2025-05-07T20:25:09.0920593Z #define __INT64_TYPE__ long int 2025-05-07T20:25:09.0920881Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:09.0921175Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:09.0921478Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:09.0921757Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:09.0922091Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.0922468Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:09.0922798Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:09.0923088Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:09.0923415Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:09.0923746Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:09.0924548Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:09.0925031Z #define __SSE__ 1 2025-05-07T20:25:09.0925286Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:09.0925657Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:09.0926038Z #define __amd64__ 1 2025-05-07T20:25:09.0926299Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:09.0926576Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:09.0926878Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:09.0927182Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:09.0927483Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:09.0927790Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:09.0928081Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:09.0928379Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:09.0928675Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:09.0929067Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:09.0929583Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:09.0929970Z #define _LP64 1 2025-05-07T20:25:09.0930210Z #define __UINT8_C(c) c 2025-05-07T20:25:09.0930480Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:09.0930771Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:09.0931071Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:09.0931383Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:09.0931715Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:09.0932113Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:09.0932635Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:09.0933050Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.0933373Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:09.0933725Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:09.0934133Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:09.0934534Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:09.0934832Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:09.0935213Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:09.0935615Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:09.0935908Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:09.0936193Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:09.0936739Z #define __FXSR__ 1 2025-05-07T20:25:09.0937083Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:09.0937591Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:09.0938170Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:09.0938510Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:09.0938796Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:09.0939166Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:09.0939556Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:09.0939830Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:09.0940095Z #define __PIC__ 2 2025-05-07T20:25:09.0940368Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:09.0940809Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:09.0941241Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:09.0941615Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:09.0941982Z #define __SSE2__ 1 2025-05-07T20:25:09.0942231Z #define __INT32_TYPE__ int 2025-05-07T20:25:09.0942512Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:09.0942796Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:09.0943179Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:09.0943577Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:09.0943875Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:09.0944176Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:09.0944481Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.0944783Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:09.0945061Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:09.0945338Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:09.0945654Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.0945987Z #define __PIE__ 2 2025-05-07T20:25:09.0946344Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:09.0946775Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:09.0947164Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:09.0947571Z #define __INT16_C(c) c 2025-05-07T20:25:09.0947823Z #define __STDC__ 1 2025-05-07T20:25:09.0948084Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:09.0948393Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:09.0948681Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:09.0949012Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:09.0949406Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:09.0949778Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:09.0950071Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:09.0950389Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:09.0950688Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:09.0951000Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:09.0951327Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:09.0951635Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:09.0951975Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:09.0952411Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:09.0952827Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:09.0953174Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:09.0953650Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:09.0953934Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:09.0954113Z 2025-05-07T20:25:09.1480067Z 2025-05-07T20:25:09.1480968Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:09.1481664Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:09.1482015Z 2025-05-07T20:25:11.1701312Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:11.1701829Z #define __cpp_attributes 200809L 2025-05-07T20:25:11.1702388Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:11.1702931Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:11.1703367Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:11.1704025Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:11.1704399Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:11.1704776Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:11.1705091Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:11.1705624Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:11.1706124Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:11.1706553Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:11.1706940Z #define __CHAR_BIT__ 8 2025-05-07T20:25:11.1707325Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:11.1707770Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:11.1708243Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:11.1708560Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:11.1708868Z #define __cpp_static_assert 201411L 2025-05-07T20:25:11.1709197Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:11.1709538Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:11.1709868Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:11.1710204Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:11.1710574Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:11.1710938Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:11.1711384Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:11.1711849Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:11.1712199Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:11.1712507Z #define __GCC_IEC_559 2 2025-05-07T20:25:11.1712793Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:11.1713103Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:11.1713404Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:11.1713910Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:11.1714241Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:11.1714591Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:11.1714940Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:11.1715313Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:11.1715676Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:11.1715978Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:11.1716284Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:11.1716595Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:11.1716927Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:11.1717223Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:11.1717519Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:11.1717828Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:11.1718197Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:11.1718563Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:11.1718843Z #define __INT8_C(c) c 2025-05-07T20:25:11.1719150Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:11.1719465Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:11.1719817Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:11.1720181Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:11.1720493Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:11.1720818Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:11.1721171Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:11.1721566Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:11.1721890Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:11.1722194Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:11.1722489Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:11.1722799Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:11.1723105Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:11.1723537Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:11.1724403Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:11.1724749Z #define __linux 1 2025-05-07T20:25:11.1725017Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:11.1725354Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:11.1725685Z #define __unix 1 2025-05-07T20:25:11.1725952Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:11.1726543Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:11.1726873Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:11.1727168Z #define __WINT_MIN__ 0U 2025-05-07T20:25:11.1727446Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:11.1727880Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:11.1728178Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:11.1728478Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:11.1728763Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:11.1729069Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:11.1729399Z #define __INT64_C(c) c ## L 2025-05-07T20:25:11.1729700Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:11.1730024Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:11.1730335Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:11.1730669Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:11.1730970Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:11.1731268Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:11.1731661Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:11.1732073Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:11.1732344Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:11.1732651Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:11.1732966Z #define __DBL_DIG__ 15 2025-05-07T20:25:11.1733214Z #define __FLT32_DIG__ 6 2025-05-07T20:25:11.1733549Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:11.1733923Z #define __GXX_WEAK__ 1 2025-05-07T20:25:11.1734205Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:11.1734482Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:11.1734843Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:11.1735220Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:11.1735515Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:11.1735852Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:11.1736209Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:11.1736659Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:11.1737095Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:11.1737393Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:11.1737688Z #define __unix__ 1 2025-05-07T20:25:11.1737937Z #define __INT_WIDTH__ 32 2025-05-07T20:25:11.1738201Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:11.1738476Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:11.1738765Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:11.1748836Z #define __UINT16_C(c) c 2025-05-07T20:25:11.1749170Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:11.1749472Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:11.1749911Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:11.1750365Z #define __gnu_linux__ 1 2025-05-07T20:25:11.1750642Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:11.1750936Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:11.1751254Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:11.1751580Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:11.1751885Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:11.1752179Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:11.1752452Z #define __GNUC__ 11 2025-05-07T20:25:11.1752698Z #define __GXX_RTTI 1 2025-05-07T20:25:11.1752947Z #define __pie__ 2 2025-05-07T20:25:11.1753177Z #define __MMX__ 1 2025-05-07T20:25:11.1753425Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:11.1753851Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:11.1754157Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:11.1754457Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:11.1754738Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:11.1755069Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:11.1755423Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:11.1755813Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:11.1756230Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:11.1756562Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:11.1757134Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:11.1757434Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:11.1757724Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:11.1758066Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:11.1758495Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:11.1758782Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:11.1759071Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:11.1759388Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:11.1759707Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:11.1760009Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:11.1760317Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:11.1760591Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:11.1760887Z #define __cplusplus 201703L 2025-05-07T20:25:11.1761184Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:11.1761498Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:11.1761775Z #define __DEPRECATED 1 2025-05-07T20:25:11.1762066Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:11.1762393Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:11.1762671Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:11.1763021Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:11.1763419Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:11.1763711Z #define __SSE2_MATH__ 1 2025-05-07T20:25:11.1763987Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:11.1764323Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:11.1764639Z #define __amd64 1 2025-05-07T20:25:11.1764889Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:11.1765188Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:11.1765476Z #define __GNUG__ 11 2025-05-07T20:25:11.1765759Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:11.1766105Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:11.1766388Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:11.1766670Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:11.1766975Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:11.1767266Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:11.1767567Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:11.1767893Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:11.1768186Z #define __cpp_hex_float 201603L 2025-05-07T20:25:11.1768484Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:11.1768780Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:11.1769084Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:11.1769373Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:11.1769672Z #define __x86_64 1 2025-05-07T20:25:11.1769928Z #define __cpp_lambdas 200907L 2025-05-07T20:25:11.1770220Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:11.1770631Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:11.1771057Z #define __cpp_template_auto 201606L 2025-05-07T20:25:11.1771455Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:11.1771951Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:11.1772467Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:11.1772898Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:11.1773173Z #define __LP64__ 1 2025-05-07T20:25:11.1773434Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:11.1773823Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:11.1774234Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:11.1774541Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:11.1774855Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:11.1775160Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:11.1775457Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:11.1775747Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:11.1776042Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:11.1776404Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:11.1776802Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:11.1777283Z #define __FLT_DIG__ 6 2025-05-07T20:25:11.1777539Z #define __NO_INLINE__ 1 2025-05-07T20:25:11.1777807Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:11.1778166Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:11.1778676Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:11.1778962Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:11.1779251Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:11.1779527Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:11.1779837Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:11.1780165Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:11.1780449Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:11.1780768Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:11.1781081Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:11.1781378Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:11.1781709Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:11.1782087Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:11.1782413Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:11.1782701Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:11.1782988Z #define __FLT128_DIG__ 33 2025-05-07T20:25:11.1783254Z #define __INT32_C(c) c 2025-05-07T20:25:11.1783514Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:11.1783831Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:11.1784137Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:11.1784443Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:11.1784790Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:11.1785129Z #define unix 1 2025-05-07T20:25:11.1785379Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:11.1785660Z #define __cpp_rtti 199711L 2025-05-07T20:25:11.1785953Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:11.1786299Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:11.1786627Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:11.1786969Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:11.1787333Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:11.1787613Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:11.1787933Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:11.1788247Z #define __ELF__ 1 2025-05-07T20:25:11.1788499Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:11.1788819Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:11.1789125Z #define __FLT_RADIX__ 2 2025-05-07T20:25:11.1789740Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:11.1790147Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:11.1790551Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:11.1790854Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:11.1791157Z #define __k8 1 2025-05-07T20:25:11.1791488Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:11.1791896Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:11.1792217Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:11.1792547Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:11.1792839Z #define __LDBL_DIG__ 18 2025-05-07T20:25:11.1793102Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:11.1793387Z #define __x86_64__ 1 2025-05-07T20:25:11.1793763Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:11.1794092Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:11.1794468Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:11.1794802Z #define __FLT64_DIG__ 15 2025-05-07T20:25:11.1795113Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:11.1795491Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:11.1795838Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:11.1796134Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:11.1796436Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:11.1796767Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:11.1797169Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:11.1797599Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:11.1798043Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:11.1798401Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:11.1798747Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:11.1799104Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:11.1799524Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:11.1799834Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:11.1800168Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:11.1800479Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:11.1800743Z #define __SEG_FS 1 2025-05-07T20:25:11.1800994Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:11.1801300Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:11.1801607Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:11.1801917Z #define __SEG_GS 1 2025-05-07T20:25:11.1802265Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:11.1802685Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:11.1802981Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:11.1803307Z #define __INT16_TYPE__ short int 2025-05-07T20:25:11.1803620Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:11.1803954Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:11.1804283Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:11.1804567Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:11.1804858Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:11.1805227Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:11.1805654Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:11.1806005Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:11.1806359Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:11.1806688Z #define linux 1 2025-05-07T20:25:11.1806943Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:11.1807246Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:11.1807549Z #define __EXCEPTIONS 1 2025-05-07T20:25:11.1807819Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:11.1808112Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:11.1808413Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:11.1808727Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:11.1809108Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:11.1809536Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:11.1809924Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:11.1810282Z #define __code_model_small__ 1 2025-05-07T20:25:11.1810583Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:11.1810923Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:11.1811249Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:11.1811557Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:11.1811877Z #define __k8__ 1 2025-05-07T20:25:11.1812126Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:11.1812445Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:11.1812779Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:11.1813044Z #define __pic__ 2 2025-05-07T20:25:11.1813331Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:11.1813679Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:11.1813978Z #define __cpp_decltype 200707L 2025-05-07T20:25:11.1814298Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:11.1814677Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:11.1815087Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:11.1815477Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:11.1815808Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:11.1816164Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:11.1816479Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:11.1816768Z #define __linux__ 1 2025-05-07T20:25:11.1817026Z #define __INT64_TYPE__ long int 2025-05-07T20:25:11.1817312Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:11.1817605Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:11.1817907Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:11.1818338Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:11.1818684Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:11.1819011Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:11.1819355Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:11.1819728Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:11.1820050Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:11.1820376Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:11.1820738Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:11.1821132Z #define __SSE__ 1 2025-05-07T20:25:11.1821389Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:11.1821760Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:11.1822136Z #define __amd64__ 1 2025-05-07T20:25:11.1822387Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:11.1822662Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:11.1822962Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:11.1823256Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:11.1823563Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:11.1824174Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:11.1824562Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:11.1824862Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:11.1825251Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:11.1825764Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:11.1826161Z #define _LP64 1 2025-05-07T20:25:11.1826396Z #define __UINT8_C(c) c 2025-05-07T20:25:11.1826671Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:11.1826972Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:11.1827271Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:11.1827574Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:11.1827974Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:11.1828488Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:11.1828898Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:11.1829242Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:11.1829597Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:11.1829937Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:11.1830364Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:11.1830777Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:11.1831069Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:11.1831369Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:11.1831752Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:11.1832154Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:11.1832451Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:11.1832731Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:11.1833020Z #define __FXSR__ 1 2025-05-07T20:25:11.1833357Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:11.1833933Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:11.1834396Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:11.1834736Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:11.1835040Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:11.1835379Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:11.1835709Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:11.1836013Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:11.1836413Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:11.1836818Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:11.1837112Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:11.1837390Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:11.1837660Z #define __PIC__ 2 2025-05-07T20:25:11.1837938Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:11.1838384Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:11.1838813Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:11.1839463Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:11.1839853Z #define __cpp_constexpr 201603L 2025-05-07T20:25:11.1840144Z #define __SSE2__ 1 2025-05-07T20:25:11.1840406Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:11.1840731Z #define __INT32_TYPE__ int 2025-05-07T20:25:11.1841213Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:11.1841502Z #define __cpp_exceptions 199711L 2025-05-07T20:25:11.1841815Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:11.1842187Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:11.1842582Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:11.1842881Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:11.1843185Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:11.1843493Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:11.1843796Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:11.1844076Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:11.1844365Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:11.1844686Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:11.1845018Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:11.1845352Z #define __PIE__ 2 2025-05-07T20:25:11.1845704Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:11.1846177Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:11.1846523Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:11.1846913Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:11.1847314Z #define __INT16_C(c) c 2025-05-07T20:25:11.1847571Z #define __STDC__ 1 2025-05-07T20:25:11.1847821Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:11.1848103Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:11.1848413Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:11.1848707Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:11.1849036Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:11.1849431Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:11.1849806Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:11.1850104Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:11.1850432Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:11.1850755Z #define __SSE_MATH__ 1 2025-05-07T20:25:11.1851021Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:11.1851345Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:11.1851690Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:11.1852006Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:11.1852327Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:11.1852639Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:11.1852971Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:11.1853405Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:11.1853820Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:11.1854161Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:11.1854481Z #define _GNU_SOURCE 1 2025-05-07T20:25:11.1854758Z #define __cpp_init_captures 201304L 2025-05-07T20:25:11.1855078Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:11.1855353Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:11.1855538Z 2025-05-07T20:25:11.2382114Z 2025-05-07T20:25:11.2382762Z + conda run -n build_binary c++ --version 2025-05-07T20:25:11.2383067Z 2025-05-07T20:25:13.2380608Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:13.2381022Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:13.2381489Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:13.2382049Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:13.2382394Z 2025-05-07T20:25:13.2382399Z 2025-05-07T20:25:13.3025153Z 2025-05-07T20:25:13.3026168Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:13.3026985Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:13.3027446Z 2025-05-07T20:25:15.2869407Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:15.2871818Z 2025-05-07T20:25:15.2872623Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:15.2873225Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:15.2873829Z 2025-05-07T20:25:17.2913622Z #define __cplusplus 201703L 2025-05-07T20:25:17.2916057Z 2025-05-07T20:25:17.2916786Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:17.2969744Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:17.2970211Z . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:17.2982923Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:17.2983321Z env: 2025-05-07T20:25:17.2983575Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:17.2983919Z BUILD_ENV: build_binary 2025-05-07T20:25:17.2984202Z BUILD_TARGET: genai 2025-05-07T20:25:17.2984466Z BUILD_VARIANT: cuda 2025-05-07T20:25:17.2984732Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:17.2985023Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:17.2985369Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:17.2985740Z ##[endgroup] 2025-05-07T20:25:17.6684008Z ################################################################################ 2025-05-07T20:25:17.6684446Z # Install CUDA 2025-05-07T20:25:17.6684680Z # 2025-05-07T20:25:17.6702401Z # [2025-05-07T20:25:17.669Z] + install_cuda build_binary 12.8.0 2025-05-07T20:25:17.6702835Z ################################################################################ 2025-05-07T20:25:17.6703075Z 2025-05-07T20:25:17.6720703Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:17.7771574Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:17.7771961Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:17.7777607Z + conda clean --packages --tarball -y 2025-05-07T20:25:17.7777837Z 2025-05-07T20:25:18.5240471Z Will remove 32 (142.2 MB) tarball(s). 2025-05-07T20:25:18.5240912Z Will remove 6 (617 KB) package(s). 2025-05-07T20:25:18.5900820Z 2025-05-07T20:25:18.5912265Z + conda clean --all -y 2025-05-07T20:25:18.5912493Z 2025-05-07T20:25:19.3035986Z There are no unused tarball(s) to remove. 2025-05-07T20:25:19.3036585Z Will remove 1 index cache(s). 2025-05-07T20:25:19.3037090Z There are no unused package(s) to remove. 2025-05-07T20:25:19.3037608Z There are no tempfile(s) to remove. 2025-05-07T20:25:19.3038108Z There are no logfile(s) to remove. 2025-05-07T20:25:19.3692366Z 2025-05-07T20:25:19.3707528Z [INSTALL] Installing CUDA 12.8.0 ... 2025-05-07T20:25:19.3733362Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0 2025-05-07T20:25:20.3354892Z Channels: 2025-05-07T20:25:20.3355339Z - conda-forge 2025-05-07T20:25:20.3355779Z Platform: linux-64 2025-05-07T20:25:31.2287990Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:32.4177749Z Solving environment: - \ | / - done 2025-05-07T20:25:32.4993604Z 2025-05-07T20:25:32.4994321Z ## Package Plan ## 2025-05-07T20:25:32.4994694Z 2025-05-07T20:25:32.4995036Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:32.4995548Z 2025-05-07T20:25:32.4995700Z added / updated specs: 2025-05-07T20:25:32.4996070Z - cuda=12.8.0 2025-05-07T20:25:32.4996282Z 2025-05-07T20:25:32.4996306Z 2025-05-07T20:25:32.4996511Z The following packages will be downloaded: 2025-05-07T20:25:32.4996867Z 2025-05-07T20:25:32.4997070Z package | build 2025-05-07T20:25:32.4997611Z ---------------------------|----------------- 2025-05-07T20:25:32.4998187Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:32.4998779Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:32.4999234Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:32.4999717Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:32.5000187Z cuda-12.8.0 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:32.5001446Z cuda-cccl_linux-64-12.8.55 | ha770c72_1 1.0 MB conda-forge 2025-05-07T20:25:32.5002220Z cuda-command-line-tools-12.8.0| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:32.5002782Z cuda-compiler-12.8.0 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:32.5003319Z cuda-crt-dev_linux-64-12.8.61| ha770c72_1 90 KB conda-forge 2025-05-07T20:25:32.5003849Z cuda-crt-tools-12.8.61 | ha770c72_1 27 KB conda-forge 2025-05-07T20:25:32.5004360Z cuda-cudart-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:32.5004887Z cuda-cudart-dev-12.8.57 | h5888daf_1 23 KB conda-forge 2025-05-07T20:25:32.5005444Z cuda-cudart-dev_linux-64-12.8.57| h3f2d84a_1 377 KB conda-forge 2025-05-07T20:25:32.5006004Z cuda-cudart-static-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:32.5006589Z cuda-cudart-static_linux-64-12.8.57| h3f2d84a_1 950 KB conda-forge 2025-05-07T20:25:32.5007175Z cuda-cudart_linux-64-12.8.57| h3f2d84a_1 188 KB conda-forge 2025-05-07T20:25:32.5007717Z cuda-cuobjdump-12.8.55 | hbd13f7d_0 227 KB conda-forge 2025-05-07T20:25:32.5008220Z cuda-cupti-12.8.57 | hbd13f7d_0 1.8 MB conda-forge 2025-05-07T20:25:32.5008721Z cuda-cupti-dev-12.8.57 | h5888daf_0 4.0 MB conda-forge 2025-05-07T20:25:32.5009234Z cuda-cuxxfilt-12.8.55 | hbd13f7d_0 211 KB conda-forge 2025-05-07T20:25:32.5009747Z cuda-driver-dev-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:32.5010298Z cuda-driver-dev_linux-64-12.8.90| h3f2d84a_1 36 KB conda-forge 2025-05-07T20:25:32.5010824Z cuda-gdb-12.8.55 | h50b4baa_0 353 KB conda-forge 2025-05-07T20:25:32.5011318Z cuda-libraries-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:32.5011850Z cuda-libraries-dev-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:32.5012376Z cuda-nsight-12.8.55 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:32.5012863Z cuda-nvcc-12.8.61 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:32.5013370Z cuda-nvcc-dev_linux-64-12.8.61| he91c749_1 12.7 MB conda-forge 2025-05-07T20:25:32.5013904Z cuda-nvcc-impl-12.8.61 | h85509e4_1 25 KB conda-forge 2025-05-07T20:25:32.5014421Z cuda-nvcc-tools-12.8.61 | he02047a_1 24.5 MB conda-forge 2025-05-07T20:25:32.5014943Z cuda-nvcc_linux-64-12.8.61 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:32.5015457Z cuda-nvdisasm-12.8.55 | hbd13f7d_0 4.9 MB conda-forge 2025-05-07T20:25:32.5015972Z cuda-nvml-dev-12.8.55 | hbd13f7d_0 134 KB conda-forge 2025-05-07T20:25:32.5016474Z cuda-nvprof-12.8.57 | hbd13f7d_0 2.5 MB conda-forge 2025-05-07T20:25:32.5016980Z cuda-nvprune-12.8.55 | hbd13f7d_0 68 KB conda-forge 2025-05-07T20:25:32.5017488Z cuda-nvrtc-12.8.61 | hbd13f7d_0 63.1 MB conda-forge 2025-05-07T20:25:32.5017987Z cuda-nvrtc-dev-12.8.61 | h5888daf_0 34 KB conda-forge 2025-05-07T20:25:32.5018481Z cuda-nvtx-12.8.55 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:32.5018990Z cuda-nvvm-dev_linux-64-12.8.61| ha770c72_1 25 KB conda-forge 2025-05-07T20:25:32.5019520Z cuda-nvvm-impl-12.8.61 | he02047a_1 20.8 MB conda-forge 2025-05-07T20:25:32.5020038Z cuda-nvvm-tools-12.8.61 | he02047a_1 23.5 MB conda-forge 2025-05-07T20:25:32.5020535Z cuda-nvvp-12.8.57 | hbd13f7d_0 112.4 MB conda-forge 2025-05-07T20:25:32.5021016Z cuda-opencl-12.8.55 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:32.5021634Z cuda-opencl-dev-12.8.55 | h5888daf_0 95 KB conda-forge 2025-05-07T20:25:32.5022285Z cuda-profiler-api-12.8.55 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:32.5022807Z cuda-runtime-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:25:32.5023336Z cuda-sanitizer-api-12.8.55 | hbd13f7d_0 8.8 MB conda-forge 2025-05-07T20:25:32.5024289Z cuda-toolkit-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:25:32.5024786Z cuda-tools-12.8.0 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:32.5025269Z cuda-version-12.8 | h5d125a7_3 21 KB conda-forge 2025-05-07T20:25:32.5025836Z cuda-visual-tools-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:32.5026354Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:32.5026814Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:32.5027264Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:32.5027795Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:32.5028377Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:32.5028949Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:32.5029502Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:32.5030005Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:32.5030519Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:32.5031052Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:32.5031547Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:32.5031997Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:32.5032451Z gds-tools-1.13.0.11 | h5888daf_0 37.9 MB conda-forge 2025-05-07T20:25:32.5032926Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:32.5033350Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:32.5033885Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:32.5034340Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:32.5034787Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:32.5035253Z libcublas-12.8.3.14 | h9ab20c4_0 460.2 MB conda-forge 2025-05-07T20:25:32.5035810Z libcublas-dev-12.8.3.14 | h9ab20c4_0 89 KB conda-forge 2025-05-07T20:25:32.5036313Z libcufft-11.3.3.41 | hbd13f7d_0 147.4 MB conda-forge 2025-05-07T20:25:32.5036819Z libcufft-dev-11.3.3.41 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:32.5037318Z libcufile-1.13.0.11 | h12f29b5_0 939 KB conda-forge 2025-05-07T20:25:32.5037826Z libcufile-dev-1.13.0.11 | h5888daf_0 35 KB conda-forge 2025-05-07T20:25:32.5038334Z libcurand-10.3.9.55 | hbd13f7d_0 43.6 MB conda-forge 2025-05-07T20:25:32.5038833Z libcurand-dev-10.3.9.55 | h5888daf_0 265 KB conda-forge 2025-05-07T20:25:32.5039350Z libcusolver-11.7.2.55 | h9ab20c4_0 156.9 MB conda-forge 2025-05-07T20:25:32.5039878Z libcusolver-dev-11.7.2.55 | h9ab20c4_0 59 KB conda-forge 2025-05-07T20:25:32.5040403Z libcusparse-12.5.7.53 | hbd13f7d_0 164.9 MB conda-forge 2025-05-07T20:25:32.5040925Z libcusparse-dev-12.5.7.53 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:32.5041450Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:32.5042099Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:32.5042705Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:32.5043212Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:32.5043721Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:32.5044210Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:32.5044670Z libglvnd-1.7.0 | ha4b6fd6_2 129 KB conda-forge 2025-05-07T20:25:32.5045155Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:32.5045639Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:32.5046094Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:32.5046554Z libnpp-12.3.3.65 | hbd13f7d_0 130.6 MB conda-forge 2025-05-07T20:25:32.5047042Z libnpp-dev-12.3.3.65 | h5888daf_0 443 KB conda-forge 2025-05-07T20:25:32.5047515Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:32.5047965Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:32.5048450Z libnvfatbin-12.8.55 | hbd13f7d_0 793 KB conda-forge 2025-05-07T20:25:32.5048969Z libnvfatbin-dev-12.8.55 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:32.5049493Z libnvjitlink-12.8.61 | hbd13f7d_0 28.7 MB conda-forge 2025-05-07T20:25:32.5050014Z libnvjitlink-dev-12.8.61 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:32.5050532Z libnvjpeg-12.3.5.57 | h97fd463_0 3.0 MB conda-forge 2025-05-07T20:25:32.5051037Z libnvjpeg-dev-12.3.5.57 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:32.5051538Z libopengl-1.7.0 | ha4b6fd6_2 50 KB conda-forge 2025-05-07T20:25:32.5052019Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:32.5052487Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:32.5052975Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:32.5053456Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:32.5053926Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:32.5054382Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:32.5054855Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:32.5055355Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:32.5055831Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:32.5056296Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:32.5056742Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:32.5057241Z nsight-compute-2025.1.0.14 | hb5ebaad_0 320.6 MB conda-forge 2025-05-07T20:25:32.5057738Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:32.5058170Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:32.5058607Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:32.5059110Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:32.5059607Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:32.5060083Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:32.5060695Z python-3.10.13 |hd12c33a_1_cpython 24.5 MB conda-forge 2025-05-07T20:25:32.5061303Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:32.5061770Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:32.5062213Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:32.5062879Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:32.5063342Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:32.5063828Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:32.5064345Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:32.5064863Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:32.5065405Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:32.5065923Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:32.5066444Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:32.5066957Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:32.5067435Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:32.5067920Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:32.5068409Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:32.5068934Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:32.5069473Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:32.5069992Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:32.5070501Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:32.5071018Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:32.5071511Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:32.5072010Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:32.5072537Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:32.5073045Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:32.5073594Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:32.5074030Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:32.5074456Z ------------------------------------------------------------ 2025-05-07T20:25:32.5074838Z Total: 1.90 GB 2025-05-07T20:25:32.5075087Z 2025-05-07T20:25:32.5075235Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:32.5075484Z 2025-05-07T20:25:32.5075722Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:32.5076197Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:32.5076670Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:32.5077195Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:32.5077685Z cuda conda-forge/noarch::cuda-12.8.0-ha804496_0 2025-05-07T20:25:32.5078212Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 2025-05-07T20:25:32.5078892Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 2025-05-07T20:25:32.5079544Z cuda-compiler conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 2025-05-07T20:25:32.5080161Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:25:32.5080888Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 2025-05-07T20:25:32.5081563Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 2025-05-07T20:25:32.5082153Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 2025-05-07T20:25:32.5082802Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:32.5083633Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 2025-05-07T20:25:32.5084334Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:32.5085017Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:32.5085649Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 2025-05-07T20:25:32.5086231Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 2025-05-07T20:25:32.5086812Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 2025-05-07T20:25:32.5087417Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 2025-05-07T20:25:32.5088029Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 2025-05-07T20:25:32.5088671Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 2025-05-07T20:25:32.5089272Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 2025-05-07T20:25:32.5089827Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 2025-05-07T20:25:32.5090468Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 2025-05-07T20:25:32.5091079Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 2025-05-07T20:25:32.5091622Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 2025-05-07T20:25:32.5092210Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 2025-05-07T20:25:32.5092855Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 2025-05-07T20:25:32.5093462Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 2025-05-07T20:25:32.5094086Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 2025-05-07T20:25:32.5094700Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 2025-05-07T20:25:32.5095286Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 2025-05-07T20:25:32.5095858Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 2025-05-07T20:25:32.5096428Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 2025-05-07T20:25:32.5096989Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 2025-05-07T20:25:32.5097551Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 2025-05-07T20:25:32.5098121Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 2025-05-07T20:25:32.5098712Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:25:32.5099344Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 2025-05-07T20:25:32.5099950Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 2025-05-07T20:25:32.5100526Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 2025-05-07T20:25:32.5101068Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 2025-05-07T20:25:32.5101656Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 2025-05-07T20:25:32.5102298Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 2025-05-07T20:25:32.5102914Z cuda-runtime conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 2025-05-07T20:25:32.5103535Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 2025-05-07T20:25:32.5104258Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 2025-05-07T20:25:32.5104888Z cuda-tools conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 2025-05-07T20:25:32.5105431Z cuda-version conda-forge/noarch::cuda-version-12.8-h5d125a7_3 2025-05-07T20:25:32.5106028Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 2025-05-07T20:25:32.5106645Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:32.5107147Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:32.5107602Z expat conda-forge/linux-64::expat-2.7.0-h5888daf_0 2025-05-07T20:25:32.5108186Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:32.5108864Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:32.5109543Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:32.5110196Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:32.5110771Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:32.5111331Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:32.5111891Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:32.5112418Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:32.5112898Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:32.5113374Z gds-tools conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 2025-05-07T20:25:32.5113975Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:32.5114408Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:32.5114881Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:32.5115356Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:32.5115818Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:32.5116329Z libcublas conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 2025-05-07T20:25:32.5116905Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 2025-05-07T20:25:32.5117482Z libcufft conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 2025-05-07T20:25:32.5118046Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 2025-05-07T20:25:32.5118613Z libcufile conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 2025-05-07T20:25:32.5119177Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 2025-05-07T20:25:32.5119754Z libcurand conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 2025-05-07T20:25:32.5120325Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 2025-05-07T20:25:32.5120920Z libcusolver conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 2025-05-07T20:25:32.5121531Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 2025-05-07T20:25:32.5122145Z libcusparse conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 2025-05-07T20:25:32.5122758Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 2025-05-07T20:25:32.5123347Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:32.5124259Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:32.5124868Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:32.5125440Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:32.5126024Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:32.5126765Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:32.5127266Z libglvnd conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 2025-05-07T20:25:32.5127933Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:32.5128469Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:32.5128951Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:32.5129432Z libnpp conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 2025-05-07T20:25:32.5129957Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 2025-05-07T20:25:32.5130464Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:32.5130948Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:32.5131477Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 2025-05-07T20:25:32.5132077Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 2025-05-07T20:25:32.5132686Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 2025-05-07T20:25:32.5133308Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 2025-05-07T20:25:32.5133906Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 2025-05-07T20:25:32.5134484Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 2025-05-07T20:25:32.5135060Z libopengl conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 2025-05-07T20:25:32.5135562Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:32.5136066Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:32.5136605Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:32.5137127Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:32.5137619Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:32.5138149Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:32.5138711Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:32.5139219Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:32.5139708Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:32.5140180Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:32.5140727Z nsight-compute conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 2025-05-07T20:25:32.5152825Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:32.5153284Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:32.5153817Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:32.5154391Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:32.5154955Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:32.5155518Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:32.5156106Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:32.5156600Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:32.5157091Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:32.5157647Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:32.5158252Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:32.5158852Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:32.5159505Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:32.5160111Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:32.5160829Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:32.5161517Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:32.5162070Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:32.5162612Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:32.5163165Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:32.5163774Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:32.5164425Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:32.5165032Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:32.5165607Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:32.5166182Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:32.5166762Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:32.5167333Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:32.5167949Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:32.5168540Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:32.5169042Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:32.5169320Z 2025-05-07T20:25:32.5169459Z The following packages will be UPDATED: 2025-05-07T20:25:32.5169690Z 2025-05-07T20:25:32.5170005Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:32.5170674Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:32.5171050Z 2025-05-07T20:25:32.5171301Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:32.5171650Z 2025-05-07T20:25:32.5171980Z python pkgs/main::python-3.10.16-he870216_1 --> conda-forge::python-3.10.13-hd12c33a_1_cpython 2025-05-07T20:25:32.5172687Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:32.5173326Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:32.5173692Z 2025-05-07T20:25:32.5173718Z 2025-05-07T20:25:32.5173722Z 2025-05-07T20:25:32.5173887Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:32.5174316Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:25:32.5174576Z 2025-05-07T20:25:32.5175023Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:25:32.5175329Z 2025-05-07T20:25:32.5175334Z 2025-05-07T20:25:32.5175606Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:25:32.5175895Z 2025-05-07T20:25:32.5175900Z 2025-05-07T20:25:32.5175904Z 2025-05-07T20:25:32.5176159Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:25:32.5176453Z 2025-05-07T20:25:32.5176457Z 2025-05-07T20:25:32.5176461Z 2025-05-07T20:25:32.5176465Z 2025-05-07T20:25:32.5177118Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:25:32.5177414Z 2025-05-07T20:25:32.5177418Z 2025-05-07T20:25:32.5177422Z 2025-05-07T20:25:32.5177426Z 2025-05-07T20:25:32.5177796Z 2025-05-07T20:25:32.5189228Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:25:32.5189565Z 2025-05-07T20:25:32.5189572Z 2025-05-07T20:25:32.5189577Z 2025-05-07T20:25:32.5189582Z 2025-05-07T20:25:32.5189588Z 2025-05-07T20:25:32.5189593Z 2025-05-07T20:25:32.5191230Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:25:32.5191660Z 2025-05-07T20:25:32.5191664Z 2025-05-07T20:25:32.5191668Z 2025-05-07T20:25:32.5191803Z 2025-05-07T20:25:32.5191807Z 2025-05-07T20:25:32.5191811Z 2025-05-07T20:25:32.5194469Z 2025-05-07T20:25:32.5196376Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:25:32.5196784Z 2025-05-07T20:25:32.5196789Z 2025-05-07T20:25:32.5196793Z 2025-05-07T20:25:32.5196797Z 2025-05-07T20:25:32.5196801Z 2025-05-07T20:25:32.5196805Z 2025-05-07T20:25:32.5196809Z 2025-05-07T20:25:32.5198201Z 2025-05-07T20:25:32.5200067Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:25:32.5200419Z 2025-05-07T20:25:32.5200425Z 2025-05-07T20:25:32.5200430Z 2025-05-07T20:25:32.5200436Z 2025-05-07T20:25:32.5200442Z 2025-05-07T20:25:32.5200448Z 2025-05-07T20:25:32.5200453Z 2025-05-07T20:25:32.5200458Z 2025-05-07T20:25:32.5200464Z 2025-05-07T20:25:32.5201918Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:25:32.5202466Z 2025-05-07T20:25:32.5202473Z 2025-05-07T20:25:32.5202493Z 2025-05-07T20:25:32.5202499Z 2025-05-07T20:25:32.5202506Z 2025-05-07T20:25:32.5202525Z 2025-05-07T20:25:32.5202532Z 2025-05-07T20:25:32.5202538Z 2025-05-07T20:25:32.5202553Z 2025-05-07T20:25:32.5202568Z 2025-05-07T20:25:32.5203862Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:25:32.5204413Z 2025-05-07T20:25:32.5204420Z 2025-05-07T20:25:32.5204427Z 2025-05-07T20:25:32.5204433Z 2025-05-07T20:25:32.5204439Z 2025-05-07T20:25:32.5204445Z 2025-05-07T20:25:32.5204451Z 2025-05-07T20:25:32.5204457Z 2025-05-07T20:25:32.5204463Z 2025-05-07T20:25:32.5204469Z 2025-05-07T20:25:32.5204475Z 2025-05-07T20:25:32.5205608Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:25:32.5206170Z 2025-05-07T20:25:32.5206178Z 2025-05-07T20:25:32.5206184Z 2025-05-07T20:25:32.5206190Z 2025-05-07T20:25:32.5206196Z 2025-05-07T20:25:32.5206202Z 2025-05-07T20:25:32.5206219Z 2025-05-07T20:25:32.5206225Z 2025-05-07T20:25:32.5206244Z 2025-05-07T20:25:32.5206250Z 2025-05-07T20:25:32.5206256Z 2025-05-07T20:25:32.5206262Z 2025-05-07T20:25:32.5207469Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:25:32.5208035Z 2025-05-07T20:25:32.5208042Z 2025-05-07T20:25:32.5208048Z 2025-05-07T20:25:32.5208054Z 2025-05-07T20:25:32.5208060Z 2025-05-07T20:25:32.5208066Z 2025-05-07T20:25:32.5208072Z 2025-05-07T20:25:32.5208078Z 2025-05-07T20:25:32.5208095Z 2025-05-07T20:25:32.5208102Z 2025-05-07T20:25:32.5208116Z 2025-05-07T20:25:32.5208122Z 2025-05-07T20:25:32.5208128Z 2025-05-07T20:25:32.5209365Z python-3.10.13 | 24.5 MB | | 0%  2025-05-07T20:25:32.5210354Z 2025-05-07T20:25:32.5210361Z 2025-05-07T20:25:32.5210367Z 2025-05-07T20:25:32.5210373Z 2025-05-07T20:25:32.5210379Z 2025-05-07T20:25:32.5210385Z 2025-05-07T20:25:32.5210391Z 2025-05-07T20:25:32.5210396Z 2025-05-07T20:25:32.5210402Z 2025-05-07T20:25:32.5210421Z 2025-05-07T20:25:32.5210427Z 2025-05-07T20:25:32.5210433Z 2025-05-07T20:25:32.5210439Z 2025-05-07T20:25:32.5210445Z 2025-05-07T20:25:32.5211762Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:25:32.5212322Z 2025-05-07T20:25:32.5212328Z 2025-05-07T20:25:32.5212334Z 2025-05-07T20:25:32.5212340Z 2025-05-07T20:25:32.5212346Z 2025-05-07T20:25:32.5212351Z 2025-05-07T20:25:32.5212357Z 2025-05-07T20:25:32.5212363Z 2025-05-07T20:25:32.5212368Z 2025-05-07T20:25:32.5212374Z 2025-05-07T20:25:32.5212379Z 2025-05-07T20:25:32.5212385Z 2025-05-07T20:25:32.5212391Z 2025-05-07T20:25:32.5212407Z 2025-05-07T20:25:32.5212728Z 2025-05-07T20:25:32.5215696Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:25:32.5216263Z 2025-05-07T20:25:32.5216281Z 2025-05-07T20:25:32.5216288Z 2025-05-07T20:25:32.5216294Z 2025-05-07T20:25:32.5216300Z 2025-05-07T20:25:32.5216482Z 2025-05-07T20:25:32.5216488Z 2025-05-07T20:25:32.5216494Z 2025-05-07T20:25:32.5216500Z 2025-05-07T20:25:32.5216506Z 2025-05-07T20:25:32.5216512Z 2025-05-07T20:25:32.5216517Z 2025-05-07T20:25:32.5216643Z 2025-05-07T20:25:32.5216652Z 2025-05-07T20:25:32.5216658Z 2025-05-07T20:25:32.5216664Z 2025-05-07T20:25:32.5217310Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:25:32.5217899Z 2025-05-07T20:25:32.5217906Z 2025-05-07T20:25:32.5217912Z 2025-05-07T20:25:32.5217917Z 2025-05-07T20:25:32.5217924Z 2025-05-07T20:25:32.5217930Z 2025-05-07T20:25:32.5217936Z 2025-05-07T20:25:32.5217941Z 2025-05-07T20:25:32.5217947Z 2025-05-07T20:25:32.5217952Z 2025-05-07T20:25:32.5217958Z 2025-05-07T20:25:32.5217964Z 2025-05-07T20:25:32.5217969Z 2025-05-07T20:25:32.5217975Z 2025-05-07T20:25:32.5217999Z 2025-05-07T20:25:32.5218006Z 2025-05-07T20:25:32.5218012Z 2025-05-07T20:25:32.5219382Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:25:32.5220028Z 2025-05-07T20:25:32.5220035Z 2025-05-07T20:25:32.5220041Z 2025-05-07T20:25:32.5220047Z 2025-05-07T20:25:32.5220063Z 2025-05-07T20:25:32.5220069Z 2025-05-07T20:25:32.5220075Z 2025-05-07T20:25:32.5220082Z 2025-05-07T20:25:32.5220088Z 2025-05-07T20:25:32.5220095Z 2025-05-07T20:25:32.5220101Z 2025-05-07T20:25:32.5220107Z 2025-05-07T20:25:32.5220113Z 2025-05-07T20:25:32.5220119Z 2025-05-07T20:25:32.5220125Z 2025-05-07T20:25:32.5220132Z 2025-05-07T20:25:32.5220138Z 2025-05-07T20:25:32.5220144Z 2025-05-07T20:25:32.5221205Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:25:32.5221822Z 2025-05-07T20:25:32.5221830Z 2025-05-07T20:25:32.5221835Z 2025-05-07T20:25:32.5221841Z 2025-05-07T20:25:32.5221848Z 2025-05-07T20:25:32.5221854Z 2025-05-07T20:25:32.5221860Z 2025-05-07T20:25:32.5221867Z 2025-05-07T20:25:32.5221885Z 2025-05-07T20:25:32.5221903Z 2025-05-07T20:25:32.5221910Z 2025-05-07T20:25:32.5221916Z 2025-05-07T20:25:32.5221922Z 2025-05-07T20:25:32.5221928Z 2025-05-07T20:25:32.5221934Z 2025-05-07T20:25:32.5221951Z 2025-05-07T20:25:32.5221957Z 2025-05-07T20:25:32.5221962Z 2025-05-07T20:25:32.5221968Z 2025-05-07T20:25:32.6116612Z ... (more hidden) ... 2025-05-07T20:25:32.6117272Z 2025-05-07T20:25:32.6126601Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:25:32.6126908Z 2025-05-07T20:25:32.6126912Z 2025-05-07T20:25:32.6130112Z libcusparse-12.5.7.5 | 164.9 MB | 2 | 2%  2025-05-07T20:25:32.6130453Z 2025-05-07T20:25:32.6130458Z 2025-05-07T20:25:32.6130463Z 2025-05-07T20:25:32.6174141Z libcusolver-11.7.2.5 | 156.9 MB | | 1%  2025-05-07T20:25:32.6174460Z 2025-05-07T20:25:32.6174467Z 2025-05-07T20:25:32.6174474Z 2025-05-07T20:25:32.6176183Z 2025-05-07T20:25:32.6323713Z libcufft-11.3.3.41 | 147.4 MB | 1 | 2%  2025-05-07T20:25:32.7122486Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:25:32.7123562Z 2025-05-07T20:25:32.7128500Z nsight-compute-2025. | 320.6 MB | | 1%  2025-05-07T20:25:32.7129229Z 2025-05-07T20:25:32.7129236Z 2025-05-07T20:25:32.7132772Z libcusparse-12.5.7.5 | 164.9 MB | 4 | 4%  2025-05-07T20:25:32.7133157Z 2025-05-07T20:25:32.7133163Z 2025-05-07T20:25:32.7135802Z 2025-05-07T20:25:32.7181385Z libcusolver-11.7.2.5 | 156.9 MB | 2 | 3%  2025-05-07T20:25:32.7181772Z 2025-05-07T20:25:32.7181778Z 2025-05-07T20:25:32.7181784Z 2025-05-07T20:25:32.7181789Z 2025-05-07T20:25:32.7325450Z libcufft-11.3.3.41 | 147.4 MB | 3 | 4%  2025-05-07T20:25:32.8124278Z libcublas-12.8.3.14 | 460.2 MB | | 1% 2025-05-07T20:25:32.8127228Z 2025-05-07T20:25:32.8128560Z nsight-compute-2025. | 320.6 MB | 1 | 2%  2025-05-07T20:25:32.8128957Z 2025-05-07T20:25:32.8129232Z 2025-05-07T20:25:32.8133371Z libcusparse-12.5.7.5 | 164.9 MB | 6 | 7%  2025-05-07T20:25:32.8133779Z 2025-05-07T20:25:32.8133785Z 2025-05-07T20:25:32.8135206Z 2025-05-07T20:25:32.8184047Z libcusolver-11.7.2.5 | 156.9 MB | 5 | 5%  2025-05-07T20:25:32.8184496Z 2025-05-07T20:25:32.8184502Z 2025-05-07T20:25:32.8184508Z 2025-05-07T20:25:32.8184514Z 2025-05-07T20:25:32.8368918Z libcufft-11.3.3.41 | 147.4 MB | 6 | 7%  2025-05-07T20:25:32.9128026Z libcublas-12.8.3.14 | 460.2 MB | | 1% 2025-05-07T20:25:32.9128438Z 2025-05-07T20:25:32.9257526Z nsight-compute-2025. | 320.6 MB | 2 | 3%  2025-05-07T20:25:32.9257955Z 2025-05-07T20:25:32.9257962Z 2025-05-07T20:25:32.9257968Z 2025-05-07T20:25:32.9257985Z 2025-05-07T20:25:32.9275526Z libcufft-11.3.3.41 | 147.4 MB | 9 | 9%  2025-05-07T20:25:32.9275944Z 2025-05-07T20:25:32.9275950Z 2025-05-07T20:25:32.9278503Z 2025-05-07T20:25:32.9335981Z libcusolver-11.7.2.5 | 156.9 MB | 7 | 7%  2025-05-07T20:25:32.9336404Z 2025-05-07T20:25:32.9336410Z 2025-05-07T20:25:32.9369018Z libcusparse-12.5.7.5 | 164.9 MB | 9 | 9%  2025-05-07T20:25:33.0136333Z libcublas-12.8.3.14 | 460.2 MB | 1 | 1% 2025-05-07T20:25:33.0136782Z 2025-05-07T20:25:33.0370809Z nsight-compute-2025. | 320.6 MB | 3 | 4%  2025-05-07T20:25:33.0399207Z libcublas-12.8.3.14 | 460.2 MB | 2 | 2% 2025-05-07T20:25:33.0399596Z 2025-05-07T20:25:33.0399603Z 2025-05-07T20:25:33.0399608Z 2025-05-07T20:25:33.0401495Z 2025-05-07T20:25:33.0419383Z libcufft-11.3.3.41 | 147.4 MB | #1 | 11%  2025-05-07T20:25:33.0419798Z 2025-05-07T20:25:33.0419804Z 2025-05-07T20:25:33.0420442Z 2025-05-07T20:25:33.0423181Z libcusolver-11.7.2.5 | 156.9 MB | 9 | 9%  2025-05-07T20:25:33.0423595Z 2025-05-07T20:25:33.0423601Z 2025-05-07T20:25:33.1137772Z libcusparse-12.5.7.5 | 164.9 MB | #1 | 11%  2025-05-07T20:25:33.1138246Z 2025-05-07T20:25:33.1373297Z nsight-compute-2025. | 320.6 MB | 4 | 5%  2025-05-07T20:25:33.1420247Z libcublas-12.8.3.14 | 460.2 MB | 2 | 3% 2025-05-07T20:25:33.1420632Z 2025-05-07T20:25:33.1420639Z 2025-05-07T20:25:33.1424606Z 2025-05-07T20:25:33.1493750Z libcusolver-11.7.2.5 | 156.9 MB | # | 11%  2025-05-07T20:25:33.1494203Z 2025-05-07T20:25:33.1494319Z 2025-05-07T20:25:33.1536388Z libcusparse-12.5.7.5 | 164.9 MB | #3 | 13%  2025-05-07T20:25:33.1536804Z 2025-05-07T20:25:33.1536810Z 2025-05-07T20:25:33.1536816Z 2025-05-07T20:25:33.1536822Z 2025-05-07T20:25:33.2141848Z libcufft-11.3.3.41 | 147.4 MB | #3 | 14%  2025-05-07T20:25:33.2142310Z 2025-05-07T20:25:33.2374517Z nsight-compute-2025. | 320.6 MB | 5 | 6%  2025-05-07T20:25:33.2427134Z libcublas-12.8.3.14 | 460.2 MB | 3 | 3% 2025-05-07T20:25:33.2427940Z 2025-05-07T20:25:33.2427945Z 2025-05-07T20:25:33.2430899Z 2025-05-07T20:25:33.2609310Z libcusolver-11.7.2.5 | 156.9 MB | #2 | 13%  2025-05-07T20:25:33.2609727Z 2025-05-07T20:25:33.2609734Z 2025-05-07T20:25:33.2620006Z libcusparse-12.5.7.5 | 164.9 MB | #5 | 16%  2025-05-07T20:25:33.2620424Z 2025-05-07T20:25:33.2620431Z 2025-05-07T20:25:33.2620437Z 2025-05-07T20:25:33.2622194Z 2025-05-07T20:25:33.3142090Z libcufft-11.3.3.41 | 147.4 MB | #5 | 16%  2025-05-07T20:25:33.3143658Z 2025-05-07T20:25:33.3428969Z nsight-compute-2025. | 320.6 MB | 6 | 7%  2025-05-07T20:25:33.3429285Z 2025-05-07T20:25:33.3429290Z 2025-05-07T20:25:33.3429298Z 2025-05-07T20:25:33.3452272Z libcusolver-11.7.2.5 | 156.9 MB | #4 | 15%  2025-05-07T20:25:33.3613073Z libcublas-12.8.3.14 | 460.2 MB | 3 | 4% 2025-05-07T20:25:33.3613366Z 2025-05-07T20:25:33.3616526Z 2025-05-07T20:25:33.3620468Z libcusparse-12.5.7.5 | 164.9 MB | #7 | 18%  2025-05-07T20:25:33.3621016Z 2025-05-07T20:25:33.3621022Z 2025-05-07T20:25:33.3621026Z 2025-05-07T20:25:33.3621690Z 2025-05-07T20:25:33.4147120Z libcufft-11.3.3.41 | 147.4 MB | #8 | 18%  2025-05-07T20:25:33.4149511Z 2025-05-07T20:25:33.4454635Z nsight-compute-2025. | 320.6 MB | 7 | 7%  2025-05-07T20:25:33.4474843Z libcublas-12.8.3.14 | 460.2 MB | 4 | 5% 2025-05-07T20:25:33.4475233Z 2025-05-07T20:25:33.4475238Z 2025-05-07T20:25:33.4475242Z 2025-05-07T20:25:33.4629208Z libcusolver-11.7.2.5 | 156.9 MB | #6 | 16%  2025-05-07T20:25:33.4629697Z 2025-05-07T20:25:33.4629703Z 2025-05-07T20:25:33.4633724Z libcusparse-12.5.7.5 | 164.9 MB | #9 | 20%  2025-05-07T20:25:33.4634219Z 2025-05-07T20:25:33.4634225Z 2025-05-07T20:25:33.4634231Z 2025-05-07T20:25:33.4636936Z 2025-05-07T20:25:33.5147188Z libcufft-11.3.3.41 | 147.4 MB | ## | 20%  2025-05-07T20:25:33.5149884Z 2025-05-07T20:25:33.5457012Z nsight-compute-2025. | 320.6 MB | 8 | 8%  2025-05-07T20:25:33.5528917Z libcublas-12.8.3.14 | 460.2 MB | 5 | 5% 2025-05-07T20:25:33.5529232Z 2025-05-07T20:25:33.5529238Z 2025-05-07T20:25:33.5529263Z 2025-05-07T20:25:33.5629322Z libcusolver-11.7.2.5 | 156.9 MB | #8 | 18%  2025-05-07T20:25:33.5629690Z 2025-05-07T20:25:33.5629696Z 2025-05-07T20:25:33.5637328Z libcusparse-12.5.7.5 | 164.9 MB | ##2 | 22%  2025-05-07T20:25:33.5637733Z 2025-05-07T20:25:33.5637738Z 2025-05-07T20:25:33.5637742Z 2025-05-07T20:25:33.5642125Z 2025-05-07T20:25:33.6151907Z libcufft-11.3.3.41 | 147.4 MB | ##2 | 23%  2025-05-07T20:25:33.6154592Z 2025-05-07T20:25:33.6459776Z nsight-compute-2025. | 320.6 MB | 9 | 9%  2025-05-07T20:25:33.6529170Z libcublas-12.8.3.14 | 460.2 MB | 5 | 6% 2025-05-07T20:25:33.6529622Z 2025-05-07T20:25:33.6529628Z 2025-05-07T20:25:33.6533257Z 2025-05-07T20:25:33.6639475Z libcusolver-11.7.2.5 | 156.9 MB | ## | 20%  2025-05-07T20:25:33.6639920Z 2025-05-07T20:25:33.6639925Z 2025-05-07T20:25:33.6639929Z 2025-05-07T20:25:33.6639933Z 2025-05-07T20:25:33.6713695Z libcufft-11.3.3.41 | 147.4 MB | ##4 | 25%  2025-05-07T20:25:33.6714128Z 2025-05-07T20:25:33.6714135Z 2025-05-07T20:25:33.7159461Z libcusparse-12.5.7.5 | 164.9 MB | ##4 | 24%  2025-05-07T20:25:33.7159897Z 2025-05-07T20:25:33.7532461Z nsight-compute-2025. | 320.6 MB | # | 10%  2025-05-07T20:25:33.7532875Z 2025-05-07T20:25:33.7532881Z 2025-05-07T20:25:33.7532886Z 2025-05-07T20:25:33.7643484Z libcusolver-11.7.2.5 | 156.9 MB | ##2 | 22%  2025-05-07T20:25:33.7643898Z 2025-05-07T20:25:33.7643914Z 2025-05-07T20:25:33.7643920Z 2025-05-07T20:25:33.7643925Z 2025-05-07T20:25:33.7714915Z libcufft-11.3.3.41 | 147.4 MB | ##7 | 27%  2025-05-07T20:25:33.7715326Z 2025-05-07T20:25:33.7715699Z 2025-05-07T20:25:33.8164571Z libcusparse-12.5.7.5 | 164.9 MB | ##6 | 26%  2025-05-07T20:25:33.8166659Z 2025-05-07T20:25:33.8211470Z nsight-compute-2025. | 320.6 MB | #1 | 12%  2025-05-07T20:25:33.8534916Z libcublas-12.8.3.14 | 460.2 MB | 6 | 7% 2025-05-07T20:25:33.8535306Z 2025-05-07T20:25:33.8535312Z 2025-05-07T20:25:33.8538238Z 2025-05-07T20:25:33.8644301Z libcusolver-11.7.2.5 | 156.9 MB | ##4 | 25%  2025-05-07T20:25:33.8644721Z 2025-05-07T20:25:33.8644738Z 2025-05-07T20:25:33.8644743Z 2025-05-07T20:25:33.8644749Z 2025-05-07T20:25:33.8722236Z libcufft-11.3.3.41 | 147.4 MB | ##9 | 30%  2025-05-07T20:25:33.8722632Z 2025-05-07T20:25:33.8726161Z 2025-05-07T20:25:33.9165688Z libcusparse-12.5.7.5 | 164.9 MB | ##8 | 29%  2025-05-07T20:25:33.9166158Z 2025-05-07T20:25:33.9396006Z nsight-compute-2025. | 320.6 MB | #2 | 13%  2025-05-07T20:25:33.9536005Z libcublas-12.8.3.14 | 460.2 MB | 7 | 7% 2025-05-07T20:25:33.9536404Z 2025-05-07T20:25:33.9536689Z 2025-05-07T20:25:33.9540729Z 2025-05-07T20:25:33.9655269Z libcusolver-11.7.2.5 | 156.9 MB | ##6 | 27%  2025-05-07T20:25:33.9655689Z 2025-05-07T20:25:33.9655978Z 2025-05-07T20:25:33.9655987Z 2025-05-07T20:25:33.9656057Z 2025-05-07T20:25:33.9724380Z libcufft-11.3.3.41 | 147.4 MB | ###2 | 32%  2025-05-07T20:25:33.9724790Z 2025-05-07T20:25:33.9727211Z 2025-05-07T20:25:34.0169514Z libcusparse-12.5.7.5 | 164.9 MB | ### | 31%  2025-05-07T20:25:34.0170201Z 2025-05-07T20:25:34.0450389Z nsight-compute-2025. | 320.6 MB | #3 | 14%  2025-05-07T20:25:34.0537607Z libcublas-12.8.3.14 | 460.2 MB | 7 | 8% 2025-05-07T20:25:34.0538216Z 2025-05-07T20:25:34.0538404Z 2025-05-07T20:25:34.0538449Z 2025-05-07T20:25:34.0655479Z libcusolver-11.7.2.5 | 156.9 MB | ##9 | 29%  2025-05-07T20:25:34.0655903Z 2025-05-07T20:25:34.0655909Z 2025-05-07T20:25:34.0655914Z 2025-05-07T20:25:34.0655920Z 2025-05-07T20:25:34.0725478Z libcufft-11.3.3.41 | 147.4 MB | ###4 | 35%  2025-05-07T20:25:34.0725891Z 2025-05-07T20:25:34.0727858Z 2025-05-07T20:25:34.1170312Z libcusparse-12.5.7.5 | 164.9 MB | ###3 | 33%  2025-05-07T20:25:34.1170752Z 2025-05-07T20:25:34.1453258Z nsight-compute-2025. | 320.6 MB | #4 | 15%  2025-05-07T20:25:34.1546036Z libcublas-12.8.3.14 | 460.2 MB | 8 | 8% 2025-05-07T20:25:34.1546445Z 2025-05-07T20:25:34.1546451Z 2025-05-07T20:25:34.1548765Z 2025-05-07T20:25:34.1658219Z libcusolver-11.7.2.5 | 156.9 MB | ###1 | 31%  2025-05-07T20:25:34.1658644Z 2025-05-07T20:25:34.1658650Z 2025-05-07T20:25:34.1658656Z 2025-05-07T20:25:34.1658661Z 2025-05-07T20:25:34.1726554Z libcufft-11.3.3.41 | 147.4 MB | ###7 | 37%  2025-05-07T20:25:34.1726970Z 2025-05-07T20:25:34.1727140Z 2025-05-07T20:25:34.2193124Z libcusparse-12.5.7.5 | 164.9 MB | ###5 | 35%  2025-05-07T20:25:34.2199485Z 2025-05-07T20:25:34.2456614Z nsight-compute-2025. | 320.6 MB | #5 | 16%  2025-05-07T20:25:34.2726186Z libcublas-12.8.3.14 | 460.2 MB | 8 | 9% 2025-05-07T20:25:34.2726602Z 2025-05-07T20:25:34.2726627Z 2025-05-07T20:25:34.2726633Z 2025-05-07T20:25:34.2726639Z 2025-05-07T20:25:34.2802997Z libcufft-11.3.3.41 | 147.4 MB | ###9 | 40%  2025-05-07T20:25:34.2803424Z 2025-05-07T20:25:34.2803429Z 2025-05-07T20:25:34.2805754Z 2025-05-07T20:25:34.2814720Z libcusolver-11.7.2.5 | 156.9 MB | ###3 | 34%  2025-05-07T20:25:34.2815095Z 2025-05-07T20:25:34.2815939Z 2025-05-07T20:25:34.3193148Z libcusparse-12.5.7.5 | 164.9 MB | ###7 | 38%  2025-05-07T20:25:34.3194117Z 2025-05-07T20:25:34.3728230Z nsight-compute-2025. | 320.6 MB | #7 | 17%  2025-05-07T20:25:34.3728534Z 2025-05-07T20:25:34.3728538Z 2025-05-07T20:25:34.3728542Z 2025-05-07T20:25:34.3728546Z 2025-05-07T20:25:34.3803511Z libcufft-11.3.3.41 | 147.4 MB | ####2 | 42%  2025-05-07T20:25:34.3804035Z 2025-05-07T20:25:34.3804040Z 2025-05-07T20:25:34.3806687Z 2025-05-07T20:25:34.3815424Z libcusolver-11.7.2.5 | 156.9 MB | ###5 | 36%  2025-05-07T20:25:34.3815723Z 2025-05-07T20:25:34.3815729Z 2025-05-07T20:25:34.4277426Z libcusparse-12.5.7.5 | 164.9 MB | #### | 40%  2025-05-07T20:25:34.4278391Z 2025-05-07T20:25:34.4728315Z nsight-compute-2025. | 320.6 MB | #8 | 18%  2025-05-07T20:25:34.4728830Z 2025-05-07T20:25:34.4728834Z 2025-05-07T20:25:34.4728837Z 2025-05-07T20:25:34.4728841Z 2025-05-07T20:25:34.4818672Z libcufft-11.3.3.41 | 147.4 MB | ####4 | 45%  2025-05-07T20:25:34.4818972Z 2025-05-07T20:25:34.4818976Z 2025-05-07T20:25:34.4834098Z libcusparse-12.5.7.5 | 164.9 MB | ####2 | 42%  2025-05-07T20:25:34.4834391Z 2025-05-07T20:25:34.4834395Z 2025-05-07T20:25:34.4838083Z 2025-05-07T20:25:34.4851808Z libcusolver-11.7.2.5 | 156.9 MB | ###7 | 38%  2025-05-07T20:25:34.5366367Z libcublas-12.8.3.14 | 460.2 MB | 9 | 9% 2025-05-07T20:25:34.5366894Z 2025-05-07T20:25:34.5732707Z nsight-compute-2025. | 320.6 MB | #9 | 19%  2025-05-07T20:25:34.5733318Z 2025-05-07T20:25:34.5733324Z 2025-05-07T20:25:34.5733328Z 2025-05-07T20:25:34.5733332Z 2025-05-07T20:25:34.5884860Z libcufft-11.3.3.41 | 147.4 MB | ####7 | 48%  2025-05-07T20:25:34.5885171Z 2025-05-07T20:25:34.5885176Z 2025-05-07T20:25:34.5885941Z libcusparse-12.5.7.5 | 164.9 MB | ####4 | 45%  2025-05-07T20:25:34.5886252Z 2025-05-07T20:25:34.5886259Z 2025-05-07T20:25:34.5888907Z 2025-05-07T20:25:34.6305005Z libcusolver-11.7.2.5 | 156.9 MB | ###9 | 40%  2025-05-07T20:25:34.6367003Z libcublas-12.8.3.14 | 460.2 MB | 9 | 10% 2025-05-07T20:25:34.6367283Z 2025-05-07T20:25:34.6734061Z nsight-compute-2025. | 320.6 MB | ## | 20%  2025-05-07T20:25:34.6734357Z 2025-05-07T20:25:34.6734362Z 2025-05-07T20:25:34.6734366Z 2025-05-07T20:25:34.6734370Z 2025-05-07T20:25:34.6888078Z libcufft-11.3.3.41 | 147.4 MB | ##### | 51%  2025-05-07T20:25:34.6888383Z 2025-05-07T20:25:34.6890742Z 2025-05-07T20:25:34.6992808Z libcusparse-12.5.7.5 | 164.9 MB | ####7 | 47%  2025-05-07T20:25:34.6993121Z 2025-05-07T20:25:34.6993126Z 2025-05-07T20:25:34.6993911Z 2025-05-07T20:25:34.7312137Z libcusolver-11.7.2.5 | 156.9 MB | ####1 | 42%  2025-05-07T20:25:34.7398282Z libcublas-12.8.3.14 | 460.2 MB | # | 10% 2025-05-07T20:25:34.7398642Z 2025-05-07T20:25:34.7856266Z nsight-compute-2025. | 320.6 MB | ##1 | 21%  2025-05-07T20:25:34.7856631Z 2025-05-07T20:25:34.7856635Z 2025-05-07T20:25:34.7856640Z 2025-05-07T20:25:34.7856654Z 2025-05-07T20:25:34.7952199Z libcufft-11.3.3.41 | 147.4 MB | #####3 | 53%  2025-05-07T20:25:34.7952504Z 2025-05-07T20:25:34.7952510Z 2025-05-07T20:25:34.8046690Z libcusparse-12.5.7.5 | 164.9 MB | ####9 | 49%  2025-05-07T20:25:34.8047013Z 2025-05-07T20:25:34.8047035Z 2025-05-07T20:25:34.8048841Z 2025-05-07T20:25:34.8312487Z libcusolver-11.7.2.5 | 156.9 MB | ####3 | 44%  2025-05-07T20:25:34.8441400Z libcublas-12.8.3.14 | 460.2 MB | # | 11% 2025-05-07T20:25:34.8441956Z 2025-05-07T20:25:34.9011034Z nsight-compute-2025. | 320.6 MB | ##2 | 22%  2025-05-07T20:25:34.9011336Z 2025-05-07T20:25:34.9011807Z 2025-05-07T20:25:34.9024417Z libcusparse-12.5.7.5 | 164.9 MB | #####1 | 52%  2025-05-07T20:25:34.9024726Z 2025-05-07T20:25:34.9024730Z 2025-05-07T20:25:34.9024738Z 2025-05-07T20:25:34.9027406Z 2025-05-07T20:25:34.9226010Z libcufft-11.3.3.41 | 147.4 MB | #####5 | 56%  2025-05-07T20:25:34.9226370Z 2025-05-07T20:25:34.9226375Z 2025-05-07T20:25:34.9226379Z 2025-05-07T20:25:34.9313453Z libcusolver-11.7.2.5 | 156.9 MB | ####5 | 46%  2025-05-07T20:25:34.9444425Z libcublas-12.8.3.14 | 460.2 MB | #1 | 12% 2025-05-07T20:25:34.9444748Z 2025-05-07T20:25:35.0012513Z nsight-compute-2025. | 320.6 MB | ##3 | 23%  2025-05-07T20:25:35.0012881Z 2025-05-07T20:25:35.0012887Z 2025-05-07T20:25:35.0049078Z libcusparse-12.5.7.5 | 164.9 MB | #####3 | 54%  2025-05-07T20:25:35.0049426Z 2025-05-07T20:25:35.0049431Z 2025-05-07T20:25:35.0049435Z 2025-05-07T20:25:35.0049439Z 2025-05-07T20:25:35.0315928Z libcufft-11.3.3.41 | 147.4 MB | #####8 | 58%  2025-05-07T20:25:35.0521143Z libcublas-12.8.3.14 | 460.2 MB | #2 | 12% 2025-05-07T20:25:35.0521416Z 2025-05-07T20:25:35.0521420Z 2025-05-07T20:25:35.0522780Z 2025-05-07T20:25:35.0589078Z libcusolver-11.7.2.5 | 156.9 MB | ####7 | 47%  2025-05-07T20:25:35.0589377Z 2025-05-07T20:25:35.1035040Z nsight-compute-2025. | 320.6 MB | ##4 | 24%  2025-05-07T20:25:35.1035372Z 2025-05-07T20:25:35.1035376Z 2025-05-07T20:25:35.1063884Z libcusparse-12.5.7.5 | 164.9 MB | #####6 | 56%  2025-05-07T20:25:35.1064186Z 2025-05-07T20:25:35.1064191Z 2025-05-07T20:25:35.1064452Z 2025-05-07T20:25:35.1064456Z 2025-05-07T20:25:35.1316683Z libcufft-11.3.3.41 | 147.4 MB | ###### | 61%  2025-05-07T20:25:35.1661515Z libcublas-12.8.3.14 | 460.2 MB | #3 | 13% 2025-05-07T20:25:35.1661806Z 2025-05-07T20:25:35.1723699Z nsight-compute-2025. | 320.6 MB | ##5 | 25%  2025-05-07T20:25:35.1724377Z 2025-05-07T20:25:35.1724382Z 2025-05-07T20:25:35.1726506Z 2025-05-07T20:25:35.2036553Z libcusolver-11.7.2.5 | 156.9 MB | ####9 | 49%  2025-05-07T20:25:35.2036844Z 2025-05-07T20:25:35.2036848Z 2025-05-07T20:25:35.2072703Z libcusparse-12.5.7.5 | 164.9 MB | #####8 | 58%  2025-05-07T20:25:35.2072988Z 2025-05-07T20:25:35.2072991Z 2025-05-07T20:25:35.2072995Z 2025-05-07T20:25:35.2072999Z 2025-05-07T20:25:35.2663253Z libcufft-11.3.3.41 | 147.4 MB | ######3 | 63%  2025-05-07T20:25:35.2666942Z 2025-05-07T20:25:35.2728018Z nsight-compute-2025. | 320.6 MB | ##6 | 26%  2025-05-07T20:25:35.2728557Z 2025-05-07T20:25:35.2728561Z 2025-05-07T20:25:35.2728565Z 2025-05-07T20:25:35.3036517Z libcusolver-11.7.2.5 | 156.9 MB | #####1 | 51%  2025-05-07T20:25:35.3036826Z 2025-05-07T20:25:35.3036831Z 2025-05-07T20:25:35.3074350Z libcusparse-12.5.7.5 | 164.9 MB | ###### | 61%  2025-05-07T20:25:35.3074779Z 2025-05-07T20:25:35.3074783Z 2025-05-07T20:25:35.3074787Z 2025-05-07T20:25:35.3074791Z 2025-05-07T20:25:35.3241298Z libcufft-11.3.3.41 | 147.4 MB | ######5 | 66%  2025-05-07T20:25:35.3663559Z libcublas-12.8.3.14 | 460.2 MB | #3 | 14% 2025-05-07T20:25:35.3664900Z 2025-05-07T20:25:35.3768894Z nsight-compute-2025. | 320.6 MB | ##7 | 28%  2025-05-07T20:25:35.3769179Z 2025-05-07T20:25:35.3769191Z 2025-05-07T20:25:35.3769195Z 2025-05-07T20:25:35.4088259Z libcusolver-11.7.2.5 | 156.9 MB | #####3 | 53%  2025-05-07T20:25:35.4088567Z 2025-05-07T20:25:35.4088571Z 2025-05-07T20:25:35.4110575Z libcusparse-12.5.7.5 | 164.9 MB | ######3 | 63%  2025-05-07T20:25:35.4110984Z 2025-05-07T20:25:35.4110990Z 2025-05-07T20:25:35.4110995Z 2025-05-07T20:25:35.4113322Z 2025-05-07T20:25:35.4242992Z libcufft-11.3.3.41 | 147.4 MB | ######8 | 68%  2025-05-07T20:25:35.4666301Z libcublas-12.8.3.14 | 460.2 MB | #4 | 15% 2025-05-07T20:25:35.4666703Z 2025-05-07T20:25:35.4772172Z nsight-compute-2025. | 320.6 MB | ##8 | 29%  2025-05-07T20:25:35.4772464Z 2025-05-07T20:25:35.4772471Z 2025-05-07T20:25:35.5089484Z 2025-05-07T20:25:35.5090022Z libcusolver-11.7.2.5 | 156.9 MB | #####5 | 56%  2025-05-07T20:25:35.5090423Z 2025-05-07T20:25:35.5091808Z 2025-05-07T20:25:35.5112703Z libcusparse-12.5.7.5 | 164.9 MB | ######5 | 66%  2025-05-07T20:25:35.5113139Z 2025-05-07T20:25:35.5113146Z 2025-05-07T20:25:35.5113152Z 2025-05-07T20:25:35.5113739Z 2025-05-07T20:25:35.5705090Z libcufft-11.3.3.41 | 147.4 MB | #######1 | 71%  2025-05-07T20:25:35.5706096Z 2025-05-07T20:25:35.5772166Z nsight-compute-2025. | 320.6 MB | ### | 30%  2025-05-07T20:25:35.5772572Z 2025-05-07T20:25:35.5772578Z 2025-05-07T20:25:35.5772584Z 2025-05-07T20:25:35.5821413Z libcusolver-11.7.2.5 | 156.9 MB | #####7 | 58%  2025-05-07T20:25:35.6094471Z libcublas-12.8.3.14 | 460.2 MB | #5 | 15% 2025-05-07T20:25:35.6094797Z 2025-05-07T20:25:35.6097011Z 2025-05-07T20:25:35.6114384Z libcusparse-12.5.7.5 | 164.9 MB | ######8 | 68%  2025-05-07T20:25:35.6114761Z 2025-05-07T20:25:35.6114766Z 2025-05-07T20:25:35.6114770Z 2025-05-07T20:25:35.6118237Z 2025-05-07T20:25:35.6773903Z libcufft-11.3.3.41 | 147.4 MB | #######3 | 74%  2025-05-07T20:25:35.6774203Z 2025-05-07T20:25:35.6774207Z 2025-05-07T20:25:35.6775198Z 2025-05-07T20:25:35.6822282Z libcusolver-11.7.2.5 | 156.9 MB | ###### | 60%  2025-05-07T20:25:35.6992027Z libcublas-12.8.3.14 | 460.2 MB | #6 | 16% 2025-05-07T20:25:35.6992774Z 2025-05-07T20:25:35.7258724Z nsight-compute-2025. | 320.6 MB | ###1 | 31%  2025-05-07T20:25:35.7259022Z 2025-05-07T20:25:35.7259026Z 2025-05-07T20:25:35.7290206Z libcusparse-12.5.7.5 | 164.9 MB | ####### | 71%  2025-05-07T20:25:35.7290500Z 2025-05-07T20:25:35.7290504Z 2025-05-07T20:25:35.7290508Z 2025-05-07T20:25:35.7291443Z 2025-05-07T20:25:35.7774565Z libcufft-11.3.3.41 | 147.4 MB | #######6 | 76%  2025-05-07T20:25:35.7774862Z 2025-05-07T20:25:35.7774866Z 2025-05-07T20:25:35.7775388Z 2025-05-07T20:25:35.7824933Z libcusolver-11.7.2.5 | 156.9 MB | ######2 | 62%  2025-05-07T20:25:35.7993080Z libcublas-12.8.3.14 | 460.2 MB | #6 | 17% 2025-05-07T20:25:35.7995809Z 2025-05-07T20:25:35.8323018Z nsight-compute-2025. | 320.6 MB | ###2 | 32%  2025-05-07T20:25:35.8323313Z 2025-05-07T20:25:35.8323317Z 2025-05-07T20:25:35.8349972Z libcusparse-12.5.7.5 | 164.9 MB | #######2 | 73%  2025-05-07T20:25:35.8350370Z 2025-05-07T20:25:35.8350402Z 2025-05-07T20:25:35.8350406Z 2025-05-07T20:25:35.8352859Z 2025-05-07T20:25:35.8774868Z libcufft-11.3.3.41 | 147.4 MB | #######9 | 79%  2025-05-07T20:25:35.8775277Z 2025-05-07T20:25:35.8775281Z 2025-05-07T20:25:35.8776014Z 2025-05-07T20:25:35.8994755Z libcusolver-11.7.2.5 | 156.9 MB | ######5 | 65%  2025-05-07T20:25:35.8995463Z 2025-05-07T20:25:35.9325366Z nsight-compute-2025. | 320.6 MB | ###3 | 34%  2025-05-07T20:25:35.9325657Z 2025-05-07T20:25:35.9325776Z 2025-05-07T20:25:35.9352614Z libcusparse-12.5.7.5 | 164.9 MB | #######5 | 75%  2025-05-07T20:25:35.9353041Z 2025-05-07T20:25:35.9353047Z 2025-05-07T20:25:35.9353052Z 2025-05-07T20:25:35.9354624Z 2025-05-07T20:25:35.9778565Z libcufft-11.3.3.41 | 147.4 MB | ########1 | 82%  2025-05-07T20:25:35.9779424Z 2025-05-07T20:25:35.9779431Z 2025-05-07T20:25:35.9782043Z 2025-05-07T20:25:35.9998573Z libcusolver-11.7.2.5 | 156.9 MB | ######8 | 68%  2025-05-07T20:25:35.9999676Z 2025-05-07T20:25:36.0223960Z nsight-compute-2025. | 320.6 MB | ###5 | 35%  2025-05-07T20:25:36.0327952Z libcublas-12.8.3.14 | 460.2 MB | #7 | 18% 2025-05-07T20:25:36.0328242Z 2025-05-07T20:25:36.0328246Z 2025-05-07T20:25:36.0357375Z libcusparse-12.5.7.5 | 164.9 MB | #######7 | 78%  2025-05-07T20:25:36.0357921Z 2025-05-07T20:25:36.0358040Z 2025-05-07T20:25:36.0358048Z 2025-05-07T20:25:36.0358148Z 2025-05-07T20:25:36.0896874Z libcufft-11.3.3.41 | 147.4 MB | ########4 | 85%  2025-05-07T20:25:36.0897176Z 2025-05-07T20:25:36.0897180Z 2025-05-07T20:25:36.0897725Z 2025-05-07T20:25:36.1071933Z libcusolver-11.7.2.5 | 156.9 MB | ####### | 71%  2025-05-07T20:25:36.1072227Z 2025-05-07T20:25:36.1417043Z nsight-compute-2025. | 320.6 MB | ###6 | 36%  2025-05-07T20:25:36.1417332Z 2025-05-07T20:25:36.1417337Z 2025-05-07T20:25:36.1417341Z 2025-05-07T20:25:36.1417809Z 2025-05-07T20:25:36.1433621Z libcufft-11.3.3.41 | 147.4 MB | ########7 | 87%  2025-05-07T20:25:36.1433954Z 2025-05-07T20:25:36.1433958Z 2025-05-07T20:25:36.1575144Z libcusparse-12.5.7.5 | 164.9 MB | ######## | 80%  2025-05-07T20:25:36.2073751Z libcublas-12.8.3.14 | 460.2 MB | #8 | 18% 2025-05-07T20:25:36.2075437Z 2025-05-07T20:25:36.2244206Z nsight-compute-2025. | 320.6 MB | ###7 | 38%  2025-05-07T20:25:36.2244522Z 2025-05-07T20:25:36.2244528Z 2025-05-07T20:25:36.2247304Z 2025-05-07T20:25:36.2419133Z libcusolver-11.7.2.5 | 156.9 MB | #######3 | 73%  2025-05-07T20:25:36.2419434Z 2025-05-07T20:25:36.2419439Z 2025-05-07T20:25:36.2419443Z 2025-05-07T20:25:36.2419958Z 2025-05-07T20:25:36.2434115Z libcufft-11.3.3.41 | 147.4 MB | ########9 | 90%  2025-05-07T20:25:36.2434423Z 2025-05-07T20:25:36.2434427Z 2025-05-07T20:25:36.2577958Z libcusparse-12.5.7.5 | 164.9 MB | ########2 | 83%  2025-05-07T20:25:36.3167358Z libcublas-12.8.3.14 | 460.2 MB | #8 | 19% 2025-05-07T20:25:36.3167950Z 2025-05-07T20:25:36.3244149Z nsight-compute-2025. | 320.6 MB | ###8 | 39%  2025-05-07T20:25:36.3244445Z 2025-05-07T20:25:36.3244450Z 2025-05-07T20:25:36.3244747Z 2025-05-07T20:25:36.3420394Z libcusolver-11.7.2.5 | 156.9 MB | #######5 | 76%  2025-05-07T20:25:36.3420693Z 2025-05-07T20:25:36.3420697Z 2025-05-07T20:25:36.3420701Z 2025-05-07T20:25:36.3421232Z 2025-05-07T20:25:36.3498135Z libcufft-11.3.3.41 | 147.4 MB | #########2 | 92%  2025-05-07T20:25:36.3498440Z 2025-05-07T20:25:36.3498903Z 2025-05-07T20:25:36.4026960Z libcusparse-12.5.7.5 | 164.9 MB | ########4 | 85%  2025-05-07T20:25:36.4249134Z libcublas-12.8.3.14 | 460.2 MB | #9 | 20% 2025-05-07T20:25:36.4249420Z 2025-05-07T20:25:36.4249424Z 2025-05-07T20:25:36.4252820Z 2025-05-07T20:25:36.4259004Z libcusolver-11.7.2.5 | 156.9 MB | #######7 | 78%  2025-05-07T20:25:36.4259613Z 2025-05-07T20:25:36.4481404Z nsight-compute-2025. | 320.6 MB | ###9 | 40%  2025-05-07T20:25:36.4481713Z 2025-05-07T20:25:36.4481717Z 2025-05-07T20:25:36.4481721Z 2025-05-07T20:25:36.4483863Z 2025-05-07T20:25:36.4528365Z libcufft-11.3.3.41 | 147.4 MB | #########5 | 95%  2025-05-07T20:25:36.4528657Z 2025-05-07T20:25:36.4528661Z 2025-05-07T20:25:36.5028299Z libcusparse-12.5.7.5 | 164.9 MB | ########7 | 87%  2025-05-07T20:25:36.5252650Z libcublas-12.8.3.14 | 460.2 MB | ## | 20% 2025-05-07T20:25:36.5252933Z 2025-05-07T20:25:36.5252937Z 2025-05-07T20:25:36.5256346Z 2025-05-07T20:25:36.5365787Z libcusolver-11.7.2.5 | 156.9 MB | ######## | 80%  2025-05-07T20:25:36.5367748Z 2025-05-07T20:25:36.5483238Z nsight-compute-2025. | 320.6 MB | ####1 | 41%  2025-05-07T20:25:36.5483545Z 2025-05-07T20:25:36.5483549Z 2025-05-07T20:25:36.5483553Z 2025-05-07T20:25:36.5484090Z 2025-05-07T20:25:36.5535627Z libcufft-11.3.3.41 | 147.4 MB | #########7 | 98%  2025-05-07T20:25:36.5536032Z 2025-05-07T20:25:36.5536072Z 2025-05-07T20:25:36.6028688Z libcusparse-12.5.7.5 | 164.9 MB | ########9 | 89%  2025-05-07T20:25:36.6257346Z libcublas-12.8.3.14 | 460.2 MB | ##1 | 21% 2025-05-07T20:25:36.6257875Z 2025-05-07T20:25:36.6257883Z 2025-05-07T20:25:36.6257888Z 2025-05-07T20:25:36.6420761Z libcusolver-11.7.2.5 | 156.9 MB | ########2 | 83%  2025-05-07T20:25:36.6423103Z 2025-05-07T20:25:36.6641396Z nsight-compute-2025. | 320.6 MB | ####2 | 42%  2025-05-07T20:25:36.6641732Z 2025-05-07T20:25:36.6641736Z 2025-05-07T20:25:36.7028665Z libcusparse-12.5.7.5 | 164.9 MB | #########1 | 92%  2025-05-07T20:25:36.7270546Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 22% 2025-05-07T20:25:36.7270927Z 2025-05-07T20:25:36.7270933Z 2025-05-07T20:25:36.7270939Z 2025-05-07T20:25:36.7641606Z libcusolver-11.7.2.5 | 156.9 MB | ########5 | 85%  2025-05-07T20:25:36.7642010Z 2025-05-07T20:25:36.7642016Z 2025-05-07T20:25:36.7851432Z libcusparse-12.5.7.5 | 164.9 MB | #########4 | 94%  2025-05-07T20:25:36.7852260Z 2025-05-07T20:25:36.8271511Z nsight-compute-2025. | 320.6 MB | ####3 | 43%  2025-05-07T20:25:36.8271914Z 2025-05-07T20:25:36.8271944Z 2025-05-07T20:25:36.8272640Z 2025-05-07T20:25:36.8503745Z libcusolver-11.7.2.5 | 156.9 MB | ########8 | 88%  2025-05-07T20:25:36.8649010Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 23% 2025-05-07T20:25:36.8649398Z 2025-05-07T20:25:36.8649404Z 2025-05-07T20:25:36.9040440Z libcusparse-12.5.7.5 | 164.9 MB | #########7 | 97%  2025-05-07T20:25:36.9040743Z 2025-05-07T20:25:36.9278807Z nsight-compute-2025. | 320.6 MB | ####4 | 44%  2025-05-07T20:25:36.9279108Z 2025-05-07T20:25:36.9279112Z 2025-05-07T20:25:36.9279537Z 2025-05-07T20:25:36.9504732Z libcusolver-11.7.2.5 | 156.9 MB | #########1 | 91%  2025-05-07T20:25:36.9784942Z libcublas-12.8.3.14 | 460.2 MB | ##3 | 24% 2025-05-07T20:25:36.9785329Z 2025-05-07T20:25:36.9785334Z 2025-05-07T20:25:37.0110058Z libcusparse-12.5.7.5 | 164.9 MB | #########9 | 100%  2025-05-07T20:25:37.0112371Z 2025-05-07T20:25:37.0301714Z nsight-compute-2025. | 320.6 MB | ####5 | 45%  2025-05-07T20:25:37.0302201Z 2025-05-07T20:25:37.0302206Z 2025-05-07T20:25:37.0302959Z 2025-05-07T20:25:37.0506623Z libcusolver-11.7.2.5 | 156.9 MB | #########4 | 94%  2025-05-07T20:25:37.1113351Z libcublas-12.8.3.14 | 460.2 MB | ##4 | 25% 2025-05-07T20:25:37.1114364Z 2025-05-07T20:25:37.1365872Z nsight-compute-2025. | 320.6 MB | ####6 | 46%  2025-05-07T20:25:37.1366222Z 2025-05-07T20:25:37.1366227Z 2025-05-07T20:25:37.1366231Z 2025-05-07T20:25:37.1508736Z libcusolver-11.7.2.5 | 156.9 MB | #########7 | 97%  2025-05-07T20:25:37.2113208Z libcublas-12.8.3.14 | 460.2 MB | ##5 | 25% 2025-05-07T20:25:37.2115094Z 2025-05-07T20:25:37.2484877Z nsight-compute-2025. | 320.6 MB | ####7 | 48%  2025-05-07T20:25:37.2485174Z 2025-05-07T20:25:37.2485179Z 2025-05-07T20:25:37.2489894Z 2025-05-07T20:25:37.2509211Z libcusolver-11.7.2.5 | 156.9 MB | #########9 | 100%  2025-05-07T20:25:37.3117172Z libcublas-12.8.3.14 | 460.2 MB | ##6 | 26% 2025-05-07T20:25:37.3117507Z 2025-05-07T20:25:37.3512293Z nsight-compute-2025. | 320.6 MB | ####8 | 49%  2025-05-07T20:25:37.4216502Z libcublas-12.8.3.14 | 460.2 MB | ##7 | 27% 2025-05-07T20:25:37.4218339Z 2025-05-07T20:25:37.4514256Z nsight-compute-2025. | 320.6 MB | ##### | 50%  2025-05-07T20:25:37.5216567Z libcublas-12.8.3.14 | 460.2 MB | ##8 | 28% 2025-05-07T20:25:37.5219264Z 2025-05-07T20:25:37.5514843Z nsight-compute-2025. | 320.6 MB | #####1 | 51%  2025-05-07T20:25:37.6217037Z libcublas-12.8.3.14 | 460.2 MB | ##9 | 29% 2025-05-07T20:25:37.6219435Z 2025-05-07T20:25:37.6519606Z nsight-compute-2025. | 320.6 MB | #####2 | 53%  2025-05-07T20:25:37.7219052Z libcublas-12.8.3.14 | 460.2 MB | ### | 30% 2025-05-07T20:25:37.7222181Z 2025-05-07T20:25:37.7522319Z nsight-compute-2025. | 320.6 MB | #####4 | 54%  2025-05-07T20:25:37.8222367Z libcublas-12.8.3.14 | 460.2 MB | ###1 | 31% 2025-05-07T20:25:37.8222637Z 2025-05-07T20:25:37.8522832Z nsight-compute-2025. | 320.6 MB | #####5 | 56%  2025-05-07T20:25:37.9225420Z libcublas-12.8.3.14 | 460.2 MB | ###2 | 32% 2025-05-07T20:25:37.9226351Z 2025-05-07T20:25:37.9810677Z nsight-compute-2025. | 320.6 MB | #####7 | 57%  2025-05-07T20:25:38.0226780Z libcublas-12.8.3.14 | 460.2 MB | ###3 | 33% 2025-05-07T20:25:38.0228950Z 2025-05-07T20:25:38.0814260Z nsight-compute-2025. | 320.6 MB | #####9 | 59%  2025-05-07T20:25:38.1228304Z libcublas-12.8.3.14 | 460.2 MB | ###4 | 34% 2025-05-07T20:25:38.1233187Z 2025-05-07T20:25:38.1983813Z nsight-compute-2025. | 320.6 MB | ###### | 61%  2025-05-07T20:25:38.2266778Z libcublas-12.8.3.14 | 460.2 MB | ###5 | 35% 2025-05-07T20:25:38.2269400Z 2025-05-07T20:25:38.3268428Z nsight-compute-2025. | 320.6 MB | ######2 | 63%  2025-05-07T20:25:38.3268928Z 2025-05-07T20:25:38.3611723Z nsight-compute-2025. | 320.6 MB | ######4 | 64%  2025-05-07T20:25:38.4273357Z libcublas-12.8.3.14 | 460.2 MB | ###6 | 36% 2025-05-07T20:25:38.4273914Z 2025-05-07T20:25:38.5275696Z nsight-compute-2025. | 320.6 MB | ######6 | 66%  2025-05-07T20:25:38.5279599Z 2025-05-07T20:25:38.5314948Z nsight-compute-2025. | 320.6 MB | ######8 | 68%  2025-05-07T20:25:38.6339676Z libcublas-12.8.3.14 | 460.2 MB | ###6 | 37% 2025-05-07T20:25:38.6342351Z 2025-05-07T20:25:38.6382670Z nsight-compute-2025. | 320.6 MB | ####### | 70%  2025-05-07T20:25:38.7385253Z libcublas-12.8.3.14 | 460.2 MB | ###7 | 38% 2025-05-07T20:25:38.7722575Z libcublas-12.8.3.14 | 460.2 MB | ###8 | 39% 2025-05-07T20:25:38.7725620Z 2025-05-07T20:25:38.8385875Z nsight-compute-2025. | 320.6 MB | #######1 | 72%  2025-05-07T20:25:38.9028313Z libcublas-12.8.3.14 | 460.2 MB | ###9 | 39% 2025-05-07T20:25:38.9033966Z 2025-05-07T20:25:38.9391942Z nsight-compute-2025. | 320.6 MB | #######3 | 74%  2025-05-07T20:25:39.0194237Z libcublas-12.8.3.14 | 460.2 MB | #### | 40% 2025-05-07T20:25:39.0196719Z 2025-05-07T20:25:39.0393280Z nsight-compute-2025. | 320.6 MB | #######5 | 75%  2025-05-07T20:25:39.1273858Z libcublas-12.8.3.14 | 460.2 MB | ####1 | 41% 2025-05-07T20:25:39.1274259Z 2025-05-07T20:25:39.1396315Z nsight-compute-2025. | 320.6 MB | #######6 | 77%  2025-05-07T20:25:39.2276120Z libcublas-12.8.3.14 | 460.2 MB | ####2 | 42% 2025-05-07T20:25:39.2277246Z 2025-05-07T20:25:39.2486352Z nsight-compute-2025. | 320.6 MB | #######8 | 78%  2025-05-07T20:25:39.3462994Z libcublas-12.8.3.14 | 460.2 MB | ####2 | 43% 2025-05-07T20:25:39.3463679Z 2025-05-07T20:25:39.3490732Z nsight-compute-2025. | 320.6 MB | #######9 | 80%  2025-05-07T20:25:39.4493777Z libcublas-12.8.3.14 | 460.2 MB | ####3 | 44% 2025-05-07T20:25:39.4582700Z libcublas-12.8.3.14 | 460.2 MB | ####4 | 45% 2025-05-07T20:25:39.4584812Z 2025-05-07T20:25:39.5516235Z nsight-compute-2025. | 320.6 MB | ######## | 81%  2025-05-07T20:25:39.5664804Z libcublas-12.8.3.14 | 460.2 MB | ####5 | 46% 2025-05-07T20:25:39.5665657Z 2025-05-07T20:25:39.6525601Z nsight-compute-2025. | 320.6 MB | ########2 | 82%  2025-05-07T20:25:39.6712413Z libcublas-12.8.3.14 | 460.2 MB | ####6 | 46% 2025-05-07T20:25:39.6713087Z 2025-05-07T20:25:39.6713190Z 2025-05-07T20:25:39.6713198Z 2025-05-07T20:25:39.6718019Z 2025-05-07T20:25:39.6785113Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:25:39.6785826Z 2025-05-07T20:25:39.7328130Z nsight-compute-2025. | 320.6 MB | ########3 | 84%  2025-05-07T20:25:39.7328567Z 2025-05-07T20:25:39.7328574Z 2025-05-07T20:25:39.7328580Z 2025-05-07T20:25:39.7328587Z 2025-05-07T20:25:39.7333267Z 2025-05-07T20:25:39.7579167Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:25:39.8003899Z libcublas-12.8.3.14 | 460.2 MB | ####7 | 47% 2025-05-07T20:25:39.8008392Z 2025-05-07T20:25:39.8331502Z nsight-compute-2025. | 320.6 MB | ########4 | 85%  2025-05-07T20:25:39.8331812Z 2025-05-07T20:25:39.8331816Z 2025-05-07T20:25:39.8331820Z 2025-05-07T20:25:39.8331824Z 2025-05-07T20:25:39.8331828Z 2025-05-07T20:25:39.8711545Z libnpp-12.3.3.65 | 130.6 MB | 2 | 3%  2025-05-07T20:25:39.9225928Z libcublas-12.8.3.14 | 460.2 MB | ####8 | 48% 2025-05-07T20:25:39.9227568Z 2025-05-07T20:25:39.9359581Z nsight-compute-2025. | 320.6 MB | ########6 | 86%  2025-05-07T20:25:39.9359958Z 2025-05-07T20:25:39.9359965Z 2025-05-07T20:25:39.9359971Z 2025-05-07T20:25:39.9359977Z 2025-05-07T20:25:39.9359983Z 2025-05-07T20:25:39.9773523Z libnpp-12.3.3.65 | 130.6 MB | 5 | 5%  2025-05-07T20:25:40.0340226Z libcublas-12.8.3.14 | 460.2 MB | ####8 | 49% 2025-05-07T20:25:40.0340627Z 2025-05-07T20:25:40.0340659Z 2025-05-07T20:25:40.0340663Z 2025-05-07T20:25:40.0340667Z 2025-05-07T20:25:40.0340671Z 2025-05-07T20:25:40.0393942Z libnpp-12.3.3.65 | 130.6 MB | 7 | 8%  2025-05-07T20:25:40.0398363Z 2025-05-07T20:25:40.0890930Z nsight-compute-2025. | 320.6 MB | ########7 | 87%  2025-05-07T20:25:40.1341055Z libcublas-12.8.3.14 | 460.2 MB | ####9 | 50% 2025-05-07T20:25:40.1341373Z 2025-05-07T20:25:40.1341381Z 2025-05-07T20:25:40.1341620Z 2025-05-07T20:25:40.1341628Z 2025-05-07T20:25:40.1343289Z 2025-05-07T20:25:40.1394053Z libnpp-12.3.3.65 | 130.6 MB | # | 10%  2025-05-07T20:25:40.1395933Z 2025-05-07T20:25:40.1513243Z nsight-compute-2025. | 320.6 MB | ########8 | 88%  2025-05-07T20:25:40.1513771Z 2025-05-07T20:25:40.1515503Z 2025-05-07T20:25:40.1853374Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:25:40.1853790Z 2025-05-07T20:25:40.1853796Z 2025-05-07T20:25:40.1853802Z 2025-05-07T20:25:40.1854161Z 2025-05-07T20:25:40.1854167Z 2025-05-07T20:25:40.1857120Z 2025-05-07T20:25:40.2209537Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:25:40.2447216Z libcublas-12.8.3.14 | 460.2 MB | ##### | 50% 2025-05-07T20:25:40.2447617Z 2025-05-07T20:25:40.2447623Z 2025-05-07T20:25:40.2447634Z 2025-05-07T20:25:40.2447640Z 2025-05-07T20:25:40.2447645Z 2025-05-07T20:25:40.2688920Z libnpp-12.3.3.65 | 130.6 MB | #2 | 12%  2025-05-07T20:25:40.2692021Z 2025-05-07T20:25:40.2859292Z nsight-compute-2025. | 320.6 MB | ########9 | 90%  2025-05-07T20:25:40.2859708Z 2025-05-07T20:25:40.2859714Z 2025-05-07T20:25:40.2859719Z 2025-05-07T20:25:40.2859725Z 2025-05-07T20:25:40.2859730Z 2025-05-07T20:25:40.2862831Z 2025-05-07T20:25:40.3342629Z cuda-nsight-12.8.55 | 113.2 MB | 2 | 2%  2025-05-07T20:25:40.3559678Z libcublas-12.8.3.14 | 460.2 MB | #####1 | 51% 2025-05-07T20:25:40.3560090Z 2025-05-07T20:25:40.3560118Z 2025-05-07T20:25:40.3560124Z 2025-05-07T20:25:40.3560130Z 2025-05-07T20:25:40.3560135Z 2025-05-07T20:25:40.3869818Z libnpp-12.3.3.65 | 130.6 MB | #4 | 15%  2025-05-07T20:25:40.3870249Z 2025-05-07T20:25:40.3870255Z 2025-05-07T20:25:40.3870260Z 2025-05-07T20:25:40.3870266Z 2025-05-07T20:25:40.3870271Z 2025-05-07T20:25:40.3870281Z 2025-05-07T20:25:40.3953846Z cuda-nsight-12.8.55 | 113.2 MB | 5 | 5%  2025-05-07T20:25:40.3954298Z 2025-05-07T20:25:40.4466721Z nsight-compute-2025. | 320.6 MB | ######### | 91%  2025-05-07T20:25:40.4682335Z libcublas-12.8.3.14 | 460.2 MB | #####1 | 52% 2025-05-07T20:25:40.4682731Z 2025-05-07T20:25:40.4682737Z 2025-05-07T20:25:40.4682826Z 2025-05-07T20:25:40.4682834Z 2025-05-07T20:25:40.4687112Z 2025-05-07T20:25:40.4871492Z libnpp-12.3.3.65 | 130.6 MB | #6 | 17%  2025-05-07T20:25:40.4871913Z 2025-05-07T20:25:40.4871919Z 2025-05-07T20:25:40.4871946Z 2025-05-07T20:25:40.4871951Z 2025-05-07T20:25:40.4871957Z 2025-05-07T20:25:40.4877289Z 2025-05-07T20:25:40.5234260Z cuda-nsight-12.8.55 | 113.2 MB | 7 | 7%  2025-05-07T20:25:40.5234714Z 2025-05-07T20:25:40.5615956Z nsight-compute-2025. | 320.6 MB | #########1 | 92%  2025-05-07T20:25:40.5776477Z libcublas-12.8.3.14 | 460.2 MB | #####2 | 52% 2025-05-07T20:25:40.5776895Z 2025-05-07T20:25:40.5776901Z 2025-05-07T20:25:40.5776907Z 2025-05-07T20:25:40.5776913Z 2025-05-07T20:25:40.5776919Z 2025-05-07T20:25:40.5879711Z libnpp-12.3.3.65 | 130.6 MB | #8 | 19%  2025-05-07T20:25:40.5880131Z 2025-05-07T20:25:40.5880137Z 2025-05-07T20:25:40.5880143Z 2025-05-07T20:25:40.5880148Z 2025-05-07T20:25:40.5880154Z 2025-05-07T20:25:40.5882784Z 2025-05-07T20:25:40.6355921Z cuda-nsight-12.8.55 | 113.2 MB | # | 10%  2025-05-07T20:25:40.6356375Z 2025-05-07T20:25:40.6757724Z nsight-compute-2025. | 320.6 MB | #########2 | 92%  2025-05-07T20:25:40.6780402Z libcublas-12.8.3.14 | 460.2 MB | #####3 | 53% 2025-05-07T20:25:40.6781188Z 2025-05-07T20:25:40.6781251Z 2025-05-07T20:25:40.6781273Z 2025-05-07T20:25:40.6781279Z 2025-05-07T20:25:40.6781285Z 2025-05-07T20:25:40.6882342Z libnpp-12.3.3.65 | 130.6 MB | ##1 | 21%  2025-05-07T20:25:40.6882781Z 2025-05-07T20:25:40.6882787Z 2025-05-07T20:25:40.6882792Z 2025-05-07T20:25:40.6882798Z 2025-05-07T20:25:40.6882804Z 2025-05-07T20:25:40.6883901Z 2025-05-07T20:25:40.7489228Z cuda-nsight-12.8.55 | 113.2 MB | #2 | 13%  2025-05-07T20:25:40.7490822Z 2025-05-07T20:25:40.7761519Z nsight-compute-2025. | 320.6 MB | #########3 | 93%  2025-05-07T20:25:40.7781858Z libcublas-12.8.3.14 | 460.2 MB | #####3 | 54% 2025-05-07T20:25:40.7782251Z 2025-05-07T20:25:40.7782350Z 2025-05-07T20:25:40.7782382Z 2025-05-07T20:25:40.7782388Z 2025-05-07T20:25:40.7782424Z 2025-05-07T20:25:40.7882589Z libnpp-12.3.3.65 | 130.6 MB | ##3 | 23%  2025-05-07T20:25:40.7883278Z 2025-05-07T20:25:40.7883283Z 2025-05-07T20:25:40.7883289Z 2025-05-07T20:25:40.7883294Z 2025-05-07T20:25:40.7883478Z 2025-05-07T20:25:40.7884285Z 2025-05-07T20:25:40.8609030Z cuda-nsight-12.8.55 | 113.2 MB | #5 | 15%  2025-05-07T20:25:40.8610253Z 2025-05-07T20:25:40.8818064Z nsight-compute-2025. | 320.6 MB | #########4 | 94%  2025-05-07T20:25:40.8818496Z 2025-05-07T20:25:40.8818502Z 2025-05-07T20:25:40.8818508Z 2025-05-07T20:25:40.8818513Z 2025-05-07T20:25:40.8818523Z 2025-05-07T20:25:40.8887370Z libnpp-12.3.3.65 | 130.6 MB | ##5 | 26%  2025-05-07T20:25:40.8887809Z 2025-05-07T20:25:40.8887815Z 2025-05-07T20:25:40.8887820Z 2025-05-07T20:25:40.8887826Z 2025-05-07T20:25:40.8887832Z 2025-05-07T20:25:40.8891043Z 2025-05-07T20:25:40.8911344Z cuda-nsight-12.8.55 | 113.2 MB | #8 | 18%  2025-05-07T20:25:40.9166563Z libcublas-12.8.3.14 | 460.2 MB | #####4 | 54% 2025-05-07T20:25:40.9166987Z 2025-05-07T20:25:40.9166994Z 2025-05-07T20:25:40.9175359Z 2025-05-07T20:25:40.9658487Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:25:40.9658924Z 2025-05-07T20:25:40.9658930Z 2025-05-07T20:25:40.9658936Z 2025-05-07T20:25:40.9658947Z 2025-05-07T20:25:40.9658952Z 2025-05-07T20:25:40.9658958Z 2025-05-07T20:25:40.9660360Z 2025-05-07T20:25:40.9951576Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:25:40.9952543Z libcublas-12.8.3.14 | 460.2 MB | #####4 | 55% 2025-05-07T20:25:40.9952905Z 2025-05-07T20:25:40.9952912Z 2025-05-07T20:25:40.9952917Z 2025-05-07T20:25:40.9952934Z 2025-05-07T20:25:40.9953062Z 2025-05-07T20:25:41.0042523Z libnpp-12.3.3.65 | 130.6 MB | ##7 | 28%  2025-05-07T20:25:41.0042930Z 2025-05-07T20:25:41.0042944Z 2025-05-07T20:25:41.0042958Z 2025-05-07T20:25:41.0042964Z 2025-05-07T20:25:41.0042969Z 2025-05-07T20:25:41.0042995Z 2025-05-07T20:25:41.0664907Z cuda-nsight-12.8.55 | 113.2 MB | ## | 21%  2025-05-07T20:25:41.0665365Z 2025-05-07T20:25:41.0665371Z 2025-05-07T20:25:41.0665394Z 2025-05-07T20:25:41.0665401Z 2025-05-07T20:25:41.0665406Z 2025-05-07T20:25:41.0665412Z 2025-05-07T20:25:41.0668977Z 2025-05-07T20:25:41.0683634Z cuda-nvvp-12.8.57 | 112.4 MB | 1 | 2%  2025-05-07T20:25:41.0684073Z 2025-05-07T20:25:41.1149498Z nsight-compute-2025. | 320.6 MB | #########5 | 95%  2025-05-07T20:25:41.1249826Z libcublas-12.8.3.14 | 460.2 MB | #####5 | 55% 2025-05-07T20:25:41.1250224Z 2025-05-07T20:25:41.1250230Z 2025-05-07T20:25:41.1250236Z 2025-05-07T20:25:41.1250251Z 2025-05-07T20:25:41.1251887Z 2025-05-07T20:25:41.1354384Z libnpp-12.3.3.65 | 130.6 MB | ##9 | 30%  2025-05-07T20:25:41.1354815Z 2025-05-07T20:25:41.1354821Z 2025-05-07T20:25:41.1354835Z 2025-05-07T20:25:41.1354842Z 2025-05-07T20:25:41.1354874Z 2025-05-07T20:25:41.1354879Z 2025-05-07T20:25:41.1670153Z cuda-nsight-12.8.55 | 113.2 MB | ##3 | 23%  2025-05-07T20:25:41.1670595Z 2025-05-07T20:25:41.1670627Z 2025-05-07T20:25:41.1670633Z 2025-05-07T20:25:41.1670639Z 2025-05-07T20:25:41.1670644Z 2025-05-07T20:25:41.1670649Z 2025-05-07T20:25:41.1672640Z 2025-05-07T20:25:41.1696910Z cuda-nvvp-12.8.57 | 112.4 MB | 3 | 4%  2025-05-07T20:25:41.1697348Z 2025-05-07T20:25:41.2466817Z nsight-compute-2025. | 320.6 MB | #########5 | 96%  2025-05-07T20:25:41.2467222Z 2025-05-07T20:25:41.2467228Z 2025-05-07T20:25:41.2467233Z 2025-05-07T20:25:41.2467248Z 2025-05-07T20:25:41.2467254Z 2025-05-07T20:25:41.2467264Z 2025-05-07T20:25:41.2504161Z cuda-nsight-12.8.55 | 113.2 MB | ##5 | 25%  2025-05-07T20:25:41.2526713Z libcublas-12.8.3.14 | 460.2 MB | #####6 | 56% 2025-05-07T20:25:41.2527122Z 2025-05-07T20:25:41.2527129Z 2025-05-07T20:25:41.2527135Z 2025-05-07T20:25:41.2527721Z 2025-05-07T20:25:41.2527727Z 2025-05-07T20:25:41.2682589Z libnpp-12.3.3.65 | 130.6 MB | ###1 | 32%  2025-05-07T20:25:41.2683013Z 2025-05-07T20:25:41.2683261Z 2025-05-07T20:25:41.2683268Z 2025-05-07T20:25:41.2683274Z 2025-05-07T20:25:41.2683279Z 2025-05-07T20:25:41.2683285Z 2025-05-07T20:25:41.2684340Z 2025-05-07T20:25:41.2698383Z cuda-nvvp-12.8.57 | 112.4 MB | 5 | 6%  2025-05-07T20:25:41.2700598Z 2025-05-07T20:25:41.3524812Z nsight-compute-2025. | 320.6 MB | #########6 | 96%  2025-05-07T20:25:41.3525466Z 2025-05-07T20:25:41.3525472Z 2025-05-07T20:25:41.3525478Z 2025-05-07T20:25:41.3525483Z 2025-05-07T20:25:41.3525489Z 2025-05-07T20:25:41.3528577Z 2025-05-07T20:25:41.3547063Z cuda-nsight-12.8.55 | 113.2 MB | ##7 | 28%  2025-05-07T20:25:41.3682747Z libcublas-12.8.3.14 | 460.2 MB | #####6 | 57% 2025-05-07T20:25:41.3683156Z 2025-05-07T20:25:41.3683163Z 2025-05-07T20:25:41.3683191Z 2025-05-07T20:25:41.3683197Z 2025-05-07T20:25:41.3683202Z 2025-05-07T20:25:41.3683208Z 2025-05-07T20:25:41.3685311Z 2025-05-07T20:25:41.3745139Z cuda-nvvp-12.8.57 | 112.4 MB | 7 | 7%  2025-05-07T20:25:41.3745586Z 2025-05-07T20:25:41.3745593Z 2025-05-07T20:25:41.3745598Z 2025-05-07T20:25:41.3745604Z 2025-05-07T20:25:41.3745609Z 2025-05-07T20:25:41.3852979Z libnpp-12.3.3.65 | 130.6 MB | ###3 | 33%  2025-05-07T20:25:41.3853396Z 2025-05-07T20:25:41.4580721Z nsight-compute-2025. | 320.6 MB | #########7 | 97%  2025-05-07T20:25:41.4581368Z 2025-05-07T20:25:41.4581374Z 2025-05-07T20:25:41.4581380Z 2025-05-07T20:25:41.4581386Z 2025-05-07T20:25:41.4581391Z 2025-05-07T20:25:41.4585405Z 2025-05-07T20:25:41.4719125Z cuda-nsight-12.8.55 | 113.2 MB | ##9 | 30%  2025-05-07T20:25:41.4719578Z 2025-05-07T20:25:41.4719584Z 2025-05-07T20:25:41.4719590Z 2025-05-07T20:25:41.4719595Z 2025-05-07T20:25:41.4719624Z 2025-05-07T20:25:41.4719630Z 2025-05-07T20:25:41.4719636Z 2025-05-07T20:25:41.4782185Z cuda-nvvp-12.8.57 | 112.4 MB | 8 | 9%  2025-05-07T20:25:41.4848192Z libcublas-12.8.3.14 | 460.2 MB | #####7 | 57% 2025-05-07T20:25:41.4848598Z 2025-05-07T20:25:41.4848715Z 2025-05-07T20:25:41.4848723Z 2025-05-07T20:25:41.4848730Z 2025-05-07T20:25:41.4848758Z 2025-05-07T20:25:41.4991429Z libnpp-12.3.3.65 | 130.6 MB | ###5 | 35%  2025-05-07T20:25:41.4993865Z 2025-05-07T20:25:41.5663569Z nsight-compute-2025. | 320.6 MB | #########7 | 98%  2025-05-07T20:25:41.5663975Z 2025-05-07T20:25:41.5663982Z 2025-05-07T20:25:41.5663999Z 2025-05-07T20:25:41.5664005Z 2025-05-07T20:25:41.5664011Z 2025-05-07T20:25:41.5665406Z 2025-05-07T20:25:41.5727224Z cuda-nsight-12.8.55 | 113.2 MB | ###1 | 32%  2025-05-07T20:25:41.5728284Z 2025-05-07T20:25:41.5728290Z 2025-05-07T20:25:41.5728296Z 2025-05-07T20:25:41.5728301Z 2025-05-07T20:25:41.5728327Z 2025-05-07T20:25:41.5728333Z 2025-05-07T20:25:41.5730639Z 2025-05-07T20:25:41.5807486Z cuda-nvvp-12.8.57 | 112.4 MB | # | 11%  2025-05-07T20:25:41.5870419Z libcublas-12.8.3.14 | 460.2 MB | #####7 | 58% 2025-05-07T20:25:41.5870809Z 2025-05-07T20:25:41.5870815Z 2025-05-07T20:25:41.5870825Z 2025-05-07T20:25:41.5870841Z 2025-05-07T20:25:41.5870847Z 2025-05-07T20:25:41.6052626Z libnpp-12.3.3.65 | 130.6 MB | ###6 | 37%  2025-05-07T20:25:41.6053040Z 2025-05-07T20:25:41.6701988Z nsight-compute-2025. | 320.6 MB | #########8 | 98%  2025-05-07T20:25:41.6702413Z 2025-05-07T20:25:41.6702419Z 2025-05-07T20:25:41.6702425Z 2025-05-07T20:25:41.6702430Z 2025-05-07T20:25:41.6702436Z 2025-05-07T20:25:41.6707776Z 2025-05-07T20:25:41.6736198Z cuda-nsight-12.8.55 | 113.2 MB | ###3 | 34%  2025-05-07T20:25:41.6737117Z 2025-05-07T20:25:41.6737123Z 2025-05-07T20:25:41.6737129Z 2025-05-07T20:25:41.6737863Z 2025-05-07T20:25:41.6737867Z 2025-05-07T20:25:41.6737882Z 2025-05-07T20:25:41.6737886Z 2025-05-07T20:25:41.6808362Z cuda-nvvp-12.8.57 | 112.4 MB | #2 | 12%  2025-05-07T20:25:41.6965192Z libcublas-12.8.3.14 | 460.2 MB | #####7 | 58% 2025-05-07T20:25:41.6965851Z 2025-05-07T20:25:41.6965857Z 2025-05-07T20:25:41.6965863Z 2025-05-07T20:25:41.6965869Z 2025-05-07T20:25:41.6976609Z 2025-05-07T20:25:41.7059098Z libnpp-12.3.3.65 | 130.6 MB | ###8 | 38%  2025-05-07T20:25:41.7059514Z 2025-05-07T20:25:41.7702572Z nsight-compute-2025. | 320.6 MB | #########9 | 99%  2025-05-07T20:25:41.7703004Z 2025-05-07T20:25:41.7703011Z 2025-05-07T20:25:41.7703017Z 2025-05-07T20:25:41.7703024Z 2025-05-07T20:25:41.7703030Z 2025-05-07T20:25:41.7706972Z 2025-05-07T20:25:41.7745255Z cuda-nsight-12.8.55 | 113.2 MB | ###5 | 36%  2025-05-07T20:25:41.7745708Z 2025-05-07T20:25:41.7745715Z 2025-05-07T20:25:41.7745740Z 2025-05-07T20:25:41.7745745Z 2025-05-07T20:25:41.7745751Z 2025-05-07T20:25:41.7745756Z 2025-05-07T20:25:41.7752213Z 2025-05-07T20:25:41.7999834Z cuda-nvvp-12.8.57 | 112.4 MB | #4 | 14%  2025-05-07T20:25:41.8061164Z libcublas-12.8.3.14 | 460.2 MB | #####8 | 58% 2025-05-07T20:25:41.8065148Z 2025-05-07T20:25:41.8073851Z nsight-compute-2025. | 320.6 MB | #########9 | 100%  2025-05-07T20:25:41.8074253Z 2025-05-07T20:25:41.8074260Z 2025-05-07T20:25:41.8074265Z 2025-05-07T20:25:41.8074271Z 2025-05-07T20:25:41.8074276Z 2025-05-07T20:25:41.8704524Z libnpp-12.3.3.65 | 130.6 MB | #### | 40%  2025-05-07T20:25:41.8704949Z 2025-05-07T20:25:41.8704954Z 2025-05-07T20:25:41.8704960Z 2025-05-07T20:25:41.8704966Z 2025-05-07T20:25:41.8704972Z 2025-05-07T20:25:41.8707381Z 2025-05-07T20:25:41.8749528Z cuda-nsight-12.8.55 | 113.2 MB | ###8 | 38%  2025-05-07T20:25:41.8749967Z 2025-05-07T20:25:41.8749973Z 2025-05-07T20:25:41.8750000Z 2025-05-07T20:25:41.8750004Z 2025-05-07T20:25:41.8750008Z 2025-05-07T20:25:41.8750012Z 2025-05-07T20:25:41.8755303Z 2025-05-07T20:25:41.9045238Z cuda-nvvp-12.8.57 | 112.4 MB | #6 | 16%  2025-05-07T20:25:41.9104481Z libcublas-12.8.3.14 | 460.2 MB | #####8 | 59% 2025-05-07T20:25:41.9104867Z 2025-05-07T20:25:41.9104873Z 2025-05-07T20:25:41.9104878Z 2025-05-07T20:25:41.9104884Z 2025-05-07T20:25:41.9107026Z 2025-05-07T20:25:41.9711031Z libnpp-12.3.3.65 | 130.6 MB | ####1 | 42%  2025-05-07T20:25:41.9711467Z 2025-05-07T20:25:41.9711473Z 2025-05-07T20:25:41.9711478Z 2025-05-07T20:25:41.9711483Z 2025-05-07T20:25:41.9711490Z 2025-05-07T20:25:41.9711496Z 2025-05-07T20:25:41.9751963Z cuda-nsight-12.8.55 | 113.2 MB | #### | 40%  2025-05-07T20:25:41.9752409Z 2025-05-07T20:25:41.9752415Z 2025-05-07T20:25:41.9752420Z 2025-05-07T20:25:41.9752425Z 2025-05-07T20:25:41.9752431Z 2025-05-07T20:25:41.9752456Z 2025-05-07T20:25:41.9752461Z 2025-05-07T20:25:42.0055044Z cuda-nvvp-12.8.57 | 112.4 MB | #8 | 18%  2025-05-07T20:25:42.0104871Z libcublas-12.8.3.14 | 460.2 MB | #####9 | 59% 2025-05-07T20:25:42.0105246Z 2025-05-07T20:25:42.0105252Z 2025-05-07T20:25:42.0105257Z 2025-05-07T20:25:42.0105262Z 2025-05-07T20:25:42.0110964Z 2025-05-07T20:25:42.0713133Z libnpp-12.3.3.65 | 130.6 MB | ####3 | 43%  2025-05-07T20:25:42.0713610Z 2025-05-07T20:25:42.0713621Z 2025-05-07T20:25:42.0713625Z 2025-05-07T20:25:42.0713629Z 2025-05-07T20:25:42.0713633Z 2025-05-07T20:25:42.0713637Z 2025-05-07T20:25:42.0752402Z cuda-nsight-12.8.55 | 113.2 MB | ####2 | 43%  2025-05-07T20:25:42.0752870Z 2025-05-07T20:25:42.0753015Z 2025-05-07T20:25:42.0753020Z 2025-05-07T20:25:42.0753024Z 2025-05-07T20:25:42.0753027Z 2025-05-07T20:25:42.0753031Z 2025-05-07T20:25:42.0753035Z 2025-05-07T20:25:42.1056255Z cuda-nvvp-12.8.57 | 112.4 MB | ## | 20%  2025-05-07T20:25:42.1107334Z libcublas-12.8.3.14 | 460.2 MB | #####9 | 60% 2025-05-07T20:25:42.1107601Z 2025-05-07T20:25:42.1107800Z 2025-05-07T20:25:42.1107806Z 2025-05-07T20:25:42.1107810Z 2025-05-07T20:25:42.1115311Z 2025-05-07T20:25:42.1805249Z libnpp-12.3.3.65 | 130.6 MB | ####5 | 45%  2025-05-07T20:25:42.1805564Z 2025-05-07T20:25:42.1805568Z 2025-05-07T20:25:42.1805572Z 2025-05-07T20:25:42.1805576Z 2025-05-07T20:25:42.1805580Z 2025-05-07T20:25:42.1807332Z 2025-05-07T20:25:42.1816141Z cuda-nsight-12.8.55 | 113.2 MB | ####5 | 45%  2025-05-07T20:25:42.1816574Z 2025-05-07T20:25:42.1816578Z 2025-05-07T20:25:42.1816582Z 2025-05-07T20:25:42.1816586Z 2025-05-07T20:25:42.1816590Z 2025-05-07T20:25:42.1816594Z 2025-05-07T20:25:42.1821117Z 2025-05-07T20:25:42.2070320Z cuda-nvvp-12.8.57 | 112.4 MB | ##2 | 22%  2025-05-07T20:25:42.2110105Z libcublas-12.8.3.14 | 460.2 MB | ###### | 60% 2025-05-07T20:25:42.2110503Z 2025-05-07T20:25:42.2110509Z 2025-05-07T20:25:42.2110516Z 2025-05-07T20:25:42.2110521Z 2025-05-07T20:25:42.2113090Z 2025-05-07T20:25:42.2819498Z libnpp-12.3.3.65 | 130.6 MB | ####7 | 47%  2025-05-07T20:25:42.2819923Z 2025-05-07T20:25:42.2819929Z 2025-05-07T20:25:42.2819934Z 2025-05-07T20:25:42.2819940Z 2025-05-07T20:25:42.2819946Z 2025-05-07T20:25:42.2819951Z 2025-05-07T20:25:42.2822245Z 2025-05-07T20:25:42.2889599Z cuda-nvvp-12.8.57 | 112.4 MB | ##3 | 24%  2025-05-07T20:25:42.2890255Z 2025-05-07T20:25:42.2890259Z 2025-05-07T20:25:42.2890263Z 2025-05-07T20:25:42.2890276Z 2025-05-07T20:25:42.2890280Z 2025-05-07T20:25:42.2898360Z 2025-05-07T20:25:42.3070999Z cuda-nsight-12.8.55 | 113.2 MB | ####7 | 47%  2025-05-07T20:25:42.3114350Z libcublas-12.8.3.14 | 460.2 MB | ###### | 61% 2025-05-07T20:25:42.3114742Z 2025-05-07T20:25:42.3114749Z 2025-05-07T20:25:42.3114777Z 2025-05-07T20:25:42.3114783Z 2025-05-07T20:25:42.3117842Z 2025-05-07T20:25:42.3822959Z libnpp-12.3.3.65 | 130.6 MB | ####9 | 49%  2025-05-07T20:25:42.3823300Z 2025-05-07T20:25:42.3823305Z 2025-05-07T20:25:42.3823309Z 2025-05-07T20:25:42.3823313Z 2025-05-07T20:25:42.3823317Z 2025-05-07T20:25:42.3823320Z 2025-05-07T20:25:42.3823639Z 2025-05-07T20:25:42.3952885Z cuda-nvvp-12.8.57 | 112.4 MB | ##5 | 26%  2025-05-07T20:25:42.3953203Z 2025-05-07T20:25:42.3953208Z 2025-05-07T20:25:42.3953212Z 2025-05-07T20:25:42.3953216Z 2025-05-07T20:25:42.3953231Z 2025-05-07T20:25:42.3953235Z 2025-05-07T20:25:42.4076577Z cuda-nsight-12.8.55 | 113.2 MB | ####9 | 49%  2025-05-07T20:25:42.4116677Z libcublas-12.8.3.14 | 460.2 MB | ######1 | 61% 2025-05-07T20:25:42.4116945Z 2025-05-07T20:25:42.4116953Z 2025-05-07T20:25:42.4116957Z 2025-05-07T20:25:42.4117147Z 2025-05-07T20:25:42.4119665Z 2025-05-07T20:25:42.4832803Z libnpp-12.3.3.65 | 130.6 MB | ##### | 51%  2025-05-07T20:25:42.4833243Z 2025-05-07T20:25:42.4833250Z 2025-05-07T20:25:42.4833255Z 2025-05-07T20:25:42.4833282Z 2025-05-07T20:25:42.4833288Z 2025-05-07T20:25:42.4833293Z 2025-05-07T20:25:42.4840494Z 2025-05-07T20:25:42.5030841Z cuda-nvvp-12.8.57 | 112.4 MB | ##7 | 28%  2025-05-07T20:25:42.5031211Z 2025-05-07T20:25:42.5031215Z 2025-05-07T20:25:42.5031219Z 2025-05-07T20:25:42.5031223Z 2025-05-07T20:25:42.5031227Z 2025-05-07T20:25:42.5031231Z 2025-05-07T20:25:42.5077401Z cuda-nsight-12.8.55 | 113.2 MB | #####1 | 51%  2025-05-07T20:25:42.5155064Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 62% 2025-05-07T20:25:42.5155332Z 2025-05-07T20:25:42.5155336Z 2025-05-07T20:25:42.5155340Z 2025-05-07T20:25:42.5155344Z 2025-05-07T20:25:42.5161401Z 2025-05-07T20:25:42.5928655Z libnpp-12.3.3.65 | 130.6 MB | #####2 | 53%  2025-05-07T20:25:42.5929255Z 2025-05-07T20:25:42.5929260Z 2025-05-07T20:25:42.5929263Z 2025-05-07T20:25:42.5929267Z 2025-05-07T20:25:42.5929271Z 2025-05-07T20:25:42.5929275Z 2025-05-07T20:25:42.5929412Z 2025-05-07T20:25:42.6036013Z cuda-nvvp-12.8.57 | 112.4 MB | ##9 | 30%  2025-05-07T20:25:42.6036312Z 2025-05-07T20:25:42.6036316Z 2025-05-07T20:25:42.6036320Z 2025-05-07T20:25:42.6036324Z 2025-05-07T20:25:42.6036328Z 2025-05-07T20:25:42.6038496Z 2025-05-07T20:25:42.6080165Z cuda-nsight-12.8.55 | 113.2 MB | #####3 | 54%  2025-05-07T20:25:42.6159805Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 63% 2025-05-07T20:25:42.6160189Z 2025-05-07T20:25:42.6160195Z 2025-05-07T20:25:42.6160201Z 2025-05-07T20:25:42.6160206Z 2025-05-07T20:25:42.6160211Z 2025-05-07T20:25:42.6934654Z libnpp-12.3.3.65 | 130.6 MB | #####4 | 55%  2025-05-07T20:25:42.6935048Z 2025-05-07T20:25:42.6935052Z 2025-05-07T20:25:42.6935056Z 2025-05-07T20:25:42.6935077Z 2025-05-07T20:25:42.6935081Z 2025-05-07T20:25:42.6935085Z 2025-05-07T20:25:42.6935089Z 2025-05-07T20:25:42.7038107Z cuda-nvvp-12.8.57 | 112.4 MB | ###2 | 32%  2025-05-07T20:25:42.7038492Z 2025-05-07T20:25:42.7038498Z 2025-05-07T20:25:42.7038504Z 2025-05-07T20:25:42.7038509Z 2025-05-07T20:25:42.7038514Z 2025-05-07T20:25:42.7039867Z 2025-05-07T20:25:42.7101210Z cuda-nsight-12.8.55 | 113.2 MB | #####5 | 56%  2025-05-07T20:25:42.7188992Z libcublas-12.8.3.14 | 460.2 MB | ######3 | 63% 2025-05-07T20:25:42.7189355Z 2025-05-07T20:25:42.7189488Z 2025-05-07T20:25:42.7189496Z 2025-05-07T20:25:42.7189501Z 2025-05-07T20:25:42.7191148Z 2025-05-07T20:25:42.7938820Z libnpp-12.3.3.65 | 130.6 MB | #####6 | 57%  2025-05-07T20:25:42.7939182Z 2025-05-07T20:25:42.7939186Z 2025-05-07T20:25:42.7939190Z 2025-05-07T20:25:42.7939194Z 2025-05-07T20:25:42.7939197Z 2025-05-07T20:25:42.7939202Z 2025-05-07T20:25:42.7939225Z 2025-05-07T20:25:42.8039584Z cuda-nvvp-12.8.57 | 112.4 MB | ###4 | 34%  2025-05-07T20:25:42.8040003Z 2025-05-07T20:25:42.8040008Z 2025-05-07T20:25:42.8040023Z 2025-05-07T20:25:42.8040027Z 2025-05-07T20:25:42.8040031Z 2025-05-07T20:25:42.8040034Z 2025-05-07T20:25:42.8101587Z cuda-nsight-12.8.55 | 113.2 MB | #####7 | 58%  2025-05-07T20:25:42.8194167Z libcublas-12.8.3.14 | 460.2 MB | ######3 | 64% 2025-05-07T20:25:42.8194511Z 2025-05-07T20:25:42.8194517Z 2025-05-07T20:25:42.8194523Z 2025-05-07T20:25:42.8194528Z 2025-05-07T20:25:42.8194534Z 2025-05-07T20:25:42.9053956Z libnpp-12.3.3.65 | 130.6 MB | #####8 | 59%  2025-05-07T20:25:42.9054332Z 2025-05-07T20:25:42.9054336Z 2025-05-07T20:25:42.9054340Z 2025-05-07T20:25:42.9054344Z 2025-05-07T20:25:42.9054348Z 2025-05-07T20:25:42.9054352Z 2025-05-07T20:25:42.9109872Z cuda-nsight-12.8.55 | 113.2 MB | ###### | 61%  2025-05-07T20:25:42.9197263Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 64% 2025-05-07T20:25:42.9197564Z 2025-05-07T20:25:42.9197761Z 2025-05-07T20:25:42.9197768Z 2025-05-07T20:25:42.9197774Z 2025-05-07T20:25:42.9199478Z 2025-05-07T20:25:43.0058035Z libnpp-12.3.3.65 | 130.6 MB | ###### | 61%  2025-05-07T20:25:43.0058476Z 2025-05-07T20:25:43.0058483Z 2025-05-07T20:25:43.0058488Z 2025-05-07T20:25:43.0058504Z 2025-05-07T20:25:43.0058510Z 2025-05-07T20:25:43.0060197Z 2025-05-07T20:25:43.0115647Z cuda-nsight-12.8.55 | 113.2 MB | ######2 | 63%  2025-05-07T20:25:43.0191009Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 65% 2025-05-07T20:25:43.0191276Z 2025-05-07T20:25:43.0191281Z 2025-05-07T20:25:43.0191294Z 2025-05-07T20:25:43.0191298Z 2025-05-07T20:25:43.0191303Z 2025-05-07T20:25:43.0191306Z 2025-05-07T20:25:43.0191310Z 2025-05-07T20:25:43.0199334Z cuda-nvvp-12.8.57 | 112.4 MB | ###6 | 37%  2025-05-07T20:25:43.0199637Z 2025-05-07T20:25:43.0199969Z 2025-05-07T20:25:43.0199975Z 2025-05-07T20:25:43.0199980Z 2025-05-07T20:25:43.0204227Z 2025-05-07T20:25:43.1059699Z libnpp-12.3.3.65 | 130.6 MB | ######2 | 63%  2025-05-07T20:25:43.1060027Z 2025-05-07T20:25:43.1060031Z 2025-05-07T20:25:43.1060035Z 2025-05-07T20:25:43.1060039Z 2025-05-07T20:25:43.1060043Z 2025-05-07T20:25:43.1060697Z 2025-05-07T20:25:43.1192428Z cuda-nsight-12.8.55 | 113.2 MB | ######5 | 65%  2025-05-07T20:25:43.1192760Z 2025-05-07T20:25:43.1192766Z 2025-05-07T20:25:43.1192771Z 2025-05-07T20:25:43.1192776Z 2025-05-07T20:25:43.1192782Z 2025-05-07T20:25:43.1192787Z 2025-05-07T20:25:43.1192792Z 2025-05-07T20:25:43.1208232Z cuda-nvvp-12.8.57 | 112.4 MB | ###9 | 39%  2025-05-07T20:25:43.1208643Z 2025-05-07T20:25:43.1208650Z 2025-05-07T20:25:43.1208655Z 2025-05-07T20:25:43.1208660Z 2025-05-07T20:25:43.1208666Z 2025-05-07T20:25:43.1272766Z libnpp-12.3.3.65 | 130.6 MB | ######4 | 65%  2025-05-07T20:25:43.2065599Z libcublas-12.8.3.14 | 460.2 MB | ######5 | 65% 2025-05-07T20:25:43.2065909Z 2025-05-07T20:25:43.2065925Z 2025-05-07T20:25:43.2065951Z 2025-05-07T20:25:43.2065957Z 2025-05-07T20:25:43.2065962Z 2025-05-07T20:25:43.2069354Z 2025-05-07T20:25:43.2193300Z cuda-nsight-12.8.55 | 113.2 MB | ######7 | 68%  2025-05-07T20:25:43.2193834Z 2025-05-07T20:25:43.2193841Z 2025-05-07T20:25:43.2193846Z 2025-05-07T20:25:43.2193851Z 2025-05-07T20:25:43.2193857Z 2025-05-07T20:25:43.2193862Z 2025-05-07T20:25:43.2196352Z 2025-05-07T20:25:43.2210258Z cuda-nvvp-12.8.57 | 112.4 MB | ####1 | 42%  2025-05-07T20:25:43.2210634Z 2025-05-07T20:25:43.2210639Z 2025-05-07T20:25:43.2210645Z 2025-05-07T20:25:43.2210650Z 2025-05-07T20:25:43.2212565Z 2025-05-07T20:25:43.3068249Z libnpp-12.3.3.65 | 130.6 MB | ######7 | 67%  2025-05-07T20:25:43.3068662Z 2025-05-07T20:25:43.3068668Z 2025-05-07T20:25:43.3068691Z 2025-05-07T20:25:43.3068696Z 2025-05-07T20:25:43.3068701Z 2025-05-07T20:25:43.3075355Z 2025-05-07T20:25:43.3196649Z cuda-nsight-12.8.55 | 113.2 MB | ####### | 70%  2025-05-07T20:25:43.3197107Z 2025-05-07T20:25:43.3197113Z 2025-05-07T20:25:43.3197119Z 2025-05-07T20:25:43.3197135Z 2025-05-07T20:25:43.3197140Z 2025-05-07T20:25:43.3197145Z 2025-05-07T20:25:43.3198456Z 2025-05-07T20:25:43.3212647Z cuda-nvvp-12.8.57 | 112.4 MB | ####4 | 45%  2025-05-07T20:25:43.3212962Z 2025-05-07T20:25:43.3212966Z 2025-05-07T20:25:43.3212970Z 2025-05-07T20:25:43.3212974Z 2025-05-07T20:25:43.3212978Z 2025-05-07T20:25:43.3704996Z libnpp-12.3.3.65 | 130.6 MB | ######9 | 69%  2025-05-07T20:25:43.4071145Z libcublas-12.8.3.14 | 460.2 MB | ######5 | 66% 2025-05-07T20:25:43.4071417Z 2025-05-07T20:25:43.4071421Z 2025-05-07T20:25:43.4071425Z 2025-05-07T20:25:43.4071429Z 2025-05-07T20:25:43.4071433Z 2025-05-07T20:25:43.4074396Z 2025-05-07T20:25:43.4200947Z cuda-nsight-12.8.55 | 113.2 MB | #######3 | 74%  2025-05-07T20:25:43.4201349Z 2025-05-07T20:25:43.4201353Z 2025-05-07T20:25:43.4201369Z 2025-05-07T20:25:43.4201374Z 2025-05-07T20:25:43.4201378Z 2025-05-07T20:25:43.4201390Z 2025-05-07T20:25:43.4201394Z 2025-05-07T20:25:43.4245160Z cuda-nvvp-12.8.57 | 112.4 MB | ####7 | 47%  2025-05-07T20:25:43.4245457Z 2025-05-07T20:25:43.4245462Z 2025-05-07T20:25:43.4245466Z 2025-05-07T20:25:43.4245477Z 2025-05-07T20:25:43.4245481Z 2025-05-07T20:25:43.4776649Z libnpp-12.3.3.65 | 130.6 MB | #######1 | 71%  2025-05-07T20:25:43.5071310Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 66% 2025-05-07T20:25:43.5071582Z 2025-05-07T20:25:43.5071586Z 2025-05-07T20:25:43.5071590Z 2025-05-07T20:25:43.5071594Z 2025-05-07T20:25:43.5071599Z 2025-05-07T20:25:43.5073829Z 2025-05-07T20:25:43.5203069Z cuda-nsight-12.8.55 | 113.2 MB | #######6 | 76%  2025-05-07T20:25:43.5203646Z 2025-05-07T20:25:43.5203650Z 2025-05-07T20:25:43.5203654Z 2025-05-07T20:25:43.5203658Z 2025-05-07T20:25:43.5203662Z 2025-05-07T20:25:43.5203816Z 2025-05-07T20:25:43.5204129Z 2025-05-07T20:25:43.5247728Z cuda-nvvp-12.8.57 | 112.4 MB | ####9 | 50%  2025-05-07T20:25:43.5248087Z 2025-05-07T20:25:43.5248092Z 2025-05-07T20:25:43.5248098Z 2025-05-07T20:25:43.5248103Z 2025-05-07T20:25:43.5250232Z 2025-05-07T20:25:43.5948190Z libnpp-12.3.3.65 | 130.6 MB | #######3 | 74%  2025-05-07T20:25:43.6204790Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 67% 2025-05-07T20:25:43.6205178Z 2025-05-07T20:25:43.6205182Z 2025-05-07T20:25:43.6205186Z 2025-05-07T20:25:43.6205190Z 2025-05-07T20:25:43.6205194Z 2025-05-07T20:25:43.6205198Z 2025-05-07T20:25:43.6207089Z 2025-05-07T20:25:43.6284572Z cuda-nvvp-12.8.57 | 112.4 MB | #####2 | 52%  2025-05-07T20:25:43.6284877Z 2025-05-07T20:25:43.6284898Z 2025-05-07T20:25:43.6284902Z 2025-05-07T20:25:43.6284906Z 2025-05-07T20:25:43.6284910Z 2025-05-07T20:25:43.6288092Z 2025-05-07T20:25:43.6370431Z cuda-nsight-12.8.55 | 113.2 MB | #######9 | 79%  2025-05-07T20:25:43.6371040Z 2025-05-07T20:25:43.6371044Z 2025-05-07T20:25:43.6371048Z 2025-05-07T20:25:43.6371052Z 2025-05-07T20:25:43.6371370Z 2025-05-07T20:25:43.6950441Z libnpp-12.3.3.65 | 130.6 MB | #######5 | 76%  2025-05-07T20:25:43.7238813Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 67% 2025-05-07T20:25:43.7239084Z 2025-05-07T20:25:43.7239088Z 2025-05-07T20:25:43.7239092Z 2025-05-07T20:25:43.7239096Z 2025-05-07T20:25:43.7239100Z 2025-05-07T20:25:43.7239104Z 2025-05-07T20:25:43.7244782Z 2025-05-07T20:25:43.7367615Z cuda-nvvp-12.8.57 | 112.4 MB | #####4 | 55%  2025-05-07T20:25:43.7367916Z 2025-05-07T20:25:43.7367921Z 2025-05-07T20:25:43.7367924Z 2025-05-07T20:25:43.7367928Z 2025-05-07T20:25:43.7367932Z 2025-05-07T20:25:43.7367952Z 2025-05-07T20:25:43.7477115Z cuda-nsight-12.8.55 | 113.2 MB | ########1 | 82%  2025-05-07T20:25:43.7477428Z 2025-05-07T20:25:43.7477444Z 2025-05-07T20:25:43.7477448Z 2025-05-07T20:25:43.7477452Z 2025-05-07T20:25:43.7477456Z 2025-05-07T20:25:43.7957829Z libnpp-12.3.3.65 | 130.6 MB | #######8 | 78%  2025-05-07T20:25:43.8331759Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 68% 2025-05-07T20:25:43.8332119Z 2025-05-07T20:25:43.8332125Z 2025-05-07T20:25:43.8332130Z 2025-05-07T20:25:43.8332136Z 2025-05-07T20:25:43.8332142Z 2025-05-07T20:25:43.8332147Z 2025-05-07T20:25:43.8332153Z 2025-05-07T20:25:43.8559950Z cuda-nvvp-12.8.57 | 112.4 MB | #####7 | 57%  2025-05-07T20:25:43.8560399Z 2025-05-07T20:25:43.8560405Z 2025-05-07T20:25:43.8560411Z 2025-05-07T20:25:43.8560416Z 2025-05-07T20:25:43.8560421Z 2025-05-07T20:25:43.8560437Z 2025-05-07T20:25:43.8691002Z cuda-nsight-12.8.55 | 113.2 MB | ########4 | 84%  2025-05-07T20:25:43.8691450Z 2025-05-07T20:25:43.8691455Z 2025-05-07T20:25:43.8691460Z 2025-05-07T20:25:43.8691466Z 2025-05-07T20:25:43.8691481Z 2025-05-07T20:25:43.9118926Z libnpp-12.3.3.65 | 130.6 MB | ######## | 80%  2025-05-07T20:25:43.9512056Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 69% 2025-05-07T20:25:43.9512372Z 2025-05-07T20:25:43.9512607Z 2025-05-07T20:25:43.9512611Z 2025-05-07T20:25:43.9512615Z 2025-05-07T20:25:43.9512619Z 2025-05-07T20:25:43.9514365Z 2025-05-07T20:25:43.9514369Z 2025-05-07T20:25:43.9693977Z cuda-nvvp-12.8.57 | 112.4 MB | #####9 | 60%  2025-05-07T20:25:43.9694280Z 2025-05-07T20:25:43.9694292Z 2025-05-07T20:25:43.9694296Z 2025-05-07T20:25:43.9694300Z 2025-05-07T20:25:43.9694304Z 2025-05-07T20:25:43.9717880Z libnpp-12.3.3.65 | 130.6 MB | ########1 | 82%  2025-05-07T20:25:43.9718223Z 2025-05-07T20:25:43.9718239Z 2025-05-07T20:25:43.9718244Z 2025-05-07T20:25:43.9718519Z 2025-05-07T20:25:43.9718522Z 2025-05-07T20:25:43.9721399Z 2025-05-07T20:25:44.0123124Z cuda-nsight-12.8.55 | 113.2 MB | ########6 | 87%  2025-05-07T20:25:44.0639541Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 69% 2025-05-07T20:25:44.0639968Z 2025-05-07T20:25:44.0639975Z 2025-05-07T20:25:44.0639980Z 2025-05-07T20:25:44.0639985Z 2025-05-07T20:25:44.0639990Z 2025-05-07T20:25:44.0640006Z 2025-05-07T20:25:44.0643029Z 2025-05-07T20:25:44.0696138Z cuda-nvvp-12.8.57 | 112.4 MB | ######1 | 62%  2025-05-07T20:25:44.0696440Z 2025-05-07T20:25:44.0696444Z 2025-05-07T20:25:44.0696456Z 2025-05-07T20:25:44.0696460Z 2025-05-07T20:25:44.0698374Z 2025-05-07T20:25:44.0728815Z libnpp-12.3.3.65 | 130.6 MB | ########4 | 84%  2025-05-07T20:25:44.0729104Z 2025-05-07T20:25:44.0729115Z 2025-05-07T20:25:44.0729120Z 2025-05-07T20:25:44.0729124Z 2025-05-07T20:25:44.0729127Z 2025-05-07T20:25:44.0729131Z 2025-05-07T20:25:44.1127164Z cuda-nsight-12.8.55 | 113.2 MB | ########8 | 89%  2025-05-07T20:25:44.1788371Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 70% 2025-05-07T20:25:44.1788749Z 2025-05-07T20:25:44.1788754Z 2025-05-07T20:25:44.1788758Z 2025-05-07T20:25:44.1788762Z 2025-05-07T20:25:44.1788766Z 2025-05-07T20:25:44.1849205Z libnpp-12.3.3.65 | 130.6 MB | ########5 | 86%  2025-05-07T20:25:44.1849536Z 2025-05-07T20:25:44.1849542Z 2025-05-07T20:25:44.1849547Z 2025-05-07T20:25:44.1849552Z 2025-05-07T20:25:44.1849557Z 2025-05-07T20:25:44.1849562Z 2025-05-07T20:25:44.1864555Z cuda-nsight-12.8.55 | 113.2 MB | #########1 | 91%  2025-05-07T20:25:44.1864860Z 2025-05-07T20:25:44.1864864Z 2025-05-07T20:25:44.1864868Z 2025-05-07T20:25:44.1864872Z 2025-05-07T20:25:44.1864876Z 2025-05-07T20:25:44.1864888Z 2025-05-07T20:25:44.1864892Z 2025-05-07T20:25:44.2129379Z cuda-nvvp-12.8.57 | 112.4 MB | ######4 | 64%  2025-05-07T20:25:44.2845118Z libcublas-12.8.3.14 | 460.2 MB | ####### | 70% 2025-05-07T20:25:44.2845388Z 2025-05-07T20:25:44.2845392Z 2025-05-07T20:25:44.2845396Z 2025-05-07T20:25:44.2845413Z 2025-05-07T20:25:44.2847495Z 2025-05-07T20:25:44.2868930Z libnpp-12.3.3.65 | 130.6 MB | ########7 | 88%  2025-05-07T20:25:44.2869322Z 2025-05-07T20:25:44.2869326Z 2025-05-07T20:25:44.2869330Z 2025-05-07T20:25:44.2869334Z 2025-05-07T20:25:44.2869338Z 2025-05-07T20:25:44.2869341Z 2025-05-07T20:25:44.2872574Z 2025-05-07T20:25:44.2968228Z cuda-nvvp-12.8.57 | 112.4 MB | ######6 | 66%  2025-05-07T20:25:44.2968692Z 2025-05-07T20:25:44.2968699Z 2025-05-07T20:25:44.2968704Z 2025-05-07T20:25:44.2968710Z 2025-05-07T20:25:44.2968716Z 2025-05-07T20:25:44.2970674Z 2025-05-07T20:25:44.3134327Z cuda-nsight-12.8.55 | 113.2 MB | #########3 | 94%  2025-05-07T20:25:44.3849199Z libcublas-12.8.3.14 | 460.2 MB | ####### | 71% 2025-05-07T20:25:44.3849507Z 2025-05-07T20:25:44.3849512Z 2025-05-07T20:25:44.3849516Z 2025-05-07T20:25:44.3849519Z 2025-05-07T20:25:44.3849523Z 2025-05-07T20:25:44.3870437Z libnpp-12.3.3.65 | 130.6 MB | ########9 | 90%  2025-05-07T20:25:44.3871135Z 2025-05-07T20:25:44.3871164Z 2025-05-07T20:25:44.3871168Z 2025-05-07T20:25:44.3871172Z 2025-05-07T20:25:44.3871189Z 2025-05-07T20:25:44.3871200Z 2025-05-07T20:25:44.3872379Z 2025-05-07T20:25:44.3982299Z cuda-nvvp-12.8.57 | 112.4 MB | ######8 | 69%  2025-05-07T20:25:44.3982613Z 2025-05-07T20:25:44.3982617Z 2025-05-07T20:25:44.3982621Z 2025-05-07T20:25:44.3982624Z 2025-05-07T20:25:44.3982628Z 2025-05-07T20:25:44.3985044Z 2025-05-07T20:25:44.4143706Z cuda-nsight-12.8.55 | 113.2 MB | #########5 | 96%  2025-05-07T20:25:44.4852465Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 71% 2025-05-07T20:25:44.4852852Z 2025-05-07T20:25:44.4852857Z 2025-05-07T20:25:44.4852862Z 2025-05-07T20:25:44.4852867Z 2025-05-07T20:25:44.4853153Z 2025-05-07T20:25:44.4874706Z libnpp-12.3.3.65 | 130.6 MB | #########1 | 92%  2025-05-07T20:25:44.4875077Z 2025-05-07T20:25:44.4875083Z 2025-05-07T20:25:44.4875340Z 2025-05-07T20:25:44.4875347Z 2025-05-07T20:25:44.4875352Z 2025-05-07T20:25:44.4875358Z 2025-05-07T20:25:44.4879742Z 2025-05-07T20:25:44.4988728Z cuda-nvvp-12.8.57 | 112.4 MB | #######1 | 71%  2025-05-07T20:25:44.4989176Z 2025-05-07T20:25:44.4989182Z 2025-05-07T20:25:44.4989188Z 2025-05-07T20:25:44.4989194Z 2025-05-07T20:25:44.4989199Z 2025-05-07T20:25:44.4990761Z 2025-05-07T20:25:44.5143786Z cuda-nsight-12.8.55 | 113.2 MB | #########7 | 98%  2025-05-07T20:25:44.5856127Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 72% 2025-05-07T20:25:44.5856398Z 2025-05-07T20:25:44.5856410Z 2025-05-07T20:25:44.5856415Z 2025-05-07T20:25:44.5856418Z 2025-05-07T20:25:44.5856426Z 2025-05-07T20:25:44.5896253Z libnpp-12.3.3.65 | 130.6 MB | #########3 | 94%  2025-05-07T20:25:44.5896561Z 2025-05-07T20:25:44.5896572Z 2025-05-07T20:25:44.5896576Z 2025-05-07T20:25:44.5896580Z 2025-05-07T20:25:44.5896583Z 2025-05-07T20:25:44.5896597Z 2025-05-07T20:25:44.5899626Z 2025-05-07T20:25:44.6188659Z cuda-nvvp-12.8.57 | 112.4 MB | #######3 | 73%  2025-05-07T20:25:44.6857826Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 72% 2025-05-07T20:25:44.6858107Z 2025-05-07T20:25:44.6858111Z 2025-05-07T20:25:44.6858115Z 2025-05-07T20:25:44.6858119Z 2025-05-07T20:25:44.6858130Z 2025-05-07T20:25:44.6900622Z libnpp-12.3.3.65 | 130.6 MB | #########6 | 96%  2025-05-07T20:25:44.6901062Z 2025-05-07T20:25:44.6901068Z 2025-05-07T20:25:44.6901074Z 2025-05-07T20:25:44.6901079Z 2025-05-07T20:25:44.6901084Z 2025-05-07T20:25:44.6901090Z 2025-05-07T20:25:44.6903038Z 2025-05-07T20:25:44.7766089Z cuda-nvvp-12.8.57 | 112.4 MB | #######5 | 76%  2025-05-07T20:25:44.7865763Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 73% 2025-05-07T20:25:44.7866053Z 2025-05-07T20:25:44.7866058Z 2025-05-07T20:25:44.7866062Z 2025-05-07T20:25:44.7866066Z 2025-05-07T20:25:44.7868352Z 2025-05-07T20:25:44.8174613Z libnpp-12.3.3.65 | 130.6 MB | #########8 | 99%  2025-05-07T20:25:44.8174954Z 2025-05-07T20:25:44.8174960Z 2025-05-07T20:25:44.8174966Z 2025-05-07T20:25:44.8174971Z 2025-05-07T20:25:44.8174976Z 2025-05-07T20:25:44.8174981Z 2025-05-07T20:25:44.8174987Z 2025-05-07T20:25:44.8875343Z cuda-nvvp-12.8.57 | 112.4 MB | #######7 | 78%  2025-05-07T20:25:44.9174881Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 73% 2025-05-07T20:25:44.9175161Z 2025-05-07T20:25:44.9175165Z 2025-05-07T20:25:44.9175169Z 2025-05-07T20:25:44.9175173Z 2025-05-07T20:25:44.9175177Z 2025-05-07T20:25:44.9175181Z 2025-05-07T20:25:44.9175185Z 2025-05-07T20:25:44.9876709Z cuda-nvvp-12.8.57 | 112.4 MB | ######## | 81%  2025-05-07T20:25:45.0176950Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 74% 2025-05-07T20:25:45.0177239Z 2025-05-07T20:25:45.0177243Z 2025-05-07T20:25:45.0177247Z 2025-05-07T20:25:45.0177251Z 2025-05-07T20:25:45.0177270Z 2025-05-07T20:25:45.0177275Z 2025-05-07T20:25:45.0183057Z 2025-05-07T20:25:45.0878753Z cuda-nvvp-12.8.57 | 112.4 MB | ########3 | 83%  2025-05-07T20:25:45.1181322Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 74% 2025-05-07T20:25:45.1181921Z 2025-05-07T20:25:45.1181925Z 2025-05-07T20:25:45.1181929Z 2025-05-07T20:25:45.1181933Z 2025-05-07T20:25:45.1181937Z 2025-05-07T20:25:45.1181941Z 2025-05-07T20:25:45.1183437Z 2025-05-07T20:25:45.1886030Z cuda-nvvp-12.8.57 | 112.4 MB | ########6 | 86%  2025-05-07T20:25:45.2187660Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 75% 2025-05-07T20:25:45.2188036Z 2025-05-07T20:25:45.2188042Z 2025-05-07T20:25:45.2188048Z 2025-05-07T20:25:45.2188053Z 2025-05-07T20:25:45.2188058Z 2025-05-07T20:25:45.2188371Z 2025-05-07T20:25:45.2188376Z 2025-05-07T20:25:45.2888873Z cuda-nvvp-12.8.57 | 112.4 MB | ########8 | 89%  2025-05-07T20:25:45.3273675Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 76% 2025-05-07T20:25:45.3274043Z 2025-05-07T20:25:45.3274048Z 2025-05-07T20:25:45.3274051Z 2025-05-07T20:25:45.3274055Z 2025-05-07T20:25:45.3274059Z 2025-05-07T20:25:45.3274063Z 2025-05-07T20:25:45.3274067Z 2025-05-07T20:25:45.3894610Z cuda-nvvp-12.8.57 | 112.4 MB | #########1 | 91%  2025-05-07T20:25:45.4275907Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 76% 2025-05-07T20:25:45.4276627Z 2025-05-07T20:25:45.4276632Z 2025-05-07T20:25:45.4276643Z 2025-05-07T20:25:45.4276647Z 2025-05-07T20:25:45.4276651Z 2025-05-07T20:25:45.4276655Z 2025-05-07T20:25:45.4276659Z 2025-05-07T20:25:45.4895537Z cuda-nvvp-12.8.57 | 112.4 MB | #########3 | 94%  2025-05-07T20:25:45.5276855Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 77% 2025-05-07T20:25:45.5277289Z 2025-05-07T20:25:45.5277294Z 2025-05-07T20:25:45.5277298Z 2025-05-07T20:25:45.5277302Z 2025-05-07T20:25:45.5277306Z 2025-05-07T20:25:45.5277319Z 2025-05-07T20:25:45.5277337Z 2025-05-07T20:25:45.6037299Z cuda-nvvp-12.8.57 | 112.4 MB | #########6 | 96%  2025-05-07T20:25:45.6289545Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 78% 2025-05-07T20:25:45.6289932Z 2025-05-07T20:25:45.6289939Z 2025-05-07T20:25:45.6289944Z 2025-05-07T20:25:45.6289950Z 2025-05-07T20:25:45.6289955Z 2025-05-07T20:25:45.6289961Z 2025-05-07T20:25:45.6291757Z 2025-05-07T20:25:45.7048103Z cuda-nvvp-12.8.57 | 112.4 MB | #########9 | 99%  2025-05-07T20:25:45.8050133Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 78% 2025-05-07T20:25:45.9068463Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 79% 2025-05-07T20:25:46.0072750Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 80% 2025-05-07T20:25:46.1150364Z libcublas-12.8.3.14 | 460.2 MB | ######## | 80% 2025-05-07T20:25:46.2153745Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 81% 2025-05-07T20:25:46.3180588Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 82% 2025-05-07T20:25:46.4278585Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 82% 2025-05-07T20:25:46.5292313Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 83% 2025-05-07T20:25:46.6603390Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 84% 2025-05-07T20:25:46.7619246Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 84% 2025-05-07T20:25:46.8621037Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 85% 2025-05-07T20:25:46.9622252Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 85% 2025-05-07T20:25:47.0623353Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 86% 2025-05-07T20:25:47.1626374Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 87% 2025-05-07T20:25:47.2629721Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 87% 2025-05-07T20:25:47.3634090Z libcublas-12.8.3.14 | 460.2 MB | ########8 | 88% 2025-05-07T20:25:47.4634781Z libcublas-12.8.3.14 | 460.2 MB | ########8 | 89% 2025-05-07T20:25:47.5635207Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 89% 2025-05-07T20:25:47.6751820Z libcublas-12.8.3.14 | 460.2 MB | ######### | 90% 2025-05-07T20:25:47.8263698Z libcublas-12.8.3.14 | 460.2 MB | ######### | 91% 2025-05-07T20:25:47.9270462Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 91% 2025-05-07T20:25:48.0276623Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 92% 2025-05-07T20:25:48.1276899Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 93% 2025-05-07T20:25:48.2281439Z libcublas-12.8.3.14 | 460.2 MB | #########3 | 93% 2025-05-07T20:25:48.3284592Z libcublas-12.8.3.14 | 460.2 MB | #########4 | 94% 2025-05-07T20:25:48.4288642Z libcublas-12.8.3.14 | 460.2 MB | #########4 | 95% 2025-05-07T20:25:48.5355095Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 96% 2025-05-07T20:25:48.6375450Z libcublas-12.8.3.14 | 460.2 MB | #########6 | 96% 2025-05-07T20:25:48.7390067Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 97% 2025-05-07T20:25:48.8384147Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 98% 2025-05-07T20:25:48.8384455Z 2025-05-07T20:25:48.8384690Z 2025-05-07T20:25:48.8384696Z 2025-05-07T20:25:48.8384700Z 2025-05-07T20:25:48.8384704Z 2025-05-07T20:25:48.8388019Z 2025-05-07T20:25:48.8394912Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:25:48.8713362Z libcublas-12.8.3.14 | 460.2 MB | #########8 | 99% 2025-05-07T20:25:48.8713935Z 2025-05-07T20:25:48.8713942Z 2025-05-07T20:25:48.8713949Z 2025-05-07T20:25:48.8713955Z 2025-05-07T20:25:48.8713961Z 2025-05-07T20:25:48.8713967Z 2025-05-07T20:25:48.8713974Z 2025-05-07T20:25:48.8714766Z 2025-05-07T20:25:48.9621631Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:25:48.9621990Z 2025-05-07T20:25:48.9621995Z 2025-05-07T20:25:48.9621999Z 2025-05-07T20:25:48.9624231Z 2025-05-07T20:25:48.9631890Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:25:48.9721051Z libcublas-12.8.3.14 | 460.2 MB | #########9 | 99% 2025-05-07T20:25:48.9721333Z 2025-05-07T20:25:48.9721338Z 2025-05-07T20:25:48.9721355Z 2025-05-07T20:25:48.9721359Z 2025-05-07T20:25:48.9721364Z 2025-05-07T20:25:48.9721368Z 2025-05-07T20:25:48.9721372Z 2025-05-07T20:25:48.9723106Z 2025-05-07T20:25:49.0726494Z cuda-nvrtc-12.8.61 | 63.1 MB | 4 | 5%  2025-05-07T20:25:49.0726942Z 2025-05-07T20:25:49.0728901Z 2025-05-07T20:25:49.0728907Z 2025-05-07T20:25:49.0729152Z 2025-05-07T20:25:49.0729159Z 2025-05-07T20:25:49.0729164Z 2025-05-07T20:25:49.0729168Z 2025-05-07T20:25:49.0729176Z 2025-05-07T20:25:49.0767615Z cuda-nvrtc-12.8.61 | 63.1 MB | 9 | 10%  2025-05-07T20:25:49.1760747Z libcublas-12.8.3.14 | 460.2 MB | #########9 | 100% 2025-05-07T20:25:49.1761144Z 2025-05-07T20:25:49.1761150Z 2025-05-07T20:25:49.1761156Z 2025-05-07T20:25:49.1761161Z 2025-05-07T20:25:49.1761196Z 2025-05-07T20:25:49.1761202Z 2025-05-07T20:25:49.1761208Z 2025-05-07T20:25:49.1761219Z 2025-05-07T20:25:49.2767744Z cuda-nvrtc-12.8.61 | 63.1 MB | #4 | 15%  2025-05-07T20:25:49.2768159Z 2025-05-07T20:25:49.2768163Z 2025-05-07T20:25:49.2768167Z 2025-05-07T20:25:49.2768171Z 2025-05-07T20:25:49.2768175Z 2025-05-07T20:25:49.2768179Z 2025-05-07T20:25:49.2768183Z 2025-05-07T20:25:49.2768187Z 2025-05-07T20:25:49.3877756Z cuda-nvrtc-12.8.61 | 63.1 MB | ## | 20%  2025-05-07T20:25:49.3878104Z 2025-05-07T20:25:49.3878109Z 2025-05-07T20:25:49.3878113Z 2025-05-07T20:25:49.3878117Z 2025-05-07T20:25:49.3878121Z 2025-05-07T20:25:49.3878125Z 2025-05-07T20:25:49.3878129Z 2025-05-07T20:25:49.3882131Z 2025-05-07T20:25:49.4878141Z cuda-nvrtc-12.8.61 | 63.1 MB | ##5 | 25%  2025-05-07T20:25:49.4878479Z 2025-05-07T20:25:49.4878484Z 2025-05-07T20:25:49.4878488Z 2025-05-07T20:25:49.4878513Z 2025-05-07T20:25:49.4878517Z 2025-05-07T20:25:49.4878521Z 2025-05-07T20:25:49.4878525Z 2025-05-07T20:25:49.4879510Z 2025-05-07T20:25:49.5930146Z cuda-nvrtc-12.8.61 | 63.1 MB | ### | 30%  2025-05-07T20:25:49.5930578Z 2025-05-07T20:25:49.5930583Z 2025-05-07T20:25:49.5930587Z 2025-05-07T20:25:49.5930597Z 2025-05-07T20:25:49.5930601Z 2025-05-07T20:25:49.5930605Z 2025-05-07T20:25:49.5930609Z 2025-05-07T20:25:49.5933736Z 2025-05-07T20:25:49.6937414Z cuda-nvrtc-12.8.61 | 63.1 MB | ###5 | 35%  2025-05-07T20:25:49.6937799Z 2025-05-07T20:25:49.6937806Z 2025-05-07T20:25:49.6937812Z 2025-05-07T20:25:49.6937817Z 2025-05-07T20:25:49.6937835Z 2025-05-07T20:25:49.6937839Z 2025-05-07T20:25:49.6937842Z 2025-05-07T20:25:49.6937846Z 2025-05-07T20:25:49.7939278Z cuda-nvrtc-12.8.61 | 63.1 MB | ###9 | 40%  2025-05-07T20:25:49.7939690Z 2025-05-07T20:25:49.7939697Z 2025-05-07T20:25:49.7939702Z 2025-05-07T20:25:49.7939975Z 2025-05-07T20:25:49.7939981Z 2025-05-07T20:25:49.7939986Z 2025-05-07T20:25:49.7939992Z 2025-05-07T20:25:49.7942444Z 2025-05-07T20:25:49.8942094Z cuda-nvrtc-12.8.61 | 63.1 MB | ####5 | 46%  2025-05-07T20:25:49.8942600Z 2025-05-07T20:25:49.8942606Z 2025-05-07T20:25:49.8942612Z 2025-05-07T20:25:49.8942617Z 2025-05-07T20:25:49.8942623Z 2025-05-07T20:25:49.8942628Z 2025-05-07T20:25:49.8942633Z 2025-05-07T20:25:49.8942639Z 2025-05-07T20:25:49.9429940Z cuda-nvrtc-12.8.61 | 63.1 MB | #####1 | 52%  2025-05-07T20:25:49.9430422Z 2025-05-07T20:25:49.9430428Z 2025-05-07T20:25:49.9430434Z 2025-05-07T20:25:49.9430439Z 2025-05-07T20:25:49.9431959Z 2025-05-07T20:25:49.9751632Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:25:49.9751947Z 2025-05-07T20:25:49.9751951Z 2025-05-07T20:25:49.9751954Z 2025-05-07T20:25:49.9751958Z 2025-05-07T20:25:49.9751962Z 2025-05-07T20:25:49.9751966Z 2025-05-07T20:25:49.9755739Z 2025-05-07T20:25:49.9777745Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:25:49.9778050Z 2025-05-07T20:25:49.9778069Z 2025-05-07T20:25:49.9778075Z 2025-05-07T20:25:49.9778080Z 2025-05-07T20:25:49.9778085Z 2025-05-07T20:25:49.9778090Z 2025-05-07T20:25:49.9778095Z 2025-05-07T20:25:49.9778100Z 2025-05-07T20:25:49.9778105Z 2025-05-07T20:25:49.9983264Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:25:49.9983638Z 2025-05-07T20:25:49.9983642Z 2025-05-07T20:25:49.9983646Z 2025-05-07T20:25:49.9983650Z 2025-05-07T20:25:49.9983653Z 2025-05-07T20:25:49.9983657Z 2025-05-07T20:25:49.9983661Z 2025-05-07T20:25:49.9986252Z 2025-05-07T20:25:50.0256802Z cuda-nvrtc-12.8.61 | 63.1 MB | #####6 | 57%  2025-05-07T20:25:50.0257155Z 2025-05-07T20:25:50.0257159Z 2025-05-07T20:25:50.0257163Z 2025-05-07T20:25:50.0257167Z 2025-05-07T20:25:50.0257179Z 2025-05-07T20:25:50.0257213Z 2025-05-07T20:25:50.0257218Z 2025-05-07T20:25:50.0257222Z 2025-05-07T20:25:50.0257226Z 2025-05-07T20:25:50.0257230Z 2025-05-07T20:25:50.0780187Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:25:50.0780549Z 2025-05-07T20:25:50.0780554Z 2025-05-07T20:25:50.0780559Z 2025-05-07T20:25:50.0780565Z 2025-05-07T20:25:50.0780570Z 2025-05-07T20:25:50.0780575Z 2025-05-07T20:25:50.0780580Z 2025-05-07T20:25:50.0780586Z 2025-05-07T20:25:50.0786986Z 2025-05-07T20:25:50.1096452Z libcurand-10.3.9.55 | 43.6 MB | 6 | 7%  2025-05-07T20:25:50.1096783Z 2025-05-07T20:25:50.1096789Z 2025-05-07T20:25:50.1096794Z 2025-05-07T20:25:50.1096799Z 2025-05-07T20:25:50.1096804Z 2025-05-07T20:25:50.1096809Z 2025-05-07T20:25:50.1096815Z 2025-05-07T20:25:50.1098477Z 2025-05-07T20:25:50.1258088Z cuda-nvrtc-12.8.61 | 63.1 MB | ######2 | 62%  2025-05-07T20:25:50.1258557Z 2025-05-07T20:25:50.1258594Z 2025-05-07T20:25:50.1258600Z 2025-05-07T20:25:50.1258605Z 2025-05-07T20:25:50.1258610Z 2025-05-07T20:25:50.1258616Z 2025-05-07T20:25:50.1258622Z 2025-05-07T20:25:50.1258640Z 2025-05-07T20:25:50.1258646Z 2025-05-07T20:25:50.1258651Z 2025-05-07T20:25:50.2051372Z gds-tools-1.13.0.11 | 37.9 MB | 7 | 8%  2025-05-07T20:25:50.2051784Z 2025-05-07T20:25:50.2051788Z 2025-05-07T20:25:50.2051800Z 2025-05-07T20:25:50.2051804Z 2025-05-07T20:25:50.2051808Z 2025-05-07T20:25:50.2051812Z 2025-05-07T20:25:50.2051816Z 2025-05-07T20:25:50.2051820Z 2025-05-07T20:25:50.2051824Z 2025-05-07T20:25:50.2245735Z libcurand-10.3.9.55 | 43.6 MB | #3 | 14%  2025-05-07T20:25:50.2246084Z 2025-05-07T20:25:50.2246088Z 2025-05-07T20:25:50.2246091Z 2025-05-07T20:25:50.2246095Z 2025-05-07T20:25:50.2246099Z 2025-05-07T20:25:50.2246103Z 2025-05-07T20:25:50.2246107Z 2025-05-07T20:25:50.2249398Z 2025-05-07T20:25:50.2286173Z cuda-nvrtc-12.8.61 | 63.1 MB | ######7 | 67%  2025-05-07T20:25:50.2286847Z 2025-05-07T20:25:50.2286853Z 2025-05-07T20:25:50.2286858Z 2025-05-07T20:25:50.2286864Z 2025-05-07T20:25:50.2287071Z 2025-05-07T20:25:50.2287079Z 2025-05-07T20:25:50.2287085Z 2025-05-07T20:25:50.2287090Z 2025-05-07T20:25:50.2287096Z 2025-05-07T20:25:50.2287115Z 2025-05-07T20:25:50.3098772Z gds-tools-1.13.0.11 | 37.9 MB | #5 | 16%  2025-05-07T20:25:50.3099097Z 2025-05-07T20:25:50.3099102Z 2025-05-07T20:25:50.3099105Z 2025-05-07T20:25:50.3099118Z 2025-05-07T20:25:50.3099122Z 2025-05-07T20:25:50.3099126Z 2025-05-07T20:25:50.3099129Z 2025-05-07T20:25:50.3099133Z 2025-05-07T20:25:50.3100591Z 2025-05-07T20:25:50.3357246Z libcurand-10.3.9.55 | 43.6 MB | #9 | 20%  2025-05-07T20:25:50.3357606Z 2025-05-07T20:25:50.3357610Z 2025-05-07T20:25:50.3357614Z 2025-05-07T20:25:50.3357617Z 2025-05-07T20:25:50.3357621Z 2025-05-07T20:25:50.3357643Z 2025-05-07T20:25:50.3357647Z 2025-05-07T20:25:50.3357650Z 2025-05-07T20:25:50.3357655Z 2025-05-07T20:25:50.3357659Z 2025-05-07T20:25:50.3503612Z gds-tools-1.13.0.11 | 37.9 MB | ##3 | 23%  2025-05-07T20:25:50.3503939Z 2025-05-07T20:25:50.3503943Z 2025-05-07T20:25:50.3503946Z 2025-05-07T20:25:50.3503950Z 2025-05-07T20:25:50.3503954Z 2025-05-07T20:25:50.3503958Z 2025-05-07T20:25:50.3503962Z 2025-05-07T20:25:50.3503965Z 2025-05-07T20:25:50.4151200Z cuda-nvrtc-12.8.61 | 63.1 MB | #######2 | 72%  2025-05-07T20:25:50.4151519Z 2025-05-07T20:25:50.4151523Z 2025-05-07T20:25:50.4151527Z 2025-05-07T20:25:50.4151531Z 2025-05-07T20:25:50.4151535Z 2025-05-07T20:25:50.4151539Z 2025-05-07T20:25:50.4151543Z 2025-05-07T20:25:50.4151547Z 2025-05-07T20:25:50.4157773Z 2025-05-07T20:25:50.4370979Z libcurand-10.3.9.55 | 43.6 MB | ##5 | 26%  2025-05-07T20:25:50.4371307Z 2025-05-07T20:25:50.4371332Z 2025-05-07T20:25:50.4371347Z 2025-05-07T20:25:50.4371353Z 2025-05-07T20:25:50.4371359Z 2025-05-07T20:25:50.4371366Z 2025-05-07T20:25:50.4371373Z 2025-05-07T20:25:50.4371402Z 2025-05-07T20:25:50.4371407Z 2025-05-07T20:25:50.4371413Z 2025-05-07T20:25:50.4732414Z gds-tools-1.13.0.11 | 37.9 MB | ### | 31%  2025-05-07T20:25:50.4732741Z 2025-05-07T20:25:50.4732753Z 2025-05-07T20:25:50.4732757Z 2025-05-07T20:25:50.4732761Z 2025-05-07T20:25:50.4732765Z 2025-05-07T20:25:50.4732769Z 2025-05-07T20:25:50.4732773Z 2025-05-07T20:25:50.4732776Z 2025-05-07T20:25:50.5169366Z cuda-nvrtc-12.8.61 | 63.1 MB | #######6 | 77%  2025-05-07T20:25:50.5169772Z 2025-05-07T20:25:50.5169776Z 2025-05-07T20:25:50.5169780Z 2025-05-07T20:25:50.5169784Z 2025-05-07T20:25:50.5169788Z 2025-05-07T20:25:50.5169792Z 2025-05-07T20:25:50.5169796Z 2025-05-07T20:25:50.5169799Z 2025-05-07T20:25:50.5172075Z 2025-05-07T20:25:50.5396251Z libcurand-10.3.9.55 | 43.6 MB | ###1 | 31%  2025-05-07T20:25:50.5396766Z 2025-05-07T20:25:50.5396772Z 2025-05-07T20:25:50.5396777Z 2025-05-07T20:25:50.5396799Z 2025-05-07T20:25:50.5396804Z 2025-05-07T20:25:50.5396810Z 2025-05-07T20:25:50.5396815Z 2025-05-07T20:25:50.5396820Z 2025-05-07T20:25:50.5396826Z 2025-05-07T20:25:50.5396831Z 2025-05-07T20:25:50.5917285Z gds-tools-1.13.0.11 | 37.9 MB | ###8 | 38%  2025-05-07T20:25:50.5917648Z 2025-05-07T20:25:50.5917652Z 2025-05-07T20:25:50.5917656Z 2025-05-07T20:25:50.5917660Z 2025-05-07T20:25:50.5917664Z 2025-05-07T20:25:50.5917667Z 2025-05-07T20:25:50.5917671Z 2025-05-07T20:25:50.5919321Z 2025-05-07T20:25:50.6190758Z cuda-nvrtc-12.8.61 | 63.1 MB | ########1 | 81%  2025-05-07T20:25:50.6191069Z 2025-05-07T20:25:50.6191073Z 2025-05-07T20:25:50.6191077Z 2025-05-07T20:25:50.6191081Z 2025-05-07T20:25:50.6191085Z 2025-05-07T20:25:50.6191089Z 2025-05-07T20:25:50.6191392Z 2025-05-07T20:25:50.6191412Z 2025-05-07T20:25:50.6193814Z 2025-05-07T20:25:50.6398622Z libcurand-10.3.9.55 | 43.6 MB | ###7 | 38%  2025-05-07T20:25:50.6399140Z 2025-05-07T20:25:50.6399152Z 2025-05-07T20:25:50.6399156Z 2025-05-07T20:25:50.6399160Z 2025-05-07T20:25:50.6399163Z 2025-05-07T20:25:50.6399167Z 2025-05-07T20:25:50.6399171Z 2025-05-07T20:25:50.6399174Z 2025-05-07T20:25:50.6399178Z 2025-05-07T20:25:50.6402708Z 2025-05-07T20:25:50.6917578Z gds-tools-1.13.0.11 | 37.9 MB | ####5 | 46%  2025-05-07T20:25:50.6917905Z 2025-05-07T20:25:50.6917909Z 2025-05-07T20:25:50.6917913Z 2025-05-07T20:25:50.6917917Z 2025-05-07T20:25:50.6917921Z 2025-05-07T20:25:50.6917924Z 2025-05-07T20:25:50.6917928Z 2025-05-07T20:25:50.6922196Z 2025-05-07T20:25:50.7231261Z cuda-nvrtc-12.8.61 | 63.1 MB | ########5 | 86%  2025-05-07T20:25:50.7231639Z 2025-05-07T20:25:50.7231643Z 2025-05-07T20:25:50.7231672Z 2025-05-07T20:25:50.7231676Z 2025-05-07T20:25:50.7231680Z 2025-05-07T20:25:50.7231684Z 2025-05-07T20:25:50.7231688Z 2025-05-07T20:25:50.7231692Z 2025-05-07T20:25:50.7235163Z 2025-05-07T20:25:50.7403044Z libcurand-10.3.9.55 | 43.6 MB | ####3 | 44%  2025-05-07T20:25:50.7403387Z 2025-05-07T20:25:50.7403393Z 2025-05-07T20:25:50.7403398Z 2025-05-07T20:25:50.7403404Z 2025-05-07T20:25:50.7403409Z 2025-05-07T20:25:50.7403414Z 2025-05-07T20:25:50.7403419Z 2025-05-07T20:25:50.7403424Z 2025-05-07T20:25:50.7403429Z 2025-05-07T20:25:50.7406021Z 2025-05-07T20:25:50.7937269Z gds-tools-1.13.0.11 | 37.9 MB | #####3 | 53%  2025-05-07T20:25:50.7937585Z 2025-05-07T20:25:50.7937597Z 2025-05-07T20:25:50.7937601Z 2025-05-07T20:25:50.7937605Z 2025-05-07T20:25:50.7937609Z 2025-05-07T20:25:50.7937612Z 2025-05-07T20:25:50.7937616Z 2025-05-07T20:25:50.7937624Z 2025-05-07T20:25:50.8240472Z cuda-nvrtc-12.8.61 | 63.1 MB | ########9 | 90%  2025-05-07T20:25:50.8240960Z 2025-05-07T20:25:50.8240966Z 2025-05-07T20:25:50.8240972Z 2025-05-07T20:25:50.8240977Z 2025-05-07T20:25:50.8240982Z 2025-05-07T20:25:50.8240999Z 2025-05-07T20:25:50.8241005Z 2025-05-07T20:25:50.8241010Z 2025-05-07T20:25:50.8244207Z 2025-05-07T20:25:50.8536229Z libcurand-10.3.9.55 | 43.6 MB | ####9 | 49%  2025-05-07T20:25:50.8536650Z 2025-05-07T20:25:50.8536657Z 2025-05-07T20:25:50.8536661Z 2025-05-07T20:25:50.8536665Z 2025-05-07T20:25:50.8536669Z 2025-05-07T20:25:50.8536673Z 2025-05-07T20:25:50.8536676Z 2025-05-07T20:25:50.8536680Z 2025-05-07T20:25:50.8536684Z 2025-05-07T20:25:50.8536688Z 2025-05-07T20:25:50.8943462Z gds-tools-1.13.0.11 | 37.9 MB | ######1 | 61%  2025-05-07T20:25:50.8943840Z 2025-05-07T20:25:50.8943844Z 2025-05-07T20:25:50.8943848Z 2025-05-07T20:25:50.8943852Z 2025-05-07T20:25:50.8943856Z 2025-05-07T20:25:50.8943859Z 2025-05-07T20:25:50.8943887Z 2025-05-07T20:25:50.8943892Z 2025-05-07T20:25:50.9540399Z cuda-nvrtc-12.8.61 | 63.1 MB | #########4 | 95%  2025-05-07T20:25:50.9540806Z 2025-05-07T20:25:50.9540829Z 2025-05-07T20:25:50.9540834Z 2025-05-07T20:25:50.9540838Z 2025-05-07T20:25:50.9540842Z 2025-05-07T20:25:50.9540845Z 2025-05-07T20:25:50.9540858Z 2025-05-07T20:25:50.9540862Z 2025-05-07T20:25:50.9540866Z 2025-05-07T20:25:50.9542764Z 2025-05-07T20:25:50.9727202Z gds-tools-1.13.0.11 | 37.9 MB | ######8 | 68%  2025-05-07T20:25:50.9727660Z 2025-05-07T20:25:50.9727666Z 2025-05-07T20:25:50.9727671Z 2025-05-07T20:25:50.9727676Z 2025-05-07T20:25:50.9727681Z 2025-05-07T20:25:50.9727686Z 2025-05-07T20:25:50.9727692Z 2025-05-07T20:25:50.9727697Z 2025-05-07T20:25:50.9729413Z 2025-05-07T20:25:50.9951346Z libcurand-10.3.9.55 | 43.6 MB | #####5 | 55%  2025-05-07T20:25:50.9951809Z 2025-05-07T20:25:50.9951815Z 2025-05-07T20:25:50.9952128Z 2025-05-07T20:25:50.9952132Z 2025-05-07T20:25:50.9952135Z 2025-05-07T20:25:50.9952139Z 2025-05-07T20:25:50.9952143Z 2025-05-07T20:25:50.9952146Z 2025-05-07T20:25:51.0546604Z cuda-nvrtc-12.8.61 | 63.1 MB | #########9 | 100%  2025-05-07T20:25:51.0546947Z 2025-05-07T20:25:51.0546951Z 2025-05-07T20:25:51.0546955Z 2025-05-07T20:25:51.0546959Z 2025-05-07T20:25:51.0546962Z 2025-05-07T20:25:51.0546966Z 2025-05-07T20:25:51.0546970Z 2025-05-07T20:25:51.0546974Z 2025-05-07T20:25:51.0546977Z 2025-05-07T20:25:51.0548030Z 2025-05-07T20:25:51.0731982Z gds-tools-1.13.0.11 | 37.9 MB | #######5 | 76%  2025-05-07T20:25:51.0732287Z 2025-05-07T20:25:51.0732291Z 2025-05-07T20:25:51.0732295Z 2025-05-07T20:25:51.0732299Z 2025-05-07T20:25:51.0732302Z 2025-05-07T20:25:51.0732306Z 2025-05-07T20:25:51.0732310Z 2025-05-07T20:25:51.0732314Z 2025-05-07T20:25:51.0732369Z 2025-05-07T20:25:51.1554275Z libcurand-10.3.9.55 | 43.6 MB | ######1 | 62%  2025-05-07T20:25:51.1554712Z 2025-05-07T20:25:51.1554717Z 2025-05-07T20:25:51.1554735Z 2025-05-07T20:25:51.1554739Z 2025-05-07T20:25:51.1554743Z 2025-05-07T20:25:51.1554756Z 2025-05-07T20:25:51.1554760Z 2025-05-07T20:25:51.1554764Z 2025-05-07T20:25:51.1554770Z 2025-05-07T20:25:51.1558597Z 2025-05-07T20:25:51.1732336Z gds-tools-1.13.0.11 | 37.9 MB | ########4 | 85%  2025-05-07T20:25:51.1732734Z 2025-05-07T20:25:51.1732739Z 2025-05-07T20:25:51.1732742Z 2025-05-07T20:25:51.1732746Z 2025-05-07T20:25:51.1732750Z 2025-05-07T20:25:51.1732754Z 2025-05-07T20:25:51.1732758Z 2025-05-07T20:25:51.1732761Z 2025-05-07T20:25:51.1736861Z 2025-05-07T20:25:51.2602047Z libcurand-10.3.9.55 | 43.6 MB | ######8 | 68%  2025-05-07T20:25:51.2602430Z 2025-05-07T20:25:51.2602434Z 2025-05-07T20:25:51.2602438Z 2025-05-07T20:25:51.2602441Z 2025-05-07T20:25:51.2602445Z 2025-05-07T20:25:51.2602449Z 2025-05-07T20:25:51.2602474Z 2025-05-07T20:25:51.2602478Z 2025-05-07T20:25:51.2602482Z 2025-05-07T20:25:51.2605625Z 2025-05-07T20:25:51.2744863Z gds-tools-1.13.0.11 | 37.9 MB | #########2 | 93%  2025-05-07T20:25:51.2745309Z 2025-05-07T20:25:51.2745316Z 2025-05-07T20:25:51.2745321Z 2025-05-07T20:25:51.2745326Z 2025-05-07T20:25:51.2745332Z 2025-05-07T20:25:51.2745337Z 2025-05-07T20:25:51.2745343Z 2025-05-07T20:25:51.2745348Z 2025-05-07T20:25:51.2748040Z 2025-05-07T20:25:51.3802480Z libcurand-10.3.9.55 | 43.6 MB | #######5 | 75%  2025-05-07T20:25:51.3802795Z 2025-05-07T20:25:51.3802799Z 2025-05-07T20:25:51.3802803Z 2025-05-07T20:25:51.3802806Z 2025-05-07T20:25:51.3802810Z 2025-05-07T20:25:51.3802814Z 2025-05-07T20:25:51.3802817Z 2025-05-07T20:25:51.3802821Z 2025-05-07T20:25:51.3805148Z 2025-05-07T20:25:51.4994099Z libcurand-10.3.9.55 | 43.6 MB | ########1 | 81%  2025-05-07T20:25:51.4994415Z 2025-05-07T20:25:51.4994442Z 2025-05-07T20:25:51.4994446Z 2025-05-07T20:25:51.4994449Z 2025-05-07T20:25:51.4994462Z 2025-05-07T20:25:51.4994466Z 2025-05-07T20:25:51.4994470Z 2025-05-07T20:25:51.4994483Z 2025-05-07T20:25:51.4996161Z 2025-05-07T20:25:51.5998666Z libcurand-10.3.9.55 | 43.6 MB | ########7 | 87%  2025-05-07T20:25:51.5998975Z 2025-05-07T20:25:51.5998979Z 2025-05-07T20:25:51.5998983Z 2025-05-07T20:25:51.5998987Z 2025-05-07T20:25:51.5998990Z 2025-05-07T20:25:51.5998994Z 2025-05-07T20:25:51.5998998Z 2025-05-07T20:25:51.5999001Z 2025-05-07T20:25:51.6004142Z 2025-05-07T20:25:52.6450405Z libcurand-10.3.9.55 | 43.6 MB | #########4 | 94%  2025-05-07T20:25:52.6450891Z 2025-05-07T20:25:52.6450897Z 2025-05-07T20:25:52.6450902Z 2025-05-07T20:25:52.6450909Z 2025-05-07T20:25:52.6450914Z 2025-05-07T20:25:52.6450931Z 2025-05-07T20:25:52.6450936Z 2025-05-07T20:25:52.6450942Z 2025-05-07T20:25:52.6450949Z 2025-05-07T20:25:52.6450955Z 2025-05-07T20:25:52.6911392Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:25:52.6911854Z 2025-05-07T20:25:52.6911858Z 2025-05-07T20:25:52.6912088Z 2025-05-07T20:25:52.6912094Z 2025-05-07T20:25:52.6912098Z 2025-05-07T20:25:52.6912101Z 2025-05-07T20:25:52.6912105Z 2025-05-07T20:25:52.6912109Z 2025-05-07T20:25:52.6912113Z 2025-05-07T20:25:52.6912117Z 2025-05-07T20:25:52.6914717Z 2025-05-07T20:25:52.7916031Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:25:52.7916430Z 2025-05-07T20:25:52.7916435Z 2025-05-07T20:25:52.7916439Z 2025-05-07T20:25:52.7916443Z 2025-05-07T20:25:52.7916447Z 2025-05-07T20:25:52.7916452Z 2025-05-07T20:25:52.7916456Z 2025-05-07T20:25:52.7916460Z 2025-05-07T20:25:52.7916466Z 2025-05-07T20:25:52.7916470Z 2025-05-07T20:25:52.7918211Z 2025-05-07T20:25:52.8084217Z libnvjitlink-12.8.61 | 28.7 MB | #1 | 12%  2025-05-07T20:25:52.8084676Z 2025-05-07T20:25:52.8084699Z 2025-05-07T20:25:52.8084703Z 2025-05-07T20:25:52.8084707Z 2025-05-07T20:25:52.8084711Z 2025-05-07T20:25:52.8084843Z 2025-05-07T20:25:52.8974872Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:25:52.8975227Z 2025-05-07T20:25:52.8975232Z 2025-05-07T20:25:52.8975236Z 2025-05-07T20:25:52.8975240Z 2025-05-07T20:25:52.8975244Z 2025-05-07T20:25:52.8975248Z 2025-05-07T20:25:52.8975260Z 2025-05-07T20:25:52.8975265Z 2025-05-07T20:25:52.8975269Z 2025-05-07T20:25:52.8975273Z 2025-05-07T20:25:52.8977300Z 2025-05-07T20:25:53.0080166Z libnvjitlink-12.8.61 | 28.7 MB | ##3 | 23%  2025-05-07T20:25:53.0080536Z 2025-05-07T20:25:53.0080541Z 2025-05-07T20:25:53.0080545Z 2025-05-07T20:25:53.0080549Z 2025-05-07T20:25:53.0080553Z 2025-05-07T20:25:53.0080557Z 2025-05-07T20:25:53.0080576Z 2025-05-07T20:25:53.0080582Z 2025-05-07T20:25:53.0080587Z 2025-05-07T20:25:53.0080593Z 2025-05-07T20:25:53.0083179Z 2025-05-07T20:25:53.1063836Z libnvjitlink-12.8.61 | 28.7 MB | ###4 | 34%  2025-05-07T20:25:53.1064208Z 2025-05-07T20:25:53.1064213Z 2025-05-07T20:25:53.1064242Z 2025-05-07T20:25:53.1064247Z 2025-05-07T20:25:53.1064251Z 2025-05-07T20:25:53.1064255Z 2025-05-07T20:25:53.1064259Z 2025-05-07T20:25:53.1069680Z 2025-05-07T20:25:53.1145165Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:25:53.1145542Z 2025-05-07T20:25:53.1145548Z 2025-05-07T20:25:53.1145554Z 2025-05-07T20:25:53.1145559Z 2025-05-07T20:25:53.1145565Z 2025-05-07T20:25:53.1145570Z 2025-05-07T20:25:53.1145576Z 2025-05-07T20:25:53.1145583Z 2025-05-07T20:25:53.1145588Z 2025-05-07T20:25:53.1145594Z 2025-05-07T20:25:53.1145600Z 2025-05-07T20:25:53.1497976Z libnvjitlink-12.8.61 | 28.7 MB | ####4 | 45%  2025-05-07T20:25:53.1498431Z 2025-05-07T20:25:53.1498436Z 2025-05-07T20:25:53.1498440Z 2025-05-07T20:25:53.1498461Z 2025-05-07T20:25:53.1498465Z 2025-05-07T20:25:53.1498470Z 2025-05-07T20:25:53.1498474Z 2025-05-07T20:25:53.1498478Z 2025-05-07T20:25:53.1498482Z 2025-05-07T20:25:53.1498496Z 2025-05-07T20:25:53.1498502Z 2025-05-07T20:25:53.1500944Z 2025-05-07T20:25:53.2189324Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:25:53.2189828Z 2025-05-07T20:25:53.2189834Z 2025-05-07T20:25:53.2189850Z 2025-05-07T20:25:53.2189856Z 2025-05-07T20:25:53.2189862Z 2025-05-07T20:25:53.2189868Z 2025-05-07T20:25:53.2189874Z 2025-05-07T20:25:53.2189879Z 2025-05-07T20:25:53.2189885Z 2025-05-07T20:25:53.2189891Z 2025-05-07T20:25:53.2191809Z 2025-05-07T20:25:53.2500001Z libnvjitlink-12.8.61 | 28.7 MB | #####5 | 55%  2025-05-07T20:25:53.2500484Z 2025-05-07T20:25:53.2500490Z 2025-05-07T20:25:53.2500496Z 2025-05-07T20:25:53.2500501Z 2025-05-07T20:25:53.2500507Z 2025-05-07T20:25:53.2500513Z 2025-05-07T20:25:53.2500518Z 2025-05-07T20:25:53.2500813Z 2025-05-07T20:25:53.2500819Z 2025-05-07T20:25:53.2500826Z 2025-05-07T20:25:53.2500832Z 2025-05-07T20:25:53.2501508Z 2025-05-07T20:25:53.3281918Z cuda-nvcc-tools-12.8 | 24.5 MB | #2 | 12%  2025-05-07T20:25:53.3282344Z 2025-05-07T20:25:53.3282349Z 2025-05-07T20:25:53.3282353Z 2025-05-07T20:25:53.3282358Z 2025-05-07T20:25:53.3282362Z 2025-05-07T20:25:53.3282366Z 2025-05-07T20:25:53.3282370Z 2025-05-07T20:25:53.3282374Z 2025-05-07T20:25:53.3282379Z 2025-05-07T20:25:53.3282383Z 2025-05-07T20:25:53.3282387Z 2025-05-07T20:25:53.3497053Z libnvjitlink-12.8.61 | 28.7 MB | ######5 | 66%  2025-05-07T20:25:53.3505152Z 2025-05-07T20:25:53.3505160Z 2025-05-07T20:25:53.3527776Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:25:53.3528151Z 2025-05-07T20:25:53.3528158Z 2025-05-07T20:25:53.3528164Z 2025-05-07T20:25:53.3528169Z 2025-05-07T20:25:53.3528175Z 2025-05-07T20:25:53.3528200Z 2025-05-07T20:25:53.3528206Z 2025-05-07T20:25:53.3528212Z 2025-05-07T20:25:53.3528217Z 2025-05-07T20:25:53.3528223Z 2025-05-07T20:25:53.3528229Z 2025-05-07T20:25:53.3528243Z 2025-05-07T20:25:53.3837571Z cuda-nvcc-tools-12.8 | 24.5 MB | ##4 | 25%  2025-05-07T20:25:53.3837927Z 2025-05-07T20:25:53.3837932Z 2025-05-07T20:25:53.3837936Z 2025-05-07T20:25:53.3837940Z 2025-05-07T20:25:53.3837944Z 2025-05-07T20:25:53.3837948Z 2025-05-07T20:25:53.3837952Z 2025-05-07T20:25:53.3837956Z 2025-05-07T20:25:53.3837960Z 2025-05-07T20:25:53.4330352Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:25:53.4330740Z 2025-05-07T20:25:53.4330746Z 2025-05-07T20:25:53.4330752Z 2025-05-07T20:25:53.4330758Z 2025-05-07T20:25:53.4330763Z 2025-05-07T20:25:53.4330783Z 2025-05-07T20:25:53.4330789Z 2025-05-07T20:25:53.4330795Z 2025-05-07T20:25:53.4330801Z 2025-05-07T20:25:53.4330806Z 2025-05-07T20:25:53.4330831Z 2025-05-07T20:25:53.4379020Z libnvjitlink-12.8.61 | 28.7 MB | #######5 | 76%  2025-05-07T20:25:53.4379520Z 2025-05-07T20:25:53.4379524Z 2025-05-07T20:25:53.4379538Z 2025-05-07T20:25:53.4379542Z 2025-05-07T20:25:53.4379547Z 2025-05-07T20:25:53.4379551Z 2025-05-07T20:25:53.4379555Z 2025-05-07T20:25:53.4379559Z 2025-05-07T20:25:53.4379563Z 2025-05-07T20:25:53.4379567Z 2025-05-07T20:25:53.4379571Z 2025-05-07T20:25:53.4379575Z 2025-05-07T20:25:53.4379579Z 2025-05-07T20:25:53.4706639Z python-3.10.13 | 24.5 MB | | 0%  2025-05-07T20:25:53.4706961Z 2025-05-07T20:25:53.4706965Z 2025-05-07T20:25:53.4706969Z 2025-05-07T20:25:53.4706973Z 2025-05-07T20:25:53.4706977Z 2025-05-07T20:25:53.4706981Z 2025-05-07T20:25:53.4706985Z 2025-05-07T20:25:53.4706989Z 2025-05-07T20:25:53.4706993Z 2025-05-07T20:25:53.4706997Z 2025-05-07T20:25:53.4707001Z 2025-05-07T20:25:53.4712215Z 2025-05-07T20:25:53.5379939Z cuda-nvcc-tools-12.8 | 24.5 MB | ###6 | 37%  2025-05-07T20:25:53.5380397Z 2025-05-07T20:25:53.5380402Z 2025-05-07T20:25:53.5380406Z 2025-05-07T20:25:53.5380420Z 2025-05-07T20:25:53.5380434Z 2025-05-07T20:25:53.5380438Z 2025-05-07T20:25:53.5380442Z 2025-05-07T20:25:53.5380446Z 2025-05-07T20:25:53.5380450Z 2025-05-07T20:25:53.5380454Z 2025-05-07T20:25:53.5380458Z 2025-05-07T20:25:53.5380462Z 2025-05-07T20:25:53.5382090Z 2025-05-07T20:25:53.5481736Z python-3.10.13 | 24.5 MB | # | 11%  2025-05-07T20:25:53.5482204Z 2025-05-07T20:25:53.5482211Z 2025-05-07T20:25:53.5482217Z 2025-05-07T20:25:53.5482222Z 2025-05-07T20:25:53.5482228Z 2025-05-07T20:25:53.5482234Z 2025-05-07T20:25:53.5482239Z 2025-05-07T20:25:53.5482245Z 2025-05-07T20:25:53.5482250Z 2025-05-07T20:25:53.5482256Z 2025-05-07T20:25:53.5482261Z 2025-05-07T20:25:53.5777285Z libnvjitlink-12.8.61 | 28.7 MB | ########5 | 85%  2025-05-07T20:25:53.5777997Z 2025-05-07T20:25:53.5778002Z 2025-05-07T20:25:53.5778006Z 2025-05-07T20:25:53.5778010Z 2025-05-07T20:25:53.5778014Z 2025-05-07T20:25:53.5778338Z 2025-05-07T20:25:53.5778344Z 2025-05-07T20:25:53.5778348Z 2025-05-07T20:25:53.5778352Z 2025-05-07T20:25:53.5778356Z 2025-05-07T20:25:53.5778360Z 2025-05-07T20:25:53.5779721Z 2025-05-07T20:25:53.6412828Z cuda-nvcc-tools-12.8 | 24.5 MB | ####8 | 48%  2025-05-07T20:25:53.6413265Z 2025-05-07T20:25:53.6413270Z 2025-05-07T20:25:53.6413274Z 2025-05-07T20:25:53.6413288Z 2025-05-07T20:25:53.6413292Z 2025-05-07T20:25:53.6413296Z 2025-05-07T20:25:53.6413300Z 2025-05-07T20:25:53.6413304Z 2025-05-07T20:25:53.6413308Z 2025-05-07T20:25:53.6413313Z 2025-05-07T20:25:53.6413317Z 2025-05-07T20:25:53.6413321Z 2025-05-07T20:25:53.6414603Z 2025-05-07T20:25:53.6659582Z python-3.10.13 | 24.5 MB | ##1 | 22%  2025-05-07T20:25:53.6659942Z 2025-05-07T20:25:53.6659947Z 2025-05-07T20:25:53.6659951Z 2025-05-07T20:25:53.6659955Z 2025-05-07T20:25:53.6659960Z 2025-05-07T20:25:53.6659964Z 2025-05-07T20:25:53.6659976Z 2025-05-07T20:25:53.6659981Z 2025-05-07T20:25:53.6659985Z 2025-05-07T20:25:53.6659989Z 2025-05-07T20:25:53.6659994Z 2025-05-07T20:25:53.6845720Z libnvjitlink-12.8.61 | 28.7 MB | #########4 | 95%  2025-05-07T20:25:53.6846066Z 2025-05-07T20:25:53.6846071Z 2025-05-07T20:25:53.6850775Z 2025-05-07T20:25:53.6853133Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:25:53.6853435Z 2025-05-07T20:25:53.6853439Z 2025-05-07T20:25:53.6853444Z 2025-05-07T20:25:53.6853448Z 2025-05-07T20:25:53.6853470Z 2025-05-07T20:25:53.6853474Z 2025-05-07T20:25:53.6853479Z 2025-05-07T20:25:53.6853492Z 2025-05-07T20:25:53.6853504Z 2025-05-07T20:25:53.6853524Z 2025-05-07T20:25:53.6853528Z 2025-05-07T20:25:53.6853624Z 2025-05-07T20:25:53.7424829Z cuda-nvcc-tools-12.8 | 24.5 MB | #####9 | 59%  2025-05-07T20:25:53.7425365Z 2025-05-07T20:25:53.7425372Z 2025-05-07T20:25:53.7425388Z 2025-05-07T20:25:53.7425403Z 2025-05-07T20:25:53.7425408Z 2025-05-07T20:25:53.7425414Z 2025-05-07T20:25:53.7425420Z 2025-05-07T20:25:53.7425426Z 2025-05-07T20:25:53.7425432Z 2025-05-07T20:25:53.7425438Z 2025-05-07T20:25:53.7425443Z 2025-05-07T20:25:53.7425449Z 2025-05-07T20:25:53.7431856Z 2025-05-07T20:25:53.7990869Z python-3.10.13 | 24.5 MB | ###2 | 32%  2025-05-07T20:25:53.7991372Z 2025-05-07T20:25:53.7991385Z 2025-05-07T20:25:53.7991396Z 2025-05-07T20:25:53.7991405Z 2025-05-07T20:25:53.7991412Z 2025-05-07T20:25:53.7991420Z 2025-05-07T20:25:53.7991427Z 2025-05-07T20:25:53.7991434Z 2025-05-07T20:25:53.7991441Z 2025-05-07T20:25:53.7991448Z 2025-05-07T20:25:53.7991455Z 2025-05-07T20:25:53.7993948Z 2025-05-07T20:25:53.8427463Z cuda-nvcc-tools-12.8 | 24.5 MB | ######9 | 70%  2025-05-07T20:25:53.8427956Z 2025-05-07T20:25:53.8427960Z 2025-05-07T20:25:53.8427965Z 2025-05-07T20:25:53.8427969Z 2025-05-07T20:25:53.8427981Z 2025-05-07T20:25:53.8427985Z 2025-05-07T20:25:53.8427989Z 2025-05-07T20:25:53.8427993Z 2025-05-07T20:25:53.8427997Z 2025-05-07T20:25:53.8428001Z 2025-05-07T20:25:53.8428006Z 2025-05-07T20:25:53.8428010Z 2025-05-07T20:25:53.8431027Z 2025-05-07T20:25:53.8991023Z python-3.10.13 | 24.5 MB | ####3 | 43%  2025-05-07T20:25:53.8991449Z 2025-05-07T20:25:53.8991454Z 2025-05-07T20:25:53.8991458Z 2025-05-07T20:25:53.8991471Z 2025-05-07T20:25:53.8991475Z 2025-05-07T20:25:53.8991479Z 2025-05-07T20:25:53.8991483Z 2025-05-07T20:25:53.8991487Z 2025-05-07T20:25:53.8991491Z 2025-05-07T20:25:53.8991495Z 2025-05-07T20:25:53.8991499Z 2025-05-07T20:25:53.9001960Z 2025-05-07T20:25:53.9432708Z cuda-nvcc-tools-12.8 | 24.5 MB | ######## | 81%  2025-05-07T20:25:53.9433379Z 2025-05-07T20:25:53.9433384Z 2025-05-07T20:25:53.9433388Z 2025-05-07T20:25:53.9433392Z 2025-05-07T20:25:53.9433396Z 2025-05-07T20:25:53.9433400Z 2025-05-07T20:25:53.9433640Z 2025-05-07T20:25:53.9433646Z 2025-05-07T20:25:53.9433650Z 2025-05-07T20:25:53.9433654Z 2025-05-07T20:25:53.9433658Z 2025-05-07T20:25:53.9433662Z 2025-05-07T20:25:53.9438050Z 2025-05-07T20:25:54.0020833Z python-3.10.13 | 24.5 MB | #####5 | 55%  2025-05-07T20:25:54.0021212Z 2025-05-07T20:25:54.0021228Z 2025-05-07T20:25:54.0021234Z 2025-05-07T20:25:54.0021240Z 2025-05-07T20:25:54.0021245Z 2025-05-07T20:25:54.0021251Z 2025-05-07T20:25:54.0021257Z 2025-05-07T20:25:54.0021262Z 2025-05-07T20:25:54.0021268Z 2025-05-07T20:25:54.0021273Z 2025-05-07T20:25:54.0021278Z 2025-05-07T20:25:54.0027553Z 2025-05-07T20:25:54.0438905Z cuda-nvcc-tools-12.8 | 24.5 MB | ######### | 91%  2025-05-07T20:25:54.0441510Z 2025-05-07T20:25:54.0557593Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:25:54.0558021Z 2025-05-07T20:25:54.0558027Z 2025-05-07T20:25:54.0558033Z 2025-05-07T20:25:54.0558050Z 2025-05-07T20:25:54.0558056Z 2025-05-07T20:25:54.0558062Z 2025-05-07T20:25:54.0558067Z 2025-05-07T20:25:54.0558073Z 2025-05-07T20:25:54.0558078Z 2025-05-07T20:25:54.0558084Z 2025-05-07T20:25:54.0558090Z 2025-05-07T20:25:54.0558095Z 2025-05-07T20:25:54.0559854Z 2025-05-07T20:25:54.0918383Z python-3.10.13 | 24.5 MB | ######6 | 67%  2025-05-07T20:25:54.0918718Z 2025-05-07T20:25:54.0918722Z 2025-05-07T20:25:54.0918727Z 2025-05-07T20:25:54.0918731Z 2025-05-07T20:25:54.0918735Z 2025-05-07T20:25:54.0918739Z 2025-05-07T20:25:54.0918743Z 2025-05-07T20:25:54.0918747Z 2025-05-07T20:25:54.0918751Z 2025-05-07T20:25:54.0918755Z 2025-05-07T20:25:54.0918771Z 2025-05-07T20:25:54.0918775Z 2025-05-07T20:25:54.0918779Z 2025-05-07T20:25:54.0921509Z 2025-05-07T20:25:54.1558025Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:25:54.1558395Z 2025-05-07T20:25:54.1558400Z 2025-05-07T20:25:54.1558413Z 2025-05-07T20:25:54.1558417Z 2025-05-07T20:25:54.1558421Z 2025-05-07T20:25:54.1558425Z 2025-05-07T20:25:54.1558429Z 2025-05-07T20:25:54.1558433Z 2025-05-07T20:25:54.1558437Z 2025-05-07T20:25:54.1558441Z 2025-05-07T20:25:54.1558445Z 2025-05-07T20:25:54.1558449Z 2025-05-07T20:25:54.1559829Z 2025-05-07T20:25:54.1921596Z python-3.10.13 | 24.5 MB | #######8 | 78%  2025-05-07T20:25:54.1921939Z 2025-05-07T20:25:54.1921943Z 2025-05-07T20:25:54.1921947Z 2025-05-07T20:25:54.1921951Z 2025-05-07T20:25:54.1921955Z 2025-05-07T20:25:54.1921959Z 2025-05-07T20:25:54.1921963Z 2025-05-07T20:25:54.1921967Z 2025-05-07T20:25:54.1921971Z 2025-05-07T20:25:54.1921975Z 2025-05-07T20:25:54.1921979Z 2025-05-07T20:25:54.1921983Z 2025-05-07T20:25:54.1921987Z 2025-05-07T20:25:54.1923494Z 2025-05-07T20:25:54.2587072Z cuda-nvvm-tools-12.8 | 23.5 MB | #2 | 12%  2025-05-07T20:25:54.2587533Z 2025-05-07T20:25:54.2587549Z 2025-05-07T20:25:54.2587561Z 2025-05-07T20:25:54.2587565Z 2025-05-07T20:25:54.2587569Z 2025-05-07T20:25:54.2587573Z 2025-05-07T20:25:54.2587578Z 2025-05-07T20:25:54.2587582Z 2025-05-07T20:25:54.2587586Z 2025-05-07T20:25:54.2587590Z 2025-05-07T20:25:54.2587594Z 2025-05-07T20:25:54.2587598Z 2025-05-07T20:25:54.2588851Z 2025-05-07T20:25:54.3037291Z python-3.10.13 | 24.5 MB | ########9 | 89%  2025-05-07T20:25:54.3037763Z 2025-05-07T20:25:54.3037770Z 2025-05-07T20:25:54.3037776Z 2025-05-07T20:25:54.3037782Z 2025-05-07T20:25:54.3037787Z 2025-05-07T20:25:54.3037793Z 2025-05-07T20:25:54.3037808Z 2025-05-07T20:25:54.3037814Z 2025-05-07T20:25:54.3037820Z 2025-05-07T20:25:54.3037826Z 2025-05-07T20:25:54.3037831Z 2025-05-07T20:25:54.3038040Z 2025-05-07T20:25:54.3038044Z 2025-05-07T20:25:54.3038048Z 2025-05-07T20:25:54.4038803Z cuda-nvvm-tools-12.8 | 23.5 MB | ##4 | 25%  2025-05-07T20:25:54.4039312Z 2025-05-07T20:25:54.4039318Z 2025-05-07T20:25:54.4039324Z 2025-05-07T20:25:54.4039329Z 2025-05-07T20:25:54.4039334Z 2025-05-07T20:25:54.4039339Z 2025-05-07T20:25:54.4039345Z 2025-05-07T20:25:54.4039350Z 2025-05-07T20:25:54.4039355Z 2025-05-07T20:25:54.4039360Z 2025-05-07T20:25:54.4039366Z 2025-05-07T20:25:54.4039371Z 2025-05-07T20:25:54.4039377Z 2025-05-07T20:25:54.4041412Z 2025-05-07T20:25:54.5065277Z cuda-nvvm-tools-12.8 | 23.5 MB | ###8 | 39%  2025-05-07T20:25:54.5065656Z 2025-05-07T20:25:54.5065661Z 2025-05-07T20:25:54.5065665Z 2025-05-07T20:25:54.5065669Z 2025-05-07T20:25:54.5065673Z 2025-05-07T20:25:54.5065677Z 2025-05-07T20:25:54.5065681Z 2025-05-07T20:25:54.5065685Z 2025-05-07T20:25:54.5065692Z 2025-05-07T20:25:54.5065723Z 2025-05-07T20:25:54.5065727Z 2025-05-07T20:25:54.5065731Z 2025-05-07T20:25:54.5065735Z 2025-05-07T20:25:54.5066099Z 2025-05-07T20:25:54.6094638Z cuda-nvvm-tools-12.8 | 23.5 MB | #####1 | 52%  2025-05-07T20:25:54.6095021Z 2025-05-07T20:25:54.6095026Z 2025-05-07T20:25:54.6095030Z 2025-05-07T20:25:54.6095034Z 2025-05-07T20:25:54.6095038Z 2025-05-07T20:25:54.6095042Z 2025-05-07T20:25:54.6095046Z 2025-05-07T20:25:54.6095050Z 2025-05-07T20:25:54.6095054Z 2025-05-07T20:25:54.6095058Z 2025-05-07T20:25:54.6095062Z 2025-05-07T20:25:54.6095066Z 2025-05-07T20:25:54.6095070Z 2025-05-07T20:25:54.6095767Z 2025-05-07T20:25:54.7103773Z cuda-nvvm-tools-12.8 | 23.5 MB | ######4 | 64%  2025-05-07T20:25:54.7104315Z 2025-05-07T20:25:54.7104319Z 2025-05-07T20:25:54.7104325Z 2025-05-07T20:25:54.7104330Z 2025-05-07T20:25:54.7104334Z 2025-05-07T20:25:54.7104338Z 2025-05-07T20:25:54.7104342Z 2025-05-07T20:25:54.7104379Z 2025-05-07T20:25:54.7104383Z 2025-05-07T20:25:54.7104387Z 2025-05-07T20:25:54.7104391Z 2025-05-07T20:25:54.7104395Z 2025-05-07T20:25:54.7104406Z 2025-05-07T20:25:54.7104421Z 2025-05-07T20:25:54.8105081Z cuda-nvvm-tools-12.8 | 23.5 MB | #######7 | 77%  2025-05-07T20:25:54.8105497Z 2025-05-07T20:25:54.8105503Z 2025-05-07T20:25:54.8105509Z 2025-05-07T20:25:54.8105529Z 2025-05-07T20:25:54.8105535Z 2025-05-07T20:25:54.8105541Z 2025-05-07T20:25:54.8105547Z 2025-05-07T20:25:54.8105553Z 2025-05-07T20:25:54.8105559Z 2025-05-07T20:25:54.8105564Z 2025-05-07T20:25:54.8105570Z 2025-05-07T20:25:54.8105576Z 2025-05-07T20:25:54.8105582Z 2025-05-07T20:25:54.8105635Z 2025-05-07T20:25:54.8220744Z cuda-nvvm-tools-12.8 | 23.5 MB | #########1 | 91%  2025-05-07T20:25:54.8221242Z 2025-05-07T20:25:54.8221248Z 2025-05-07T20:25:54.8221253Z 2025-05-07T20:25:54.8221259Z 2025-05-07T20:25:54.8221265Z 2025-05-07T20:25:54.8221311Z 2025-05-07T20:25:54.8221317Z 2025-05-07T20:25:54.8221323Z 2025-05-07T20:25:54.8221328Z 2025-05-07T20:25:54.8221334Z 2025-05-07T20:25:54.8221339Z 2025-05-07T20:25:54.8853473Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:25:54.8853902Z 2025-05-07T20:25:54.8853908Z 2025-05-07T20:25:54.8853914Z 2025-05-07T20:25:54.8853919Z 2025-05-07T20:25:54.8853924Z 2025-05-07T20:25:54.8853930Z 2025-05-07T20:25:54.8853936Z 2025-05-07T20:25:54.8853941Z 2025-05-07T20:25:54.8853947Z 2025-05-07T20:25:54.8853952Z 2025-05-07T20:25:54.8853958Z 2025-05-07T20:25:54.8853964Z 2025-05-07T20:25:54.8853969Z 2025-05-07T20:25:54.8853975Z 2025-05-07T20:25:54.8855506Z 2025-05-07T20:25:54.9854703Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:25:54.9855202Z 2025-05-07T20:25:54.9855207Z 2025-05-07T20:25:54.9855211Z 2025-05-07T20:25:54.9855215Z 2025-05-07T20:25:54.9855230Z 2025-05-07T20:25:54.9855519Z 2025-05-07T20:25:54.9855524Z 2025-05-07T20:25:54.9855528Z 2025-05-07T20:25:54.9855532Z 2025-05-07T20:25:54.9855536Z 2025-05-07T20:25:54.9855540Z 2025-05-07T20:25:54.9855681Z 2025-05-07T20:25:54.9855685Z 2025-05-07T20:25:54.9855689Z 2025-05-07T20:25:54.9855702Z 2025-05-07T20:25:55.0204429Z cuda-nvvm-impl-12.8. | 20.8 MB | #4 | 14%  2025-05-07T20:25:55.0204951Z 2025-05-07T20:25:55.0204958Z 2025-05-07T20:25:55.0204964Z 2025-05-07T20:25:55.0204970Z 2025-05-07T20:25:55.0204976Z 2025-05-07T20:25:55.0204997Z 2025-05-07T20:25:55.0205004Z 2025-05-07T20:25:55.0205009Z 2025-05-07T20:25:55.0205015Z 2025-05-07T20:25:55.0205020Z 2025-05-07T20:25:55.0205024Z 2025-05-07T20:25:55.0205028Z 2025-05-07T20:25:55.0618083Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:25:55.0618460Z 2025-05-07T20:25:55.0618465Z 2025-05-07T20:25:55.0618469Z 2025-05-07T20:25:55.0618473Z 2025-05-07T20:25:55.0618502Z 2025-05-07T20:25:55.0618506Z 2025-05-07T20:25:55.0618511Z 2025-05-07T20:25:55.0618515Z 2025-05-07T20:25:55.0618519Z 2025-05-07T20:25:55.0618523Z 2025-05-07T20:25:55.0618537Z 2025-05-07T20:25:55.0618542Z 2025-05-07T20:25:55.0618546Z 2025-05-07T20:25:55.0618551Z 2025-05-07T20:25:55.0618555Z 2025-05-07T20:25:55.0619116Z 2025-05-07T20:25:55.1537113Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:25:55.1537575Z 2025-05-07T20:25:55.1537580Z 2025-05-07T20:25:55.1537584Z 2025-05-07T20:25:55.1537588Z 2025-05-07T20:25:55.1537592Z 2025-05-07T20:25:55.1537597Z 2025-05-07T20:25:55.1537601Z 2025-05-07T20:25:55.1537605Z 2025-05-07T20:25:55.1537609Z 2025-05-07T20:25:55.1537622Z 2025-05-07T20:25:55.1537626Z 2025-05-07T20:25:55.1537630Z 2025-05-07T20:25:55.1537634Z 2025-05-07T20:25:55.1537638Z 2025-05-07T20:25:55.1538967Z 2025-05-07T20:25:55.1619477Z cuda-nvvm-impl-12.8. | 20.8 MB | ##8 | 28%  2025-05-07T20:25:55.1620053Z 2025-05-07T20:25:55.1620060Z 2025-05-07T20:25:55.1620065Z 2025-05-07T20:25:55.1620071Z 2025-05-07T20:25:55.1620089Z 2025-05-07T20:25:55.1620095Z 2025-05-07T20:25:55.1620100Z 2025-05-07T20:25:55.1620106Z 2025-05-07T20:25:55.1620112Z 2025-05-07T20:25:55.1620117Z 2025-05-07T20:25:55.1620123Z 2025-05-07T20:25:55.1620128Z 2025-05-07T20:25:55.1620134Z 2025-05-07T20:25:55.1620139Z 2025-05-07T20:25:55.1620145Z 2025-05-07T20:25:55.1621793Z 2025-05-07T20:25:55.2546172Z cuda-nvcc-dev_linux- | 12.7 MB | ##2 | 22%  2025-05-07T20:25:55.2546628Z 2025-05-07T20:25:55.2546634Z 2025-05-07T20:25:55.2546640Z 2025-05-07T20:25:55.2546646Z 2025-05-07T20:25:55.2546652Z 2025-05-07T20:25:55.2546658Z 2025-05-07T20:25:55.2546665Z 2025-05-07T20:25:55.2546671Z 2025-05-07T20:25:55.2546690Z 2025-05-07T20:25:55.2546696Z 2025-05-07T20:25:55.2546702Z 2025-05-07T20:25:55.2546707Z 2025-05-07T20:25:55.2546748Z 2025-05-07T20:25:55.2546754Z 2025-05-07T20:25:55.2546761Z 2025-05-07T20:25:55.2694376Z cuda-nvvm-impl-12.8. | 20.8 MB | ####2 | 42%  2025-05-07T20:25:55.2694859Z 2025-05-07T20:25:55.2694866Z 2025-05-07T20:25:55.2694871Z 2025-05-07T20:25:55.2694877Z 2025-05-07T20:25:55.2694883Z 2025-05-07T20:25:55.2694889Z 2025-05-07T20:25:55.2694894Z 2025-05-07T20:25:55.2694900Z 2025-05-07T20:25:55.2694906Z 2025-05-07T20:25:55.2694911Z 2025-05-07T20:25:55.2694917Z 2025-05-07T20:25:55.2694923Z 2025-05-07T20:25:55.2694929Z 2025-05-07T20:25:55.2694934Z 2025-05-07T20:25:55.2694940Z 2025-05-07T20:25:55.2697428Z 2025-05-07T20:25:55.2735986Z cuda-nvcc-dev_linux- | 12.7 MB | ####4 | 45%  2025-05-07T20:25:55.2736462Z 2025-05-07T20:25:55.2736467Z 2025-05-07T20:25:55.2736471Z 2025-05-07T20:25:55.2736483Z 2025-05-07T20:25:55.2736487Z 2025-05-07T20:25:55.2736491Z 2025-05-07T20:25:55.2736495Z 2025-05-07T20:25:55.2736739Z 2025-05-07T20:25:55.2736743Z 2025-05-07T20:25:55.2736747Z 2025-05-07T20:25:55.2736752Z 2025-05-07T20:25:55.2736756Z 2025-05-07T20:25:55.2736760Z 2025-05-07T20:25:55.3127542Z python-3.10.13 | 24.5 MB | ########## | 100%  2025-05-07T20:25:55.3128028Z 2025-05-07T20:25:55.3128035Z 2025-05-07T20:25:55.3128040Z 2025-05-07T20:25:55.3128046Z 2025-05-07T20:25:55.3128051Z 2025-05-07T20:25:55.3128057Z 2025-05-07T20:25:55.3128063Z 2025-05-07T20:25:55.3128069Z 2025-05-07T20:25:55.3128075Z 2025-05-07T20:25:55.3128081Z 2025-05-07T20:25:55.3128087Z 2025-05-07T20:25:55.3128103Z 2025-05-07T20:25:55.3128117Z 2025-05-07T20:25:55.3128123Z 2025-05-07T20:25:55.3128129Z 2025-05-07T20:25:55.3128135Z 2025-05-07T20:25:55.3129468Z 2025-05-07T20:25:55.3554627Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:25:55.3555032Z 2025-05-07T20:25:55.3555037Z 2025-05-07T20:25:55.3555068Z 2025-05-07T20:25:55.3555072Z 2025-05-07T20:25:55.3555076Z 2025-05-07T20:25:55.3555080Z 2025-05-07T20:25:55.3555085Z 2025-05-07T20:25:55.3555091Z 2025-05-07T20:25:55.3555113Z 2025-05-07T20:25:55.3555119Z 2025-05-07T20:25:55.3555125Z 2025-05-07T20:25:55.3555130Z 2025-05-07T20:25:55.3555135Z 2025-05-07T20:25:55.3555141Z 2025-05-07T20:25:55.3555397Z 2025-05-07T20:25:55.3796203Z cuda-nvvm-impl-12.8. | 20.8 MB | #####4 | 54%  2025-05-07T20:25:55.3796654Z 2025-05-07T20:25:55.3796659Z 2025-05-07T20:25:55.3796663Z 2025-05-07T20:25:55.3796668Z 2025-05-07T20:25:55.3796671Z 2025-05-07T20:25:55.3796675Z 2025-05-07T20:25:55.3796679Z 2025-05-07T20:25:55.3796683Z 2025-05-07T20:25:55.3796687Z 2025-05-07T20:25:55.3796700Z 2025-05-07T20:25:55.3796704Z 2025-05-07T20:25:55.3796708Z 2025-05-07T20:25:55.3796712Z 2025-05-07T20:25:55.3796716Z 2025-05-07T20:25:55.3796720Z 2025-05-07T20:25:55.3798061Z 2025-05-07T20:25:55.4129497Z cuda-nvcc-dev_linux- | 12.7 MB | ######6 | 66%  2025-05-07T20:25:55.4130077Z 2025-05-07T20:25:55.4130084Z 2025-05-07T20:25:55.4130089Z 2025-05-07T20:25:55.4130111Z 2025-05-07T20:25:55.4130117Z 2025-05-07T20:25:55.4130123Z 2025-05-07T20:25:55.4130128Z 2025-05-07T20:25:55.4130134Z 2025-05-07T20:25:55.4130139Z 2025-05-07T20:25:55.4130145Z 2025-05-07T20:25:55.4130150Z 2025-05-07T20:25:55.4130156Z 2025-05-07T20:25:55.4130161Z 2025-05-07T20:25:55.4130167Z 2025-05-07T20:25:55.4130172Z 2025-05-07T20:25:55.4130178Z 2025-05-07T20:25:55.4131747Z 2025-05-07T20:25:55.4599074Z cuda-sanitizer-api-1 | 8.8 MB | ##6 | 27%  2025-05-07T20:25:55.4599651Z 2025-05-07T20:25:55.4599657Z 2025-05-07T20:25:55.4599663Z 2025-05-07T20:25:55.4599669Z 2025-05-07T20:25:55.4599674Z 2025-05-07T20:25:55.4599680Z 2025-05-07T20:25:55.4599685Z 2025-05-07T20:25:55.4599690Z 2025-05-07T20:25:55.4599698Z 2025-05-07T20:25:55.4599734Z 2025-05-07T20:25:55.4599741Z 2025-05-07T20:25:55.4599747Z 2025-05-07T20:25:55.4599752Z 2025-05-07T20:25:55.4599758Z 2025-05-07T20:25:55.4600145Z 2025-05-07T20:25:55.4880080Z cuda-nvvm-impl-12.8. | 20.8 MB | ######6 | 67%  2025-05-07T20:25:55.4880459Z 2025-05-07T20:25:55.4880463Z 2025-05-07T20:25:55.4880467Z 2025-05-07T20:25:55.4880471Z 2025-05-07T20:25:55.4880476Z 2025-05-07T20:25:55.4880480Z 2025-05-07T20:25:55.4880484Z 2025-05-07T20:25:55.4880488Z 2025-05-07T20:25:55.4880501Z 2025-05-07T20:25:55.4880505Z 2025-05-07T20:25:55.4880509Z 2025-05-07T20:25:55.4880513Z 2025-05-07T20:25:55.4880517Z 2025-05-07T20:25:55.4880521Z 2025-05-07T20:25:55.4880525Z 2025-05-07T20:25:55.4882182Z 2025-05-07T20:25:55.5133711Z cuda-nvcc-dev_linux- | 12.7 MB | ########6 | 87%  2025-05-07T20:25:55.5134161Z 2025-05-07T20:25:55.5134168Z 2025-05-07T20:25:55.5134173Z 2025-05-07T20:25:55.5134179Z 2025-05-07T20:25:55.5134482Z 2025-05-07T20:25:55.5134488Z 2025-05-07T20:25:55.5134495Z 2025-05-07T20:25:55.5134500Z 2025-05-07T20:25:55.5134506Z 2025-05-07T20:25:55.5134512Z 2025-05-07T20:25:55.5134702Z 2025-05-07T20:25:55.5134711Z 2025-05-07T20:25:55.5134717Z 2025-05-07T20:25:55.5134722Z 2025-05-07T20:25:55.5134728Z 2025-05-07T20:25:55.5134733Z 2025-05-07T20:25:55.5134739Z 2025-05-07T20:25:55.5633108Z cuda-sanitizer-api-1 | 8.8 MB | #####6 | 56%  2025-05-07T20:25:55.5633794Z 2025-05-07T20:25:55.5633815Z 2025-05-07T20:25:55.5633820Z 2025-05-07T20:25:55.5633826Z 2025-05-07T20:25:55.5633842Z 2025-05-07T20:25:55.5633848Z 2025-05-07T20:25:55.5633854Z 2025-05-07T20:25:55.5633860Z 2025-05-07T20:25:55.5633865Z 2025-05-07T20:25:55.5633872Z 2025-05-07T20:25:55.5633877Z 2025-05-07T20:25:55.5633882Z 2025-05-07T20:25:55.5633889Z 2025-05-07T20:25:55.5633895Z 2025-05-07T20:25:55.5637050Z 2025-05-07T20:25:55.6166594Z cuda-nvvm-impl-12.8. | 20.8 MB | #######8 | 79%  2025-05-07T20:25:55.6167131Z 2025-05-07T20:25:55.6167136Z 2025-05-07T20:25:55.6167141Z 2025-05-07T20:25:55.6167160Z 2025-05-07T20:25:55.6167164Z 2025-05-07T20:25:55.6167168Z 2025-05-07T20:25:55.6167172Z 2025-05-07T20:25:55.6167176Z 2025-05-07T20:25:55.6167180Z 2025-05-07T20:25:55.6167184Z 2025-05-07T20:25:55.6167188Z 2025-05-07T20:25:55.6167192Z 2025-05-07T20:25:55.6167196Z 2025-05-07T20:25:55.6167200Z 2025-05-07T20:25:55.6167204Z 2025-05-07T20:25:55.6167208Z 2025-05-07T20:25:55.6167222Z 2025-05-07T20:25:55.6640564Z cuda-sanitizer-api-1 | 8.8 MB | ########4 | 84%  2025-05-07T20:25:55.6641085Z 2025-05-07T20:25:55.6641091Z 2025-05-07T20:25:55.6641108Z 2025-05-07T20:25:55.6641115Z 2025-05-07T20:25:55.6641120Z 2025-05-07T20:25:55.6641126Z 2025-05-07T20:25:55.6641132Z 2025-05-07T20:25:55.6641137Z 2025-05-07T20:25:55.6641143Z 2025-05-07T20:25:55.6641148Z 2025-05-07T20:25:55.6641183Z 2025-05-07T20:25:55.6641190Z 2025-05-07T20:25:55.6641195Z 2025-05-07T20:25:55.6641201Z 2025-05-07T20:25:55.6646338Z 2025-05-07T20:25:55.7946978Z cuda-nvvm-impl-12.8. | 20.8 MB | #########2 | 92%  2025-05-07T20:25:55.7947437Z 2025-05-07T20:25:55.7947442Z 2025-05-07T20:25:55.7947446Z 2025-05-07T20:25:55.7947450Z 2025-05-07T20:25:55.7947454Z 2025-05-07T20:25:55.7947466Z 2025-05-07T20:25:55.7947470Z 2025-05-07T20:25:55.7947474Z 2025-05-07T20:25:55.7947478Z 2025-05-07T20:25:55.7947482Z 2025-05-07T20:25:55.7947486Z 2025-05-07T20:25:55.7947490Z 2025-05-07T20:25:55.7947494Z 2025-05-07T20:25:55.7949163Z 2025-05-07T20:25:55.8022800Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:25:55.8023230Z 2025-05-07T20:25:55.8023236Z 2025-05-07T20:25:55.8023242Z 2025-05-07T20:25:55.8023248Z 2025-05-07T20:25:55.8023253Z 2025-05-07T20:25:55.8023258Z 2025-05-07T20:25:55.8023264Z 2025-05-07T20:25:55.8023290Z 2025-05-07T20:25:55.8023297Z 2025-05-07T20:25:55.8025183Z 2025-05-07T20:25:55.8689268Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:25:55.8689624Z 2025-05-07T20:25:55.8689629Z 2025-05-07T20:25:55.8689642Z 2025-05-07T20:25:55.8689646Z 2025-05-07T20:25:55.8689650Z 2025-05-07T20:25:55.8689654Z 2025-05-07T20:25:55.8689657Z 2025-05-07T20:25:55.8689662Z 2025-05-07T20:25:55.8689665Z 2025-05-07T20:25:55.8689673Z 2025-05-07T20:25:55.8689679Z 2025-05-07T20:25:55.8689684Z 2025-05-07T20:25:55.8689690Z 2025-05-07T20:25:55.8689696Z 2025-05-07T20:25:55.8689701Z 2025-05-07T20:25:55.8689707Z 2025-05-07T20:25:55.8689713Z 2025-05-07T20:25:55.8691780Z 2025-05-07T20:25:55.9698204Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:25:55.9698703Z 2025-05-07T20:25:55.9698709Z 2025-05-07T20:25:55.9698717Z 2025-05-07T20:25:55.9698721Z 2025-05-07T20:25:55.9698981Z 2025-05-07T20:25:55.9698985Z 2025-05-07T20:25:55.9698989Z 2025-05-07T20:25:55.9698993Z 2025-05-07T20:25:55.9698997Z 2025-05-07T20:25:55.9699001Z 2025-05-07T20:25:55.9699018Z 2025-05-07T20:25:55.9699187Z 2025-05-07T20:25:55.9699192Z 2025-05-07T20:25:55.9699196Z 2025-05-07T20:25:55.9699200Z 2025-05-07T20:25:55.9699204Z 2025-05-07T20:25:55.9699208Z 2025-05-07T20:25:55.9699730Z 2025-05-07T20:25:55.9888645Z cuda-nvdisasm-12.8.5 | 4.9 MB | ######3 | 64%  2025-05-07T20:25:55.9889028Z 2025-05-07T20:25:55.9889032Z 2025-05-07T20:25:55.9889036Z 2025-05-07T20:25:55.9889040Z 2025-05-07T20:25:55.9889045Z 2025-05-07T20:25:55.9889049Z 2025-05-07T20:25:55.9889053Z 2025-05-07T20:25:55.9889057Z 2025-05-07T20:25:55.9889061Z 2025-05-07T20:25:55.9889065Z 2025-05-07T20:25:55.9889069Z 2025-05-07T20:25:55.9889073Z 2025-05-07T20:25:55.9889077Z 2025-05-07T20:25:55.9889081Z 2025-05-07T20:25:55.9889085Z 2025-05-07T20:25:55.9889089Z 2025-05-07T20:25:55.9889111Z 2025-05-07T20:25:56.0327317Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:25:56.0327814Z 2025-05-07T20:25:56.0327844Z 2025-05-07T20:25:56.0327851Z 2025-05-07T20:25:56.0327867Z 2025-05-07T20:25:56.0327873Z 2025-05-07T20:25:56.0327879Z 2025-05-07T20:25:56.0327885Z 2025-05-07T20:25:56.0327890Z 2025-05-07T20:25:56.0327896Z 2025-05-07T20:25:56.0327902Z 2025-05-07T20:25:56.0327908Z 2025-05-07T20:25:56.0327914Z 2025-05-07T20:25:56.0327920Z 2025-05-07T20:25:56.0327926Z 2025-05-07T20:25:56.0327932Z 2025-05-07T20:25:56.0327936Z 2025-05-07T20:25:56.0327940Z 2025-05-07T20:25:56.0327944Z 2025-05-07T20:25:56.0327948Z 2025-05-07T20:25:56.0474617Z ... (more hidden) ... 2025-05-07T20:25:56.0475050Z 2025-05-07T20:25:56.0475054Z 2025-05-07T20:25:56.0475058Z 2025-05-07T20:25:56.0475063Z 2025-05-07T20:25:56.0475069Z 2025-05-07T20:25:56.0475076Z 2025-05-07T20:25:56.0475098Z 2025-05-07T20:25:56.0475104Z 2025-05-07T20:25:56.0475110Z 2025-05-07T20:25:56.0475115Z 2025-05-07T20:25:56.0475121Z 2025-05-07T20:25:56.0475127Z 2025-05-07T20:25:56.0475152Z 2025-05-07T20:25:56.0475158Z 2025-05-07T20:25:56.0475163Z 2025-05-07T20:25:56.0477482Z 2025-05-07T20:25:56.1333264Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:25:56.1333776Z 2025-05-07T20:25:56.1333783Z 2025-05-07T20:25:56.1333789Z 2025-05-07T20:25:56.1333795Z 2025-05-07T20:25:56.1333800Z 2025-05-07T20:25:56.1333806Z 2025-05-07T20:25:56.1333812Z 2025-05-07T20:25:56.1333818Z 2025-05-07T20:25:56.1333824Z 2025-05-07T20:25:56.1333830Z 2025-05-07T20:25:56.1333836Z 2025-05-07T20:25:56.1333842Z 2025-05-07T20:25:56.1333848Z 2025-05-07T20:25:56.1333854Z 2025-05-07T20:25:56.1333860Z 2025-05-07T20:25:56.1333866Z 2025-05-07T20:25:56.1333872Z 2025-05-07T20:25:56.1333878Z 2025-05-07T20:25:56.1333885Z 2025-05-07T20:25:56.2337481Z ... (more hidden) ... 2025-05-07T20:25:56.2337924Z 2025-05-07T20:25:56.2337931Z 2025-05-07T20:25:56.2337936Z 2025-05-07T20:25:56.2337956Z 2025-05-07T20:25:56.2337961Z 2025-05-07T20:25:56.2337967Z 2025-05-07T20:25:56.2337973Z 2025-05-07T20:25:56.2337978Z 2025-05-07T20:25:56.2337995Z 2025-05-07T20:25:56.2338001Z 2025-05-07T20:25:56.2338006Z 2025-05-07T20:25:56.2338012Z 2025-05-07T20:25:56.2338017Z 2025-05-07T20:25:56.2338023Z 2025-05-07T20:25:56.2338029Z 2025-05-07T20:25:56.2338034Z 2025-05-07T20:25:56.2338040Z 2025-05-07T20:25:56.2347331Z 2025-05-07T20:25:56.2959601Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:25:56.2960009Z 2025-05-07T20:25:56.2960014Z 2025-05-07T20:25:56.2960019Z 2025-05-07T20:25:56.2960023Z 2025-05-07T20:25:56.2960028Z 2025-05-07T20:25:56.2960032Z 2025-05-07T20:25:56.2960036Z 2025-05-07T20:25:56.2960040Z 2025-05-07T20:25:56.2960045Z 2025-05-07T20:25:56.2960312Z 2025-05-07T20:25:56.2960316Z 2025-05-07T20:25:56.2960320Z 2025-05-07T20:25:56.2960324Z 2025-05-07T20:25:56.2960328Z 2025-05-07T20:25:56.2960341Z 2025-05-07T20:25:56.2960482Z 2025-05-07T20:25:56.2960487Z 2025-05-07T20:25:56.2960491Z 2025-05-07T20:25:56.2960494Z 2025-05-07T20:25:56.4898334Z ... (more hidden) ... 2025-05-07T20:25:56.4898745Z 2025-05-07T20:25:56.4898749Z 2025-05-07T20:25:56.4898753Z 2025-05-07T20:25:56.4898757Z 2025-05-07T20:25:56.4898761Z 2025-05-07T20:25:56.4898766Z 2025-05-07T20:25:56.4898770Z 2025-05-07T20:25:56.4898775Z 2025-05-07T20:25:56.4898779Z 2025-05-07T20:25:56.4898782Z 2025-05-07T20:25:56.4898786Z 2025-05-07T20:25:56.4898790Z 2025-05-07T20:25:56.4898794Z 2025-05-07T20:25:56.4898799Z 2025-05-07T20:25:56.4902288Z 2025-05-07T20:25:58.5953721Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:25:58.5954113Z 2025-05-07T20:25:58.5954144Z 2025-05-07T20:25:58.5954148Z 2025-05-07T20:25:58.5954162Z 2025-05-07T20:25:58.5954166Z 2025-05-07T20:25:58.5954170Z 2025-05-07T20:25:58.5954174Z 2025-05-07T20:25:59.1439055Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:25:59.1447832Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:25:59.1448159Z 2025-05-07T20:25:59.1448165Z 2025-05-07T20:25:59.1448171Z 2025-05-07T20:25:59.1448177Z 2025-05-07T20:25:59.1448182Z 2025-05-07T20:25:59.1448187Z 2025-05-07T20:25:59.1448203Z 2025-05-07T20:25:59.1448208Z 2025-05-07T20:25:59.2069239Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:25:59.2069680Z 2025-05-07T20:25:59.2069685Z 2025-05-07T20:25:59.2069693Z 2025-05-07T20:25:59.2069711Z 2025-05-07T20:25:59.2069717Z 2025-05-07T20:25:59.3187088Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:25:59.3187483Z 2025-05-07T20:25:59.3187499Z 2025-05-07T20:25:59.3187534Z 2025-05-07T20:25:59.3187540Z 2025-05-07T20:25:59.3187545Z 2025-05-07T20:25:59.3187551Z 2025-05-07T20:25:59.3187556Z 2025-05-07T20:25:59.3187561Z 2025-05-07T20:25:59.3187580Z 2025-05-07T20:25:59.6309449Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:25:59.6309962Z 2025-05-07T20:25:59.6309968Z 2025-05-07T20:25:59.6309974Z 2025-05-07T20:25:59.6309980Z 2025-05-07T20:25:59.6309986Z 2025-05-07T20:25:59.6309991Z 2025-05-07T20:25:59.6309997Z 2025-05-07T20:25:59.6310003Z 2025-05-07T20:25:59.6310009Z 2025-05-07T20:25:59.6310015Z 2025-05-07T20:25:59.6310020Z 2025-05-07T20:25:59.7453614Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:25:59.7454180Z 2025-05-07T20:25:59.7454186Z 2025-05-07T20:25:59.7454192Z 2025-05-07T20:25:59.7454197Z 2025-05-07T20:25:59.7454204Z 2025-05-07T20:25:59.7454209Z 2025-05-07T20:25:59.7454215Z 2025-05-07T20:25:59.7454220Z 2025-05-07T20:25:59.7454226Z 2025-05-07T20:25:59.7454268Z 2025-05-07T20:25:59.7454272Z 2025-05-07T20:25:59.7454276Z 2025-05-07T20:26:00.2400381Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:26:00.2400772Z 2025-05-07T20:26:00.2400777Z 2025-05-07T20:26:00.2400781Z 2025-05-07T20:26:00.2400785Z 2025-05-07T20:26:00.2400789Z 2025-05-07T20:26:00.2400793Z 2025-05-07T20:26:00.2400797Z 2025-05-07T20:26:00.2400801Z 2025-05-07T20:26:00.2400805Z 2025-05-07T20:26:00.2400809Z 2025-05-07T20:26:00.2400813Z 2025-05-07T20:26:00.2400817Z 2025-05-07T20:26:00.2400821Z 2025-05-07T20:26:00.2400825Z 2025-05-07T20:26:00.5472560Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:26:00.5473065Z 2025-05-07T20:26:00.5473070Z 2025-05-07T20:26:00.5473084Z 2025-05-07T20:26:00.5473088Z 2025-05-07T20:26:00.5473092Z 2025-05-07T20:26:00.5473096Z 2025-05-07T20:26:00.5473101Z 2025-05-07T20:26:00.5473105Z 2025-05-07T20:26:00.5473110Z 2025-05-07T20:26:00.5473381Z 2025-05-07T20:26:00.5473385Z 2025-05-07T20:26:00.5473390Z 2025-05-07T20:26:00.5473394Z 2025-05-07T20:26:00.5473398Z 2025-05-07T20:26:00.5473402Z 2025-05-07T20:26:00.5473680Z 2025-05-07T20:26:00.5473685Z 2025-05-07T20:26:00.9877898Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:26:00.9878305Z 2025-05-07T20:26:00.9878309Z 2025-05-07T20:26:00.9878313Z 2025-05-07T20:26:00.9878317Z 2025-05-07T20:26:00.9878321Z 2025-05-07T20:26:00.9878325Z 2025-05-07T20:26:00.9878329Z 2025-05-07T20:26:00.9878333Z 2025-05-07T20:26:00.9878337Z 2025-05-07T20:26:00.9878341Z 2025-05-07T20:26:00.9878345Z 2025-05-07T20:26:00.9878349Z 2025-05-07T20:26:00.9878359Z 2025-05-07T20:26:00.9878363Z 2025-05-07T20:26:00.9878367Z 2025-05-07T20:26:00.9878371Z 2025-05-07T20:26:01.0413242Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:26:01.0413651Z 2025-05-07T20:26:01.0413691Z 2025-05-07T20:26:01.0413696Z 2025-05-07T20:26:01.0413701Z 2025-05-07T20:26:01.0413707Z 2025-05-07T20:26:01.0413711Z 2025-05-07T20:26:01.0413716Z 2025-05-07T20:26:01.0413736Z 2025-05-07T20:26:01.0413741Z 2025-05-07T20:26:01.0413745Z 2025-05-07T20:26:01.0413750Z 2025-05-07T20:26:01.0413755Z 2025-05-07T20:26:01.0413760Z 2025-05-07T20:26:01.0413765Z 2025-05-07T20:26:01.0413770Z 2025-05-07T20:26:01.0413774Z 2025-05-07T20:26:01.0413777Z 2025-05-07T20:26:01.0413781Z 2025-05-07T20:26:01.1584877Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:26:01.1585280Z 2025-05-07T20:26:01.1585284Z 2025-05-07T20:26:01.1585288Z 2025-05-07T20:26:01.1585291Z 2025-05-07T20:26:01.1585295Z 2025-05-07T20:26:01.1585299Z 2025-05-07T20:26:01.1585303Z 2025-05-07T20:26:01.1585307Z 2025-05-07T20:26:01.1585311Z 2025-05-07T20:26:01.1585326Z 2025-05-07T20:26:01.1585330Z 2025-05-07T20:26:01.1585333Z 2025-05-07T20:26:01.1585365Z 2025-05-07T20:26:01.2691964Z python-3.10.13 | 24.5 MB | ########## | 100%  2025-05-07T20:26:01.2692305Z 2025-05-07T20:26:01.2692309Z 2025-05-07T20:26:01.2692331Z 2025-05-07T20:26:01.2692335Z 2025-05-07T20:26:01.2692338Z 2025-05-07T20:26:01.2692342Z 2025-05-07T20:26:01.2692346Z 2025-05-07T20:26:01.2692350Z 2025-05-07T20:26:01.2692354Z 2025-05-07T20:26:01.2692358Z 2025-05-07T20:26:01.2692362Z 2025-05-07T20:26:01.2692366Z 2025-05-07T20:26:01.2692370Z 2025-05-07T20:26:01.2692373Z 2025-05-07T20:26:01.2692377Z 2025-05-07T20:26:01.2692381Z 2025-05-07T20:26:01.2692385Z 2025-05-07T20:26:01.2692389Z 2025-05-07T20:26:01.2692397Z 2025-05-07T20:26:01.7006025Z ... (more hidden) ... 2025-05-07T20:26:01.7006490Z 2025-05-07T20:26:01.7006496Z 2025-05-07T20:26:01.7006502Z 2025-05-07T20:26:01.7006507Z 2025-05-07T20:26:01.7006512Z 2025-05-07T20:26:01.7006518Z 2025-05-07T20:26:01.7006524Z 2025-05-07T20:26:01.7006560Z 2025-05-07T20:26:01.7006565Z 2025-05-07T20:26:01.7006570Z 2025-05-07T20:26:01.7006576Z 2025-05-07T20:26:01.7006581Z 2025-05-07T20:26:01.7006586Z 2025-05-07T20:26:01.7006608Z 2025-05-07T20:26:01.7006793Z 2025-05-07T20:26:06.3758297Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:06.3758681Z 2025-05-07T20:26:07.6899787Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:07.6909331Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:07.6909712Z 2025-05-07T20:26:07.6909718Z 2025-05-07T20:26:07.6909725Z 2025-05-07T20:26:07.6909731Z 2025-05-07T20:26:07.6909738Z 2025-05-07T20:26:07.6909744Z 2025-05-07T20:26:07.6909750Z 2025-05-07T20:26:07.6909756Z 2025-05-07T20:26:07.6909765Z 2025-05-07T20:26:07.6909772Z 2025-05-07T20:26:07.6909778Z 2025-05-07T20:26:07.6909785Z 2025-05-07T20:26:07.6909789Z 2025-05-07T20:26:07.6909801Z 2025-05-07T20:26:07.6909805Z 2025-05-07T20:26:07.6910115Z 2025-05-07T20:26:07.6910119Z 2025-05-07T20:26:07.6910123Z 2025-05-07T20:26:07.6910127Z 2025-05-07T20:26:07.6910234Z 2025-05-07T20:26:07.6910857Z  2025-05-07T20:26:07.6911226Z 2025-05-07T20:26:07.6911453Z 2025-05-07T20:26:07.6911659Z  2025-05-07T20:26:07.6911888Z 2025-05-07T20:26:07.6911892Z 2025-05-07T20:26:07.6912072Z  2025-05-07T20:26:07.6912299Z 2025-05-07T20:26:07.6912303Z 2025-05-07T20:26:07.6912314Z 2025-05-07T20:26:07.6912563Z  2025-05-07T20:26:07.6912885Z 2025-05-07T20:26:07.6912890Z 2025-05-07T20:26:07.6912894Z 2025-05-07T20:26:07.6912898Z 2025-05-07T20:26:07.6913406Z  2025-05-07T20:26:07.6913793Z 2025-05-07T20:26:07.6913810Z 2025-05-07T20:26:07.6913819Z 2025-05-07T20:26:07.6913823Z 2025-05-07T20:26:07.6913828Z 2025-05-07T20:26:07.6914341Z  2025-05-07T20:26:07.6914603Z 2025-05-07T20:26:07.6914619Z 2025-05-07T20:26:07.6914625Z 2025-05-07T20:26:07.6914629Z 2025-05-07T20:26:07.6914633Z 2025-05-07T20:26:07.6914637Z 2025-05-07T20:26:07.6915222Z  2025-05-07T20:26:07.6915510Z 2025-05-07T20:26:07.6915516Z 2025-05-07T20:26:07.6915529Z 2025-05-07T20:26:07.6915535Z 2025-05-07T20:26:07.6915540Z 2025-05-07T20:26:07.6915546Z 2025-05-07T20:26:07.6915552Z 2025-05-07T20:26:07.6916360Z  2025-05-07T20:26:07.6916789Z 2025-05-07T20:26:07.6916797Z 2025-05-07T20:26:07.6916804Z 2025-05-07T20:26:07.6916810Z 2025-05-07T20:26:07.6916816Z 2025-05-07T20:26:07.6916822Z 2025-05-07T20:26:07.6916839Z 2025-05-07T20:26:07.6916866Z 2025-05-07T20:26:07.6917268Z  2025-05-07T20:26:07.6917683Z 2025-05-07T20:26:07.6917698Z 2025-05-07T20:26:07.6917704Z 2025-05-07T20:26:07.6917720Z 2025-05-07T20:26:07.6917726Z 2025-05-07T20:26:07.6917732Z 2025-05-07T20:26:07.6917737Z 2025-05-07T20:26:07.6917743Z 2025-05-07T20:26:07.6917749Z 2025-05-07T20:26:07.6918250Z  2025-05-07T20:26:07.6918599Z 2025-05-07T20:26:07.6918605Z 2025-05-07T20:26:07.6918620Z 2025-05-07T20:26:07.6918626Z 2025-05-07T20:26:07.6918632Z 2025-05-07T20:26:07.6918637Z 2025-05-07T20:26:07.6918643Z 2025-05-07T20:26:07.6918648Z 2025-05-07T20:26:07.6918654Z 2025-05-07T20:26:07.6918664Z 2025-05-07T20:26:07.6919137Z  2025-05-07T20:26:07.6919565Z 2025-05-07T20:26:07.6919571Z 2025-05-07T20:26:07.6919577Z 2025-05-07T20:26:07.6919595Z 2025-05-07T20:26:07.6919601Z 2025-05-07T20:26:07.6919608Z 2025-05-07T20:26:07.6919614Z 2025-05-07T20:26:07.6919620Z 2025-05-07T20:26:07.6919626Z 2025-05-07T20:26:07.6919639Z 2025-05-07T20:26:07.6919656Z 2025-05-07T20:26:07.6920023Z  2025-05-07T20:26:07.6920451Z 2025-05-07T20:26:07.6920457Z 2025-05-07T20:26:07.6920463Z 2025-05-07T20:26:07.6920468Z 2025-05-07T20:26:07.6920473Z 2025-05-07T20:26:07.6920486Z 2025-05-07T20:26:07.6920492Z 2025-05-07T20:26:07.6920498Z 2025-05-07T20:26:07.6920504Z 2025-05-07T20:26:07.6920509Z 2025-05-07T20:26:07.6920515Z 2025-05-07T20:26:07.6920521Z 2025-05-07T20:26:07.6921158Z  2025-05-07T20:26:07.6921563Z 2025-05-07T20:26:07.6921570Z 2025-05-07T20:26:07.6921577Z 2025-05-07T20:26:07.6921582Z 2025-05-07T20:26:07.6921588Z 2025-05-07T20:26:07.6921594Z 2025-05-07T20:26:07.6921768Z 2025-05-07T20:26:07.6921775Z 2025-05-07T20:26:07.6921782Z 2025-05-07T20:26:07.6921800Z 2025-05-07T20:26:07.6921806Z 2025-05-07T20:26:07.6921813Z 2025-05-07T20:26:07.6921915Z 2025-05-07T20:26:07.6922300Z  2025-05-07T20:26:07.6922717Z 2025-05-07T20:26:07.6922724Z 2025-05-07T20:26:07.6922731Z 2025-05-07T20:26:07.6922737Z 2025-05-07T20:26:07.6922743Z 2025-05-07T20:26:07.6922750Z 2025-05-07T20:26:07.6922756Z 2025-05-07T20:26:07.6922762Z 2025-05-07T20:26:07.6922769Z 2025-05-07T20:26:07.6922775Z 2025-05-07T20:26:07.6922781Z 2025-05-07T20:26:07.6922788Z 2025-05-07T20:26:07.6922794Z 2025-05-07T20:26:07.6922801Z 2025-05-07T20:26:07.6923187Z  2025-05-07T20:26:07.6923602Z 2025-05-07T20:26:07.6923609Z 2025-05-07T20:26:07.6923616Z 2025-05-07T20:26:07.6923622Z 2025-05-07T20:26:07.6923628Z 2025-05-07T20:26:07.6923645Z 2025-05-07T20:26:07.6923652Z 2025-05-07T20:26:07.6923658Z 2025-05-07T20:26:07.6923664Z 2025-05-07T20:26:07.6923671Z 2025-05-07T20:26:07.6923677Z 2025-05-07T20:26:07.6923692Z 2025-05-07T20:26:07.6923698Z 2025-05-07T20:26:07.6923705Z 2025-05-07T20:26:07.6923711Z 2025-05-07T20:26:07.6924357Z  2025-05-07T20:26:07.6924766Z 2025-05-07T20:26:07.6924773Z 2025-05-07T20:26:07.6924779Z 2025-05-07T20:26:07.6924785Z 2025-05-07T20:26:07.6924801Z 2025-05-07T20:26:07.6924807Z 2025-05-07T20:26:07.6924814Z 2025-05-07T20:26:07.6924820Z 2025-05-07T20:26:07.6924826Z 2025-05-07T20:26:07.6924833Z 2025-05-07T20:26:07.6924839Z 2025-05-07T20:26:07.6924845Z 2025-05-07T20:26:07.6924852Z 2025-05-07T20:26:07.6924858Z 2025-05-07T20:26:07.6924864Z 2025-05-07T20:26:07.6924871Z 2025-05-07T20:26:07.6925305Z  2025-05-07T20:26:07.6925752Z 2025-05-07T20:26:07.6925761Z 2025-05-07T20:26:07.6925769Z 2025-05-07T20:26:07.6925794Z 2025-05-07T20:26:07.6925800Z 2025-05-07T20:26:07.6925806Z 2025-05-07T20:26:07.6925821Z 2025-05-07T20:26:07.6925827Z 2025-05-07T20:26:07.6925834Z 2025-05-07T20:26:07.6925840Z 2025-05-07T20:26:07.6925846Z 2025-05-07T20:26:07.6925854Z 2025-05-07T20:26:07.6925862Z 2025-05-07T20:26:07.6925881Z 2025-05-07T20:26:07.6925889Z 2025-05-07T20:26:07.6925897Z 2025-05-07T20:26:07.6925905Z 2025-05-07T20:26:07.6926316Z  2025-05-07T20:26:07.6926739Z 2025-05-07T20:26:07.6926746Z 2025-05-07T20:26:07.6926752Z 2025-05-07T20:26:07.6926767Z 2025-05-07T20:26:07.6926773Z 2025-05-07T20:26:07.6926779Z 2025-05-07T20:26:07.6926786Z 2025-05-07T20:26:07.6926792Z 2025-05-07T20:26:07.6926799Z 2025-05-07T20:26:07.6926805Z 2025-05-07T20:26:07.6926811Z 2025-05-07T20:26:07.6926817Z 2025-05-07T20:26:07.6926832Z 2025-05-07T20:26:07.6926838Z 2025-05-07T20:26:07.6926845Z 2025-05-07T20:26:07.6926851Z 2025-05-07T20:26:07.6926857Z 2025-05-07T20:26:07.6926863Z 2025-05-07T20:26:07.6927994Z  2025-05-07T20:26:07.6928410Z 2025-05-07T20:26:07.6928422Z 2025-05-07T20:26:07.6928625Z  2025-05-07T20:26:07.6928829Z 2025-05-07T20:26:07.6928839Z 2025-05-07T20:26:07.6929617Z  2025-05-07T20:26:07.6929834Z 2025-05-07T20:26:07.6929840Z 2025-05-07T20:26:07.6929851Z 2025-05-07T20:26:07.6930367Z  2025-05-07T20:26:07.6930577Z 2025-05-07T20:26:07.6930584Z 2025-05-07T20:26:07.6930599Z 2025-05-07T20:26:07.6930610Z 2025-05-07T20:26:07.6931396Z  2025-05-07T20:26:07.6931616Z 2025-05-07T20:26:07.6931623Z 2025-05-07T20:26:07.6931629Z 2025-05-07T20:26:07.6931635Z 2025-05-07T20:26:07.6931654Z 2025-05-07T20:26:07.6932195Z  2025-05-07T20:26:07.6932413Z 2025-05-07T20:26:07.6932672Z 2025-05-07T20:26:07.6932678Z 2025-05-07T20:26:07.6932684Z 2025-05-07T20:26:07.6932694Z 2025-05-07T20:26:07.6932700Z 2025-05-07T20:26:07.6933030Z  2025-05-07T20:26:07.6934121Z 2025-05-07T20:26:07.6934134Z 2025-05-07T20:26:07.6934140Z 2025-05-07T20:26:07.6934147Z 2025-05-07T20:26:07.6934153Z 2025-05-07T20:26:07.6934171Z 2025-05-07T20:26:07.6934188Z 2025-05-07T20:26:07.6934422Z  2025-05-07T20:26:07.6934686Z 2025-05-07T20:26:07.6934692Z 2025-05-07T20:26:07.6934699Z 2025-05-07T20:26:07.6934705Z 2025-05-07T20:26:07.6934720Z 2025-05-07T20:26:07.6934726Z 2025-05-07T20:26:07.6934733Z 2025-05-07T20:26:07.6934739Z 2025-05-07T20:26:07.6934973Z  2025-05-07T20:26:07.6935242Z 2025-05-07T20:26:07.6935248Z 2025-05-07T20:26:07.6935262Z 2025-05-07T20:26:07.6935268Z 2025-05-07T20:26:07.6935274Z 2025-05-07T20:26:07.6935279Z 2025-05-07T20:26:07.6935285Z 2025-05-07T20:26:07.6935291Z 2025-05-07T20:26:07.6935296Z 2025-05-07T20:26:07.6935659Z  2025-05-07T20:26:07.6935968Z 2025-05-07T20:26:07.6935974Z 2025-05-07T20:26:07.6935980Z 2025-05-07T20:26:07.6935986Z 2025-05-07T20:26:07.6935993Z 2025-05-07T20:26:07.6936012Z 2025-05-07T20:26:07.6936019Z 2025-05-07T20:26:07.6936025Z 2025-05-07T20:26:07.6936031Z 2025-05-07T20:26:07.6936041Z 2025-05-07T20:26:07.6936654Z  2025-05-07T20:26:07.6936866Z 2025-05-07T20:26:07.6936871Z 2025-05-07T20:26:07.6936875Z 2025-05-07T20:26:07.6936880Z 2025-05-07T20:26:07.6936891Z 2025-05-07T20:26:07.6936902Z 2025-05-07T20:26:07.6936906Z 2025-05-07T20:26:07.6936910Z 2025-05-07T20:26:07.6936914Z 2025-05-07T20:26:07.6936918Z 2025-05-07T20:26:07.6936921Z 2025-05-07T20:26:07.6937354Z  2025-05-07T20:26:07.6937599Z 2025-05-07T20:26:07.6937604Z 2025-05-07T20:26:07.6937608Z 2025-05-07T20:26:07.6937612Z 2025-05-07T20:26:07.6937616Z 2025-05-07T20:26:07.6937620Z 2025-05-07T20:26:07.6937624Z 2025-05-07T20:26:07.6937641Z 2025-05-07T20:26:07.6937653Z 2025-05-07T20:26:07.6937657Z 2025-05-07T20:26:07.6937660Z 2025-05-07T20:26:07.6937664Z 2025-05-07T20:26:07.6938093Z  2025-05-07T20:26:07.6938320Z 2025-05-07T20:26:07.6938324Z 2025-05-07T20:26:07.6938328Z 2025-05-07T20:26:07.6938338Z 2025-05-07T20:26:07.6938343Z 2025-05-07T20:26:07.6938347Z 2025-05-07T20:26:07.6938351Z 2025-05-07T20:26:07.6938355Z 2025-05-07T20:26:07.6938359Z 2025-05-07T20:26:07.6938363Z 2025-05-07T20:26:07.6938367Z 2025-05-07T20:26:07.6938371Z 2025-05-07T20:26:07.6938375Z 2025-05-07T20:26:07.6938834Z  2025-05-07T20:26:07.6939092Z 2025-05-07T20:26:07.6939103Z 2025-05-07T20:26:07.6939115Z 2025-05-07T20:26:07.6939119Z 2025-05-07T20:26:07.6939123Z 2025-05-07T20:26:07.6939127Z 2025-05-07T20:26:07.6939131Z 2025-05-07T20:26:07.6939134Z 2025-05-07T20:26:07.6939139Z 2025-05-07T20:26:07.6939142Z 2025-05-07T20:26:07.6939146Z 2025-05-07T20:26:07.6939150Z 2025-05-07T20:26:07.6939162Z 2025-05-07T20:26:07.6939166Z 2025-05-07T20:26:07.6939582Z  2025-05-07T20:26:07.6939831Z 2025-05-07T20:26:07.6939835Z 2025-05-07T20:26:07.6939846Z 2025-05-07T20:26:07.6939856Z 2025-05-07T20:26:07.6939860Z 2025-05-07T20:26:07.6939864Z 2025-05-07T20:26:07.6939868Z 2025-05-07T20:26:07.6939872Z 2025-05-07T20:26:07.6939876Z 2025-05-07T20:26:07.6939880Z 2025-05-07T20:26:07.6939884Z 2025-05-07T20:26:07.6939889Z 2025-05-07T20:26:07.6939892Z 2025-05-07T20:26:07.6939897Z 2025-05-07T20:26:07.6939910Z 2025-05-07T20:26:07.6940430Z  2025-05-07T20:26:07.6940701Z 2025-05-07T20:26:07.6940705Z 2025-05-07T20:26:07.6940710Z 2025-05-07T20:26:07.6940721Z 2025-05-07T20:26:07.6940725Z 2025-05-07T20:26:07.6940729Z 2025-05-07T20:26:07.6940733Z 2025-05-07T20:26:07.6940737Z 2025-05-07T20:26:07.6940741Z 2025-05-07T20:26:07.6940745Z 2025-05-07T20:26:07.6940749Z 2025-05-07T20:26:07.6940753Z 2025-05-07T20:26:07.6941113Z 2025-05-07T20:26:07.6941119Z 2025-05-07T20:26:07.6941125Z 2025-05-07T20:26:07.6941130Z 2025-05-07T20:26:07.6941398Z  2025-05-07T20:26:07.6941818Z 2025-05-07T20:26:07.6941837Z 2025-05-07T20:26:07.6941844Z 2025-05-07T20:26:07.6941849Z 2025-05-07T20:26:07.6941855Z 2025-05-07T20:26:07.6941860Z 2025-05-07T20:26:07.6941866Z 2025-05-07T20:26:07.6941872Z 2025-05-07T20:26:07.6941877Z 2025-05-07T20:26:07.6941883Z 2025-05-07T20:26:07.6941889Z 2025-05-07T20:26:07.6941894Z 2025-05-07T20:26:07.6941900Z 2025-05-07T20:26:07.6941905Z 2025-05-07T20:26:07.6941912Z 2025-05-07T20:26:07.6941917Z 2025-05-07T20:26:07.6941923Z 2025-05-07T20:26:07.6942222Z  2025-05-07T20:26:07.6942558Z 2025-05-07T20:26:07.6942564Z 2025-05-07T20:26:07.6942570Z 2025-05-07T20:26:07.6942576Z 2025-05-07T20:26:07.6942581Z 2025-05-07T20:26:07.6942587Z 2025-05-07T20:26:07.6942592Z 2025-05-07T20:26:07.6942598Z 2025-05-07T20:26:07.6942614Z 2025-05-07T20:26:07.6942619Z 2025-05-07T20:26:07.6942625Z 2025-05-07T20:26:07.6942630Z 2025-05-07T20:26:07.6942646Z 2025-05-07T20:26:07.6942652Z 2025-05-07T20:26:07.6942665Z 2025-05-07T20:26:07.6942670Z 2025-05-07T20:26:07.6942676Z 2025-05-07T20:26:07.6942681Z 2025-05-07T20:26:07.6943351Z  2025-05-07T20:26:07.6943694Z 2025-05-07T20:26:07.6943700Z 2025-05-07T20:26:07.6943861Z  2025-05-07T20:26:07.6944029Z 2025-05-07T20:26:07.6944035Z 2025-05-07T20:26:07.6944427Z  2025-05-07T20:26:07.6944597Z 2025-05-07T20:26:07.6944603Z 2025-05-07T20:26:07.6944608Z 2025-05-07T20:26:07.6945226Z  2025-05-07T20:26:07.6945414Z 2025-05-07T20:26:07.6945420Z 2025-05-07T20:26:07.6945426Z 2025-05-07T20:26:07.6945431Z 2025-05-07T20:26:07.6945678Z  2025-05-07T20:26:07.6945894Z 2025-05-07T20:26:07.6945901Z 2025-05-07T20:26:07.6945913Z 2025-05-07T20:26:07.6945920Z 2025-05-07T20:26:07.6945940Z 2025-05-07T20:26:07.6946438Z  2025-05-07T20:26:07.6946651Z 2025-05-07T20:26:07.6946657Z 2025-05-07T20:26:07.6946663Z 2025-05-07T20:26:07.6946673Z 2025-05-07T20:26:07.6946679Z 2025-05-07T20:26:07.6946690Z 2025-05-07T20:26:07.6947103Z  2025-05-07T20:26:07.6947304Z 2025-05-07T20:26:07.6947316Z 2025-05-07T20:26:07.6947322Z 2025-05-07T20:26:07.6947327Z 2025-05-07T20:26:07.6947333Z 2025-05-07T20:26:07.6947338Z 2025-05-07T20:26:07.6947352Z 2025-05-07T20:26:07.6947841Z  2025-05-07T20:26:07.6948068Z 2025-05-07T20:26:07.6948074Z 2025-05-07T20:26:07.6948080Z 2025-05-07T20:26:07.6948086Z 2025-05-07T20:26:07.6948091Z 2025-05-07T20:26:07.6948097Z 2025-05-07T20:26:07.6948102Z 2025-05-07T20:26:07.6948111Z 2025-05-07T20:26:07.6948506Z  2025-05-07T20:26:07.6948734Z 2025-05-07T20:26:07.6948746Z 2025-05-07T20:26:07.6948750Z 2025-05-07T20:26:07.6948754Z 2025-05-07T20:26:07.6948758Z 2025-05-07T20:26:07.6948769Z 2025-05-07T20:26:07.6948773Z 2025-05-07T20:26:07.6948786Z 2025-05-07T20:26:07.6948790Z 2025-05-07T20:26:07.6949331Z  2025-05-07T20:26:07.6949636Z 2025-05-07T20:26:07.6949643Z 2025-05-07T20:26:07.6949659Z 2025-05-07T20:26:07.6949666Z 2025-05-07T20:26:07.6949672Z 2025-05-07T20:26:07.6949677Z 2025-05-07T20:26:07.6949683Z 2025-05-07T20:26:07.6949689Z 2025-05-07T20:26:07.6949695Z 2025-05-07T20:26:07.6949706Z 2025-05-07T20:26:07.6950000Z  2025-05-07T20:26:07.6950259Z 2025-05-07T20:26:07.6950265Z 2025-05-07T20:26:07.6950270Z 2025-05-07T20:26:07.6950283Z 2025-05-07T20:26:07.6950289Z 2025-05-07T20:26:07.6950294Z 2025-05-07T20:26:07.6950300Z 2025-05-07T20:26:07.6950306Z 2025-05-07T20:26:07.6950311Z 2025-05-07T20:26:07.6950317Z 2025-05-07T20:26:07.6950322Z 2025-05-07T20:26:07.6950780Z  2025-05-07T20:26:07.6951108Z 2025-05-07T20:26:07.6951114Z 2025-05-07T20:26:07.6951120Z 2025-05-07T20:26:07.6951126Z 2025-05-07T20:26:07.6951132Z 2025-05-07T20:26:07.6951307Z 2025-05-07T20:26:07.6951313Z 2025-05-07T20:26:07.6951320Z 2025-05-07T20:26:07.6951326Z 2025-05-07T20:26:07.6951333Z 2025-05-07T20:26:07.6951339Z 2025-05-07T20:26:07.6951456Z 2025-05-07T20:26:07.6951722Z  2025-05-07T20:26:07.6952045Z 2025-05-07T20:26:07.6952052Z 2025-05-07T20:26:07.6952058Z 2025-05-07T20:26:07.6952064Z 2025-05-07T20:26:07.6952070Z 2025-05-07T20:26:07.6952076Z 2025-05-07T20:26:07.6952093Z 2025-05-07T20:26:07.6952099Z 2025-05-07T20:26:07.6952105Z 2025-05-07T20:26:07.6952112Z 2025-05-07T20:26:07.6952118Z 2025-05-07T20:26:07.6952135Z 2025-05-07T20:26:07.6952142Z 2025-05-07T20:26:07.6952414Z  2025-05-07T20:26:07.6952762Z 2025-05-07T20:26:07.6952769Z 2025-05-07T20:26:07.6952774Z 2025-05-07T20:26:07.6952780Z 2025-05-07T20:26:07.6952786Z 2025-05-07T20:26:07.6952792Z 2025-05-07T20:26:07.6952799Z 2025-05-07T20:26:07.6952805Z 2025-05-07T20:26:07.6952811Z 2025-05-07T20:26:07.6952818Z 2025-05-07T20:26:07.6952837Z 2025-05-07T20:26:07.6952843Z 2025-05-07T20:26:07.6952849Z 2025-05-07T20:26:07.6952855Z 2025-05-07T20:26:07.6953149Z  2025-05-07T20:26:07.6953637Z 2025-05-07T20:26:07.6953644Z 2025-05-07T20:26:07.6953651Z 2025-05-07T20:26:07.6953657Z 2025-05-07T20:26:07.6953663Z 2025-05-07T20:26:07.6953670Z 2025-05-07T20:26:07.6953688Z 2025-05-07T20:26:07.6953694Z 2025-05-07T20:26:07.6953700Z 2025-05-07T20:26:07.6953706Z 2025-05-07T20:26:07.6953712Z 2025-05-07T20:26:07.6953718Z 2025-05-07T20:26:07.6953735Z 2025-05-07T20:26:07.6953742Z 2025-05-07T20:26:07.6953748Z 2025-05-07T20:26:07.6954042Z  2025-05-07T20:26:07.6954394Z 2025-05-07T20:26:07.6954401Z 2025-05-07T20:26:07.6954407Z 2025-05-07T20:26:07.6954412Z 2025-05-07T20:26:07.6954428Z 2025-05-07T20:26:07.6954434Z 2025-05-07T20:26:07.6954441Z 2025-05-07T20:26:07.6954447Z 2025-05-07T20:26:07.6954453Z 2025-05-07T20:26:07.6954471Z 2025-05-07T20:26:07.6954486Z 2025-05-07T20:26:07.6954492Z 2025-05-07T20:26:07.6954499Z 2025-05-07T20:26:07.6954505Z 2025-05-07T20:26:07.6954511Z 2025-05-07T20:26:07.6954518Z 2025-05-07T20:26:07.6954821Z  2025-05-07T20:26:07.6955206Z 2025-05-07T20:26:07.6955212Z 2025-05-07T20:26:07.6955228Z 2025-05-07T20:26:07.6955235Z 2025-05-07T20:26:07.6955241Z 2025-05-07T20:26:07.6955248Z 2025-05-07T20:26:07.6955254Z 2025-05-07T20:26:07.6955260Z 2025-05-07T20:26:07.6955266Z 2025-05-07T20:26:07.6955273Z 2025-05-07T20:26:07.6955280Z 2025-05-07T20:26:07.6955286Z 2025-05-07T20:26:07.6955293Z 2025-05-07T20:26:07.6955299Z 2025-05-07T20:26:07.6955306Z 2025-05-07T20:26:07.6955312Z 2025-05-07T20:26:07.6955318Z 2025-05-07T20:26:07.6955631Z  2025-05-07T20:26:07.6956013Z 2025-05-07T20:26:07.6956020Z 2025-05-07T20:26:07.6956027Z 2025-05-07T20:26:07.6956033Z 2025-05-07T20:26:07.6956040Z 2025-05-07T20:26:07.6956055Z 2025-05-07T20:26:07.6956070Z 2025-05-07T20:26:07.6956076Z 2025-05-07T20:26:07.6956083Z 2025-05-07T20:26:07.6956089Z 2025-05-07T20:26:07.6956096Z 2025-05-07T20:26:07.6956102Z 2025-05-07T20:26:07.6956115Z 2025-05-07T20:26:07.6956122Z 2025-05-07T20:26:07.6956136Z 2025-05-07T20:26:07.6956143Z 2025-05-07T20:26:07.6956149Z 2025-05-07T20:26:07.6956155Z 2025-05-07T20:26:07.6957090Z  2025-05-07T20:26:07.6957480Z 2025-05-07T20:26:07.6957487Z 2025-05-07T20:26:07.6957690Z  2025-05-07T20:26:07.6957877Z 2025-05-07T20:26:07.6957883Z 2025-05-07T20:26:07.6958253Z  2025-05-07T20:26:07.6958455Z 2025-05-07T20:26:07.6958462Z 2025-05-07T20:26:07.6958471Z 2025-05-07T20:26:07.6958980Z  2025-05-07T20:26:07.6959170Z 2025-05-07T20:26:07.6959182Z 2025-05-07T20:26:07.6959188Z 2025-05-07T20:26:07.6959194Z 2025-05-07T20:26:07.6959666Z  2025-05-07T20:26:07.6959890Z 2025-05-07T20:26:07.6959897Z 2025-05-07T20:26:07.6959903Z 2025-05-07T20:26:07.6960056Z 2025-05-07T20:26:07.6960066Z 2025-05-07T20:26:07.6960336Z  2025-05-07T20:26:07.6960577Z 2025-05-07T20:26:07.6960583Z 2025-05-07T20:26:07.6960589Z 2025-05-07T20:26:07.6960717Z 2025-05-07T20:26:07.6960724Z 2025-05-07T20:26:07.6960736Z 2025-05-07T20:26:07.6961144Z  2025-05-07T20:26:07.6961354Z 2025-05-07T20:26:07.6961358Z 2025-05-07T20:26:07.6961362Z 2025-05-07T20:26:07.6961366Z 2025-05-07T20:26:07.6961370Z 2025-05-07T20:26:07.6961378Z 2025-05-07T20:26:07.6961382Z 2025-05-07T20:26:07.6961912Z  2025-05-07T20:26:07.6962099Z 2025-05-07T20:26:07.6962105Z 2025-05-07T20:26:07.6962109Z 2025-05-07T20:26:07.6962122Z 2025-05-07T20:26:07.6962126Z 2025-05-07T20:26:07.6962130Z 2025-05-07T20:26:07.6962134Z 2025-05-07T20:26:07.6962138Z 2025-05-07T20:26:07.6962544Z  2025-05-07T20:26:07.6962821Z 2025-05-07T20:26:07.6962827Z 2025-05-07T20:26:07.6962833Z 2025-05-07T20:26:07.6962839Z 2025-05-07T20:26:07.6962845Z 2025-05-07T20:26:07.6962869Z 2025-05-07T20:26:07.6962875Z 2025-05-07T20:26:07.6962881Z 2025-05-07T20:26:07.6962888Z 2025-05-07T20:26:07.6963179Z  2025-05-07T20:26:07.6963481Z 2025-05-07T20:26:07.6963487Z 2025-05-07T20:26:07.6963493Z 2025-05-07T20:26:07.6963506Z 2025-05-07T20:26:07.6963512Z 2025-05-07T20:26:07.6963517Z 2025-05-07T20:26:07.6963523Z 2025-05-07T20:26:07.6963529Z 2025-05-07T20:26:07.6963534Z 2025-05-07T20:26:07.6963540Z 2025-05-07T20:26:07.6963925Z  2025-05-07T20:26:07.6964186Z 2025-05-07T20:26:07.6964191Z 2025-05-07T20:26:07.6964197Z 2025-05-07T20:26:07.6964203Z 2025-05-07T20:26:07.6964208Z 2025-05-07T20:26:07.6964213Z 2025-05-07T20:26:07.6964228Z 2025-05-07T20:26:07.6964234Z 2025-05-07T20:26:07.6964240Z 2025-05-07T20:26:07.6964245Z 2025-05-07T20:26:07.6964255Z 2025-05-07T20:26:07.6964492Z  2025-05-07T20:26:07.6964764Z 2025-05-07T20:26:07.6964777Z 2025-05-07T20:26:07.6964783Z 2025-05-07T20:26:07.6964806Z 2025-05-07T20:26:07.6964812Z 2025-05-07T20:26:07.6964817Z 2025-05-07T20:26:07.6964823Z 2025-05-07T20:26:07.6964828Z 2025-05-07T20:26:07.6964834Z 2025-05-07T20:26:07.6964845Z 2025-05-07T20:26:07.6964851Z 2025-05-07T20:26:07.6964856Z 2025-05-07T20:26:07.6965179Z  2025-05-07T20:26:07.6965476Z 2025-05-07T20:26:07.6965483Z 2025-05-07T20:26:07.6965488Z 2025-05-07T20:26:07.6965493Z 2025-05-07T20:26:07.6965507Z 2025-05-07T20:26:07.6965512Z 2025-05-07T20:26:07.6965518Z 2025-05-07T20:26:07.6965523Z 2025-05-07T20:26:07.6965529Z 2025-05-07T20:26:07.6965534Z 2025-05-07T20:26:07.6965540Z 2025-05-07T20:26:07.6965553Z 2025-05-07T20:26:07.6965558Z 2025-05-07T20:26:07.6965772Z  2025-05-07T20:26:07.6966066Z 2025-05-07T20:26:07.6966072Z 2025-05-07T20:26:07.6966077Z 2025-05-07T20:26:07.6966083Z 2025-05-07T20:26:07.6966088Z 2025-05-07T20:26:07.6966094Z 2025-05-07T20:26:07.6966099Z 2025-05-07T20:26:07.6966105Z 2025-05-07T20:26:07.6966129Z 2025-05-07T20:26:07.6966135Z 2025-05-07T20:26:07.6966141Z 2025-05-07T20:26:07.6966146Z 2025-05-07T20:26:07.6966151Z 2025-05-07T20:26:07.6966156Z 2025-05-07T20:26:07.6966382Z  2025-05-07T20:26:07.6966682Z 2025-05-07T20:26:07.6966688Z 2025-05-07T20:26:07.6966693Z 2025-05-07T20:26:07.6966699Z 2025-05-07T20:26:07.6966705Z 2025-05-07T20:26:07.6966710Z 2025-05-07T20:26:07.6966716Z 2025-05-07T20:26:07.6966722Z 2025-05-07T20:26:07.6966728Z 2025-05-07T20:26:07.6966733Z 2025-05-07T20:26:07.6966738Z 2025-05-07T20:26:07.6966743Z 2025-05-07T20:26:07.6966763Z 2025-05-07T20:26:07.6966768Z 2025-05-07T20:26:07.6966774Z 2025-05-07T20:26:07.6966999Z  2025-05-07T20:26:07.6967302Z 2025-05-07T20:26:07.6967308Z 2025-05-07T20:26:07.6967321Z 2025-05-07T20:26:07.6967326Z 2025-05-07T20:26:07.6967331Z 2025-05-07T20:26:07.6967336Z 2025-05-07T20:26:07.6967341Z 2025-05-07T20:26:07.6967347Z 2025-05-07T20:26:07.6967503Z 2025-05-07T20:26:07.6967509Z 2025-05-07T20:26:07.6967515Z 2025-05-07T20:26:07.6967520Z 2025-05-07T20:26:07.6967526Z 2025-05-07T20:26:07.6967532Z 2025-05-07T20:26:07.6967628Z 2025-05-07T20:26:07.6967634Z 2025-05-07T20:26:07.6967899Z  2025-05-07T20:26:07.6968214Z 2025-05-07T20:26:07.6968220Z 2025-05-07T20:26:07.6968225Z 2025-05-07T20:26:07.6968231Z 2025-05-07T20:26:07.6968237Z 2025-05-07T20:26:07.6968242Z 2025-05-07T20:26:07.6968248Z 2025-05-07T20:26:07.6968253Z 2025-05-07T20:26:07.6968259Z 2025-05-07T20:26:07.6968264Z 2025-05-07T20:26:07.6968270Z 2025-05-07T20:26:07.6968275Z 2025-05-07T20:26:07.6968281Z 2025-05-07T20:26:07.6968286Z 2025-05-07T20:26:07.6968292Z 2025-05-07T20:26:07.6968305Z 2025-05-07T20:26:07.6968310Z 2025-05-07T20:26:07.6968560Z  2025-05-07T20:26:07.6968884Z 2025-05-07T20:26:07.6968887Z 2025-05-07T20:26:07.6968898Z 2025-05-07T20:26:07.6968902Z 2025-05-07T20:26:07.6968913Z 2025-05-07T20:26:07.6968918Z 2025-05-07T20:26:07.6968921Z 2025-05-07T20:26:07.6968926Z 2025-05-07T20:26:07.6968929Z 2025-05-07T20:26:07.6968934Z 2025-05-07T20:26:07.6968943Z 2025-05-07T20:26:07.6968947Z 2025-05-07T20:26:07.6968951Z 2025-05-07T20:26:07.6968955Z 2025-05-07T20:26:07.6968959Z 2025-05-07T20:26:07.6968963Z 2025-05-07T20:26:07.6968967Z 2025-05-07T20:26:07.6968971Z 2025-05-07T20:26:07.6969617Z  2025-05-07T20:26:07.6970014Z 2025-05-07T20:26:07.6970020Z 2025-05-07T20:26:07.6970217Z  2025-05-07T20:26:07.6970420Z 2025-05-07T20:26:07.6970427Z 2025-05-07T20:26:07.6970606Z  2025-05-07T20:26:07.6970802Z 2025-05-07T20:26:07.6970808Z 2025-05-07T20:26:07.6970821Z 2025-05-07T20:26:07.6971024Z  2025-05-07T20:26:07.6971238Z 2025-05-07T20:26:07.6971244Z 2025-05-07T20:26:07.6971251Z 2025-05-07T20:26:07.6971261Z 2025-05-07T20:26:07.6971690Z  2025-05-07T20:26:07.6971851Z 2025-05-07T20:26:07.6971864Z 2025-05-07T20:26:07.6971868Z 2025-05-07T20:26:07.6971872Z 2025-05-07T20:26:07.6971879Z 2025-05-07T20:26:07.6972160Z  2025-05-07T20:26:07.6972377Z 2025-05-07T20:26:07.6972390Z 2025-05-07T20:26:07.6972401Z 2025-05-07T20:26:07.6972407Z 2025-05-07T20:26:07.6972413Z 2025-05-07T20:26:07.6972419Z 2025-05-07T20:26:07.6972821Z  2025-05-07T20:26:07.6973033Z 2025-05-07T20:26:07.6973039Z 2025-05-07T20:26:07.6973050Z 2025-05-07T20:26:07.6973055Z 2025-05-07T20:26:07.6973061Z 2025-05-07T20:26:07.6973066Z 2025-05-07T20:26:07.6973072Z 2025-05-07T20:26:07.6973480Z  2025-05-07T20:26:07.6973718Z 2025-05-07T20:26:07.6973724Z 2025-05-07T20:26:07.6973730Z 2025-05-07T20:26:07.6973746Z 2025-05-07T20:26:07.6973752Z 2025-05-07T20:26:07.6973758Z 2025-05-07T20:26:07.6973764Z 2025-05-07T20:26:07.6973769Z 2025-05-07T20:26:07.6973971Z  2025-05-07T20:26:07.6974217Z 2025-05-07T20:26:07.6974223Z 2025-05-07T20:26:07.6974239Z 2025-05-07T20:26:07.6974244Z 2025-05-07T20:26:07.6974250Z 2025-05-07T20:26:07.6974255Z 2025-05-07T20:26:07.6974261Z 2025-05-07T20:26:07.6974270Z 2025-05-07T20:26:07.6974277Z 2025-05-07T20:26:07.6974491Z  2025-05-07T20:26:07.6974743Z 2025-05-07T20:26:07.6974748Z 2025-05-07T20:26:07.6974754Z 2025-05-07T20:26:07.6974760Z 2025-05-07T20:26:07.6974765Z 2025-05-07T20:26:07.6974771Z 2025-05-07T20:26:07.6974776Z 2025-05-07T20:26:07.6974781Z 2025-05-07T20:26:07.6974787Z 2025-05-07T20:26:07.6974793Z 2025-05-07T20:26:07.6975018Z  2025-05-07T20:26:07.6975273Z 2025-05-07T20:26:07.6975278Z 2025-05-07T20:26:07.6975284Z 2025-05-07T20:26:07.6975290Z 2025-05-07T20:26:07.6975295Z 2025-05-07T20:26:07.6975305Z 2025-05-07T20:26:07.6975311Z 2025-05-07T20:26:07.6975326Z 2025-05-07T20:26:07.6975332Z 2025-05-07T20:26:07.6975337Z 2025-05-07T20:26:07.6975343Z 2025-05-07T20:26:07.6975559Z  2025-05-07T20:26:07.6975836Z 2025-05-07T20:26:07.6975998Z 2025-05-07T20:26:07.6976004Z 2025-05-07T20:26:07.6976011Z 2025-05-07T20:26:07.6976017Z 2025-05-07T20:26:07.6976023Z 2025-05-07T20:26:07.6976028Z 2025-05-07T20:26:07.6976119Z 2025-05-07T20:26:07.6976125Z 2025-05-07T20:26:07.6976131Z 2025-05-07T20:26:07.6976136Z 2025-05-07T20:26:07.6976142Z 2025-05-07T20:26:07.6976372Z  2025-05-07T20:26:07.6976665Z 2025-05-07T20:26:07.6976671Z 2025-05-07T20:26:07.6976676Z 2025-05-07T20:26:07.6976682Z 2025-05-07T20:26:07.6976687Z 2025-05-07T20:26:07.6976693Z 2025-05-07T20:26:07.6976698Z 2025-05-07T20:26:07.6976703Z 2025-05-07T20:26:07.6976709Z 2025-05-07T20:26:07.6976714Z 2025-05-07T20:26:07.6976720Z 2025-05-07T20:26:07.6976725Z 2025-05-07T20:26:07.6976731Z 2025-05-07T20:26:07.6976955Z  2025-05-07T20:26:07.6977239Z 2025-05-07T20:26:07.6977245Z 2025-05-07T20:26:07.6977250Z 2025-05-07T20:26:07.6977256Z 2025-05-07T20:26:07.6977261Z 2025-05-07T20:26:07.6977277Z 2025-05-07T20:26:07.6977282Z 2025-05-07T20:26:07.6977288Z 2025-05-07T20:26:07.6977293Z 2025-05-07T20:26:07.6977299Z 2025-05-07T20:26:07.6977305Z 2025-05-07T20:26:07.6977326Z 2025-05-07T20:26:07.6977333Z 2025-05-07T20:26:07.6977338Z 2025-05-07T20:26:07.6977565Z  2025-05-07T20:26:07.6977866Z 2025-05-07T20:26:07.6977872Z 2025-05-07T20:26:07.6977877Z 2025-05-07T20:26:07.6977893Z 2025-05-07T20:26:07.6977899Z 2025-05-07T20:26:07.6977904Z 2025-05-07T20:26:07.6977910Z 2025-05-07T20:26:07.6977915Z 2025-05-07T20:26:07.6977921Z 2025-05-07T20:26:07.6977926Z 2025-05-07T20:26:07.6977932Z 2025-05-07T20:26:07.6977937Z 2025-05-07T20:26:07.6977943Z 2025-05-07T20:26:07.6977948Z 2025-05-07T20:26:07.6977954Z 2025-05-07T20:26:07.6978200Z  2025-05-07T20:26:07.6978512Z 2025-05-07T20:26:07.6978518Z 2025-05-07T20:26:07.6978523Z 2025-05-07T20:26:07.6978529Z 2025-05-07T20:26:07.6978534Z 2025-05-07T20:26:07.6978547Z 2025-05-07T20:26:07.6978553Z 2025-05-07T20:26:07.6978558Z 2025-05-07T20:26:07.6978564Z 2025-05-07T20:26:07.6978569Z 2025-05-07T20:26:07.6978575Z 2025-05-07T20:26:07.6978580Z 2025-05-07T20:26:07.6978591Z 2025-05-07T20:26:07.6978597Z 2025-05-07T20:26:07.6978603Z 2025-05-07T20:26:07.6978609Z 2025-05-07T20:26:07.6978847Z  2025-05-07T20:26:07.6979158Z 2025-05-07T20:26:07.6979175Z 2025-05-07T20:26:07.6979181Z 2025-05-07T20:26:07.6979186Z 2025-05-07T20:26:07.6979199Z 2025-05-07T20:26:07.6979205Z 2025-05-07T20:26:07.6979210Z 2025-05-07T20:26:07.6979216Z 2025-05-07T20:26:07.6979221Z 2025-05-07T20:26:07.6979227Z 2025-05-07T20:26:07.6979232Z 2025-05-07T20:26:07.6979237Z 2025-05-07T20:26:07.6979243Z 2025-05-07T20:26:07.6979248Z 2025-05-07T20:26:07.6979254Z 2025-05-07T20:26:07.6979259Z 2025-05-07T20:26:07.6979265Z 2025-05-07T20:26:07.6979501Z  2025-05-07T20:26:07.6979829Z 2025-05-07T20:26:07.6979844Z 2025-05-07T20:26:07.6979849Z 2025-05-07T20:26:07.6979855Z 2025-05-07T20:26:07.6979860Z 2025-05-07T20:26:07.6979866Z 2025-05-07T20:26:07.6979871Z 2025-05-07T20:26:07.6979883Z 2025-05-07T20:26:07.6979889Z 2025-05-07T20:26:07.6979895Z 2025-05-07T20:26:07.6979900Z 2025-05-07T20:26:07.6979905Z 2025-05-07T20:26:07.6979910Z 2025-05-07T20:26:07.6979915Z 2025-05-07T20:26:07.6979921Z 2025-05-07T20:26:07.6979926Z 2025-05-07T20:26:07.6979932Z 2025-05-07T20:26:07.6979937Z 2025-05-07T20:26:07.6980200Z  2025-05-07T20:26:07.6980522Z 2025-05-07T20:26:07.6980527Z 2025-05-07T20:26:07.6980687Z  2025-05-07T20:26:07.6980850Z 2025-05-07T20:26:07.6980856Z 2025-05-07T20:26:07.6981018Z  2025-05-07T20:26:07.6981193Z 2025-05-07T20:26:07.6981198Z 2025-05-07T20:26:07.6981204Z 2025-05-07T20:26:07.6981363Z  2025-05-07T20:26:07.6981540Z 2025-05-07T20:26:07.6981546Z 2025-05-07T20:26:07.6981551Z 2025-05-07T20:26:07.6981557Z 2025-05-07T20:26:07.6981845Z  2025-05-07T20:26:07.6982036Z 2025-05-07T20:26:07.6982042Z 2025-05-07T20:26:07.6982048Z 2025-05-07T20:26:07.6982054Z 2025-05-07T20:26:07.6982059Z 2025-05-07T20:26:07.6982324Z  2025-05-07T20:26:07.6982519Z 2025-05-07T20:26:07.6982525Z 2025-05-07T20:26:07.6982539Z 2025-05-07T20:26:07.6982545Z 2025-05-07T20:26:07.6982550Z 2025-05-07T20:26:07.6982556Z 2025-05-07T20:26:07.6982735Z  2025-05-07T20:26:07.6982933Z 2025-05-07T20:26:07.6982939Z 2025-05-07T20:26:07.6982952Z 2025-05-07T20:26:07.6982958Z 2025-05-07T20:26:07.6982963Z 2025-05-07T20:26:07.6982969Z 2025-05-07T20:26:07.6982974Z 2025-05-07T20:26:07.6983158Z  2025-05-07T20:26:07.6983371Z 2025-05-07T20:26:07.6983388Z 2025-05-07T20:26:07.6983393Z 2025-05-07T20:26:07.6983399Z 2025-05-07T20:26:07.6983404Z 2025-05-07T20:26:07.6983410Z 2025-05-07T20:26:07.6983415Z 2025-05-07T20:26:07.6983421Z 2025-05-07T20:26:07.6983614Z  2025-05-07T20:26:07.6983868Z 2025-05-07T20:26:07.6983874Z 2025-05-07T20:26:07.6983879Z 2025-05-07T20:26:07.6983885Z 2025-05-07T20:26:07.6983890Z 2025-05-07T20:26:07.6983896Z 2025-05-07T20:26:07.6983909Z 2025-05-07T20:26:07.6983914Z 2025-05-07T20:26:07.6983920Z 2025-05-07T20:26:07.6984126Z  2025-05-07T20:26:07.6984389Z 2025-05-07T20:26:07.6984395Z 2025-05-07T20:26:07.6984401Z 2025-05-07T20:26:07.6984407Z 2025-05-07T20:26:07.6984413Z 2025-05-07T20:26:07.6984418Z 2025-05-07T20:26:07.6984423Z 2025-05-07T20:26:07.6984429Z 2025-05-07T20:26:07.6984434Z 2025-05-07T20:26:07.6984440Z 2025-05-07T20:26:07.6984663Z  2025-05-07T20:26:07.6984938Z 2025-05-07T20:26:07.6984944Z 2025-05-07T20:26:07.6984950Z 2025-05-07T20:26:07.6984956Z 2025-05-07T20:26:07.6984961Z 2025-05-07T20:26:07.6984967Z 2025-05-07T20:26:07.6984973Z 2025-05-07T20:26:07.6984979Z 2025-05-07T20:26:07.6984984Z 2025-05-07T20:26:07.6984990Z 2025-05-07T20:26:07.6984996Z 2025-05-07T20:26:07.6985181Z  2025-05-07T20:26:07.6985376Z 2025-05-07T20:26:07.6985380Z 2025-05-07T20:26:07.6985385Z 2025-05-07T20:26:07.6985389Z 2025-05-07T20:26:07.6985398Z 2025-05-07T20:26:07.6985403Z 2025-05-07T20:26:07.6985407Z 2025-05-07T20:26:07.6985411Z 2025-05-07T20:26:07.6985415Z 2025-05-07T20:26:07.6985419Z 2025-05-07T20:26:07.6985430Z 2025-05-07T20:26:07.6985434Z 2025-05-07T20:26:07.6985594Z  2025-05-07T20:26:07.6985868Z 2025-05-07T20:26:07.6985874Z 2025-05-07T20:26:07.6985882Z 2025-05-07T20:26:07.6985889Z 2025-05-07T20:26:07.6985895Z 2025-05-07T20:26:07.6985902Z 2025-05-07T20:26:07.6985930Z 2025-05-07T20:26:07.6985935Z 2025-05-07T20:26:07.6985941Z 2025-05-07T20:26:07.6985946Z 2025-05-07T20:26:07.6985951Z 2025-05-07T20:26:07.6985956Z 2025-05-07T20:26:07.6985962Z 2025-05-07T20:26:07.6986176Z  2025-05-07T20:26:07.6986486Z 2025-05-07T20:26:07.6986492Z 2025-05-07T20:26:07.6986497Z 2025-05-07T20:26:07.6986511Z 2025-05-07T20:26:07.6986517Z 2025-05-07T20:26:07.6986523Z 2025-05-07T20:26:07.6986529Z 2025-05-07T20:26:07.6986534Z 2025-05-07T20:26:07.6986540Z 2025-05-07T20:26:07.6986551Z 2025-05-07T20:26:07.6986557Z 2025-05-07T20:26:07.6986563Z 2025-05-07T20:26:07.6986568Z 2025-05-07T20:26:07.6986574Z 2025-05-07T20:26:07.6986810Z  2025-05-07T20:26:07.6987106Z 2025-05-07T20:26:07.6987111Z 2025-05-07T20:26:07.6987117Z 2025-05-07T20:26:07.6987122Z 2025-05-07T20:26:07.6987128Z 2025-05-07T20:26:07.6987133Z 2025-05-07T20:26:07.6987139Z 2025-05-07T20:26:07.6987144Z 2025-05-07T20:26:07.6987163Z 2025-05-07T20:26:07.6987169Z 2025-05-07T20:26:07.6987175Z 2025-05-07T20:26:07.6987188Z 2025-05-07T20:26:07.6987193Z 2025-05-07T20:26:07.6987199Z 2025-05-07T20:26:07.6987204Z 2025-05-07T20:26:07.6987431Z  2025-05-07T20:26:07.6987734Z 2025-05-07T20:26:07.6987739Z 2025-05-07T20:26:07.6987745Z 2025-05-07T20:26:07.6987925Z 2025-05-07T20:26:07.6987930Z 2025-05-07T20:26:07.6987935Z 2025-05-07T20:26:07.6987940Z 2025-05-07T20:26:07.6987946Z 2025-05-07T20:26:07.6987951Z 2025-05-07T20:26:07.6988042Z 2025-05-07T20:26:07.6988048Z 2025-05-07T20:26:07.6988053Z 2025-05-07T20:26:07.6988059Z 2025-05-07T20:26:07.6988064Z 2025-05-07T20:26:07.6988070Z 2025-05-07T20:26:07.6988075Z 2025-05-07T20:26:07.6988317Z  2025-05-07T20:26:07.6988634Z 2025-05-07T20:26:07.6988640Z 2025-05-07T20:26:07.6988645Z 2025-05-07T20:26:07.6988651Z 2025-05-07T20:26:07.6988656Z 2025-05-07T20:26:07.6988661Z 2025-05-07T20:26:07.6988667Z 2025-05-07T20:26:07.6988672Z 2025-05-07T20:26:07.6988678Z 2025-05-07T20:26:07.6988683Z 2025-05-07T20:26:07.6988689Z 2025-05-07T20:26:07.6988694Z 2025-05-07T20:26:07.6988700Z 2025-05-07T20:26:07.6988705Z 2025-05-07T20:26:07.6988711Z 2025-05-07T20:26:07.6988716Z 2025-05-07T20:26:07.6988722Z 2025-05-07T20:26:07.6988978Z  2025-05-07T20:26:07.6989308Z 2025-05-07T20:26:07.6989314Z 2025-05-07T20:26:07.6989319Z 2025-05-07T20:26:07.6989324Z 2025-05-07T20:26:07.6989330Z 2025-05-07T20:26:07.6989341Z 2025-05-07T20:26:07.6989358Z 2025-05-07T20:26:07.6989364Z 2025-05-07T20:26:07.6989369Z 2025-05-07T20:26:07.6989375Z 2025-05-07T20:26:07.6989380Z 2025-05-07T20:26:07.6989386Z 2025-05-07T20:26:07.6989391Z 2025-05-07T20:26:07.6989397Z 2025-05-07T20:26:07.6989402Z 2025-05-07T20:26:07.6989408Z 2025-05-07T20:26:07.6989413Z 2025-05-07T20:26:07.6989419Z 2025-05-07T20:26:07.6989667Z  2025-05-07T20:26:07.6990001Z 2025-05-07T20:26:07.6990006Z 2025-05-07T20:26:07.6990159Z  2025-05-07T20:26:07.6990319Z 2025-05-07T20:26:07.6990333Z 2025-05-07T20:26:07.6990489Z  2025-05-07T20:26:07.6990651Z 2025-05-07T20:26:07.6990657Z 2025-05-07T20:26:07.6990673Z 2025-05-07T20:26:07.6990842Z  2025-05-07T20:26:07.6991015Z 2025-05-07T20:26:07.6991021Z 2025-05-07T20:26:07.6991034Z 2025-05-07T20:26:07.6991040Z 2025-05-07T20:26:07.6991211Z  2025-05-07T20:26:07.6991392Z 2025-05-07T20:26:07.6991398Z 2025-05-07T20:26:07.6991403Z 2025-05-07T20:26:07.6991415Z 2025-05-07T20:26:07.6991420Z 2025-05-07T20:26:07.6991587Z  2025-05-07T20:26:07.6991785Z 2025-05-07T20:26:07.6991790Z 2025-05-07T20:26:07.6991796Z 2025-05-07T20:26:07.6991801Z 2025-05-07T20:26:07.6991807Z 2025-05-07T20:26:07.6991812Z 2025-05-07T20:26:07.6991984Z  2025-05-07T20:26:07.6992190Z 2025-05-07T20:26:07.6992196Z 2025-05-07T20:26:07.6992201Z 2025-05-07T20:26:07.6992207Z 2025-05-07T20:26:07.6992212Z 2025-05-07T20:26:07.6992218Z 2025-05-07T20:26:07.6992223Z 2025-05-07T20:26:07.6992421Z  2025-05-07T20:26:07.6992637Z 2025-05-07T20:26:07.6992642Z 2025-05-07T20:26:07.6992647Z 2025-05-07T20:26:07.6992652Z 2025-05-07T20:26:07.6992657Z 2025-05-07T20:26:07.6992663Z 2025-05-07T20:26:07.6992668Z 2025-05-07T20:26:07.6992674Z 2025-05-07T20:26:07.6992871Z  2025-05-07T20:26:07.6993102Z 2025-05-07T20:26:07.6993107Z 2025-05-07T20:26:07.6993113Z 2025-05-07T20:26:07.6993118Z 2025-05-07T20:26:07.6993131Z 2025-05-07T20:26:07.6993136Z 2025-05-07T20:26:07.6993142Z 2025-05-07T20:26:07.6993148Z 2025-05-07T20:26:07.6993153Z 2025-05-07T20:26:07.6993352Z  2025-05-07T20:26:07.6993723Z 2025-05-07T20:26:07.6993728Z 2025-05-07T20:26:07.6993734Z 2025-05-07T20:26:07.6993739Z 2025-05-07T20:26:07.6993745Z 2025-05-07T20:26:07.6993750Z 2025-05-07T20:26:07.6993755Z 2025-05-07T20:26:07.6993761Z 2025-05-07T20:26:07.6993767Z 2025-05-07T20:26:07.6993772Z 2025-05-07T20:26:07.6993990Z  2025-05-07T20:26:07.6994246Z 2025-05-07T20:26:07.6994252Z 2025-05-07T20:26:07.6994257Z 2025-05-07T20:26:07.6994263Z 2025-05-07T20:26:07.6994268Z 2025-05-07T20:26:07.6994274Z 2025-05-07T20:26:07.6994279Z 2025-05-07T20:26:07.6994285Z 2025-05-07T20:26:07.6994290Z 2025-05-07T20:26:07.6994413Z 2025-05-07T20:26:07.6994428Z 2025-05-07T20:26:07.6994656Z  2025-05-07T20:26:07.6994933Z 2025-05-07T20:26:07.6994938Z 2025-05-07T20:26:07.6995036Z 2025-05-07T20:26:07.6995043Z 2025-05-07T20:26:07.6995048Z 2025-05-07T20:26:07.6995054Z 2025-05-07T20:26:07.6995059Z 2025-05-07T20:26:07.6995064Z 2025-05-07T20:26:07.6995070Z 2025-05-07T20:26:07.6995075Z 2025-05-07T20:26:07.6995081Z 2025-05-07T20:26:07.6995086Z 2025-05-07T20:26:07.6995294Z  2025-05-07T20:26:07.6995584Z 2025-05-07T20:26:07.6995589Z 2025-05-07T20:26:07.6995595Z 2025-05-07T20:26:07.6995600Z 2025-05-07T20:26:07.6995606Z 2025-05-07T20:26:07.6995611Z 2025-05-07T20:26:07.6995617Z 2025-05-07T20:26:07.6995622Z 2025-05-07T20:26:07.6995628Z 2025-05-07T20:26:07.6995633Z 2025-05-07T20:26:07.6995639Z 2025-05-07T20:26:07.6995644Z 2025-05-07T20:26:07.6995650Z 2025-05-07T20:26:07.6995866Z  2025-05-07T20:26:07.6996151Z 2025-05-07T20:26:07.6996165Z 2025-05-07T20:26:07.6996171Z 2025-05-07T20:26:07.6996177Z 2025-05-07T20:26:07.6996182Z 2025-05-07T20:26:07.6996187Z 2025-05-07T20:26:07.6996192Z 2025-05-07T20:26:07.6996203Z 2025-05-07T20:26:07.6996209Z 2025-05-07T20:26:07.6996214Z 2025-05-07T20:26:07.6996220Z 2025-05-07T20:26:07.6996225Z 2025-05-07T20:26:07.6996231Z 2025-05-07T20:26:07.6996245Z 2025-05-07T20:26:07.6996462Z  2025-05-07T20:26:07.6996756Z 2025-05-07T20:26:07.6996762Z 2025-05-07T20:26:07.6996768Z 2025-05-07T20:26:07.6996773Z 2025-05-07T20:26:07.6996778Z 2025-05-07T20:26:07.6996784Z 2025-05-07T20:26:07.6996798Z 2025-05-07T20:26:07.6996803Z 2025-05-07T20:26:07.6996809Z 2025-05-07T20:26:07.6996814Z 2025-05-07T20:26:07.6996820Z 2025-05-07T20:26:07.6996825Z 2025-05-07T20:26:07.6996831Z 2025-05-07T20:26:07.6996836Z 2025-05-07T20:26:07.6996842Z 2025-05-07T20:26:07.6997066Z  2025-05-07T20:26:07.6997381Z 2025-05-07T20:26:07.6997394Z 2025-05-07T20:26:07.6997399Z 2025-05-07T20:26:07.6997405Z 2025-05-07T20:26:07.6997410Z 2025-05-07T20:26:07.6997416Z 2025-05-07T20:26:07.6997421Z 2025-05-07T20:26:07.6997432Z 2025-05-07T20:26:07.6997438Z 2025-05-07T20:26:07.6997443Z 2025-05-07T20:26:07.6997449Z 2025-05-07T20:26:07.6997454Z 2025-05-07T20:26:07.6997459Z 2025-05-07T20:26:07.6997465Z 2025-05-07T20:26:07.6997470Z 2025-05-07T20:26:07.6997476Z 2025-05-07T20:26:07.6997714Z  2025-05-07T20:26:07.6998025Z 2025-05-07T20:26:07.6998031Z 2025-05-07T20:26:07.6998035Z 2025-05-07T20:26:07.6998041Z 2025-05-07T20:26:07.6998046Z 2025-05-07T20:26:07.6998052Z 2025-05-07T20:26:07.6998083Z 2025-05-07T20:26:07.6998089Z 2025-05-07T20:26:07.6998094Z 2025-05-07T20:26:07.6998100Z 2025-05-07T20:26:07.6998105Z 2025-05-07T20:26:07.6998111Z 2025-05-07T20:26:07.6998116Z 2025-05-07T20:26:07.6998122Z 2025-05-07T20:26:07.6998127Z 2025-05-07T20:26:07.6998133Z 2025-05-07T20:26:07.6998145Z 2025-05-07T20:26:07.6998389Z  2025-05-07T20:26:07.6998714Z 2025-05-07T20:26:07.6998720Z 2025-05-07T20:26:07.6998725Z 2025-05-07T20:26:07.6998738Z 2025-05-07T20:26:07.6998743Z 2025-05-07T20:26:07.6998749Z 2025-05-07T20:26:07.6998754Z 2025-05-07T20:26:07.6998760Z 2025-05-07T20:26:07.6998765Z 2025-05-07T20:26:07.6998771Z 2025-05-07T20:26:07.6998776Z 2025-05-07T20:26:07.6998782Z 2025-05-07T20:26:07.6998787Z 2025-05-07T20:26:07.6998793Z 2025-05-07T20:26:07.6998798Z 2025-05-07T20:26:07.6998804Z 2025-05-07T20:26:07.6998810Z 2025-05-07T20:26:07.6998815Z 2025-05-07T20:26:07.6999075Z  2025-05-07T20:26:07.6999397Z 2025-05-07T20:26:07.6999402Z 2025-05-07T20:26:07.6999575Z  2025-05-07T20:26:07.6999737Z 2025-05-07T20:26:07.6999743Z 2025-05-07T20:26:07.6999903Z  2025-05-07T20:26:07.7000074Z 2025-05-07T20:26:07.7000080Z 2025-05-07T20:26:07.7000086Z 2025-05-07T20:26:07.7000247Z  2025-05-07T20:26:07.7000537Z 2025-05-07T20:26:07.7000542Z 2025-05-07T20:26:07.7000557Z 2025-05-07T20:26:07.7000562Z 2025-05-07T20:26:07.7000712Z  2025-05-07T20:26:07.7000932Z 2025-05-07T20:26:07.7000937Z 2025-05-07T20:26:07.7000941Z 2025-05-07T20:26:07.7000945Z 2025-05-07T20:26:07.7000949Z 2025-05-07T20:26:07.7001079Z  2025-05-07T20:26:07.7001217Z 2025-05-07T20:26:07.7001221Z 2025-05-07T20:26:07.7001225Z 2025-05-07T20:26:07.7001229Z 2025-05-07T20:26:07.7001233Z 2025-05-07T20:26:07.7001237Z 2025-05-07T20:26:07.7001367Z  2025-05-07T20:26:07.7001513Z 2025-05-07T20:26:07.7001517Z 2025-05-07T20:26:07.7001521Z 2025-05-07T20:26:07.7001525Z 2025-05-07T20:26:07.7001529Z 2025-05-07T20:26:07.7001533Z 2025-05-07T20:26:07.7001537Z 2025-05-07T20:26:07.7001669Z  2025-05-07T20:26:07.7001827Z 2025-05-07T20:26:07.7001831Z 2025-05-07T20:26:07.7001835Z 2025-05-07T20:26:07.7001839Z 2025-05-07T20:26:07.7001843Z 2025-05-07T20:26:07.7001853Z 2025-05-07T20:26:07.7001857Z 2025-05-07T20:26:07.7001861Z 2025-05-07T20:26:07.7002024Z  2025-05-07T20:26:07.7002192Z 2025-05-07T20:26:07.7002196Z 2025-05-07T20:26:07.7002205Z 2025-05-07T20:26:07.7002209Z 2025-05-07T20:26:07.7002213Z 2025-05-07T20:26:07.7002217Z 2025-05-07T20:26:07.7002221Z 2025-05-07T20:26:07.7002231Z 2025-05-07T20:26:07.7002235Z 2025-05-07T20:26:07.7002369Z  2025-05-07T20:26:07.7002544Z 2025-05-07T20:26:07.7002548Z 2025-05-07T20:26:07.7002552Z 2025-05-07T20:26:07.7002556Z 2025-05-07T20:26:07.7002560Z 2025-05-07T20:26:07.7002564Z 2025-05-07T20:26:07.7002575Z 2025-05-07T20:26:07.7002579Z 2025-05-07T20:26:07.7002583Z 2025-05-07T20:26:07.7002587Z 2025-05-07T20:26:07.7002726Z  2025-05-07T20:26:07.7002907Z 2025-05-07T20:26:07.7002910Z 2025-05-07T20:26:07.7002914Z 2025-05-07T20:26:07.7002925Z 2025-05-07T20:26:07.7002929Z 2025-05-07T20:26:07.7002933Z 2025-05-07T20:26:07.7002937Z 2025-05-07T20:26:07.7002945Z 2025-05-07T20:26:07.7002949Z 2025-05-07T20:26:07.7002953Z 2025-05-07T20:26:07.7002957Z 2025-05-07T20:26:07.7003100Z  2025-05-07T20:26:07.7003310Z 2025-05-07T20:26:07.7003314Z 2025-05-07T20:26:07.7003318Z 2025-05-07T20:26:07.7003322Z 2025-05-07T20:26:07.7003325Z 2025-05-07T20:26:07.7003330Z 2025-05-07T20:26:07.7003334Z 2025-05-07T20:26:07.7003338Z 2025-05-07T20:26:07.7003342Z 2025-05-07T20:26:07.7003345Z 2025-05-07T20:26:07.7003349Z 2025-05-07T20:26:07.7003356Z 2025-05-07T20:26:07.7003570Z  2025-05-07T20:26:07.7003872Z 2025-05-07T20:26:07.7003878Z 2025-05-07T20:26:07.7003883Z 2025-05-07T20:26:07.7003889Z 2025-05-07T20:26:07.7003894Z 2025-05-07T20:26:07.7003900Z 2025-05-07T20:26:07.7003905Z 2025-05-07T20:26:07.7003911Z 2025-05-07T20:26:07.7003916Z 2025-05-07T20:26:07.7003922Z 2025-05-07T20:26:07.7003927Z 2025-05-07T20:26:07.7003933Z 2025-05-07T20:26:07.7003938Z 2025-05-07T20:26:07.7004186Z  done 2025-05-07T20:26:08.0163873Z Preparing transaction: | / - done 2025-05-07T20:26:12.4783168Z Verifying transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:13.4115347Z Executing transaction: / - \ | / - \ | / done 2025-05-07T20:26:15.9355129Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ... 2025-05-07T20:26:15.9355586Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:15.9371119Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:15.9371768Z 2025-05-07T20:26:15.9371773Z 2025-05-07T20:26:15.9372477Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:15.9373667Z 2025-05-07T20:26:15.9385661Z 2025-05-07T20:26:15.9385989Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:15.9392189Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:15.9396619Z 2025-05-07T20:26:16.1119846Z 2025-05-07T20:26:16.1125684Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:16.1129906Z 2025-05-07T20:26:16.1149961Z 2025-05-07T20:26:16.1150480Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:16.1546490Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:18.0780809Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:18.1406634Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:18.1407167Z 2025-05-07T20:26:18.5761529Z 2025-05-07T20:26:18.5770317Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:18.6128135Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:18.6128652Z 2025-05-07T20:26:19.0500564Z 2025-05-07T20:26:19.0500895Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:19.0501880Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:19.0502631Z 2025-05-07T20:26:19.4746392Z 2025-05-07T20:26:21.5981107Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:23.7629088Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:25.8688345Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:25.8689191Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:27.9343741Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:29.9547470Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:29.9547806Z 2025-05-07T20:26:30.0221075Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:34.0305636Z /tmp/tmpaxofnkh1: line 3: clang: command not found 2025-05-07T20:26:34.0306047Z 2025-05-07T20:26:34.0306471Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:34.0944434Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:34.0944886Z 2025-05-07T20:26:34.0964568Z total 36 2025-05-07T20:26:34.0965116Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:26:34.0965688Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:24 .. 2025-05-07T20:26:34.0966269Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:34.0966827Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:34.0967559Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:34.0968218Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:34.0968850Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:34.0969341Z -rw-r--r--. 2 ec2-user ec2-user 2932 Jan 24 22:22 ~cuda-nvcc_activate.sh 2025-05-07T20:26:34.0969640Z 2025-05-07T20:26:34.0969867Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:34.0970526Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:34.0970960Z 2025-05-07T20:26:34.0987816Z 2025-05-07T20:26:34.0988336Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:34.0988715Z 2025-05-07T20:26:36.1007385Z 2025-05-07T20:26:36.1008112Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:36.1008957Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:36.1009392Z 2025-05-07T20:26:36.5623412Z 2025-05-07T20:26:36.5624233Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:36.5624634Z 2025-05-07T20:26:38.6001790Z -allow-unsupported-compiler 2025-05-07T20:26:38.6002068Z 2025-05-07T20:26:38.6649185Z 2025-05-07T20:26:38.6649690Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:38.6650252Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:38.6650615Z 2025-05-07T20:26:40.6852085Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:40.6852872Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:40.6853221Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:40.6853558Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:40.6853929Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:40.6854360Z #define _STL_PAIR_H 1 2025-05-07T20:26:40.6861932Z #define __cpp_attributes 200809L 2025-05-07T20:26:40.6862487Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:40.6862999Z #define __DELETE_THROW throw() 2025-05-07T20:26:40.6863387Z #define _PTRDIFF_T_ 2025-05-07T20:26:40.6863744Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:40.6864111Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:40.6864441Z #define _IO_LEFT 02 2025-05-07T20:26:40.6864787Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:40.6865178Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:40.6865573Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:40.6866124Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:40.6866589Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:40.6867017Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:40.6867389Z #define _IOS_OUTPUT 2 2025-05-07T20:26:40.6867734Z #define __SM_100_RT_HPP__ 2025-05-07T20:26:40.6868421Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:40.6868860Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:40.6869576Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:40.6869995Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:40.6870398Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:40.6871498Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:40.6872808Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:40.6873261Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:40.6873784Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:40.6874236Z #define _T_WCHAR_ 2025-05-07T20:26:40.6874586Z #define stdout stdout 2025-05-07T20:26:40.6875079Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:40.6875711Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:40.6876134Z #define __flexarr [] 2025-05-07T20:26:40.6876533Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:40.6877046Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:40.6877574Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:40.6877960Z #define _MATH_H 1 2025-05-07T20:26:40.6878355Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:40.6878843Z #define __S64_TYPE long int 2025-05-07T20:26:40.6879212Z #define __stub_fchflags 2025-05-07T20:26:40.6879588Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:40.6880019Z #define __SQUAD_TYPE long int 2025-05-07T20:26:40.6880316Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:40.6880633Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:40.6881132Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:40.6881513Z #define NL_NMAX INT_MAX 2025-05-07T20:26:40.6881848Z #define _BITS_TIME_H 1 2025-05-07T20:26:40.6882181Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:40.6882532Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:40.6882849Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:40.6883225Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:40.6883643Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:40.6884027Z #define __CHAR_BIT__ 8 2025-05-07T20:26:40.6884298Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:40.6884631Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:40.6884942Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:40.6885216Z #define FP_NAN 0 2025-05-07T20:26:40.6885496Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:40.6885933Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:40.6886333Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:40.6887449Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:26:40.6888191Z 2025-05-07T20:26:40.6888297Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:40.6888584Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:40.6888850Z #define __SM_80_RT_H__ 2025-05-07T20:26:40.6889093Z #define _NEW 2025-05-07T20:26:40.6889336Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:40.6889627Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:40.6890017Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:40.6890448Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:40.6890699Z #define __USE_ANSI 1 2025-05-07T20:26:40.6891003Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:40.6891419Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:40.6891799Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:40.6892114Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:40.6892413Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:40.6892900Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:40.6893201Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:40.6893505Z #define PIPE_BUF 4096 2025-05-07T20:26:40.6893930Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:40.6894408Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:40.6894804Z #define ADJ_TICK 0x4000 2025-05-07T20:26:40.6895097Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:40.6895428Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:40.6895711Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:40.6896049Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:40.6896522Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:40.6897063Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:40.6897446Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:40.6897720Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:40.6898012Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:40.6898317Z #define __cpp_static_assert 201411L 2025-05-07T20:26:40.6898619Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:40.6898899Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:40.6899195Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:40.6899494Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:40.6899807Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:40.6900118Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:40.6900437Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:40.6900813Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:40.6901168Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:40.6901468Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:40.6901797Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:40.6902169Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:40.6902547Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:40.6902867Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:40.6903173Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:40.6903521Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:40.6903871Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:40.6904290Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:40.6904722Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:40.6905044Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:40.6905331Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:40.6905623Z #define __GCC_IEC_559 2 2025-05-07T20:26:40.6905938Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:40.6906295Z #define _IO_flockfile(_fp) 2025-05-07T20:26:40.6906568Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:40.6906857Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:40.6907138Z #define _IOFBF 0 2025-05-07T20:26:40.6907361Z #define __USE_BSD 1 2025-05-07T20:26:40.6907604Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:40.6907899Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:40.6908182Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:40.6908459Z #define _IO_NO_WRITES 8 2025-05-07T20:26:40.6908731Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:40.6909102Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:40.6909475Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:40.6909801Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:40.6910141Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:40.6910445Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:40.6910732Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:40.6911015Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:40.6911338Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:40.6911745Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:40.6912138Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:40.6912456Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:40.6912992Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:40.6913340Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:40.6913964Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:40.6914282Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:40.6914585Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:40.6914870Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:40.6915475Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:40.6916089Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:40.6916432Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:40.6916767Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:40.6917083Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:40.6917373Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:40.6917649Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:40.6917981Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:40.6918324Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:40.6918642Z #define RAND_MAX 2147483647 2025-05-07T20:26:40.6918918Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:40.6919261Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:40.6919594Z #define __SM_90_RT_H__ 2025-05-07T20:26:40.6919848Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:40.6920123Z #define __COMPAR_FN_T 2025-05-07T20:26:40.6920381Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:40.6920654Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:40.6921156Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:40.6921692Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:40.6922048Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:40.6922426Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:40.6922743Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:40.6923107Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:40.6923432Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:40.6924439Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:40.6925015Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:40.6925360Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:40.6925692Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:40.6926003Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:40.6926323Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:40.6926605Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:40.6926882Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:40.6927162Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:40.6927423Z #define __u_char_defined 2025-05-07T20:26:40.6927754Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:40.6928131Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:40.6928403Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:40.6928678Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:40.6928970Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:40.6929438Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:40.6929885Z #define FP_INFINITE 1 2025-05-07T20:26:40.6930272Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:40.6930710Z #define _IO_pid_t __pid_t 2025-05-07T20:26:40.6930980Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:40.6931246Z #define __LEAF , __leaf__ 2025-05-07T20:26:40.6931501Z #define PATH_MAX 4096 2025-05-07T20:26:40.6931767Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:40.6932113Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:40.6932448Z #define _LIMITS_H___ 2025-05-07T20:26:40.6932689Z #define __size_t 2025-05-07T20:26:40.6932924Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:40.6933488Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:40.6934367Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:40.6934861Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:40.6935208Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:40.6935484Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:40.6935862Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:40.6936273Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:40.6936585Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:40.6936931Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:40.6937225Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:40.6937526Z #define __INT8_C(c) c 2025-05-07T20:26:40.6937803Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:40.6938119Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:40.6938391Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:40.6938664Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:40.6938930Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:40.6939218Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:40.6939559Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:40.6939910Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:40.6940191Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:40.6940480Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:40.6940758Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:40.6941082Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:40.6941403Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:40.6941787Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:40.6942184Z #define NFDBITS __NFDBITS 2025-05-07T20:26:40.6942457Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:40.6942764Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:40.6943102Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:40.6943430Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:40.6943702Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:40.6944013Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:40.6944330Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:40.6944668Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:40.6945103Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:40.6945478Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:40.6945781Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:40.6946116Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:40.6946453Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:40.6946795Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:40.6947150Z #define __daddr_t_defined 2025-05-07T20:26:40.6947419Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:40.6947703Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:40.6948042Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:40.6948578Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:40.6949083Z #define _ACRTIMP 2025-05-07T20:26:40.6949320Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:40.6949609Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:40.6949911Z #define _IOS_BIN 128 2025-05-07T20:26:40.6950285Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:40.6950717Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:40.6951003Z #define UNDERFLOW 4 2025-05-07T20:26:40.6951228Z #define NAME_MAX 255 2025-05-07T20:26:40.6951478Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:40.6951763Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:40.6952052Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:40.6952363Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:40.6952758Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:40.6953157Z #define __ptr_t void * 2025-05-07T20:26:40.6953406Z #define M_E 2.7182818284590452354 2025-05-07T20:26:40.6953973Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:40.6954247Z #define __USE_ISOCXX11 1 2025-05-07T20:26:40.6954527Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:40.6955014Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:40.6955324Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:40.6955615Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:40.6955919Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:40.6956252Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:40.6956520Z #define __linux 1 2025-05-07T20:26:40.6956761Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:40.6957054Z #define cudaDeviceMask 0xff 2025-05-07T20:26:40.6957333Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:40.6957643Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:40.6957938Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:40.6958236Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:40.6958559Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:40.6958883Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:40.6959192Z #define _BITS_TYPES_H 1 2025-05-07T20:26:40.6959498Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:40.6959868Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:40.6960179Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:40.6960478Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:40.6960783Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:40.6961091Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:40.6961897Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:40.6962744Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:40.6963048Z #define __unix 1 2025-05-07T20:26:40.6963273Z #define MATH_ERRNO 1 2025-05-07T20:26:40.6963537Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:40.6963833Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:40.6964107Z #define __SM_100_RT_H__ 2025-05-07T20:26:40.6964372Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:40.6964675Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:40.6964984Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:40.6965267Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:40.6965584Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:40.6966069Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:40.6966548Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:40.6966861Z #define CUDARTAPI_CDECL 2025-05-07T20:26:40.6967133Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:40.6967415Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:40.6967717Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:40.6967995Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:40.6968239Z #define __SIZE_T 2025-05-07T20:26:40.6968503Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:40.6968840Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:40.6969149Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:40.6969426Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:40.6969711Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:40.6969989Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:40.6970391Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:40.6970840Z #define __WAIT_STATUS void * 2025-05-07T20:26:40.6971121Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:40.6971398Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:40.6971683Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:40.6971988Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:40.6972273Z #define __WINT_MIN__ 0U 2025-05-07T20:26:40.6972876Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:40.6973547Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:40.6973864Z #define WUNTRACED 2 2025-05-07T20:26:40.6974206Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:40.6974501Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:40.6974802Z #define NZERO 20 2025-05-07T20:26:40.6975117Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:40.6975416Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:40.6975755Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:40.6976079Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:40.6976355Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:40.6976656Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:40.6976940Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:40.6977239Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:40.6977529Z #define EXIT_FAILURE 1 2025-05-07T20:26:40.6977781Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:40.6978055Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:40.6978340Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:40.6978606Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:40.6978902Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:40.6979266Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:40.6979642Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:40.6979947Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:40.6980212Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:40.6980500Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:40.6980804Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:40.6981126Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:40.6981434Z #define SEEK_DATA 3 2025-05-07T20:26:40.6981672Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:40.6981982Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:40.6982420Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:40.6982823Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:40.6983096Z #define __INT64_C(c) c ## L 2025-05-07T20:26:40.6983380Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:40.6983734Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:40.6984076Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:40.6984370Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:40.6984692Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:40.6985004Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:40.6985276Z #define __INT_WCHAR_T_H 2025-05-07T20:26:40.6985530Z #define WSTOPPED 2 2025-05-07T20:26:40.6985775Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:40.6986079Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:40.6986346Z #define FP_NORMAL 4 2025-05-07T20:26:40.6986596Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:40.6986898Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:40.6987151Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:40.6987418Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:40.6987735Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:40.6988197Z #define cudaTextureType1D 0x01 2025-05-07T20:26:40.6999330Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:40.6999674Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:40.6999985Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:40.7000300Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:40.7000750Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:40.7001221Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:40.7001509Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:40.7001785Z #define _POSIX_SOURCE 1 2025-05-07T20:26:40.7002058Z #define cudaTextureType2D 0x02 2025-05-07T20:26:40.7002339Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:40.7002617Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:40.7002942Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:40.7003221Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:40.7003548Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:40.7003896Z #define cudaTextureType3D 0x03 2025-05-07T20:26:40.7004184Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:40.7004464Z #define CLOCK_REALTIME 0 2025-05-07T20:26:40.7004727Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:40.7005238Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:40.7005548Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:40.7005926Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:40.7006215Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:40.7006525Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:40.7006811Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:40.7007138Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:40.7007459Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:40.7007750Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:40.7008025Z #define __GLIBC__ 2 2025-05-07T20:26:40.7008264Z #define __END_DECLS } 2025-05-07T20:26:40.7008520Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:40.7008902Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:40.7009302Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:40.7009567Z #define WCONTINUED 8 2025-05-07T20:26:40.7009807Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:40.7010090Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:40.7010369Z #define _ALLOCA_H 1 2025-05-07T20:26:40.7010613Z #define __host__ __location__(host) 2025-05-07T20:26:40.7011067Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:40.7011520Z #define __SLONG32_TYPE int 2025-05-07T20:26:40.7011798Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:40.7012100Z #define _SYS_SELECT_H 1 2025-05-07T20:26:40.7012352Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:40.7012612Z #define _IOS_NOCREATE 32 2025-05-07T20:26:40.7012876Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:40.7013165Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:40.7013473Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:40.7013776Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:40.7014071Z #define __global__ __location__(global) 2025-05-07T20:26:40.7014374Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:40.7014646Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:40.7014940Z #define __DBL_DIG__ 15 2025-05-07T20:26:40.7015176Z #define TIME_UTC 1 2025-05-07T20:26:40.7015404Z #define __FLT32_DIG__ 6 2025-05-07T20:26:40.7015752Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:40.7016161Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:40.7016495Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:40.7016824Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:40.7017133Z #define _G_BUFSIZ 8192 2025-05-07T20:26:40.7017451Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:40.7017839Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:40.7018143Z #define __cudaCDP2GetDevice 2025-05-07T20:26:40.7018439Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:40.7018744Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:40.7019001Z #define __GXX_WEAK__ 1 2025-05-07T20:26:40.7019269Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:40.7019597Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:40.7019874Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:40.7020183Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:40.7020544Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:40.7020839Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:40.7021135Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:40.7021448Z #define _G_config_h 1 2025-05-07T20:26:40.7021743Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:40.7022091Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:40.7022387Z #define _GCC_WCHAR_T 2025-05-07T20:26:40.7022633Z #define TMP_MAX 238328 2025-05-07T20:26:40.7022881Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:40.7023165Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:40.7023442Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:40.7023726Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:40.7024628Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:40.7024932Z #define _IO_SKIPWS 01 2025-05-07T20:26:40.7025655Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:40.7026179Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:40.7026608Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:40.7026967Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:40.7027346Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:40.7027731Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:40.7028115Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:40.7028378Z #define le32toh(x) (x) 2025-05-07T20:26:40.7028627Z #define _SIZE_T_DEFINED 2025-05-07T20:26:40.7028896Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:40.7029245Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:40.7029613Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:40.7030028Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:40.7030462Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:40.7030747Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:40.7031027Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:40.7031307Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:40.7031601Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:40.7032153Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:40.7032683Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:40.7033005Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:40.7033370Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:40.7033786Z #define _WCHAR_T_ 2025-05-07T20:26:40.7034029Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:40.7034408Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:40.7034813Z #define RTSIG_MAX 32 2025-05-07T20:26:40.7035051Z #define _STDDEF_H 2025-05-07T20:26:40.7035291Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:40.7035604Z #define _VA_LIST_DEFINED 2025-05-07T20:26:40.7035899Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:40.7036247Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:40.7036667Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:40.7037019Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:40.7037322Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:40.7037804Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:40.7038358Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:40.7038748Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:40.7039084Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:40.7039417Z #define __unix__ 1 2025-05-07T20:26:40.7039668Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:40.7039960Z #define __INT_WIDTH__ 32 2025-05-07T20:26:40.7040223Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:40.7040478Z #define _IONBF 2 2025-05-07T20:26:40.7040937Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:40.7041746Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:40.7042304Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:40.7042575Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:40.7042857Z #define __UINT16_C(c) c 2025-05-07T20:26:40.7043119Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:40.7043407Z #define STA_DEL 0x0020 2025-05-07T20:26:40.7043658Z #define __CUDACC_VER_MINOR__ 8 2025-05-07T20:26:40.7043934Z #define __id_t_defined 2025-05-07T20:26:40.7044222Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:40.7044691Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:40.7045140Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:40.7045422Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:40.7045692Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:40.7046105Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:40.7046387Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:40.7046754Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:40.7047038Z #define SING 2 2025-05-07T20:26:40.7047271Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:40.7047556Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:40.7047872Z #define cudaStreamDefault 0x00 2025-05-07T20:26:40.7048240Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:40.7048634Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:40.7048919Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:40.7049205Z #define __gnu_linux__ 1 2025-05-07T20:26:40.7049458Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:40.7049725Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:40.7050046Z #define MAX_INPUT 255 2025-05-07T20:26:40.7050310Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:40.7050650Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:40.7051050Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:40.7051385Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:40.7051674Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:40.7052087Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:40.7052542Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:40.7052892Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:40.7053266Z #define _Mfloat_ float 2025-05-07T20:26:40.7053544Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:40.7053873Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:40.7054170Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:40.7054514Z #define cudaMemPoolCreateUsageHwDecompress 0x2 2025-05-07T20:26:40.7055083Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:40.7055599Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:40.7055893Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:40.7056239Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:40.7056619Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:40.7056928Z #define __USE_ISOC11 1 2025-05-07T20:26:40.7057175Z #define _BSD_SIZE_T_ 2025-05-07T20:26:40.7057425Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:40.7057708Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:40.7057990Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:40.7058303Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:40.7058646Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:40.7058976Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:40.7059318Z #define __THROW throw () 2025-05-07T20:26:40.7059589Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:40.7059899Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:40.7060267Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:40.7060641Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:40.7060942Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:40.7061225Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:40.7061509Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:40.7061790Z #define L_tmpnam 20 2025-05-07T20:26:40.7062027Z #define ___int_wchar_t_h 2025-05-07T20:26:40.7062385Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:40.7062787Z #define isascii(c) __isascii (c) 2025-05-07T20:26:40.7063064Z #define _T_PTRDIFF 2025-05-07T20:26:40.7063382Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:40.7063758Z #define toascii(c) __toascii (c) 2025-05-07T20:26:40.7064030Z #define __GNUC__ 11 2025-05-07T20:26:40.7064290Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:40.7064606Z #define __GXX_RTTI 1 2025-05-07T20:26:40.7064842Z #define __pie__ 2 2025-05-07T20:26:40.7065059Z #define __MMX__ 1 2025-05-07T20:26:40.7065295Z #define __cudaCDP2Malloc 2025-05-07T20:26:40.7065566Z #define __timespec_defined 1 2025-05-07T20:26:40.7065963Z #define L_ctermid 9 2025-05-07T20:26:40.7066203Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:40.7066654Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:40.7067075Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:40.7067461Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:40.7067748Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:40.7068054Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:40.7068373Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:40.7068707Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:40.7068986Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:40.7069446Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:40.7070223Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:40.7070854Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:40.7071183Z #define __USE_SVID 1 2025-05-07T20:26:40.7071447Z #define __constant__ __location__(constant) 2025-05-07T20:26:40.7071784Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:40.7072099Z #define __device__ __location__(device) 2025-05-07T20:26:40.7072437Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:40.7072778Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:40.7073060Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:40.7073351Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:40.7073876Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:40.7074266Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:40.7074566Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:40.7074952Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:40.7075351Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:40.7075618Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:40.7076000Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:40.7076457Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:40.7076792Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:40.7077080Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:40.7077363Z #define NGROUPS_MAX 65536 2025-05-07T20:26:40.7077638Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:40.7077911Z #define __USE_ISOC95 1 2025-05-07T20:26:40.7078153Z #define _TIME_H 1 2025-05-07T20:26:40.7078438Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:40.7078769Z #define __USE_ISOC99 1 2025-05-07T20:26:40.7079110Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:40.7079499Z #define HOST_NAME_MAX 64 2025-05-07T20:26:40.7079770Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:40.7080040Z #define _IOS_ATEND 4 2025-05-07T20:26:40.7080290Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:40.7080633Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:40.7081057Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:40.7081420Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:40.7081725Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:40.7082060Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:40.7082395Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:40.7082667Z #define _STDIO_H 1 2025-05-07T20:26:40.7083075Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:40.7083568Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:40.7083948Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:40.7084342Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:40.7084644Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:40.7084930Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:40.7085218Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:40.7085521Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:40.7085843Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:40.7086303Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:40.7086586Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:40.7086961Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:40.7087284Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:40.7087569Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:40.7087877Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:40.7088250Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:40.7088636Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:40.7088893Z #define __USE_XOPEN 1 2025-05-07T20:26:40.7089153Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:40.7089616Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:40.7090072Z #define __USE_XOPEN2K 1 2025-05-07T20:26:40.7090335Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:40.7090622Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:40.7090929Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:40.7091230Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:40.7091784Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:40.7092329Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:40.7092631Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:40.7093011Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:40.7093421Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:40.7093815Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:40.7094229Z #define __END_NAMESPACE_C99 2025-05-07T20:26:40.7094520Z #define __glibcxx_integral_traps true 2025-05-07T20:26:40.7094820Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:40.7095094Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:40.7095373Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:40.7095648Z #define _IOS_TRUNC 16 2025-05-07T20:26:40.7095892Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:40.7096160Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:40.7096468Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:40.7096783Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:40.7097177Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:40.7097578Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:40.7097865Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:40.7098144Z #define _IO_UNITBUF 020000 2025-05-07T20:26:40.7098418Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:40.7098693Z #define __FD_SETSIZE 1024 2025-05-07T20:26:40.7098962Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:40.7099251Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:40.7099604Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:40.7099979Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:40.7100263Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:40.7100584Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:40.7100925Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:40.7101221Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:40.7101537Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:40.7101899Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:40.7102203Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:40.7102546Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:40.7102845Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:40.7103134Z #define __USE_POSIX199506 1 2025-05-07T20:26:40.7103400Z #define _FEATURES_H 1 2025-05-07T20:26:40.7103652Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:40.7104069Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:40.7104571Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:40.7104910Z #define __stub_getmsg 2025-05-07T20:26:40.7105163Z #define _IO_FIXED 010000 2025-05-07T20:26:40.7105456Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:40.7105782Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:40.7106196Z #define __stub_setlogin 2025-05-07T20:26:40.7106454Z #define __stub_fattach 2025-05-07T20:26:40.7106710Z #define __cplusplus 201703L 2025-05-07T20:26:40.7107067Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:40.7107375Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:40.7107648Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:40.7107941Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:40.7108448Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:40.7109001Z #define _IO_INTERNAL 010 2025-05-07T20:26:40.7109260Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:40.7109618Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:40.7109990Z #define __dev_t_defined 2025-05-07T20:26:40.7110242Z #define __DEPRECATED 1 2025-05-07T20:26:40.7110491Z #define __S32_TYPE int 2025-05-07T20:26:40.7110763Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:40.7111070Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:40.7111349Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:40.7111627Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:40.7112264Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:40.7112916Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:40.7113246Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:40.7113740Z #define OVERFLOW 3 2025-05-07T20:26:40.7113998Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:40.7114324Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:40.7114627Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:40.7114979Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:40.7115329Z #define __SSE2_MATH__ 1 2025-05-07T20:26:40.7115590Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:40.7115912Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:40.7116232Z #define _IO_STDIO_H 2025-05-07T20:26:40.7116499Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:40.7116815Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:40.7117147Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:40.7117470Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:40.7117797Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:40.7118075Z #define __amd64 1 2025-05-07T20:26:40.7118313Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:40.7118598Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:40.7118887Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:40.7119194Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:40.7119522Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:40.7119802Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:40.7120115Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:40.7120395Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:40.7120657Z #define __bounded 2025-05-07T20:26:40.7120908Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:40.7121313Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:40.7121727Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:40.7122423Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:40.7122763Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:40.7123168Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:40.7123674Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:40.7124646Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:40.7125221Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:40.7125689Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:40.7126177Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:40.7126592Z #define STA_PLL 0x0001 2025-05-07T20:26:40.7127016Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:40.7127418Z #define __GNUG__ 11 2025-05-07T20:26:40.7136509Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:40.7136845Z #define _T_WCHAR 2025-05-07T20:26:40.7137107Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:40.7137429Z #define __specialization_static 2025-05-07T20:26:40.7137753Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:40.7138429Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:40.7138715Z #define cudaArraySparse 0x40 2025-05-07T20:26:40.7139129Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:40.7139436Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:40.7139757Z #define _WCHAR_T 2025-05-07T20:26:40.7139990Z #define __cudaCDP2Free 2025-05-07T20:26:40.7140654Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:40.7141362Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:40.7141805Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:40.7142267Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:40.7142568Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:40.7142852Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:40.7143208Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:40.7143580Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:40.7143841Z #define __NO_CTYPE 1 2025-05-07T20:26:40.7144085Z #define __stub_bdflush 2025-05-07T20:26:40.7144479Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:40.7144928Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:40.7145254Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:40.7145536Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:40.7145833Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:40.7146160Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:40.7146469Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:40.7146831Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:40.7147201Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:40.7147495Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:40.7147795Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:40.7148162Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:40.7148519Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:40.7148812Z #define _IO_STDIO 040000 2025-05-07T20:26:40.7149164Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:40.7149567Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:40.7149905Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:40.7150206Z #define _PTRDIFF_T 2025-05-07T20:26:40.7150438Z #define _MOVE_H 1 2025-05-07T20:26:40.7150680Z #define __cpp_hex_float 201603L 2025-05-07T20:26:40.7150951Z #define ADJ_TAI 0x0080 2025-05-07T20:26:40.7151197Z #define __ptrvalue 2025-05-07T20:26:40.7151434Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:40.7151693Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:40.7151993Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:40.7152310Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:40.7152572Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:40.7152875Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:40.7153296Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:40.7153833Z #define __USE_GNU 1 2025-05-07T20:26:40.7154082Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:40.7154377Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:40.7154661Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:40.7155063Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:40.7155469Z #define WEXITED 4 2025-05-07T20:26:40.7155699Z #define _IO_NO_READS 4 2025-05-07T20:26:40.7156012Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:40.7156402Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:40.7156722Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:40.7157033Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:40.7157363Z #define __uid_t_defined 2025-05-07T20:26:40.7157631Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:40.7157929Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:40.7158358Z #define WNOHANG 1 2025-05-07T20:26:40.7158618Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:40.7159028Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:40.7159318Z #define cudaEventDefault 0x00 2025-05-07T20:26:40.7159636Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:40.7159972Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:40.7160222Z #define __x86_64 1 2025-05-07T20:26:40.7160471Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:40.7160888Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:40.7161383Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:40.7161904Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:40.7162357Z #define __PTRDIFF_T 2025-05-07T20:26:40.7162696Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:40.7163094Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:40.7163396Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:40.7163706Z #define _Mlong_double_ long double 2025-05-07T20:26:40.7164006Z #define __cpp_lambdas 200907L 2025-05-07T20:26:40.7164276Z #define _IO_DEC 020 2025-05-07T20:26:40.7164517Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:40.7164800Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:40.7165107Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:40.7165406Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:40.7165678Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:40.7165992Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:40.7166364Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:40.7166671Z #define _ANSI_STDDEF_H 2025-05-07T20:26:40.7166954Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:40.7167291Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:40.7167670Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:40.7168076Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:40.7168380Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:40.7168692Z #define __cpp_template_auto 201606L 2025-05-07T20:26:40.7169067Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:40.7169456Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:40.7169742Z #define __key_t_defined 2025-05-07T20:26:40.7170003Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:40.7170392Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:40.7170884Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:40.7171268Z #define __GNUC_VA_LIST 2025-05-07T20:26:40.7171619Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:40.7172030Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:40.7172310Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:40.7172601Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:40.7172911Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:40.7173183Z #define __WCOREFLAG 0x80 2025-05-07T20:26:40.7173449Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:40.7173778Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:40.7174075Z #define __LP64__ 1 2025-05-07T20:26:40.7174331Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:40.7174665Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:40.7174969Z #define _IO_off64_t __off64_t 2025-05-07T20:26:40.7175240Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:40.7175518Z #define __time_t_defined 1 2025-05-07T20:26:40.7175787Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:40.7176148Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:40.7176532Z #define __USE_UNIX98 1 2025-05-07T20:26:40.7176790Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:40.7177080Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:40.7177360Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:40.7177680Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:40.7178131Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:40.7178401Z #define SEEK_CUR 1 2025-05-07T20:26:40.7178768Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:40.7179058Z #define _ASSERT_H 1 2025-05-07T20:26:40.7179648Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:40.7180306Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:40.7180601Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:40.7180867Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:40.7181153Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:40.7181446Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:40.7181841Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:40.7182267Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:40.7182953Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:40.7183639Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:40.7183950Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:40.7184324Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:40.7184722Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:40.7185010Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:40.7185305Z #define cudaArrayDefault 0x00 2025-05-07T20:26:40.7185603Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:40.7185914Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:40.7186238Z #define TLOSS 5 2025-05-07T20:26:40.7186503Z #define __ssize_t_defined 2025-05-07T20:26:40.7186793Z #define __CUDACC_VER_BUILD__ 61 2025-05-07T20:26:40.7187082Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:40.7187397Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:40.7187696Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:40.7187999Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:40.7188315Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:40.7188644Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:40.7188956Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:40.7189269Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:40.7189576Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:40.7189852Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:40.7190216Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:40.7190600Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:40.7190856Z #define __cdecl 2025-05-07T20:26:40.7191104Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:40.7191453Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:40.7191802Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:40.7192071Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:40.7192360Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:40.7192679Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:40.7192968Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:40.7193302Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:40.7193788Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:40.7194224Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:40.7194682Z #define ADJ_NANO 0x2000 2025-05-07T20:26:40.7195012Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:40.7195409Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:40.7195721Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:40.7196020Z #define __FLT_DIG__ 6 2025-05-07T20:26:40.7196398Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:40.7196818Z #define __NO_INLINE__ 1 2025-05-07T20:26:40.7197141Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:40.7197508Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:40.7197786Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:40.7198071Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:40.7198471Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:40.7198764Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:40.7199161Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:40.7199467Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:40.7199876Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:40.7200315Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:40.7200683Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:40.7201043Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:40.7201308Z #define MAX_CANON 255 2025-05-07T20:26:40.7201560Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:40.7201826Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:40.7202111Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:40.7202414Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:40.7202736Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:40.7203055Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:40.7203358Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:40.7203693Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:40.7204029Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:40.7204314Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:40.7204626Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:40.7204929Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:40.7205229Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:40.7205566Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:40.7205872Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:40.7206153Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:40.7206425Z #define _SYS_TYPES_H 1 2025-05-07T20:26:40.7206678Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:40.7206957Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:40.7207226Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:40.7207472Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:40.7207765Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:40.7208083Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:40.7208350Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:40.7208663Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:40.7208956Z #define FP_SUBNORMAL 3 2025-05-07T20:26:40.7209220Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:40.7209519Z #define _INITIALIZER_LIST 2025-05-07T20:26:40.7209784Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:40.7210066Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:40.7210368Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:40.7210646Z #define _IO_file_flags _flags 2025-05-07T20:26:40.7210926Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:40.7211188Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:40.7211489Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:40.7211788Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:40.7212069Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:40.7212477Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:40.7212895Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:40.7213221Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:40.7213511Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:40.7213793Z #define _BSD_SOURCE 1 2025-05-07T20:26:40.7214041Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:40.7214917Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:40.7215808Z #define __catch(X) catch(X) 2025-05-07T20:26:40.7216087Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:40.7216393Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:40.7216687Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:40.7216958Z #define __STRING(x) #x 2025-05-07T20:26:40.7217211Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:40.7217499Z #define _T_PTRDIFF_ 2025-05-07T20:26:40.7217763Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:40.7218184Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:40.7218480Z #define __unbounded 2025-05-07T20:26:40.7218740Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:40.7219127Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:40.7219421Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:40.7219741Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:40.7220035Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:40.7220343Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:40.7220689Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:40.7221016Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:40.7221311Z #define __managed__ __location__(managed) 2025-05-07T20:26:40.7221633Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:40.7222054Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:40.7222492Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:40.7222770Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:40.7223172Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:40.7223597Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:40.7224222Z #define _SYS_SIZE_T_H 2025-05-07T20:26:40.7224620Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:40.7224981Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:40.7225277Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:40.7225589Z #define _CRTIMP 2025-05-07T20:26:40.7225828Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:40.7226162Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:40.7226547Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:40.7226950Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:40.7227382Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:40.7227725Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:40.7228028Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:40.7228338Z #define __SIZE_T__ 2025-05-07T20:26:40.7228571Z #define __stub_gtty 2025-05-07T20:26:40.7228820Z #define __pid_t_defined 2025-05-07T20:26:40.7229101Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:40.7229426Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:40.7229765Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:40.7230082Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:40.7230343Z #define __need_clockid_t 2025-05-07T20:26:40.7230611Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:40.7230889Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:40.7231225Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:40.7231565Z #define _IO_HEX 0100 2025-05-07T20:26:40.7231847Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:40.7232200Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:40.7232312Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:40.7232420Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:40.7232659Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:40.7232793Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:40.7232906Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:40.7233025Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:40.7233138Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:40.7233248Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:40.7233344Z #define __stub_sstk 2025-05-07T20:26:40.7233447Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:40.7233675Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:40.7233774Z #define __wur 2025-05-07T20:26:40.7233900Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:40.7233999Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:40.7234089Z #define _IO_OCT 040 2025-05-07T20:26:40.7234192Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:40.7234295Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:40.7234394Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:40.7234530Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:40.7234892Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:40.7235006Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:40.7235326Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:40.7235436Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:40.7235534Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:40.7235653Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:40.7235755Z #define __off64_t_defined 2025-05-07T20:26:40.7235863Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:40.7235962Z #define __FLT128_DIG__ 33 2025-05-07T20:26:40.7236075Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:40.7236180Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:40.7236276Z #define __INT32_C(c) c 2025-05-07T20:26:40.7236379Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:40.7236485Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:40.7236595Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:40.7236694Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:40.7236791Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:40.7236906Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:40.7237046Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:40.7237154Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:40.7237259Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:40.7237365Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:40.7237474Z #define __have_pthread_attr_t 1 2025-05-07T20:26:40.7237580Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:40.7237813Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:40.7237936Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:40.7238046Z #define __cudaCDP2EventRecord 2025-05-07T20:26:40.7238148Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:40.7238247Z #define htole32(x) (x) 2025-05-07T20:26:40.7238510Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:40.7238641Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:40.7238754Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:40.7238927Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:40.7239087Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:40.7239221Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:40.7239368Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:40.7239473Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:40.7239580Z #define cudaArrayLayered 0x01 2025-05-07T20:26:40.7239759Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:40.7239883Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:40.7239984Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:40.7240090Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:40.7240186Z #define unix 1 2025-05-07T20:26:40.7240286Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:40.7240386Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:40.7240498Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:40.7240624Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:40.7240732Z #define __USE_POSIX 1 2025-05-07T20:26:40.7240835Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:40.7240981Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:40.7241085Z #define __THROWNL throw () 2025-05-07T20:26:40.7241185Z #define __cpp_rtti 199711L 2025-05-07T20:26:40.7241300Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:40.7241404Z #define __PMT(args) args 2025-05-07T20:26:40.7241526Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:40.7241684Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:40.7241812Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:40.7241915Z #define _SIZE_T_DECLARED 2025-05-07T20:26:40.7242024Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:40.7242123Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:40.7242536Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:40.7242791Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:40.7242892Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:40.7242996Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:40.7243235Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:40.7243328Z #define _WCHAR_T_H 2025-05-07T20:26:40.7243424Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:40.7243529Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:40.7243626Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:40.7243732Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:40.7243839Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:40.7243934Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:40.7244056Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:40.7244145Z #define __ELF__ 1 2025-05-07T20:26:40.7244252Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:40.7244365Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:40.7244458Z #define STA_INS 0x0010 2025-05-07T20:26:40.7244563Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:40.7244757Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:40.7244858Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:40.7244966Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:40.7245094Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:40.7245210Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:40.7245320Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:40.7245432Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:40.7245535Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:40.7245708Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:40.7245876Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:40.7245982Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:40.7246324Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:40.7246461Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:40.7246563Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:40.7246669Z #define __FLT_RADIX__ 2 2025-05-07T20:26:40.7246778Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:40.7246965Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:40.7247067Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:40.7247168Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:40.7247285Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:40.7247389Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:40.7247493Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:40.7247610Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:40.7247701Z #define WORD_BIT 32 2025-05-07T20:26:40.7247794Z #define _IO_USER_BUF 1 2025-05-07T20:26:40.7247902Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:40.7248016Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:40.7248134Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:40.7248245Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:40.7248352Z #define __long_double_t long double 2025-05-07T20:26:40.7248465Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:40.7248563Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:40.7248986Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:40.7249082Z #define __k8 1 2025-05-07T20:26:40.7249289Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:40.7249469Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:40.7249602Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:40.7249710Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:40.7249816Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:40.7249931Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:40.7250032Z #define __blksize_t_defined 2025-05-07T20:26:40.7250138Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:40.7250243Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:40.7250365Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:40.7250471Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:40.7250762Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:40.7250863Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:40.7251046Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:40.7251313Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:40.7251671Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:40.7251786Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:40.7251891Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:40.7251986Z #define SEEK_SET 0 2025-05-07T20:26:40.7252092Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:40.7252196Z #define __CUDA_API_VER_MINOR__ 8 2025-05-07T20:26:40.7252424Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:40.7252536Z #define __cudaCDP2GetLastError 2025-05-07T20:26:40.7252638Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:40.7252742Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:40.7253087Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:40.7253198Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:40.7253311Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:40.7253408Z #define __stub_sigreturn 2025-05-07T20:26:40.7253664Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:40.7253768Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:40.7253865Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:40.7253980Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:40.7254072Z #define CLOCK_TAI 11 2025-05-07T20:26:40.7254187Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:40.7254416Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:40.7254512Z #define __restrict_arr 2025-05-07T20:26:40.7254632Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:40.7254796Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:40.7255345Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:40.7255546Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:40.7255639Z #define __USE_MISC 1 2025-05-07T20:26:40.7255751Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:40.7255865Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:40.7255961Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:40.7256054Z #define __LDBL_DIG__ 18 2025-05-07T20:26:40.7256170Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:40.7256279Z #define __malloc_and_calloc_defined 2025-05-07T20:26:40.7256379Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:40.7256496Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:40.7256586Z #define __x86_64__ 1 2025-05-07T20:26:40.7257001Z #define _SIZE_T_ 2025-05-07T20:26:40.7258004Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:40.7258177Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:40.7266889Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:40.7267054Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:40.7267197Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:40.7267306Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:40.7267427Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:40.7267570Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:40.7267722Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:40.7267830Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:40.7268582Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:40.7268717Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:40.7268882Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:40.7268993Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:40.7269101Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:40.7269205Z #define STA_FLL 0x0008 2025-05-07T20:26:40.7269360Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:40.7269465Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:40.7269603Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:40.7269724Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:40.7269819Z #define __stub_revoke 2025-05-07T20:26:40.7269926Z #define __timer_t_defined 1 2025-05-07T20:26:40.7270068Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:40.7270182Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:40.7270298Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:40.7270418Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:40.7270534Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:40.7270650Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:40.7270769Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:40.7270888Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:40.7271047Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:40.7271151Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:40.7271260Z #define _IO_off_t __off_t 2025-05-07T20:26:40.7271355Z #define __FLT64_DIG__ 15 2025-05-07T20:26:40.7271590Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:40.7271703Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:40.7271839Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:40.7271979Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:40.7272089Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:40.7272200Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:40.7272301Z #define NULL __null 2025-05-07T20:26:40.7272451Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:40.7272566Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:40.7272682Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:40.7272785Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:40.7272886Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:40.7272984Z #define FP_ZERO 2 2025-05-07T20:26:40.7273089Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:40.7273259Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:40.7273376Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:40.7273469Z #define __WCHAR_T__ 2025-05-07T20:26:40.7273751Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:40.7273960Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:40.7274122Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:40.7274244Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:40.7274380Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:40.7274503Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:40.7274646Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:40.7274781Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:40.7274886Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:40.7274985Z #define _SIGSET_H_types 1 2025-05-07T20:26:40.7275107Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:40.7275228Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:40.7275385Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:40.7275498Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:40.7275630Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:40.7275773Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:40.7275895Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:40.7276127Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:40.7276248Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1 2025-05-07T20:26:40.7276516Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:40.7276622Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:40.7276742Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:40.7276847Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:40.7276942Z #define STA_MODE 0x4000 2025-05-07T20:26:40.7277065Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:40.7277176Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:40.7277299Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:40.7277413Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:40.7277516Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:40.7277629Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:40.7277736Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:40.7277856Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:40.7277960Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:40.7278091Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:40.7278186Z #define __SEG_FS 1 2025-05-07T20:26:40.7278288Z #define _IO_size_t size_t 2025-05-07T20:26:40.7278391Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:40.7278498Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:40.7278594Z #define __stub_lchmod 2025-05-07T20:26:40.7278693Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:40.7278807Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:40.7278918Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:40.7279006Z #define __SEG_GS 1 2025-05-07T20:26:40.7279200Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:40.7279300Z #define _IOS_APPEND 8 2025-05-07T20:26:40.7279402Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:40.7279501Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:40.7279614Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:40.7279725Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:40.7279839Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:40.7279932Z #define htole16(x) (x) 2025-05-07T20:26:40.7280054Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:40.7280162Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:40.7280263Z #define __INT16_TYPE__ short int 2025-05-07T20:26:40.7280374Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:40.7280495Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:40.7280612Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:40.7280744Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:40.7280850Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:40.7280947Z #define __WCLONE 0x80000000 2025-05-07T20:26:40.7281047Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:40.7281142Z #define SEEK_HOLE 4 2025-05-07T20:26:40.7281237Z #define TIMER_ABSTIME 1 2025-05-07T20:26:40.7281349Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:40.7281447Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:40.7281636Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:40.7281762Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:40.7281871Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:40.7281987Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:40.7282098Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:40.7282228Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:40.7282325Z #define _LINUX_LIMITS_H 2025-05-07T20:26:40.7282419Z #define linux 1 2025-05-07T20:26:40.7282517Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:40.7282641Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:40.7282748Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:40.7282849Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:40.7282969Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:40.7283124Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:40.7283229Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:40.7283430Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:40.7283537Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:40.7283633Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:40.7283807Z #define htole64(x) (x) 2025-05-07T20:26:40.7283917Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:40.7284051Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:40.7284159Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:40.7284673Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:40.7284777Z #define __USE_POSIX2 1 2025-05-07T20:26:40.7284884Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:40.7284980Z #define __WALL 0x40000000 2025-05-07T20:26:40.7285092Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:40.7285183Z #define _XLOCALE_H 1 2025-05-07T20:26:40.7285285Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:40.7285396Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:40.7285497Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:40.7285615Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:40.7285719Z #define __EXCEPTIONS 1 2025-05-07T20:26:40.7285832Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:40.7286042Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:40.7286135Z #define __WORDSIZE 64 2025-05-07T20:26:40.7286235Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:40.7286338Z #define _STL_RELOPS_H 1 2025-05-07T20:26:40.7286439Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:40.7286545Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:40.7286657Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:40.7286757Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:40.7286863Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:40.7287186Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:40.7287430Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:40.7287560Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:40.7287678Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:40.7287789Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:40.7287919Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:40.7288027Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:40.7288142Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:40.7288341Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:40.7288447Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:40.7288547Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:40.7288669Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:40.7288852Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:40.7288974Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:40.7289074Z #define _STRING_H 1 2025-05-07T20:26:40.7289180Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:40.7289281Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:40.7289386Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:40.7289536Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:40.7289643Z #define __code_model_small__ 1 2025-05-07T20:26:40.7289742Z #define _PSTL_CONFIG_H 2025-05-07T20:26:40.7289850Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:40.7289978Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:40.7290080Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:40.7290188Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:40.7290547Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:40.7290647Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:40.7290745Z #define le64toh(x) (x) 2025-05-07T20:26:40.7290845Z #define FILENAME_MAX 4096 2025-05-07T20:26:40.7291003Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:40.7291130Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:40.7291220Z #define L_cuserid 9 2025-05-07T20:26:40.7291436Z #define __ino_t_defined 2025-05-07T20:26:40.7291528Z #define __k8__ 1 2025-05-07T20:26:40.7291632Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:40.7291823Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:40.7291926Z #define __int8_t_defined 2025-05-07T20:26:40.7292025Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:40.7292132Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:40.7292259Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:40.7292363Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:40.7292498Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:40.7292655Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:40.7292748Z #define __HAVE_COLUMN 2025-05-07T20:26:40.7292851Z #define __stub_fdetach 2025-05-07T20:26:40.7293278Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:40.7293369Z #define __pic__ 2 2025-05-07T20:26:40.7293512Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:40.7293616Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:40.7293720Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:40.7293835Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:40.7293928Z #define __stub_chflags 2025-05-07T20:26:40.7294030Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:40.7294121Z #define __need_IOV_MAX 2025-05-07T20:26:40.7294236Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:40.7294355Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:40.7294460Z #define __cpp_decltype 200707L 2025-05-07T20:26:40.7294567Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:40.7294670Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:40.7294785Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:40.7294878Z #define TTY_NAME_MAX 32 2025-05-07T20:26:40.7295064Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:40.7295194Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:40.7295380Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:40.7295506Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:40.7295614Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:40.7295720Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:40.7295810Z #define __import__ 2025-05-07T20:26:40.7295909Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:40.7296081Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:40.7296171Z #define __export__ 2025-05-07T20:26:40.7296298Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:40.7296414Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:40.7296584Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:40.7296688Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:40.7296793Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:40.7296899Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:40.7297003Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:40.7297130Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:40.7297262Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:40.7297382Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:40.7297488Z #define WNOWAIT 0x01000000 2025-05-07T20:26:40.7297576Z #define PLOSS 6 2025-05-07T20:26:40.7297682Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:40.7297958Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:40.7298053Z #define EXIT_SUCCESS 0 2025-05-07T20:26:40.7298165Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:40.7298268Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:40.7298375Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:40.7298481Z #define __thread__ __thread 2025-05-07T20:26:40.7298585Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:40.7298690Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:40.7298801Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:40.7299040Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:40.7299267Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:40.7299369Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:40.7299529Z #define __linux__ 1 2025-05-07T20:26:40.7299640Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:40.7299774Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:40.7299874Z #define __S16_TYPE short int 2025-05-07T20:26:40.7300243Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:40.7300358Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:40.7300567Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:40.7300671Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:40.7300776Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:40.7300870Z #define _T_SIZE_ 2025-05-07T20:26:40.7300975Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:40.7301103Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:40.7301218Z #define _PSTL_VERSION 12000 2025-05-07T20:26:40.7301346Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:40.7301453Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:40.7301562Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:40.7301699Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:40.7301796Z #define _IOS_INPUT 1 2025-05-07T20:26:40.7301895Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:40.7302007Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:40.7302112Z #define __INT64_TYPE__ long int 2025-05-07T20:26:40.7302215Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:40.7302325Z #define __shared__ __location__(shared) 2025-05-07T20:26:40.7302428Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:40.7302593Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:40.7302687Z #define __gid_t_defined 2025-05-07T20:26:40.7302813Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:40.7302922Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:40.7303130Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:40.7303248Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:40.7303346Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:40.7303445Z #define ___int_size_t_h 2025-05-07T20:26:40.7303559Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:40.7303690Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:40.7303863Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:40.7303974Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:40.7304076Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:40.7304191Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:40.7304291Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:40.7304422Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:40.7304548Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:40.7304676Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:40.7304783Z #define __clock_t_defined 1 2025-05-07T20:26:40.7304891Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:40.7305007Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:40.7305117Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:40.7305216Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:40.7305320Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:40.7305441Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:40.7305538Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:40.7305717Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:40.7305809Z #define __SSE__ 1 2025-05-07T20:26:40.7305913Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:40.7306014Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:40.7306111Z #define _CTYPE_H 1 2025-05-07T20:26:40.7306210Z #define __sigset_t_defined 2025-05-07T20:26:40.7306319Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:40.7306421Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:40.7306515Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:40.7306719Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:40.7306819Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:40.7306983Z #define __SM_70_RT_H__ 2025-05-07T20:26:40.7307091Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:40.7307202Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:40.7307305Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:40.7307479Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:40.7307580Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:40.7307698Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:40.7307805Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:40.7307903Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:40.7307998Z #define __amd64__ 1 2025-05-07T20:26:40.7308094Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:40.7308205Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:40.7308492Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:40.7308604Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:40.7308693Z #define EOF (-1) 2025-05-07T20:26:40.7308803Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:40.7308908Z #define __USE_POSIX199309 1 2025-05-07T20:26:40.7309010Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:40.7309122Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:40.7309222Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:40.7309332Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:40.7309453Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:40.7309553Z #define ____mbstate_t_defined 1 2025-05-07T20:26:40.7309655Z #define STA_NANO 0x2000 2025-05-07T20:26:40.7309757Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:40.7309857Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:40.7309953Z #define _IO_LINKED 0x80 2025-05-07T20:26:40.7310081Z #define __cpp_lib_launder 201606 2025-05-07T20:26:40.7310179Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:40.7310288Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:40.7310400Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:40.7310502Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:40.7310653Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:40.7310780Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:40.7310889Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:40.7310997Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:40.7311098Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:40.7311195Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:40.7311343Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:40.7311473Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:40.7311685Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:40.7311887Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:40.7311981Z #define __stub_stty 2025-05-07T20:26:40.7312157Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:40.7312258Z #define le16toh(x) (x) 2025-05-07T20:26:40.7312378Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:40.7312562Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:40.7312662Z #define _SIZET_ 2025-05-07T20:26:40.7312761Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:40.7312861Z #define _SVID_SOURCE 1 2025-05-07T20:26:40.7312950Z #define _LP64 1 2025-05-07T20:26:40.7313048Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:40.7313300Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:40.7313420Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:40.7313515Z #define __UINT8_C(c) c 2025-05-07T20:26:40.7313762Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:40.7313863Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:40.7313979Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:40.7314086Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:40.7314186Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:40.7314296Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:40.7314480Z #define CUDARTAPI 2025-05-07T20:26:40.7314570Z #define IOV_MAX 1024 2025-05-07T20:26:40.7314803Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:40.7314908Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:40.7315010Z #define P_tmpdir "/tmp" 2025-05-07T20:26:40.7315125Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:40.7315213Z #define __wchar_t__ 2025-05-07T20:26:40.7315322Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:40.7315416Z #define SEEK_END 2 2025-05-07T20:26:40.7315515Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:40.7315695Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:40.7315806Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:40.7315958Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:40.7316063Z #define ____FILE_defined 1 2025-05-07T20:26:40.7316189Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:40.7316291Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:40.7316397Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:40.7316499Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:40.7316767Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:40.7316915Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:40.7317004Z #define _IO_RIGHT 04 2025-05-07T20:26:40.7317106Z #define __END_NAMESPACE_STD 2025-05-07T20:26:40.7317305Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:40.7317406Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:40.7317539Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:40.7317642Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:40.7317750Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:40.7317845Z #define _STDDEF_H_ 2025-05-07T20:26:40.7318026Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:40.7318130Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:40.7318263Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:40.7318476Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:40.7318599Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:40.7318758Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:40.7318890Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:40.7319006Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:40.7319122Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:40.7319222Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:40.7319349Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:40.7319452Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:40.7319552Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:40.7319664Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:40.7319845Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:40.7319945Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:40.7320140Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:40.7320252Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:40.7320353Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:40.7320518Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:40.7320620Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:40.7320725Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:40.7320831Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:40.7320958Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:40.7321065Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:40.7321174Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:40.7321348Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:40.7321533Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:40.7321639Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:40.7321766Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:40.7321893Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:40.7322129Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:40.7322451Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:40.7322556Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:40.7322679Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:40.7322786Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:40.7322882Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:40.7322982Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:40.7323093Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:40.7323196Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:40.7323285Z #define __FXSR__ 1 2025-05-07T20:26:40.7323379Z #define _SIZE_T 2025-05-07T20:26:40.7323491Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:40.7323617Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:40.7324070Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:40.7324306Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:40.7324478Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:40.7324589Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:40.7324790Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:40.7325009Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:40.7325105Z #define _GXX_NULLPTR_T 2025-05-07T20:26:40.7325236Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:40.7325335Z #define FOPEN_MAX 16 2025-05-07T20:26:40.7325433Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:40.7325567Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:40.7325671Z #define __suseconds_t_defined 2025-05-07T20:26:40.7325765Z #define __off_t_defined 2025-05-07T20:26:40.7325865Z #define stderr stderr 2025-05-07T20:26:40.7325967Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:40.7326086Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:40.7326197Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:40.7326299Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:40.7326725Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:40.7326828Z #define __mode_t_defined 2025-05-07T20:26:40.7326919Z #define _GCC_SIZE_T 2025-05-07T20:26:40.7327025Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:40.7327142Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:40.7327256Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:40.7327365Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:40.7327463Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:40.7327573Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:40.7327692Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:40.7327804Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:40.7327901Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:40.7327995Z #define __size_t__ 2025-05-07T20:26:40.7328136Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:40.7328242Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:40.7328363Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:40.7328532Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:40.7328637Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:40.7328814Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:40.7328906Z #define _ENDIAN_H 1 2025-05-07T20:26:40.7329023Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:40.7329126Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:40.7329234Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:40.7329324Z #define __try try 2025-05-07T20:26:40.7329426Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:40.7329526Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:40.7329630Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:40.7329900Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:40.7329996Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:40.7330338Z #define __PIC__ 2 2025-05-07T20:26:40.7330458Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:40.7330592Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:40.7332535Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:40.7332655Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:40.7332762Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:40.7332957Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:40.7333064Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:40.7333180Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:40.7333276Z #define _IO_uid_t __uid_t 2025-05-07T20:26:40.7333382Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:40.7333526Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:40.7333630Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:40.7333791Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:40.7333900Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:40.7334038Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:40.7334134Z #define LONG_BIT 64 2025-05-07T20:26:40.7334250Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:40.7334364Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:40.7334508Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:40.7334610Z #define __fsfilcnt_t_defined 2025-05-07T20:26:40.7334708Z #define __blkcnt_t_defined 2025-05-07T20:26:40.7334997Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:40.7335095Z #define __USE_LARGEFILE 1 2025-05-07T20:26:40.7335206Z #define __cpp_constexpr 201603L 2025-05-07T20:26:40.7335307Z #define CUDART_VERSION 12080 2025-05-07T20:26:40.7335405Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:40.7335520Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:40.7335614Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:40.7335828Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:40.7335938Z #define __lldiv_t_defined 1 2025-05-07T20:26:40.7336025Z #define __SSE2__ 1 2025-05-07T20:26:40.7336111Z #define _IOLBF 1 2025-05-07T20:26:40.7336229Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:40.7336333Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:40.7336444Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:40.7336553Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:40.7336673Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:40.7336776Z #define __INT32_TYPE__ int 2025-05-07T20:26:40.7336876Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:40.7336990Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:40.7337103Z #define __cpp_exceptions 199711L 2025-05-07T20:26:40.7337206Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:40.7337324Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:40.7337432Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:40.7337557Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:40.7337726Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:40.7337841Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:40.7337943Z #define __SWORD_TYPE long int 2025-05-07T20:26:40.7338052Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:40.7338161Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:40.7338264Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:40.7338372Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:40.7338666Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:40.7338768Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:40.7338931Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:40.7339019Z #define _T_SIZE 2025-05-07T20:26:40.7339132Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:40.7339274Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:40.7339406Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:40.7339508Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:40.7339615Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:40.7339837Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:40.7339949Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:40.7340047Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:40.7340312Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:40.7340416Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:40.7340526Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:40.7340628Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:40.7340761Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:40.7340850Z #define __PIE__ 2 2025-05-07T20:26:40.7340960Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:40.7341075Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:40.7341279Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:40.7341511Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:40.7341616Z #define __nlink_t_defined 2025-05-07T20:26:40.7341750Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:40.7341882Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:40.7341976Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:40.7342254Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:40.7342388Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:40.7342500Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:40.7342609Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:40.7342715Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:40.7342810Z #define __FILE_defined 1 2025-05-07T20:26:40.7342997Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:40.7343106Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:40.7343209Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:40.7343330Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:40.7343455Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:40.7343570Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:40.7343693Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:40.7343785Z #define __INT16_C(c) c 2025-05-07T20:26:40.7343891Z #define __U32_TYPE unsigned int 2025-05-07T20:26:40.7344003Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:40.7344134Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:40.7344223Z #define __STDC__ 1 2025-05-07T20:26:40.7344337Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:40.7344444Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:40.7344553Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:40.7344714Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:40.7344810Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:40.7344921Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:40.7345026Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:40.7345147Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:40.7345268Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:40.7345372Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:40.7345488Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:40.7345585Z #define stdin stdin 2025-05-07T20:26:40.7345687Z #define __ino64_t_defined 2025-05-07T20:26:40.7345780Z #define STA_CLK 0x8000 2025-05-07T20:26:40.7345885Z #define __clockid_t_defined 1 2025-05-07T20:26:40.7346042Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:40.7346222Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:40.7346332Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:40.7346441Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:40.7346560Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:40.7346670Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:40.7346877Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:40.7346982Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:40.7347531Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:40.7347826Z #define DOMAIN 1 2025-05-07T20:26:40.7347929Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:40.7348019Z #define __NVCC__ 1 2025-05-07T20:26:40.7348137Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:40.7348258Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:40.7348367Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:40.7348486Z #define __throw_exception_again throw 2025-05-07T20:26:40.7348585Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:40.7348679Z #define __EXCEPTION_H 1 2025-05-07T20:26:40.7348789Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:40.7348899Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:40.7349224Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:40.7349345Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:40.7349458Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:40.7349570Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:40.7349686Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:40.7349789Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:40.7349946Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:40.7350060Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:40.7350178Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:40.7350285Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:40.7350399Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:40.7350503Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:40.7350621Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:40.7350766Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:40.7350875Z #define __useconds_t_defined 2025-05-07T20:26:40.7350982Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:40.7351174Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:40.7351344Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:40.7351438Z #define __SSE_MATH__ 1 2025-05-07T20:26:40.7351543Z #define _IO_wint_t wint_t 2025-05-07T20:26:40.7351653Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:40.7351753Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:40.7351854Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:40.7351980Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:40.7352084Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:40.7352194Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:40.7352287Z #define __USE_ATFILE 1 2025-05-07T20:26:40.7352388Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:40.7352499Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:40.7352592Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:40.7352831Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:40.7352945Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:40.7353051Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:40.7353165Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:40.7353291Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:40.7353385Z #define _STDLIB_H 1 2025-05-07T20:26:40.7353642Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:40.7353754Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:40.7353854Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:40.7353995Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:40.7354112Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:40.7354213Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:40.7354413Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:40.7354578Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:40.7354690Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:40.7354820Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:40.7354917Z #define __ldiv_t_defined 1 2025-05-07T20:26:40.7355196Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:40.7355303Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:40.7355558Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:40.7355677Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:40.7355776Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:40.7355884Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:40.7355999Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:40.7356104Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:40.7356193Z #define CUDART_CB 2025-05-07T20:26:40.7356309Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:40.7356443Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:40.7356535Z #define MB_LEN_MAX 16 2025-05-07T20:26:40.7356780Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:40.7356885Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:40.7357023Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:40.7357150Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:40.7357257Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:40.7357427Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:40.7357541Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:40.7357633Z #define _GNU_SOURCE 1 2025-05-07T20:26:40.7357731Z #define __stub_putmsg 2025-05-07T20:26:40.7357821Z #define __CUDACC__ 1 2025-05-07T20:26:40.7357916Z #define __N(msgid) (msgid) 2025-05-07T20:26:40.7358013Z #define __P(args) args 2025-05-07T20:26:40.7358279Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:40.7358389Z #define __cpp_init_captures 201304L 2025-05-07T20:26:40.7358507Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:40.7358603Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:40.7358717Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:40.7358807Z #define __WCHAR_T 2025-05-07T20:26:40.7358904Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:40.7359017Z #define __fsblkcnt_t_defined 2025-05-07T20:26:40.7359141Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:40.7359256Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:40.7359264Z 2025-05-07T20:26:40.7529528Z 2025-05-07T20:26:40.7530002Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:40.7530015Z 2025-05-07T20:26:42.6577450Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:42.6577862Z Copyright (c) 2005-2025 NVIDIA Corporation 2025-05-07T20:26:42.6578205Z Built on Wed_Jan_15_19:20:09_PST_2025 2025-05-07T20:26:42.6578540Z Cuda compilation tools, release 12.8, V12.8.61 2025-05-07T20:26:42.6578900Z Build cuda_12.8.r12.8/compiler.35404655_0 2025-05-07T20:26:42.6579120Z 2025-05-07T20:26:42.7252257Z 2025-05-07T20:26:42.7262898Z /usr/bin/nvidia-smi 2025-05-07T20:26:42.7268539Z + nvidia-smi 2025-05-07T20:26:42.7268693Z 2025-05-07T20:26:42.7444278Z Wed May 7 20:26:42 2025 2025-05-07T20:26:42.7444690Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:42.7445261Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:26:42.7445810Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:42.7446344Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:26:42.7446966Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:26:42.7447434Z | | | MIG M. | 2025-05-07T20:26:42.7447803Z |=========================================+========================+======================| 2025-05-07T20:26:42.7613062Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:26:42.7613545Z | 0% 27C P8 16W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:26:42.7614294Z | | | N/A | 2025-05-07T20:26:42.7614860Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:42.7617746Z 2025-05-07T20:26:42.7618181Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:42.7618658Z | Processes: | 2025-05-07T20:26:42.7619144Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:26:42.7619593Z | ID ID Usage | 2025-05-07T20:26:42.7619980Z |=========================================================================================| 2025-05-07T20:26:42.7622683Z | No running processes found | 2025-05-07T20:26:42.7623215Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:43.0119634Z 2025-05-07T20:26:43.0126007Z [INSTALL] Successfully installed CUDA 12.8.0 2025-05-07T20:26:43.0182068Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:26:43.0182691Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:26:43.0196177Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:26:43.0196580Z env: 2025-05-07T20:26:43.0196846Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:26:43.0197194Z BUILD_ENV: build_binary 2025-05-07T20:26:43.0197483Z BUILD_TARGET: genai 2025-05-07T20:26:43.0197756Z BUILD_VARIANT: cuda 2025-05-07T20:26:43.0198026Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:26:43.0198327Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:26:43.0198679Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:26:43.0199077Z ##[endgroup] 2025-05-07T20:26:43.3924326Z ################################################################################ 2025-05-07T20:26:43.3924888Z # Install PyTorch (PIP) 2025-05-07T20:26:43.3925263Z # 2025-05-07T20:26:43.3942436Z # [2025-05-07T20:26:43.393Z] + install_pytorch_pip build_binary nightly cuda/12.8.0 2025-05-07T20:26:43.3943138Z ################################################################################ 2025-05-07T20:26:43.3943468Z 2025-05-07T20:26:43.3972949Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:26:44.5079952Z Channels: 2025-05-07T20:26:44.5080231Z - conda-forge 2025-05-07T20:26:44.5080497Z Platform: linux-64 2025-05-07T20:26:47.8817110Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:26:48.6051672Z Solving environment: \ | / done 2025-05-07T20:26:48.8281083Z 2025-05-07T20:26:48.8281355Z ## Package Plan ## 2025-05-07T20:26:48.8281562Z 2025-05-07T20:26:48.8281781Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:48.8282101Z 2025-05-07T20:26:48.8282211Z added / updated specs: 2025-05-07T20:26:48.8282465Z - numpy 2025-05-07T20:26:48.8282597Z 2025-05-07T20:26:48.8282614Z 2025-05-07T20:26:48.8282741Z The following packages will be downloaded: 2025-05-07T20:26:48.8282967Z 2025-05-07T20:26:48.8283100Z package | build 2025-05-07T20:26:48.8283432Z ---------------------------|----------------- 2025-05-07T20:26:48.8283836Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:26:48.8284316Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:26:48.8284793Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:26:48.8285262Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:26:48.8285743Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:26:48.8286609Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:26:48.8287083Z numpy-2.2.5 | py310hefbff90_0 7.6 MB conda-forge 2025-05-07T20:26:48.8287493Z ------------------------------------------------------------ 2025-05-07T20:26:48.8287852Z Total: 14.8 MB 2025-05-07T20:26:48.8288071Z 2025-05-07T20:26:48.8288216Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:48.8288446Z 2025-05-07T20:26:48.8288681Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:26:48.8289201Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:26:48.8289729Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:26:48.8290255Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:26:48.8290801Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:26:48.8291362Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:26:48.8292102Z numpy conda-forge/linux-64::numpy-2.2.5-py310hefbff90_0 2025-05-07T20:26:48.8292388Z 2025-05-07T20:26:48.8292393Z 2025-05-07T20:26:48.8292397Z 2025-05-07T20:26:48.8292574Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:48.8292962Z numpy-2.2.5 | 7.6 MB | | 0% 2025-05-07T20:26:48.8293192Z 2025-05-07T20:26:48.8300752Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:48.8301013Z 2025-05-07T20:26:48.8301017Z 2025-05-07T20:26:48.8311482Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:26:48.8311749Z 2025-05-07T20:26:48.8311920Z 2025-05-07T20:26:48.8316715Z 2025-05-07T20:26:48.8321615Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:26:48.8321923Z 2025-05-07T20:26:48.8321928Z 2025-05-07T20:26:48.8321932Z 2025-05-07T20:26:48.8321936Z 2025-05-07T20:26:48.8337931Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:48.8338239Z 2025-05-07T20:26:48.8338259Z 2025-05-07T20:26:48.8338264Z 2025-05-07T20:26:48.8338269Z 2025-05-07T20:26:48.8351747Z 2025-05-07T20:26:48.8353468Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:48.8353834Z 2025-05-07T20:26:48.8353838Z 2025-05-07T20:26:48.8353842Z 2025-05-07T20:26:48.8353845Z 2025-05-07T20:26:48.8353849Z 2025-05-07T20:26:48.8353853Z 2025-05-07T20:26:48.9453252Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:48.9453537Z 2025-05-07T20:26:48.9453541Z 2025-05-07T20:26:48.9453545Z 2025-05-07T20:26:48.9453548Z 2025-05-07T20:26:49.0283938Z libblas-3.9.0 | 16 KB | #########7 | 97%  2025-05-07T20:26:49.0284215Z 2025-05-07T20:26:49.0284220Z 2025-05-07T20:26:49.0335391Z 2025-05-07T20:26:49.0840442Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:26:49.0840769Z 2025-05-07T20:26:49.0840775Z 2025-05-07T20:26:49.0840781Z 2025-05-07T20:26:49.0840786Z 2025-05-07T20:26:49.0882434Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:49.0882823Z 2025-05-07T20:26:49.0882829Z 2025-05-07T20:26:49.0886219Z 2025-05-07T20:26:49.1161086Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:49.1161518Z 2025-05-07T20:26:49.1161524Z 2025-05-07T20:26:49.1161529Z 2025-05-07T20:26:49.1161535Z 2025-05-07T20:26:49.1217625Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:49.1218017Z 2025-05-07T20:26:49.1218023Z 2025-05-07T20:26:49.1218028Z 2025-05-07T20:26:49.1271276Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:49.1271682Z 2025-05-07T20:26:49.1271688Z 2025-05-07T20:26:49.1271694Z 2025-05-07T20:26:49.1271965Z 2025-05-07T20:26:49.1271970Z 2025-05-07T20:26:49.1275458Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:49.1275862Z 2025-05-07T20:26:49.1275867Z 2025-05-07T20:26:49.1275873Z 2025-05-07T20:26:49.1275878Z 2025-05-07T20:26:49.1275894Z 2025-05-07T20:26:49.1370287Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:49.1370673Z 2025-05-07T20:26:49.1370678Z 2025-05-07T20:26:49.1370681Z 2025-05-07T20:26:49.1370685Z 2025-05-07T20:26:49.1370689Z 2025-05-07T20:26:49.1406766Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:49.1407167Z 2025-05-07T20:26:49.1678837Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:49.1679218Z 2025-05-07T20:26:49.1679224Z 2025-05-07T20:26:49.1679229Z 2025-05-07T20:26:49.1679234Z 2025-05-07T20:26:49.1679240Z 2025-05-07T20:26:49.1679245Z 2025-05-07T20:26:49.1686663Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:49.1687096Z 2025-05-07T20:26:49.1687101Z 2025-05-07T20:26:49.1687105Z 2025-05-07T20:26:49.1687108Z 2025-05-07T20:26:49.1687112Z 2025-05-07T20:26:49.1687588Z 2025-05-07T20:26:49.1719908Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:49.1720225Z 2025-05-07T20:26:49.1721672Z 2025-05-07T20:26:49.1869734Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:26:49.1870119Z 2025-05-07T20:26:49.1870123Z 2025-05-07T20:26:49.1870127Z 2025-05-07T20:26:49.1870130Z 2025-05-07T20:26:49.1870134Z 2025-05-07T20:26:49.1870138Z 2025-05-07T20:26:49.1917515Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:49.1918198Z 2025-05-07T20:26:49.2026036Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:49.2281519Z numpy-2.2.5 | 7.6 MB | | 0% 2025-05-07T20:26:49.2281894Z 2025-05-07T20:26:49.2283527Z 2025-05-07T20:26:49.3149078Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:49.3152856Z numpy-2.2.5 | 7.6 MB | ########## | 100% 2025-05-07T20:26:49.3364569Z numpy-2.2.5 | 7.6 MB | ########## | 100% 2025-05-07T20:26:49.3364822Z 2025-05-07T20:26:49.3365239Z 2025-05-07T20:26:49.3369171Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:49.3369470Z 2025-05-07T20:26:49.3369474Z 2025-05-07T20:26:49.3922440Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:49.3922740Z 2025-05-07T20:26:49.3925442Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:49.3925711Z 2025-05-07T20:26:49.7561984Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:49.7569095Z numpy-2.2.5 | 7.6 MB | ########## | 100% 2025-05-07T20:26:49.7569666Z 2025-05-07T20:26:49.7569988Z 2025-05-07T20:26:49.7570305Z  2025-05-07T20:26:49.7570642Z 2025-05-07T20:26:49.7570648Z 2025-05-07T20:26:49.7570900Z  2025-05-07T20:26:49.7571199Z 2025-05-07T20:26:49.7571204Z 2025-05-07T20:26:49.7571210Z 2025-05-07T20:26:49.7571498Z  2025-05-07T20:26:49.7571806Z 2025-05-07T20:26:49.7571812Z 2025-05-07T20:26:49.7571817Z 2025-05-07T20:26:49.7571823Z 2025-05-07T20:26:49.7572086Z  2025-05-07T20:26:49.7572412Z 2025-05-07T20:26:49.7572417Z 2025-05-07T20:26:49.7572423Z 2025-05-07T20:26:49.7572428Z 2025-05-07T20:26:49.7572433Z 2025-05-07T20:26:49.7572694Z  2025-05-07T20:26:49.7572986Z 2025-05-07T20:26:49.7572990Z 2025-05-07T20:26:49.7572993Z 2025-05-07T20:26:49.7572997Z 2025-05-07T20:26:49.7573001Z 2025-05-07T20:26:49.7573004Z 2025-05-07T20:26:49.7573207Z  done 2025-05-07T20:26:49.8579639Z Preparing transaction: \ done 2025-05-07T20:26:49.9584337Z Verifying transaction: / done 2025-05-07T20:26:50.0593607Z Executing transaction: \ done 2025-05-07T20:26:50.2430316Z ################################################################################ 2025-05-07T20:26:50.2430801Z # Install Package From PyTorch PIP: torch 2025-05-07T20:26:50.2431125Z # 2025-05-07T20:26:50.2446244Z # [2025-05-07T20:26:50.244Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0 2025-05-07T20:26:50.2446859Z ################################################################################ 2025-05-07T20:26:50.2447093Z 2025-05-07T20:26:50.2461967Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:26:50.3403160Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:26:50.3403690Z ################################################################################ 2025-05-07T20:26:50.3404183Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:26:50.3404608Z # 2025-05-07T20:26:50.3421716Z # [2025-05-07T20:26:50.341Z] + __prepare_pip_arguments torch nightly cuda/12.8.0 2025-05-07T20:26:50.3422544Z ################################################################################ 2025-05-07T20:26:50.3422806Z 2025-05-07T20:26:50.3445584Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:26:50.3473101Z [INSTALL] Extracted package variant: cu128 2025-05-07T20:26:50.3490721Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:26:50.3491301Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:26:50.3500303Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:26:50.3510133Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ... 2025-05-07T20:26:50.3533445Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:29.3009850Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:29.3010486Z Collecting torch 2025-05-07T20:28:29.3011260Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:28:29.3012005Z Collecting filelock (from torch) 2025-05-07T20:28:29.3012530Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:28:29.3013506Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from torch) (4.13.2) 2025-05-07T20:28:29.3014350Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:28:29.3014870Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:28:29.3015743Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 185.3 MB/s eta 0:00:00 2025-05-07T20:28:29.3016128Z Collecting networkx (from torch) 2025-05-07T20:28:29.3016653Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:28:29.3017329Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 127.8 MB/s eta 0:00:00 2025-05-07T20:28:29.3017701Z Collecting jinja2 (from torch) 2025-05-07T20:28:29.3018204Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:28:29.3018726Z Collecting fsspec (from torch) 2025-05-07T20:28:29.3019249Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:28:29.3019853Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch) 2025-05-07T20:28:29.3020723Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:29.3021587Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch) 2025-05-07T20:28:29.3022961Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:29.3024136Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch) 2025-05-07T20:28:29.3025002Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:29.3025828Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch) 2025-05-07T20:28:29.3026565Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB) 2025-05-07T20:28:29.3027331Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch) 2025-05-07T20:28:29.3028120Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:29.3028967Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch) 2025-05-07T20:28:29.3029790Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:29.3030836Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch) 2025-05-07T20:28:29.3031587Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:29.3032337Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch) 2025-05-07T20:28:29.3033101Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:29.3033960Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch) 2025-05-07T20:28:29.3034796Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:29.3035642Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:28:29.3036404Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB) 2025-05-07T20:28:29.3037139Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:28:29.3037927Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:28:29.3038722Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch) 2025-05-07T20:28:29.3039519Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:29.3040335Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch) 2025-05-07T20:28:29.3041164Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:29.3042000Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch) 2025-05-07T20:28:29.3042827Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:29.3043667Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:28:29.3044525Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:29.3045841Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T20:28:29.3046732Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:28:29.3047308Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:28:29.3048219Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 5.2 MB/s eta 0:00:00 2025-05-07T20:28:29.3048614Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:28:29.3049359Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) 2025-05-07T20:28:29.3050455Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp310-cp310-manylinux_2_28_x86_64.whl (1047.1 MB) 2025-05-07T20:28:29.3051281Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 20.7 MB/s eta 0:00:00 2025-05-07T20:28:29.3052006Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB) 2025-05-07T20:28:29.3052811Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 50.8 MB/s eta 0:00:00 2025-05-07T20:28:29.3053615Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB) 2025-05-07T20:28:29.3054514Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 160.2 MB/s eta 0:00:00 2025-05-07T20:28:29.3055438Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB) 2025-05-07T20:28:29.3056361Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 135.1 MB/s eta 0:00:00 2025-05-07T20:28:29.3057167Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB) 2025-05-07T20:28:29.3058109Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 88.2 MB/s eta 0:00:00 2025-05-07T20:28:29.3058814Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB) 2025-05-07T20:28:29.3059610Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 41.8 MB/s eta 0:00:00 2025-05-07T20:28:29.3060414Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB) 2025-05-07T20:28:29.3061297Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 115.3 MB/s eta 0:00:00 2025-05-07T20:28:29.3062088Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB) 2025-05-07T20:28:29.3062953Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 62.2 MB/s eta 0:00:00 2025-05-07T20:28:29.3063664Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB) 2025-05-07T20:28:29.3064454Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 149.9 MB/s eta 0:00:00 2025-05-07T20:28:29.3065174Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB) 2025-05-07T20:28:29.3066090Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 118.2 MB/s eta 0:00:00 2025-05-07T20:28:29.3066903Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB) 2025-05-07T20:28:29.3067786Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 106.9 MB/s eta 0:00:00 2025-05-07T20:28:29.3068508Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:28:29.3069313Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 136.2 MB/s eta 0:00:00 2025-05-07T20:28:29.3070094Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:28:29.3070948Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 131.4 MB/s eta 0:00:00 2025-05-07T20:28:29.3071762Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB) 2025-05-07T20:28:29.3072758Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 162.3 MB/s eta 0:00:00 2025-05-07T20:28:29.3073638Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-05-07T20:28:29.3074817Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB) 2025-05-07T20:28:29.3075721Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 128.7 MB/s eta 0:00:00 2025-05-07T20:28:29.3077494Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:28:29.3079315Z 2025-05-07T20:28:29.3081346Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128 2025-05-07T20:28:29.3083445Z 2025-05-07T20:28:31.5405337Z torch 2.8.0.dev20250507+cu128 2025-05-07T20:28:31.5407915Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128) 2025-05-07T20:28:35.2935714Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:38.8390216Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128 2025-05-07T20:28:38.8390758Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:42.3805474Z True 2025-05-07T20:28:42.3805738Z True 2025-05-07T20:28:42.3805853Z 2025-05-07T20:28:42.4426064Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:42.4464459Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:42.4465099Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:42.4478262Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:42.4478631Z env: 2025-05-07T20:28:42.4478876Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:42.4479192Z BUILD_ENV: build_binary 2025-05-07T20:28:42.4479460Z BUILD_TARGET: genai 2025-05-07T20:28:42.4479884Z BUILD_VARIANT: cuda 2025-05-07T20:28:42.4480134Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:42.4480411Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:42.4480735Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:42.4481082Z ##[endgroup] 2025-05-07T20:28:42.7877068Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:42.7879142Z ################################################################################ 2025-05-07T20:28:42.7879826Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:42.7880357Z # 2025-05-07T20:28:42.7894671Z # [2025-05-07T20:28:42.789Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:42.7895244Z ################################################################################ 2025-05-07T20:28:42.7895554Z 2025-05-07T20:28:42.7910218Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:42.8839021Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:42.8848849Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:42.8849523Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:42.8849933Z 2025-05-07T20:28:42.9774458Z 2025-05-07T20:28:42.9774919Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:42.9798922Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:49.2327717Z Collecting environment information... 2025-05-07T20:28:49.2328320Z PyTorch version: 2.8.0.dev20250507+cu128 2025-05-07T20:28:49.2328654Z Is debug build: False 2025-05-07T20:28:49.2328913Z CUDA used to build PyTorch: 12.8 2025-05-07T20:28:49.2329213Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:49.2329395Z 2025-05-07T20:28:49.2329526Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:49.2329870Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:49.2330335Z Clang version: Could not collect 2025-05-07T20:28:49.2330761Z CMake version: Could not collect 2025-05-07T20:28:49.2331100Z Libc version: glibc-2.34 2025-05-07T20:28:49.2331268Z 2025-05-07T20:28:49.2331707Z Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:49.2332591Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:49.2333150Z Is CUDA available: True 2025-05-07T20:28:49.2333424Z CUDA runtime version: 12.8.61 2025-05-07T20:28:49.2333704Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:49.2334034Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:49.2334380Z Nvidia driver version: 570.133.07 2025-05-07T20:28:49.2334676Z cuDNN version: Could not collect 2025-05-07T20:28:49.2334955Z HIP runtime version: N/A 2025-05-07T20:28:49.2335221Z MIOpen runtime version: N/A 2025-05-07T20:28:49.2335499Z Is XNNPACK available: True 2025-05-07T20:28:49.2335668Z 2025-05-07T20:28:49.2335751Z CPU: 2025-05-07T20:28:49.2335980Z Architecture: x86_64 2025-05-07T20:28:49.2336327Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:49.2336727Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:49.2337124Z Byte Order: Little Endian 2025-05-07T20:28:49.2337451Z CPU(s): 16 2025-05-07T20:28:49.2337758Z On-line CPU(s) list: 0-15 2025-05-07T20:28:49.2338484Z Vendor ID: AuthenticAMD 2025-05-07T20:28:49.2338842Z Model name: AMD EPYC 7R32 2025-05-07T20:28:49.2339174Z CPU family: 23 2025-05-07T20:28:49.2339464Z Model: 49 2025-05-07T20:28:49.2339762Z Thread(s) per core: 2 2025-05-07T20:28:49.2340061Z Core(s) per socket: 8 2025-05-07T20:28:49.2340350Z Socket(s): 1 2025-05-07T20:28:49.2340643Z Stepping: 0 2025-05-07T20:28:49.2341111Z BogoMIPS: 5600.08 2025-05-07T20:28:49.2343252Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:49.2345372Z Hypervisor vendor: KVM 2025-05-07T20:28:49.2345698Z Virtualization type: full 2025-05-07T20:28:49.2346044Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:49.2346425Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:49.2346808Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:49.2347173Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:49.2347506Z NUMA node(s): 1 2025-05-07T20:28:49.2347814Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:49.2348157Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:49.2348549Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:49.2348924Z Vulnerability L1tf: Not affected 2025-05-07T20:28:49.2349290Z Vulnerability Mds: Not affected 2025-05-07T20:28:49.2349650Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:49.2350022Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:49.2350405Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:49.2350965Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:49.2351573Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:49.2352138Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:49.2352853Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:49.2353909Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:49.2354605Z Vulnerability Srbds: Not affected 2025-05-07T20:28:49.2354980Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:49.2355218Z 2025-05-07T20:28:49.2355325Z Versions of relevant libraries: 2025-05-07T20:28:49.2355606Z [pip3] numpy==2.2.5 2025-05-07T20:28:49.2355860Z [pip3] nvidia-cublas-cu12==12.8.3.14 2025-05-07T20:28:49.2356177Z [pip3] nvidia-cuda-cupti-cu12==12.8.57 2025-05-07T20:28:49.2356492Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 2025-05-07T20:28:49.2356824Z [pip3] nvidia-cuda-runtime-cu12==12.8.57 2025-05-07T20:28:49.2357149Z [pip3] nvidia-cudnn-cu12==9.8.0.87 2025-05-07T20:28:49.2357443Z [pip3] nvidia-cufft-cu12==11.3.3.41 2025-05-07T20:28:49.2357741Z [pip3] nvidia-curand-cu12==10.3.9.55 2025-05-07T20:28:49.2358050Z [pip3] nvidia-cusolver-cu12==11.7.2.55 2025-05-07T20:28:49.2358358Z [pip3] nvidia-cusparse-cu12==12.5.7.53 2025-05-07T20:28:49.2358809Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:49.2359121Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:49.2359410Z [pip3] nvidia-nvjitlink-cu12==12.8.61 2025-05-07T20:28:49.2359718Z [pip3] nvidia-nvtx-cu12==12.8.55 2025-05-07T20:28:49.2360018Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:49.2360325Z [pip3] torch==2.8.0.dev20250507+cu128 2025-05-07T20:28:49.2360711Z [conda] cuda-cudart 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:49.2361206Z [conda] cuda-cudart-dev 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:49.2361811Z [conda] cuda-cudart-dev_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:49.2362341Z [conda] cuda-cudart-static 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:49.2362887Z [conda] cuda-cudart-static_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:49.2363429Z [conda] cuda-cudart_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:49.2363930Z [conda] cuda-cupti 12.8.57 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.2364403Z [conda] cuda-cupti-dev 12.8.57 h5888daf_0 conda-forge 2025-05-07T20:28:49.2364896Z [conda] cuda-libraries 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:49.2365407Z [conda] cuda-libraries-dev 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:49.2365896Z [conda] cuda-nvrtc 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.2366379Z [conda] cuda-nvrtc-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:49.2366851Z [conda] cuda-nvtx 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.2367318Z [conda] cuda-opencl 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.2367803Z [conda] cuda-opencl-dev 12.8.55 h5888daf_0 conda-forge 2025-05-07T20:28:49.2368296Z [conda] cuda-runtime 12.8.0 ha804496_0 conda-forge 2025-05-07T20:28:49.2368775Z [conda] libcublas 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:49.2369252Z [conda] libcublas-dev 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:49.2369733Z [conda] libcufft 11.3.3.41 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.2370206Z [conda] libcufft-dev 11.3.3.41 h5888daf_0 conda-forge 2025-05-07T20:28:49.2370682Z [conda] libcurand 10.3.9.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.2371161Z [conda] libcurand-dev 10.3.9.55 h5888daf_0 conda-forge 2025-05-07T20:28:49.2371651Z [conda] libcusolver 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:49.2372149Z [conda] libcusolver-dev 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:49.2372647Z [conda] libcusparse 12.5.7.53 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.2373145Z [conda] libcusparse-dev 12.5.7.53 h5888daf_0 conda-forge 2025-05-07T20:28:49.2373646Z [conda] libnvjitlink 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.2374148Z [conda] libnvjitlink-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:49.2374620Z [conda] numpy 2.2.5 py310hefbff90_0 conda-forge 2025-05-07T20:28:49.2375097Z [conda] nvidia-cublas-cu12 12.8.3.14 pypi_0 pypi 2025-05-07T20:28:49.2375612Z [conda] nvidia-cuda-cupti-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:49.2376125Z [conda] nvidia-cuda-nvrtc-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:49.2376636Z [conda] nvidia-cuda-runtime-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:49.2377142Z [conda] nvidia-cudnn-cu12 9.8.0.87 pypi_0 pypi 2025-05-07T20:28:49.2377741Z [conda] nvidia-cufft-cu12 11.3.3.41 pypi_0 pypi 2025-05-07T20:28:49.2378226Z [conda] nvidia-curand-cu12 10.3.9.55 pypi_0 pypi 2025-05-07T20:28:49.2378723Z [conda] nvidia-cusolver-cu12 11.7.2.55 pypi_0 pypi 2025-05-07T20:28:49.2379233Z [conda] nvidia-cusparse-cu12 12.5.7.53 pypi_0 pypi 2025-05-07T20:28:49.2379752Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:49.2380245Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:49.2380822Z [conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:49.2381313Z [conda] nvidia-nvtx-cu12 12.8.55 pypi_0 pypi 2025-05-07T20:28:49.2381795Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:49.2382269Z [conda] torch 2.8.0.dev20250507+cu128 pypi_0 pypi 2025-05-07T20:28:49.2382555Z 2025-05-07T20:28:49.3085385Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:49.3086082Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:49.3098207Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:49.3098570Z env: 2025-05-07T20:28:49.3098811Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:49.3099123Z BUILD_ENV: build_binary 2025-05-07T20:28:49.3099384Z BUILD_TARGET: genai 2025-05-07T20:28:49.3099627Z BUILD_VARIANT: cuda 2025-05-07T20:28:49.3099886Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:49.3100158Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:49.3100477Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:49.3100819Z ##[endgroup] 2025-05-07T20:28:49.6503704Z ################################################################################ 2025-05-07T20:28:49.6504107Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:49.6504361Z # 2025-05-07T20:28:49.6524806Z # [2025-05-07T20:28:49.652Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:49.6525244Z ################################################################################ 2025-05-07T20:28:49.6535551Z 2025-05-07T20:28:49.6542525Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:49.7464965Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:49.7484653Z [BUILD] Running git submodules update ... 2025-05-07T20:28:49.7503941Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:49.7869852Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:49.7870344Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:49.7870808Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:49.7871220Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:49.7871652Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:49.7872113Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:49.7872541Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:49.7905664Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:49.8462248Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:49.8484541Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:52.2857811Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:52.3243451Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:52.4293677Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:52.4333957Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:52.6758390Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:52.6792926Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:52.7868491Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:52.7895161Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:53.1657621Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:53.1715962Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:53.2263618Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:53.2267651Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:53.3067611Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:53.3117998Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:53.3522442Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:53.4061039Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:53.4106723Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:53.5433616Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:53.5462589Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:53.6515872Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:53.6552744Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:53.7083701Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:53.7737393Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:53.7775758Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:53.8774033Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:53.8800589Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:53.9895786Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:53.9932293Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:54.1013323Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:54.1056402Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:54.2035276Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:54.2063400Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:54.3098591Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:54.3174679Z Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:54.4244914Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:54.4273648Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:54.5618319Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:54.5646818Z Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB) 2025-05-07T20:28:54.6623661Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:54.6652803Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:54.7157647Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:54.7678444Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:54.7708660Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:54.8200698Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:54.8738885Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:54.8768384Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:54.9268695Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:54.9909664Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:54.9935827Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:55.0446175Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:55.0932252Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:55.1460035Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:55.6659298Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 53.6 MB/s eta 0:00:00 2025-05-07T20:28:55.6700694Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:55.7213237Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:55.7708110Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:55.8250538Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:55.8910745Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:55.9429581Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB) 2025-05-07T20:28:56.0058335Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 8.1 MB/s eta 0:00:00 2025-05-07T20:28:56.0097124Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:56.0621819Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:56.1102748Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:56.1581055Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:56.2159239Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:56.2642803Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB) 2025-05-07T20:28:56.3074837Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:56.3592343Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB) 2025-05-07T20:28:56.4121805Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:56.4603540Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:56.5134870Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:56.5654433Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:56.8031199Z Installing collected packages: sortedcontainers, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:59.1755912Z 2025-05-07T20:28:59.1830474Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0 2025-05-07T20:28:59.3649962Z ################################################################################ 2025-05-07T20:28:59.3650508Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:59.3650896Z # 2025-05-07T20:28:59.3667391Z # [2025-05-07T20:28:59.366Z] + install_triton_pip build_binary 2025-05-07T20:28:59.3667987Z ################################################################################ 2025-05-07T20:28:59.3668341Z 2025-05-07T20:28:59.3668704Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:59.3669362Z ################################################################################ 2025-05-07T20:28:59.3669898Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:59.3670359Z # 2025-05-07T20:28:59.3688278Z # [2025-05-07T20:28:59.368Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:59.3688828Z ################################################################################ 2025-05-07T20:28:59.3689055Z 2025-05-07T20:28:59.3706510Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:59.4648164Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:59.4648973Z ################################################################################ 2025-05-07T20:28:59.4649451Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:59.4649844Z # 2025-05-07T20:28:59.4668708Z # [2025-05-07T20:28:59.466Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:59.4669224Z ################################################################################ 2025-05-07T20:28:59.4669509Z 2025-05-07T20:28:59.4716414Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:59.4733516Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:59.4734041Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:59.4742330Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:59.4751825Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:59.4772814Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:05.3193409Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:29:05.3194683Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:29:05.3195365Z 2025-05-07T20:29:05.3195583Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:05.3196017Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:05.3196842Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:29:05.3198107Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:29:05.3199213Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 174.1 MB/s eta 0:00:00 2025-05-07T20:29:05.3199617Z Installing collected packages: pytorch-triton 2025-05-07T20:29:05.3199971Z Attempting uninstall: pytorch-triton 2025-05-07T20:29:05.3200380Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:29:05.3200823Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:29:05.3201264Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:29:05.3201722Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:29:05.3201989Z 2025-05-07T20:29:07.5651365Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:29:07.5655259Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:29:09.8700349Z ################################################################################ 2025-05-07T20:29:09.8700862Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:29:09.8701276Z ################################################################################ 2025-05-07T20:29:09.8701528Z 2025-05-07T20:29:12.0479971Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:29:14.1904018Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:29:14.1907428Z [BUILD] Successfully ran git submodules update 2025-05-07T20:29:14.1945460Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:14.1946343Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:14.1959097Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:14.1959574Z env: 2025-05-07T20:29:14.1959904Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:14.1960395Z BUILD_ENV: build_binary 2025-05-07T20:29:14.1960746Z BUILD_TARGET: genai 2025-05-07T20:29:14.1961059Z BUILD_VARIANT: cuda 2025-05-07T20:29:14.1961551Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:14.1961915Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:14.1962385Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:14.1962811Z ##[endgroup] 2025-05-07T20:29:14.5406129Z ################################################################################ 2025-05-07T20:29:14.5406670Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:29:14.5407016Z # 2025-05-07T20:29:14.5423221Z # [2025-05-07T20:29:14.541Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.5424338Z ################################################################################ 2025-05-07T20:29:14.5424621Z 2025-05-07T20:29:14.5425117Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.5425963Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.5426382Z 2025-05-07T20:29:14.5587624Z 2bed2d996c113b97194d809bcd57307f8de8d387 fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.5590975Z 2025-05-07T20:29:14.5591757Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.5592194Z 2025-05-07T20:29:14.5770727Z 4888273ec0852f505fccc81faa23a2d37bf7d3b8624276cf783c626cc6938b65 fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.5773772Z 2025-05-07T20:29:14.5774666Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.5775240Z 2025-05-07T20:29:14.6119206Z 8884054067b6c5891f141d668bcfc919 fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.6122929Z 2025-05-07T20:29:14.6134685Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl ... 2025-05-07T20:29:14.6157501Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:17.6395764Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:29:17.6396951Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:29:17.6397978Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:17.6398786Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:17.6399137Z 2025-05-07T20:29:24.8613133Z ################################################################################ 2025-05-07T20:29:24.8613813Z [CHECK] !!!! INFO !!!! 2025-05-07T20:29:24.8614376Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128 2025-05-07T20:29:24.8614973Z [CHECK] CUDA version reported by PyTorch is: 12.8 2025-05-07T20:29:24.8615471Z [CHECK] 2025-05-07T20:29:24.8615913Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:29:24.8616563Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:29:24.8617140Z ################################################################################ 2025-05-07T20:29:24.8617408Z 2025-05-07T20:29:24.8617596Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:29:29.0232342Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:29:33.0780010Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:37.2341482Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:37.2344578Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:49.5263742Z ################################################################################ 2025-05-07T20:29:49.5264511Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:49.5264994Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:49.5265477Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:49.5266019Z ################################################################################ 2025-05-07T20:29:49.5266334Z 2025-05-07T20:29:57.7266216Z ################################################################################ 2025-05-07T20:29:57.7267450Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:57.7270308Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:57.7273565Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:57.7274237Z ################################################################################ 2025-05-07T20:29:57.7274558Z 2025-05-07T20:29:57.7274798Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:30:01.8114798Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:30:05.9026075Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:30:10.1065925Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:30:14.2368261Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:30:14.2372395Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:18.2108820Z fbgemm.nccl_init 2025-05-07T20:30:18.2109037Z 2025-05-07T20:30:18.2747240Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:22.3698908Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:22.3699199Z 2025-05-07T20:30:22.4344772Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:30:26.4110366Z fbgemm.rope_qkv_decoding 2025-05-07T20:30:26.4111176Z 2025-05-07T20:30:26.4741340Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:30:26.4742094Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:30:26.4794186Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:26.4794704Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:26.4812055Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:30:26.4812479Z env: 2025-05-07T20:30:26.4812730Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:30:26.4813060Z BUILD_ENV: build_binary 2025-05-07T20:30:26.4813320Z BUILD_TARGET: genai 2025-05-07T20:30:26.4813570Z BUILD_VARIANT: cuda 2025-05-07T20:30:26.4813828Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:30:26.4814099Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:30:26.4814426Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:30:26.4814787Z ##[endgroup] 2025-05-07T20:30:26.8249085Z ################################################################################ 2025-05-07T20:30:26.8249462Z # Test All FBGEMM-GPU Modules 2025-05-07T20:30:26.8249819Z # 2025-05-07T20:30:26.8264569Z # [2025-05-07T20:30:26.826Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:30:26.8265002Z ################################################################################ 2025-05-07T20:30:26.8265227Z 2025-05-07T20:30:35.0138671Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:35.0139838Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:35.0140293Z [TEST] Determined the test directories: 2025-05-07T20:30:35.0140654Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:35.0140998Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:35.0141332Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:35.0141552Z 2025-05-07T20:30:35.0147609Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:35.0154569Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:35.0155213Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:35.0155535Z 2025-05-07T20:30:35.4767844Z 2025-05-07T20:30:35.4768149Z [TEST] Installing PyTest ... 2025-05-07T20:30:35.4797299Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:36.6657437Z Channels: 2025-05-07T20:30:36.6657778Z - conda-forge 2025-05-07T20:30:36.6658146Z Platform: linux-64 2025-05-07T20:30:40.1118950Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:41.2665383Z Solving environment: \ | / done 2025-05-07T20:30:41.4990438Z 2025-05-07T20:30:41.4994563Z ## Package Plan ## 2025-05-07T20:30:41.4994829Z 2025-05-07T20:30:41.4995156Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:41.4995487Z 2025-05-07T20:30:41.4995599Z added / updated specs: 2025-05-07T20:30:41.4995868Z - expecttest 2025-05-07T20:30:41.4996105Z - pytest 2025-05-07T20:30:41.4996237Z 2025-05-07T20:30:41.4996242Z 2025-05-07T20:30:41.4996437Z The following packages will be downloaded: 2025-05-07T20:30:41.4996705Z 2025-05-07T20:30:41.4996882Z package | build 2025-05-07T20:30:41.4997367Z ---------------------------|----------------- 2025-05-07T20:30:41.4997921Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:41.4998498Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:41.4998993Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:41.4999467Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:41.4999935Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:41.5000382Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:41.5000823Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:41.5001648Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:41.5002158Z ------------------------------------------------------------ 2025-05-07T20:30:41.5002589Z Total: 428 KB 2025-05-07T20:30:41.5002913Z 2025-05-07T20:30:41.5003088Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:41.5003323Z 2025-05-07T20:30:41.5003543Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:41.5004080Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:41.5004628Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:41.5005135Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:41.5005632Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:41.5006104Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:41.5006572Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:41.5007024Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:41.5007296Z 2025-05-07T20:30:41.5007300Z 2025-05-07T20:30:41.5007305Z 2025-05-07T20:30:41.5007473Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:41.5008074Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:41.5008312Z 2025-05-07T20:30:41.5008689Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:41.5008946Z 2025-05-07T20:30:41.5008950Z 2025-05-07T20:30:41.5013508Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:41.5013881Z 2025-05-07T20:30:41.5013897Z 2025-05-07T20:30:41.5017363Z 2025-05-07T20:30:41.5030106Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:41.5030477Z 2025-05-07T20:30:41.5030488Z 2025-05-07T20:30:41.5030492Z 2025-05-07T20:30:41.5030496Z 2025-05-07T20:30:41.5037529Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:41.5037959Z 2025-05-07T20:30:41.5037964Z 2025-05-07T20:30:41.5037975Z 2025-05-07T20:30:41.5037979Z 2025-05-07T20:30:41.5037983Z 2025-05-07T20:30:41.5042341Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:41.5042644Z 2025-05-07T20:30:41.5042648Z 2025-05-07T20:30:41.5042659Z 2025-05-07T20:30:41.5042663Z 2025-05-07T20:30:41.5042666Z 2025-05-07T20:30:41.5042670Z 2025-05-07T20:30:41.5049993Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:41.5050303Z 2025-05-07T20:30:41.5050315Z 2025-05-07T20:30:41.5050319Z 2025-05-07T20:30:41.5050323Z 2025-05-07T20:30:41.5050327Z 2025-05-07T20:30:41.5050330Z 2025-05-07T20:30:41.5052919Z 2025-05-07T20:30:41.5890689Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:41.5890995Z 2025-05-07T20:30:41.5890999Z 2025-05-07T20:30:41.5891003Z 2025-05-07T20:30:41.5893100Z 2025-05-07T20:30:41.6816410Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:41.6816803Z 2025-05-07T20:30:41.6816808Z 2025-05-07T20:30:41.6816813Z 2025-05-07T20:30:41.6816818Z 2025-05-07T20:30:41.6820127Z 2025-05-07T20:30:41.7395398Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:41.7395703Z 2025-05-07T20:30:41.7395708Z 2025-05-07T20:30:41.7395712Z 2025-05-07T20:30:41.7395716Z 2025-05-07T20:30:41.7403365Z 2025-05-07T20:30:41.8112630Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:41.8113030Z 2025-05-07T20:30:41.8113036Z 2025-05-07T20:30:41.8113042Z 2025-05-07T20:30:41.8113047Z 2025-05-07T20:30:41.8113052Z 2025-05-07T20:30:41.8113771Z 2025-05-07T20:30:41.8118303Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:41.8118602Z 2025-05-07T20:30:41.8119381Z 2025-05-07T20:30:41.8176381Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:41.8176677Z 2025-05-07T20:30:41.8176681Z 2025-05-07T20:30:41.8176918Z 2025-05-07T20:30:41.8176923Z 2025-05-07T20:30:41.8176927Z 2025-05-07T20:30:41.8177327Z 2025-05-07T20:30:41.8764790Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:41.8765098Z 2025-05-07T20:30:41.8766369Z 2025-05-07T20:30:41.8805752Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:41.8816801Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:41.8817101Z 2025-05-07T20:30:41.8817105Z 2025-05-07T20:30:41.8817109Z 2025-05-07T20:30:41.8817113Z 2025-05-07T20:30:41.8817117Z 2025-05-07T20:30:41.8817121Z 2025-05-07T20:30:41.8820479Z 2025-05-07T20:30:41.8864414Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:41.8864794Z 2025-05-07T20:30:41.8864798Z 2025-05-07T20:30:41.8864802Z 2025-05-07T20:30:41.8864806Z 2025-05-07T20:30:41.8864809Z 2025-05-07T20:30:41.8864813Z 2025-05-07T20:30:41.8864817Z 2025-05-07T20:30:41.9002911Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:41.9069303Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:41.9069805Z 2025-05-07T20:30:41.9069814Z 2025-05-07T20:30:41.9069822Z 2025-05-07T20:30:41.9070207Z 2025-05-07T20:30:41.9074284Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:41.9074870Z 2025-05-07T20:30:41.9074879Z 2025-05-07T20:30:41.9074886Z 2025-05-07T20:30:41.9074894Z 2025-05-07T20:30:41.9129273Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:41.9129569Z 2025-05-07T20:30:41.9129573Z 2025-05-07T20:30:41.9129576Z 2025-05-07T20:30:41.9129580Z 2025-05-07T20:30:41.9130023Z 2025-05-07T20:30:41.9132284Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:41.9132570Z 2025-05-07T20:30:41.9132574Z 2025-05-07T20:30:41.9132578Z 2025-05-07T20:30:41.9132581Z 2025-05-07T20:30:41.9132585Z 2025-05-07T20:30:41.9132589Z 2025-05-07T20:30:41.9309603Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:41.9309901Z 2025-05-07T20:30:41.9309905Z 2025-05-07T20:30:41.9310554Z 2025-05-07T20:30:41.9337851Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:41.9338132Z 2025-05-07T20:30:41.9338137Z 2025-05-07T20:30:41.9338141Z 2025-05-07T20:30:41.9488672Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:41.9489612Z 2025-05-07T20:30:41.9542056Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:41.9543832Z 2025-05-07T20:30:41.9550705Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:41.9550963Z 2025-05-07T20:30:41.9550967Z 2025-05-07T20:30:41.9550971Z 2025-05-07T20:30:41.9550982Z 2025-05-07T20:30:41.9550986Z 2025-05-07T20:30:41.9550989Z 2025-05-07T20:30:41.9550993Z 2025-05-07T20:30:41.9573491Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:41.9573775Z 2025-05-07T20:30:41.9575175Z 2025-05-07T20:30:41.9582671Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:41.9582942Z 2025-05-07T20:30:41.9583060Z 2025-05-07T20:30:41.9888769Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:41.9889122Z 2025-05-07T20:30:41.9889126Z 2025-05-07T20:30:41.9889130Z 2025-05-07T20:30:41.9940832Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:41.9941088Z 2025-05-07T20:30:41.9983816Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:41.9984317Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:41.9990905Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:41.9991378Z 2025-05-07T20:30:41.9991670Z 2025-05-07T20:30:41.9991946Z  2025-05-07T20:30:41.9992206Z 2025-05-07T20:30:41.9992212Z 2025-05-07T20:30:41.9992604Z  2025-05-07T20:30:41.9992838Z 2025-05-07T20:30:41.9992842Z 2025-05-07T20:30:41.9992846Z 2025-05-07T20:30:41.9993023Z  2025-05-07T20:30:41.9993249Z 2025-05-07T20:30:41.9993252Z 2025-05-07T20:30:41.9993256Z 2025-05-07T20:30:41.9993260Z 2025-05-07T20:30:41.9993444Z  2025-05-07T20:30:41.9993855Z 2025-05-07T20:30:41.9993860Z 2025-05-07T20:30:41.9993866Z 2025-05-07T20:30:41.9993871Z 2025-05-07T20:30:41.9993876Z 2025-05-07T20:30:41.9994137Z  2025-05-07T20:30:41.9994418Z 2025-05-07T20:30:41.9994433Z 2025-05-07T20:30:41.9994438Z 2025-05-07T20:30:41.9994443Z 2025-05-07T20:30:41.9994448Z 2025-05-07T20:30:41.9994454Z 2025-05-07T20:30:41.9994672Z  2025-05-07T20:30:41.9994902Z 2025-05-07T20:30:41.9994906Z 2025-05-07T20:30:41.9994924Z 2025-05-07T20:30:41.9994928Z 2025-05-07T20:30:41.9994932Z 2025-05-07T20:30:41.9994936Z 2025-05-07T20:30:41.9994940Z 2025-05-07T20:30:41.9995136Z  done 2025-05-07T20:30:42.1002559Z Preparing transaction: \ done 2025-05-07T20:30:42.2004374Z Verifying transaction: / done 2025-05-07T20:30:44.1034511Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:44.2399329Z [TEST] Checking imports ... 2025-05-07T20:30:48.4029067Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:48.4041182Z [TEST] Setting feature flags ... 2025-05-07T20:30:48.4041635Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:48.4041979Z 2025-05-07T20:30:48.8267672Z 2025-05-07T20:30:48.8269206Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:48.8270041Z ################################################################################ 2025-05-07T20:30:48.8270380Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:48.8270637Z # 2025-05-07T20:30:48.8290223Z # [2025-05-07T20:30:48.828Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:48.8290826Z ################################################################################ 2025-05-07T20:30:48.8291057Z 2025-05-07T20:30:48.8298330Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:48.8328343Z ./attention/gqa_test.py 2025-05-07T20:30:48.8328759Z ./coalesce/coalesce_test.py 2025-05-07T20:30:48.8329068Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:48.8329366Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:48.8329674Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:48.8329943Z ./moe/activation_test.py 2025-05-07T20:30:48.8330208Z ./moe/gather_scatter_test.py 2025-05-07T20:30:48.8330466Z ./moe/layers_test.py 2025-05-07T20:30:48.8330707Z ./moe/shuffling_test.py 2025-05-07T20:30:48.8330970Z ./quantize/quantize_test.py 2025-05-07T20:30:48.8331141Z 2025-05-07T20:30:48.8331267Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:48.8331485Z 2025-05-07T20:30:48.8350595Z ################################################################################ 2025-05-07T20:30:48.8366850Z # [2025-05-07T20:30:48.836Z] Run Python Test Suite: 2025-05-07T20:30:48.8367206Z # ./attention/gqa_test.py 2025-05-07T20:30:48.8367495Z ################################################################################ 2025-05-07T20:30:48.8391990Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:48.8392871Z 2025-05-07T20:30:51.4582763Z ============================= test session starts ============================== 2025-05-07T20:30:51.4583991Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:51.4585446Z cachedir: .pytest_cache 2025-05-07T20:30:51.4586589Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:51.4588016Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:51.4588832Z plugins: hypothesis-6.131.14 2025-05-07T20:30:53.1146436Z collecting ... collected 2 items 2025-05-07T20:30:53.1146813Z 2025-05-07T20:31:30.7769317Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:31:30.7770061Z self=, 2025-05-07T20:31:30.7770512Z int4_kv=False, 2025-05-07T20:31:30.7770811Z num_groups=1, 2025-05-07T20:31:30.7771113Z B=1, 2025-05-07T20:31:30.7771381Z MAX_T=4, 2025-05-07T20:31:30.7771654Z N_H_L=1, 2025-05-07T20:31:30.7771919Z ) 2025-05-07T20:31:30.7772197Z Trying example: test_gqa( 2025-05-07T20:31:30.7772607Z self=, 2025-05-07T20:31:30.7773071Z int4_kv=True, 2025-05-07T20:31:30.7773368Z num_groups=1, 2025-05-07T20:31:30.7773702Z B=1, 2025-05-07T20:31:30.7773956Z MAX_T=4, 2025-05-07T20:31:30.7774259Z N_H_L=1, 2025-05-07T20:31:30.7774525Z ) 2025-05-07T20:31:30.7775244Z Trying example: test_gqa( 2025-05-07T20:31:30.7775651Z self=, 2025-05-07T20:31:30.7776096Z int4_kv=True, 2025-05-07T20:31:30.7776385Z num_groups=4, 2025-05-07T20:31:30.7776674Z B=23, 2025-05-07T20:31:30.7776940Z MAX_T=33, 2025-05-07T20:31:30.7777208Z N_H_L=68, 2025-05-07T20:31:30.7777477Z ) 2025-05-07T20:31:30.7777748Z Trying example: test_gqa( 2025-05-07T20:31:30.7778145Z self=, 2025-05-07T20:31:30.7778576Z int4_kv=True, 2025-05-07T20:31:30.7778870Z num_groups=4, 2025-05-07T20:31:30.7779151Z B=77, 2025-05-07T20:31:30.7779413Z MAX_T=4, 2025-05-07T20:31:30.7779688Z N_H_L=1, 2025-05-07T20:31:30.7779944Z ) 2025-05-07T20:31:30.7780218Z Trying example: test_gqa( 2025-05-07T20:31:30.7780624Z self=, 2025-05-07T20:31:30.7781053Z int4_kv=True, 2025-05-07T20:31:30.7781347Z num_groups=4, 2025-05-07T20:31:30.7781634Z B=77, 2025-05-07T20:31:30.7781897Z MAX_T=52, 2025-05-07T20:31:30.7782174Z N_H_L=67, 2025-05-07T20:31:30.7782445Z ) 2025-05-07T20:31:30.7782717Z Trying example: test_gqa( 2025-05-07T20:31:30.7783112Z self=, 2025-05-07T20:31:30.7783548Z int4_kv=False, 2025-05-07T20:31:30.7783845Z num_groups=4, 2025-05-07T20:31:30.7784129Z B=57, 2025-05-07T20:31:30.7784394Z MAX_T=45, 2025-05-07T20:31:30.7784670Z N_H_L=120, 2025-05-07T20:31:30.7784937Z ) 2025-05-07T20:31:30.7785208Z Trying example: test_gqa( 2025-05-07T20:31:30.7785612Z self=, 2025-05-07T20:31:30.7786034Z int4_kv=True, 2025-05-07T20:31:30.7786329Z num_groups=4, 2025-05-07T20:31:30.7786614Z B=52, 2025-05-07T20:31:30.7786874Z MAX_T=42, 2025-05-07T20:31:30.7787155Z N_H_L=53, 2025-05-07T20:31:30.7787424Z ) 2025-05-07T20:31:30.7787686Z Trying example: test_gqa( 2025-05-07T20:31:30.7788092Z self=, 2025-05-07T20:31:30.7788543Z int4_kv=True, 2025-05-07T20:31:30.7788829Z num_groups=1, 2025-05-07T20:31:30.7789118Z B=77, 2025-05-07T20:31:30.7789380Z MAX_T=95, 2025-05-07T20:31:30.7789645Z N_H_L=53, 2025-05-07T20:31:30.7789913Z ) 2025-05-07T20:31:30.7790185Z Trying example: test_gqa( 2025-05-07T20:31:30.7790580Z self=, 2025-05-07T20:31:30.7791016Z int4_kv=True, 2025-05-07T20:31:30.7791312Z num_groups=4, 2025-05-07T20:31:30.7791592Z B=113, 2025-05-07T20:31:30.7791859Z MAX_T=48, 2025-05-07T20:31:30.7792135Z N_H_L=96, 2025-05-07T20:31:30.7792401Z ) 2025-05-07T20:31:30.7792664Z Trying example: test_gqa( 2025-05-07T20:31:30.7793065Z self=, 2025-05-07T20:31:30.7793874Z int4_kv=False, 2025-05-07T20:31:30.7794186Z num_groups=1, 2025-05-07T20:31:30.7794477Z B=51, 2025-05-07T20:31:30.7794746Z MAX_T=61, 2025-05-07T20:31:30.7795015Z N_H_L=69, 2025-05-07T20:31:30.7795290Z ) 2025-05-07T20:31:30.7795557Z Trying example: test_gqa( 2025-05-07T20:31:30.7795954Z self=, 2025-05-07T20:31:30.7796387Z int4_kv=False, 2025-05-07T20:31:30.7796680Z num_groups=4, 2025-05-07T20:31:30.7796962Z B=17, 2025-05-07T20:31:30.7797230Z MAX_T=113, 2025-05-07T20:31:30.7797507Z N_H_L=65, 2025-05-07T20:31:30.7797769Z ) 2025-05-07T20:31:30.7798039Z Trying example: test_gqa( 2025-05-07T20:31:30.7798438Z self=, 2025-05-07T20:31:30.7798864Z int4_kv=False, 2025-05-07T20:31:30.7799164Z num_groups=4, 2025-05-07T20:31:30.7799453Z B=17, 2025-05-07T20:31:30.7799715Z MAX_T=65, 2025-05-07T20:31:30.7799989Z N_H_L=65, 2025-05-07T20:31:30.7800257Z ) 2025-05-07T20:31:30.7800529Z Trying example: test_gqa( 2025-05-07T20:31:30.7800934Z self=, 2025-05-07T20:31:30.7801418Z int4_kv=False, 2025-05-07T20:31:30.7801710Z num_groups=4, 2025-05-07T20:31:30.7802110Z B=65, 2025-05-07T20:31:30.7802375Z MAX_T=65, 2025-05-07T20:31:30.7802652Z N_H_L=65, 2025-05-07T20:31:30.7802916Z ) 2025-05-07T20:31:30.7803185Z Trying example: test_gqa( 2025-05-07T20:31:30.7803590Z self=, 2025-05-07T20:31:30.7804018Z int4_kv=False, 2025-05-07T20:31:30.7804312Z num_groups=1, 2025-05-07T20:31:30.7804598Z B=6, 2025-05-07T20:31:30.7804857Z MAX_T=108, 2025-05-07T20:31:30.7805139Z N_H_L=14, 2025-05-07T20:31:30.7805408Z ) 2025-05-07T20:31:30.7805671Z Trying example: test_gqa( 2025-05-07T20:31:30.7806074Z self=, 2025-05-07T20:31:30.7806507Z int4_kv=False, 2025-05-07T20:31:30.7806795Z num_groups=1, 2025-05-07T20:31:30.7807089Z B=6, 2025-05-07T20:31:30.7807351Z MAX_T=14, 2025-05-07T20:31:30.7807623Z N_H_L=14, 2025-05-07T20:31:30.7807891Z ) 2025-05-07T20:31:30.7808159Z Trying example: test_gqa( 2025-05-07T20:31:30.7808554Z self=, 2025-05-07T20:31:30.7808999Z int4_kv=False, 2025-05-07T20:31:30.7809294Z num_groups=1, 2025-05-07T20:31:30.7809578Z B=6, 2025-05-07T20:31:30.7809838Z MAX_T=6, 2025-05-07T20:31:30.7810112Z N_H_L=14, 2025-05-07T20:31:30.7810375Z ) 2025-05-07T20:31:30.7810642Z Trying example: test_gqa( 2025-05-07T20:31:30.7811040Z self=, 2025-05-07T20:31:30.7811511Z int4_kv=False, 2025-05-07T20:31:30.7811809Z num_groups=1, 2025-05-07T20:31:30.7812100Z B=6, 2025-05-07T20:31:30.7812353Z MAX_T=6, 2025-05-07T20:31:30.7812624Z N_H_L=6, 2025-05-07T20:31:30.7812889Z ) 2025-05-07T20:31:30.7813151Z Trying example: test_gqa( 2025-05-07T20:31:30.7813558Z self=, 2025-05-07T20:31:30.7825494Z int4_kv=False, 2025-05-07T20:31:30.7825765Z num_groups=1, 2025-05-07T20:31:30.7826000Z B=70, 2025-05-07T20:31:30.7826218Z MAX_T=94, 2025-05-07T20:31:30.7826443Z N_H_L=78, 2025-05-07T20:31:30.7826667Z ) 2025-05-07T20:31:30.7826893Z Trying example: test_gqa( 2025-05-07T20:31:30.7827235Z self=, 2025-05-07T20:31:30.7827585Z int4_kv=False, 2025-05-07T20:31:30.7827829Z num_groups=1, 2025-05-07T20:31:30.7828070Z B=78, 2025-05-07T20:31:30.7828279Z MAX_T=94, 2025-05-07T20:31:30.7828504Z N_H_L=78, 2025-05-07T20:31:30.7828722Z ) 2025-05-07T20:31:30.7828936Z Trying example: test_gqa( 2025-05-07T20:31:30.7829267Z self=, 2025-05-07T20:31:30.7829619Z int4_kv=False, 2025-05-07T20:31:30.7829853Z num_groups=1, 2025-05-07T20:31:30.7830089Z B=94, 2025-05-07T20:31:30.7830304Z MAX_T=94, 2025-05-07T20:31:30.7830527Z N_H_L=78, 2025-05-07T20:31:30.7830941Z ) 2025-05-07T20:31:30.7831168Z Trying example: test_gqa( 2025-05-07T20:31:30.7831490Z self=, 2025-05-07T20:31:30.7831842Z int4_kv=False, 2025-05-07T20:31:30.7832092Z num_groups=1, 2025-05-07T20:31:30.7832322Z B=94, 2025-05-07T20:31:30.7832539Z MAX_T=94, 2025-05-07T20:31:30.7832762Z N_H_L=94, 2025-05-07T20:31:30.7832975Z ) 2025-05-07T20:31:30.7833194Z Trying example: test_gqa( 2025-05-07T20:31:30.7833521Z self=, 2025-05-07T20:31:30.7833942Z int4_kv=False, 2025-05-07T20:31:30.7834175Z num_groups=4, 2025-05-07T20:31:30.7834402Z B=41, 2025-05-07T20:31:30.7834615Z MAX_T=105, 2025-05-07T20:31:30.7834842Z N_H_L=126, 2025-05-07T20:31:30.7835053Z ) 2025-05-07T20:31:30.7835272Z Trying example: test_gqa( 2025-05-07T20:31:30.7835597Z self=, 2025-05-07T20:31:30.7835938Z int4_kv=False, 2025-05-07T20:31:30.7836173Z num_groups=4, 2025-05-07T20:31:30.7836401Z B=105, 2025-05-07T20:31:30.7836618Z MAX_T=105, 2025-05-07T20:31:30.7836836Z N_H_L=126, 2025-05-07T20:31:30.7837056Z ) 2025-05-07T20:31:30.7837275Z Trying example: test_gqa( 2025-05-07T20:31:30.7837750Z self=, 2025-05-07T20:31:30.7838099Z int4_kv=False, 2025-05-07T20:31:30.7838332Z num_groups=4, 2025-05-07T20:31:30.7838556Z B=105, 2025-05-07T20:31:30.7838771Z MAX_T=105, 2025-05-07T20:31:30.7838996Z N_H_L=105, 2025-05-07T20:31:30.7839209Z ) 2025-05-07T20:31:30.7839428Z Trying example: test_gqa( 2025-05-07T20:31:30.7839749Z self=, 2025-05-07T20:31:30.7840081Z int4_kv=True, 2025-05-07T20:31:30.7840308Z num_groups=1, 2025-05-07T20:31:30.7840537Z B=95, 2025-05-07T20:31:30.7840749Z MAX_T=114, 2025-05-07T20:31:30.7840980Z N_H_L=43, 2025-05-07T20:31:30.7841194Z ) 2025-05-07T20:31:30.7841403Z Trying example: test_gqa( 2025-05-07T20:31:30.7841733Z self=, 2025-05-07T20:31:30.7842078Z int4_kv=True, 2025-05-07T20:31:30.7842305Z num_groups=1, 2025-05-07T20:31:30.7842533Z B=43, 2025-05-07T20:31:30.7842745Z MAX_T=114, 2025-05-07T20:31:30.7842973Z N_H_L=43, 2025-05-07T20:31:30.7843189Z ) 2025-05-07T20:31:30.7843408Z Trying example: test_gqa( 2025-05-07T20:31:30.7843729Z self=, 2025-05-07T20:31:30.7844075Z int4_kv=True, 2025-05-07T20:31:30.7844309Z num_groups=1, 2025-05-07T20:31:30.7844541Z B=43, 2025-05-07T20:31:30.7844747Z MAX_T=43, 2025-05-07T20:31:30.7844967Z N_H_L=43, 2025-05-07T20:31:30.7845183Z ) 2025-05-07T20:31:30.7845393Z Trying example: test_gqa( 2025-05-07T20:31:30.7845717Z self=, 2025-05-07T20:31:30.7846064Z int4_kv=False, 2025-05-07T20:31:30.7846294Z num_groups=1, 2025-05-07T20:31:30.7846524Z B=21, 2025-05-07T20:31:30.7846735Z MAX_T=38, 2025-05-07T20:31:30.7846955Z N_H_L=42, 2025-05-07T20:31:30.7847171Z ) 2025-05-07T20:31:30.7847385Z Trying example: test_gqa( 2025-05-07T20:31:30.7847733Z self=, 2025-05-07T20:31:30.7848073Z int4_kv=False, 2025-05-07T20:31:30.7848317Z num_groups=1, 2025-05-07T20:31:30.7848549Z B=38, 2025-05-07T20:31:30.7848755Z MAX_T=38, 2025-05-07T20:31:30.7848978Z N_H_L=42, 2025-05-07T20:31:30.7849193Z ) 2025-05-07T20:31:30.7849404Z Trying example: test_gqa( 2025-05-07T20:31:30.7849729Z self=, 2025-05-07T20:31:30.7850077Z int4_kv=False, 2025-05-07T20:31:30.7850309Z num_groups=1, 2025-05-07T20:31:30.7850539Z B=38, 2025-05-07T20:31:30.7850779Z MAX_T=42, 2025-05-07T20:31:30.7851048Z N_H_L=42, 2025-05-07T20:31:30.7851318Z ) 2025-05-07T20:31:30.7851591Z Trying example: test_gqa( 2025-05-07T20:31:30.7851996Z self=, 2025-05-07T20:31:30.7852430Z int4_kv=False, 2025-05-07T20:31:30.7852848Z num_groups=1, 2025-05-07T20:31:30.7853136Z B=42, 2025-05-07T20:31:30.7853393Z MAX_T=42, 2025-05-07T20:31:30.7853615Z N_H_L=42, 2025-05-07T20:31:30.7853826Z ) 2025-05-07T20:31:30.7854050Z Trying example: test_gqa( 2025-05-07T20:31:30.7854377Z self=, 2025-05-07T20:31:30.7854723Z int4_kv=True, 2025-05-07T20:31:30.7854954Z num_groups=1, 2025-05-07T20:31:30.7855185Z B=74, 2025-05-07T20:31:30.7855398Z MAX_T=20, 2025-05-07T20:31:30.7855611Z N_H_L=15, 2025-05-07T20:31:30.7855828Z ) 2025-05-07T20:31:30.7856044Z Trying example: test_gqa( 2025-05-07T20:31:30.7856364Z self=, 2025-05-07T20:31:30.7856713Z int4_kv=True, 2025-05-07T20:31:30.7856950Z num_groups=1, 2025-05-07T20:31:30.7857176Z B=20, 2025-05-07T20:31:30.7857390Z MAX_T=20, 2025-05-07T20:31:30.7857612Z N_H_L=15, 2025-05-07T20:31:30.7857824Z ) 2025-05-07T20:31:30.7858044Z Trying example: test_gqa( 2025-05-07T20:31:30.7858380Z self=, 2025-05-07T20:31:30.7858724Z int4_kv=True, 2025-05-07T20:31:30.7858960Z num_groups=1, 2025-05-07T20:31:30.7859195Z B=20, 2025-05-07T20:31:30.7859502Z MAX_T=15, 2025-05-07T20:31:30.7859724Z N_H_L=15, 2025-05-07T20:31:30.7859946Z ) 2025-05-07T20:31:30.7860158Z Trying example: test_gqa( 2025-05-07T20:31:30.7860485Z self=, 2025-05-07T20:31:30.7860886Z int4_kv=True, 2025-05-07T20:31:30.7861174Z num_groups=1, 2025-05-07T20:31:30.7861459Z B=15, 2025-05-07T20:31:30.7861724Z MAX_T=20, 2025-05-07T20:31:30.7861990Z N_H_L=15, 2025-05-07T20:31:30.7862261Z ) 2025-05-07T20:31:30.7862529Z Trying example: test_gqa( 2025-05-07T20:31:30.7862853Z self=, 2025-05-07T20:31:30.7863201Z int4_kv=True, 2025-05-07T20:31:30.7863439Z num_groups=1, 2025-05-07T20:31:30.7863671Z B=15, 2025-05-07T20:31:30.7863882Z MAX_T=15, 2025-05-07T20:31:30.7864105Z N_H_L=15, 2025-05-07T20:31:30.7864322Z ) 2025-05-07T20:31:30.7864533Z Trying example: test_gqa( 2025-05-07T20:31:30.7864861Z self=, 2025-05-07T20:31:30.7865216Z int4_kv=False, 2025-05-07T20:31:30.7865449Z num_groups=4, 2025-05-07T20:31:30.7865681Z B=117, 2025-05-07T20:31:30.7865895Z MAX_T=104, 2025-05-07T20:31:30.7866116Z N_H_L=69, 2025-05-07T20:31:30.7866330Z ) 2025-05-07T20:31:30.7866546Z Trying example: test_gqa( 2025-05-07T20:31:30.7866864Z self=, 2025-05-07T20:31:30.7867211Z int4_kv=False, 2025-05-07T20:31:30.7867450Z num_groups=4, 2025-05-07T20:31:30.7867674Z B=117, 2025-05-07T20:31:30.7867888Z MAX_T=117, 2025-05-07T20:31:30.7868112Z N_H_L=69, 2025-05-07T20:31:30.7868321Z ) 2025-05-07T20:31:30.7868539Z Trying example: test_gqa( 2025-05-07T20:31:30.7868865Z self=, 2025-05-07T20:31:30.7869210Z int4_kv=False, 2025-05-07T20:31:30.7869446Z num_groups=4, 2025-05-07T20:31:30.7869678Z B=69, 2025-05-07T20:31:30.7869884Z MAX_T=117, 2025-05-07T20:31:30.7870107Z N_H_L=69, 2025-05-07T20:31:30.7870338Z ) 2025-05-07T20:31:30.7870549Z Trying example: test_gqa( 2025-05-07T20:31:30.7870874Z self=, 2025-05-07T20:31:30.7871221Z int4_kv=False, 2025-05-07T20:31:30.7871458Z num_groups=4, 2025-05-07T20:31:30.7871682Z B=117, 2025-05-07T20:31:30.7871898Z MAX_T=69, 2025-05-07T20:31:30.7872123Z N_H_L=69, 2025-05-07T20:31:30.7872334Z ) 2025-05-07T20:31:30.7872540Z PASSED 2025-05-07T20:31:30.8151759Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:31:30.8152145Z 2025-05-07T20:31:30.8152322Z =========================== short test summary info ============================ 2025-05-07T20:31:30.8153327Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when CUDA is not available or xformers is not available 2025-05-07T20:31:30.8154184Z ======================== 1 passed, 1 skipped in 39.91s ========================= 2025-05-07T20:31:31.4980622Z 2025-05-07T20:31:31.4981504Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:31:31.5004634Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds 2025-05-07T20:31:31.5005059Z 2025-05-07T20:31:31.5005063Z 2025-05-07T20:31:31.5005068Z 2025-05-07T20:31:31.5005072Z 2025-05-07T20:31:31.5027290Z ################################################################################ 2025-05-07T20:31:31.5044132Z # [2025-05-07T20:31:31.504Z] Run Python Test Suite: 2025-05-07T20:31:31.5044553Z # ./coalesce/coalesce_test.py 2025-05-07T20:31:31.5045010Z ################################################################################ 2025-05-07T20:31:31.5072177Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:31:31.5072873Z 2025-05-07T20:31:33.8148117Z ============================= test session starts ============================== 2025-05-07T20:31:33.8148997Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:33.8149979Z cachedir: .pytest_cache 2025-05-07T20:31:33.8150637Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:33.8151447Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:33.8151908Z plugins: hypothesis-6.131.14 2025-05-07T20:31:35.4322938Z collecting ... collected 1 item 2025-05-07T20:31:35.4323261Z 2025-05-07T20:31:36.1698325Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:31:36.1698802Z 2025-05-07T20:31:36.1699018Z ============================== 1 passed in 2.49s =============================== 2025-05-07T20:31:36.7994457Z 2025-05-07T20:31:36.7995126Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:31:36.8015541Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:31:36.8015867Z 2025-05-07T20:31:36.8015872Z 2025-05-07T20:31:36.8015876Z 2025-05-07T20:31:36.8015880Z 2025-05-07T20:31:36.8036264Z ################################################################################ 2025-05-07T20:31:36.8050860Z # [2025-05-07T20:31:36.804Z] Run Python Test Suite: 2025-05-07T20:31:36.8052998Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:31:36.8053456Z ################################################################################ 2025-05-07T20:31:36.8079131Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:31:36.8079772Z 2025-05-07T20:31:39.0456437Z ============================= test session starts ============================== 2025-05-07T20:31:39.0457204Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:39.0457795Z cachedir: .pytest_cache 2025-05-07T20:31:39.0458461Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:39.0459285Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:39.0459749Z plugins: hypothesis-6.131.14 2025-05-07T20:31:40.7445514Z collecting ... collected 5 items 2025-05-07T20:31:40.7445861Z 2025-05-07T20:31:40.7456957Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:40.7465731Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:40.7473782Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:40.7482313Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:40.7498419Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:40.7498932Z 2025-05-07T20:31:40.7499115Z =========================== short test summary info ============================ 2025-05-07T20:31:40.7499885Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.7500919Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.7501946Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.7502967Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.7503987Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.7504758Z ============================== 5 skipped in 1.84s ============================== 2025-05-07T20:31:41.3406142Z 2025-05-07T20:31:41.3407065Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:41.3428301Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds 2025-05-07T20:31:41.3428653Z 2025-05-07T20:31:41.3428658Z 2025-05-07T20:31:41.3428662Z 2025-05-07T20:31:41.3428666Z 2025-05-07T20:31:41.3452519Z ################################################################################ 2025-05-07T20:31:41.3469302Z # [2025-05-07T20:31:41.346Z] Run Python Test Suite: 2025-05-07T20:31:41.3469690Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:41.3470038Z ################################################################################ 2025-05-07T20:31:41.3495848Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:41.3496578Z 2025-05-07T20:31:43.5647051Z ============================= test session starts ============================== 2025-05-07T20:31:43.5647720Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:43.5648268Z cachedir: .pytest_cache 2025-05-07T20:31:43.5648883Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:43.5649638Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:43.5650074Z plugins: hypothesis-6.131.14 2025-05-07T20:31:45.2417642Z collecting ... collected 2 items 2025-05-07T20:31:45.2417928Z 2025-05-07T20:31:45.2429543Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:45.2444050Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:45.2444510Z 2025-05-07T20:31:45.2444669Z =========================== short test summary info ============================ 2025-05-07T20:31:45.2445345Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:45.2446220Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:45.2446846Z ============================== 2 skipped in 1.80s ============================== 2025-05-07T20:31:45.8417132Z 2025-05-07T20:31:45.8417793Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:45.8439764Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds 2025-05-07T20:31:45.8440128Z 2025-05-07T20:31:45.8440133Z 2025-05-07T20:31:45.8440158Z 2025-05-07T20:31:45.8440568Z 2025-05-07T20:31:45.8464438Z ################################################################################ 2025-05-07T20:31:45.8481969Z # [2025-05-07T20:31:45.847Z] Run Python Test Suite: 2025-05-07T20:31:45.8482362Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:45.8482690Z ################################################################################ 2025-05-07T20:31:45.8509538Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:45.8510218Z 2025-05-07T20:31:48.1650038Z ============================= test session starts ============================== 2025-05-07T20:31:48.1650739Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:48.1651319Z cachedir: .pytest_cache 2025-05-07T20:31:48.1651983Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:48.1652789Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:48.1653238Z plugins: hypothesis-6.131.14 2025-05-07T20:31:49.8498467Z collecting ... collected 4 items 2025-05-07T20:31:49.8498929Z 2025-05-07T20:31:52.7574664Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:52.7708076Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:52.7866940Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:52.8001076Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:52.8001539Z 2025-05-07T20:31:52.8001707Z =========================== short test summary info ============================ 2025-05-07T20:31:52.8002462Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:52.8003422Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when xformers is not available 2025-05-07T20:31:52.8004073Z ============================== 4 skipped in 4.77s ============================== 2025-05-07T20:31:54.7236350Z 2025-05-07T20:31:54.7237077Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:54.7259416Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds 2025-05-07T20:31:54.7259751Z 2025-05-07T20:31:54.7259764Z 2025-05-07T20:31:54.7259769Z 2025-05-07T20:31:54.7259773Z 2025-05-07T20:31:54.7282167Z ################################################################################ 2025-05-07T20:31:54.7299118Z # [2025-05-07T20:31:54.729Z] Run Python Test Suite: 2025-05-07T20:31:54.7299501Z # ./moe/activation_test.py 2025-05-07T20:31:54.7299829Z ################################################################################ 2025-05-07T20:31:54.7328312Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:54.7328993Z 2025-05-07T20:31:57.0591067Z ============================= test session starts ============================== 2025-05-07T20:31:57.0591810Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:57.0592393Z cachedir: .pytest_cache 2025-05-07T20:31:57.0593044Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:57.0593950Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:57.0594405Z plugins: hypothesis-6.131.14 2025-05-07T20:31:58.8020434Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:58.9843083Z collecting ... collected 2 items 2025-05-07T20:31:58.9844123Z 2025-05-07T20:32:04.6527334Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:32:04.6528258Z self=, 2025-05-07T20:32:04.6528780Z T=1, 2025-05-07T20:32:04.6529004Z D=5120, 2025-05-07T20:32:04.6529235Z contiguous=True, 2025-05-07T20:32:04.6529497Z compiled=True, 2025-05-07T20:32:04.6529740Z ) 2025-05-07T20:32:04.6529967Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6530408Z self=, 2025-05-07T20:32:04.6530891Z T=4096, 2025-05-07T20:32:04.6531151Z D=5120, 2025-05-07T20:32:04.6531379Z contiguous=True, 2025-05-07T20:32:04.6531639Z compiled=True, 2025-05-07T20:32:04.6531870Z ) 2025-05-07T20:32:04.6532104Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6532531Z self=, 2025-05-07T20:32:04.6532993Z T=4096, 2025-05-07T20:32:04.6533320Z D=7168, 2025-05-07T20:32:04.6533628Z contiguous=False, 2025-05-07T20:32:04.6533891Z compiled=False, 2025-05-07T20:32:04.6534133Z ) 2025-05-07T20:32:04.6534363Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6535357Z self=, 2025-05-07T20:32:04.6535900Z T=4096, 2025-05-07T20:32:04.6536168Z D=5120, 2025-05-07T20:32:04.6536495Z contiguous=False, 2025-05-07T20:32:04.6536852Z compiled=True, 2025-05-07T20:32:04.6537179Z ) 2025-05-07T20:32:04.6537489Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6537960Z self=, 2025-05-07T20:32:04.6538391Z T=1, 2025-05-07T20:32:04.6538607Z D=7168, 2025-05-07T20:32:04.6538836Z contiguous=True, 2025-05-07T20:32:04.6539094Z compiled=True, 2025-05-07T20:32:04.6539327Z ) 2025-05-07T20:32:04.6539553Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6539990Z self=, 2025-05-07T20:32:04.6540423Z T=1, 2025-05-07T20:32:04.6540632Z D=7168, 2025-05-07T20:32:04.6540860Z contiguous=False, 2025-05-07T20:32:04.6541119Z compiled=True, 2025-05-07T20:32:04.6541358Z ) 2025-05-07T20:32:04.6543544Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6543968Z self=, 2025-05-07T20:32:04.6544416Z T=4096, 2025-05-07T20:32:04.6544632Z D=5120, 2025-05-07T20:32:04.6544863Z contiguous=False, 2025-05-07T20:32:04.6545119Z compiled=False, 2025-05-07T20:32:04.6545356Z ) 2025-05-07T20:32:04.6545584Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6546002Z self=, 2025-05-07T20:32:04.6546432Z T=1, 2025-05-07T20:32:04.6546651Z D=7168, 2025-05-07T20:32:04.6555445Z contiguous=True, 2025-05-07T20:32:04.6555840Z compiled=False, 2025-05-07T20:32:04.6556104Z ) 2025-05-07T20:32:04.6556342Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6556773Z self=, 2025-05-07T20:32:04.6557202Z T=2048, 2025-05-07T20:32:04.6557426Z D=5120, 2025-05-07T20:32:04.6557665Z contiguous=True, 2025-05-07T20:32:04.6557915Z compiled=True, 2025-05-07T20:32:04.6558150Z ) 2025-05-07T20:32:04.6558376Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6558790Z self=, 2025-05-07T20:32:04.6559215Z T=2048, 2025-05-07T20:32:04.6559431Z D=7168, 2025-05-07T20:32:04.6559650Z contiguous=True, 2025-05-07T20:32:04.6559905Z compiled=True, 2025-05-07T20:32:04.6560138Z ) 2025-05-07T20:32:04.6560357Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6560777Z self=, 2025-05-07T20:32:04.6561206Z T=2048, 2025-05-07T20:32:04.6561412Z D=7168, 2025-05-07T20:32:04.6561634Z contiguous=True, 2025-05-07T20:32:04.6562117Z compiled=False, 2025-05-07T20:32:04.6562348Z ) 2025-05-07T20:32:04.6562571Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6562986Z self=, 2025-05-07T20:32:04.6563415Z T=128, 2025-05-07T20:32:04.6563629Z D=5120, 2025-05-07T20:32:04.6563852Z contiguous=False, 2025-05-07T20:32:04.6564105Z compiled=True, 2025-05-07T20:32:04.6564329Z ) 2025-05-07T20:32:04.6564551Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6564967Z self=, 2025-05-07T20:32:04.6565383Z T=128, 2025-05-07T20:32:04.6565595Z D=5120, 2025-05-07T20:32:04.6565817Z contiguous=True, 2025-05-07T20:32:04.6566064Z compiled=True, 2025-05-07T20:32:04.6566291Z ) 2025-05-07T20:32:04.6566514Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6566928Z self=, 2025-05-07T20:32:04.6567350Z T=16384, 2025-05-07T20:32:04.6567576Z D=5120, 2025-05-07T20:32:04.6567800Z contiguous=False, 2025-05-07T20:32:04.6568047Z compiled=True, 2025-05-07T20:32:04.6568276Z ) 2025-05-07T20:32:04.6568503Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6569013Z self=, 2025-05-07T20:32:04.6569433Z T=16384, 2025-05-07T20:32:04.6569652Z D=5120, 2025-05-07T20:32:04.6569868Z contiguous=False, 2025-05-07T20:32:04.6570125Z compiled=False, 2025-05-07T20:32:04.6570356Z ) 2025-05-07T20:32:04.6570574Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6570992Z self=, 2025-05-07T20:32:04.6571424Z T=128, 2025-05-07T20:32:04.6571665Z D=7168, 2025-05-07T20:32:04.6571904Z contiguous=True, 2025-05-07T20:32:04.6572161Z compiled=False, 2025-05-07T20:32:04.6572384Z ) 2025-05-07T20:32:04.6572605Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6573023Z self=, 2025-05-07T20:32:04.6573437Z T=128, 2025-05-07T20:32:04.6573646Z D=7168, 2025-05-07T20:32:04.6573868Z contiguous=False, 2025-05-07T20:32:04.6574116Z compiled=False, 2025-05-07T20:32:04.6574352Z ) 2025-05-07T20:32:04.6574573Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6574988Z self=, 2025-05-07T20:32:04.6575401Z T=1, 2025-05-07T20:32:04.6575609Z D=5120, 2025-05-07T20:32:04.6575830Z contiguous=False, 2025-05-07T20:32:04.6576079Z compiled=False, 2025-05-07T20:32:04.6576309Z ) 2025-05-07T20:32:04.6576531Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6576937Z self=, 2025-05-07T20:32:04.6577357Z T=1, 2025-05-07T20:32:04.6577568Z D=7168, 2025-05-07T20:32:04.6577785Z contiguous=False, 2025-05-07T20:32:04.6578040Z compiled=False, 2025-05-07T20:32:04.6578271Z ) 2025-05-07T20:32:04.6578493Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6578907Z self=, 2025-05-07T20:32:04.6579327Z T=4096, 2025-05-07T20:32:04.6579532Z D=5120, 2025-05-07T20:32:04.6579760Z contiguous=True, 2025-05-07T20:32:04.6580014Z compiled=False, 2025-05-07T20:32:04.6580242Z ) 2025-05-07T20:32:04.6580463Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6580881Z self=, 2025-05-07T20:32:04.6581305Z T=128, 2025-05-07T20:32:04.6581509Z D=7168, 2025-05-07T20:32:04.6581730Z contiguous=True, 2025-05-07T20:32:04.6581982Z compiled=True, 2025-05-07T20:32:04.6582208Z ) 2025-05-07T20:32:04.6582437Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6582854Z self=, 2025-05-07T20:32:04.6583268Z T=1, 2025-05-07T20:32:04.6583476Z D=5120, 2025-05-07T20:32:04.6583699Z contiguous=False, 2025-05-07T20:32:04.6584046Z compiled=True, 2025-05-07T20:32:04.6584277Z ) 2025-05-07T20:32:04.6584501Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6584908Z self=, 2025-05-07T20:32:04.6585334Z T=4096, 2025-05-07T20:32:04.6585545Z D=7168, 2025-05-07T20:32:04.6585759Z contiguous=True, 2025-05-07T20:32:04.6586015Z compiled=False, 2025-05-07T20:32:04.6586246Z ) 2025-05-07T20:32:04.6586461Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6586873Z self=, 2025-05-07T20:32:04.6587298Z T=4096, 2025-05-07T20:32:04.6587500Z D=7168, 2025-05-07T20:32:04.6587721Z contiguous=False, 2025-05-07T20:32:04.6587974Z compiled=True, 2025-05-07T20:32:04.6588205Z ) 2025-05-07T20:32:04.6588419Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6588835Z self=, 2025-05-07T20:32:04.6589255Z T=128, 2025-05-07T20:32:04.6589464Z D=5120, 2025-05-07T20:32:04.6589684Z contiguous=True, 2025-05-07T20:32:04.6589936Z compiled=False, 2025-05-07T20:32:04.6590160Z ) 2025-05-07T20:32:04.6590381Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6590890Z self=, 2025-05-07T20:32:04.6591303Z T=128, 2025-05-07T20:32:04.6591516Z D=5120, 2025-05-07T20:32:04.6591735Z contiguous=False, 2025-05-07T20:32:04.6591984Z compiled=False, 2025-05-07T20:32:04.6592219Z ) 2025-05-07T20:32:04.6592439Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6592849Z self=, 2025-05-07T20:32:04.6593270Z T=1, 2025-05-07T20:32:04.6593479Z D=5120, 2025-05-07T20:32:04.6593796Z contiguous=True, 2025-05-07T20:32:04.6594048Z compiled=False, 2025-05-07T20:32:04.6594289Z ) 2025-05-07T20:32:04.6594502Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6594921Z self=, 2025-05-07T20:32:04.6595346Z T=2048, 2025-05-07T20:32:04.6595559Z D=7168, 2025-05-07T20:32:04.6595779Z contiguous=False, 2025-05-07T20:32:04.6596034Z compiled=True, 2025-05-07T20:32:04.6596269Z ) 2025-05-07T20:32:04.6596486Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6596905Z self=, 2025-05-07T20:32:04.6597328Z T=2048, 2025-05-07T20:32:04.6597533Z D=7168, 2025-05-07T20:32:04.6597766Z contiguous=False, 2025-05-07T20:32:04.6598020Z compiled=False, 2025-05-07T20:32:04.6598245Z ) 2025-05-07T20:32:04.6598469Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6598886Z self=, 2025-05-07T20:32:04.6599301Z T=16384, 2025-05-07T20:32:04.6599525Z D=7168, 2025-05-07T20:32:04.6599747Z contiguous=False, 2025-05-07T20:32:04.6599994Z compiled=True, 2025-05-07T20:32:04.6600222Z ) 2025-05-07T20:32:04.6600446Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6600853Z self=, 2025-05-07T20:32:04.6601275Z T=16384, 2025-05-07T20:32:04.6601490Z D=7168, 2025-05-07T20:32:04.6601710Z contiguous=True, 2025-05-07T20:32:04.6601960Z compiled=True, 2025-05-07T20:32:04.6602188Z ) 2025-05-07T20:32:04.6602409Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6602818Z self=, 2025-05-07T20:32:04.6603237Z T=4096, 2025-05-07T20:32:04.6603449Z D=7168, 2025-05-07T20:32:04.6603662Z contiguous=True, 2025-05-07T20:32:04.6603915Z compiled=True, 2025-05-07T20:32:04.6604144Z ) 2025-05-07T20:32:04.6604361Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6604775Z self=, 2025-05-07T20:32:04.6605203Z T=2048, 2025-05-07T20:32:04.6605408Z D=5120, 2025-05-07T20:32:04.6605737Z contiguous=False, 2025-05-07T20:32:04.6605993Z compiled=False, 2025-05-07T20:32:04.6606218Z ) 2025-05-07T20:32:04.6606440Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6606855Z self=, 2025-05-07T20:32:04.6607287Z T=2048, 2025-05-07T20:32:04.6607492Z D=5120, 2025-05-07T20:32:04.6607706Z contiguous=True, 2025-05-07T20:32:04.6607947Z compiled=False, 2025-05-07T20:32:04.6608172Z ) 2025-05-07T20:32:04.6608388Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6608797Z self=, 2025-05-07T20:32:04.6609212Z T=128, 2025-05-07T20:32:04.6609417Z D=7168, 2025-05-07T20:32:04.6609676Z contiguous=False, 2025-05-07T20:32:04.6610038Z compiled=True, 2025-05-07T20:32:04.6610275Z ) 2025-05-07T20:32:04.6610498Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6610908Z self=, 2025-05-07T20:32:04.6611470Z T=16384, 2025-05-07T20:32:04.6611696Z D=5120, 2025-05-07T20:32:04.6611910Z contiguous=True, 2025-05-07T20:32:04.6612160Z compiled=True, 2025-05-07T20:32:04.6612391Z ) 2025-05-07T20:32:04.6612609Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6613144Z self=, 2025-05-07T20:32:04.6613569Z T=2048, 2025-05-07T20:32:04.6613779Z D=5120, 2025-05-07T20:32:04.6614003Z contiguous=False, 2025-05-07T20:32:04.6614264Z compiled=True, 2025-05-07T20:32:04.6614492Z ) 2025-05-07T20:32:04.6614720Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6615142Z self=, 2025-05-07T20:32:04.6615563Z T=16384, 2025-05-07T20:32:04.6615784Z D=5120, 2025-05-07T20:32:04.6616013Z contiguous=True, 2025-05-07T20:32:04.6616264Z compiled=False, 2025-05-07T20:32:04.6616501Z ) 2025-05-07T20:32:04.6616737Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6617163Z self=, 2025-05-07T20:32:04.6617585Z T=16384, 2025-05-07T20:32:04.6617807Z D=7168, 2025-05-07T20:32:04.6618032Z contiguous=False, 2025-05-07T20:32:04.6618292Z compiled=False, 2025-05-07T20:32:04.6618527Z ) 2025-05-07T20:32:04.6618763Z Trying example: test_silu_mul( 2025-05-07T20:32:04.6619175Z self=, 2025-05-07T20:32:04.6619603Z T=16384, 2025-05-07T20:32:04.6619825Z D=7168, 2025-05-07T20:32:04.6620043Z contiguous=True, 2025-05-07T20:32:04.6620299Z compiled=False, 2025-05-07T20:32:04.6620543Z ) 2025-05-07T20:32:04.6620742Z PASSED 2025-05-07T20:32:04.7242804Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.7244158Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:04.7245673Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.7247288Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.7248834Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.7250390Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.7252053Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.7253604Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.7255194Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.7256592Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:04.7257977Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.7259337Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:04.7260659Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:04.7261801Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:04.7263167Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.7264595Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.7265853Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:04.7267029Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:04.7268349Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.7269862Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.7271053Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.7272079Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.7272919Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:04.7274160Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.7421809Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.7423014Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:04.7425071Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.7426823Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.7428513Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.7430074Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.7431553Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.7433275Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.7434974Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.7436382Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:04.7437760Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.7439117Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:04.7440297Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:04.7441448Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:04.7442825Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.7444277Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.7445531Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:04.7446712Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:04.7448040Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.7449572Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.7450914Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.7451942Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.7452796Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:04.7453953Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.7867171Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.7868497Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:04.7870003Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.7871829Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.7873376Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.7875002Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.7876464Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.7877997Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.7879592Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.7880989Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:04.7882361Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.7883714Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:04.7884878Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:04.7886023Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:04.7887387Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.7888822Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.7890201Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:04.7891367Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:04.7892736Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.7894258Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.7895444Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.7896468Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.7897292Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:04.7898542Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.7914592Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.7915935Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:04.7917450Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.7919052Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.7920616Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.7922174Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.7923648Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.7925503Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.7927095Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.7928496Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:04.7929869Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.7931420Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:04.7932594Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:04.7933741Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:04.7935115Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.7936683Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.7938029Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:04.7939207Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:04.7940685Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.7942217Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.7943415Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.7944450Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.7945291Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:04.7946445Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.2770325Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.2771108Z self=, 2025-05-07T20:32:05.2771584Z T=1, 2025-05-07T20:32:05.2771806Z D=5120, 2025-05-07T20:32:05.2772025Z scale_ub=None, 2025-05-07T20:32:05.2772274Z contiguous=True, 2025-05-07T20:32:05.2772532Z compiled=True, 2025-05-07T20:32:05.2772769Z ) 2025-05-07T20:32:05.2773167Z self = 2025-05-07T20:32:05.2773725Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.2774020Z 2025-05-07T20:32:05.2774120Z @given( 2025-05-07T20:32:05.2774393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.2774755Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.2775108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.2775479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.2775860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.2776194Z ) 2025-05-07T20:32:05.2776595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.2777112Z def test_silu_mul_quant( 2025-05-07T20:32:05.2777398Z self, 2025-05-07T20:32:05.2777618Z T: int, 2025-05-07T20:32:05.2777852Z D: int, 2025-05-07T20:32:05.2778107Z scale_ub: Optional[float], 2025-05-07T20:32:05.2778777Z contiguous: bool, 2025-05-07T20:32:05.2779054Z compiled: bool, 2025-05-07T20:32:05.2779318Z ) -> None: 2025-05-07T20:32:05.2779574Z torch.manual_seed(2025) 2025-05-07T20:32:05.2779848Z 2025-05-07T20:32:05.2780181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.2780572Z 2025-05-07T20:32:05.2780797Z x_sign = torch.sign(x) 2025-05-07T20:32:05.2781124Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.2781480Z x = x_sign * x_clamp 2025-05-07T20:32:05.2781757Z x0 = x[:, :D] 2025-05-07T20:32:05.2782000Z x1 = x[:, D:] 2025-05-07T20:32:05.2782243Z 2025-05-07T20:32:05.2782456Z if contiguous: 2025-05-07T20:32:05.2782718Z x0 = x0.contiguous() 2025-05-07T20:32:05.2792943Z x1 = x1.contiguous() 2025-05-07T20:32:05.2793370Z 2025-05-07T20:32:05.2793823Z if scale_ub is not None: 2025-05-07T20:32:05.2794297Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.2794717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.2795070Z ) 2025-05-07T20:32:05.2795313Z else: 2025-05-07T20:32:05.2795778Z scale_ub_tensor = None 2025-05-07T20:32:05.2796064Z 2025-05-07T20:32:05.2796338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.2796701Z op = silu_mul_quant 2025-05-07T20:32:05.2796988Z if compiled: 2025-05-07T20:32:05.2797282Z op = torch.compile(op) 2025-05-07T20:32:05.2797625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.2797943Z 2025-05-07T20:32:05.2798165Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.2798497Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.2798832Z 2025-05-07T20:32:05.2799102Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.2799490Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.2799833Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.2800185Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.2800596Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.2800960Z 2025-05-07T20:32:05.2801188Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.2801417Z 2025-05-07T20:32:05.2801535Z moe/activation_test.py:126: 2025-05-07T20:32:05.2801879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.2802267Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.2802639Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.2803530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.2804386Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.2805005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.2805781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.2806556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.2807381Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.2808224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.2809068Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.2809887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.2810613Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.2811381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.2811972Z fn() 2025-05-07T20:32:05.2812550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.2813206Z self.fn.run( 2025-05-07T20:32:05.2813741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.2814341Z kernel = self.compile( 2025-05-07T20:32:05.2814952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.2815685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.2816135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.2816397Z 2025-05-07T20:32:05.2816636Z self = 2025-05-07T20:32:05.2817857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.2819492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a4c60550>} 2025-05-07T20:32:05.2820998Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.2822154Z context = 2025-05-07T20:32:05.2822477Z 2025-05-07T20:32:05.2822674Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.2823258Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.2824163Z module_map=module_map) 2025-05-07T20:32:05.2824602Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.2825006Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.2825311Z E ^ 2025-05-07T20:32:05.2825836Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.2826343Z 2025-05-07T20:32:05.2826812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.2827387Z 2025-05-07T20:32:05.2827507Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.2827973Z self=, 2025-05-07T20:32:05.2828425Z T=2048, 2025-05-07T20:32:05.2828641Z D=5120, 2025-05-07T20:32:05.2828859Z scale_ub=1200.0, 2025-05-07T20:32:05.2829115Z contiguous=True, 2025-05-07T20:32:05.2829377Z compiled=False, 2025-05-07T20:32:05.2829606Z ) 2025-05-07T20:32:05.8693314Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.8694653Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:05.8696124Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.8697688Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.8699400Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.8700924Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8702445Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.8703965Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8705530Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.8706910Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:05.8708385Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.8709718Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:05.8710858Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:05.8711980Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:05.8713325Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.8714865Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.8716102Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:05.8717255Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:05.8718570Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.8720075Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.8721256Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8722266Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8723080Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:05.8724467Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.0679815Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.0680998Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:06.0682466Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.0684025Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.0685540Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.0687057Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.0688619Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.0690125Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.0691690Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.0693052Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:06.0694396Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.0695721Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:06.0696859Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:06.0697978Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:06.0699320Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.0700733Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.0701966Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:06.0703113Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:06.0704524Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.0706014Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.0707185Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.0708190Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.0709008Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:06.0710124Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.6082065Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.6083257Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:06.6084956Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.6086554Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.6088095Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.6089631Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.6091083Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.6092612Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.6094185Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.6095571Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:06.6096919Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.6098258Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:06.6099411Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:06.6100541Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:06.6102072Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.6103501Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.6104737Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:06.6105894Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:06.6107200Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.6108711Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.6109883Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.6110980Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.6111807Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:06.6112991Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.6411216Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.6412408Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:06.6413893Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.6415475Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.6417012Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.6418562Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.6420019Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.6421538Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.6423109Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.6424831Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:06.6426187Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.6427531Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:06.6428678Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:06.6429805Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:06.6431162Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.6432580Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.6434017Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:06.6435165Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:06.6436470Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.6437986Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.6439164Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.6440173Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.6440995Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:06.6442132Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.3445240Z self = 2025-05-07T20:32:07.3445918Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:07.3446353Z 2025-05-07T20:32:07.3446478Z @given( 2025-05-07T20:32:07.3446826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.3447272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.3447611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.3447966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.3448318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.3448618Z ) 2025-05-07T20:32:07.3448993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.3449465Z def test_silu_mul_quant( 2025-05-07T20:32:07.3449720Z self, 2025-05-07T20:32:07.3449933Z T: int, 2025-05-07T20:32:07.3450150Z D: int, 2025-05-07T20:32:07.3450383Z scale_ub: Optional[float], 2025-05-07T20:32:07.3450678Z contiguous: bool, 2025-05-07T20:32:07.3450941Z compiled: bool, 2025-05-07T20:32:07.3451357Z ) -> None: 2025-05-07T20:32:07.3451601Z torch.manual_seed(2025) 2025-05-07T20:32:07.3451867Z 2025-05-07T20:32:07.3452155Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.3452523Z 2025-05-07T20:32:07.3452734Z x_sign = torch.sign(x) 2025-05-07T20:32:07.3453040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.3453373Z x = x_sign * x_clamp 2025-05-07T20:32:07.3453633Z x0 = x[:, :D] 2025-05-07T20:32:07.3453866Z x1 = x[:, D:] 2025-05-07T20:32:07.3454083Z 2025-05-07T20:32:07.3454282Z if contiguous: 2025-05-07T20:32:07.3454536Z x0 = x0.contiguous() 2025-05-07T20:32:07.3454806Z x1 = x1.contiguous() 2025-05-07T20:32:07.3455066Z 2025-05-07T20:32:07.3455277Z if scale_ub is not None: 2025-05-07T20:32:07.3455568Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.3455929Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.3456272Z ) 2025-05-07T20:32:07.3456475Z else: 2025-05-07T20:32:07.3456707Z scale_ub_tensor = None 2025-05-07T20:32:07.3456981Z 2025-05-07T20:32:07.3457225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.3457720Z op = silu_mul_quant 2025-05-07T20:32:07.3457990Z if compiled: 2025-05-07T20:32:07.3458252Z op = torch.compile(op) 2025-05-07T20:32:07.3458572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.3458871Z 2025-05-07T20:32:07.3459074Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.3459257Z 2025-05-07T20:32:07.3459363Z moe/activation_test.py:117: 2025-05-07T20:32:07.3459720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.3460072Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.3460382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.3461123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.3461849Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.3462418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.3463371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.3464101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.3464671Z kernel = self.compile( 2025-05-07T20:32:07.3465247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.3465945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.3466362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.3466613Z 2025-05-07T20:32:07.3466839Z self = 2025-05-07T20:32:07.3467982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.3469443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a53d89d0>} 2025-05-07T20:32:07.3470860Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.3471952Z context = 2025-05-07T20:32:07.3472296Z 2025-05-07T20:32:07.3472473Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.3473127Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.3473737Z module_map=module_map) 2025-05-07T20:32:07.3474129Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.3474501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.3474776Z E ^ 2025-05-07T20:32:07.3475260Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.3475741Z 2025-05-07T20:32:07.3476178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.3476713Z 2025-05-07T20:32:07.3476830Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.3477265Z self=, 2025-05-07T20:32:07.3477684Z T=2048, 2025-05-07T20:32:07.3477886Z D=5120, 2025-05-07T20:32:07.3478099Z scale_ub=1200.0, 2025-05-07T20:32:07.3478338Z contiguous=True, 2025-05-07T20:32:07.3478578Z compiled=True, 2025-05-07T20:32:07.3478803Z ) 2025-05-07T20:32:07.3479139Z self = 2025-05-07T20:32:07.3479820Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:07.3480148Z 2025-05-07T20:32:07.3480241Z @given( 2025-05-07T20:32:07.3480498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.3480860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.3481216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.3481598Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.3481976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.3482315Z ) 2025-05-07T20:32:07.3482730Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.3483261Z def test_silu_mul_quant( 2025-05-07T20:32:07.3483538Z self, 2025-05-07T20:32:07.3483755Z T: int, 2025-05-07T20:32:07.3483968Z D: int, 2025-05-07T20:32:07.3484214Z scale_ub: Optional[float], 2025-05-07T20:32:07.3484528Z contiguous: bool, 2025-05-07T20:32:07.3484780Z compiled: bool, 2025-05-07T20:32:07.3485022Z ) -> None: 2025-05-07T20:32:07.3485255Z torch.manual_seed(2025) 2025-05-07T20:32:07.3485507Z 2025-05-07T20:32:07.3485797Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.3486159Z 2025-05-07T20:32:07.3486362Z x_sign = torch.sign(x) 2025-05-07T20:32:07.3486674Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.3487004Z x = x_sign * x_clamp 2025-05-07T20:32:07.3487263Z x0 = x[:, :D] 2025-05-07T20:32:07.3487489Z x1 = x[:, D:] 2025-05-07T20:32:07.3487715Z 2025-05-07T20:32:07.3487917Z if contiguous: 2025-05-07T20:32:07.3488164Z x0 = x0.contiguous() 2025-05-07T20:32:07.3488441Z x1 = x1.contiguous() 2025-05-07T20:32:07.3488700Z 2025-05-07T20:32:07.3488904Z if scale_ub is not None: 2025-05-07T20:32:07.3489196Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.3489560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.3489882Z ) 2025-05-07T20:32:07.3490094Z else: 2025-05-07T20:32:07.3490323Z scale_ub_tensor = None 2025-05-07T20:32:07.3490588Z 2025-05-07T20:32:07.3490839Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.3491172Z op = silu_mul_quant 2025-05-07T20:32:07.3491432Z if compiled: 2025-05-07T20:32:07.3491701Z op = torch.compile(op) 2025-05-07T20:32:07.3492019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.3492316Z 2025-05-07T20:32:07.3492518Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.3492915Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.3493227Z 2025-05-07T20:32:07.3493477Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.3493832Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.3494150Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.3494479Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.3494861Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.3495192Z 2025-05-07T20:32:07.3495405Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.3495618Z 2025-05-07T20:32:07.3495726Z moe/activation_test.py:126: 2025-05-07T20:32:07.3496046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.3496401Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.3496747Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.3497578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.3498369Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.3498941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.3499794Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.3500518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.3501280Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.3502068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:07.3502858Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.3503628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.3504301Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.3504930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.3505480Z fn() 2025-05-07T20:32:07.3506016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.3506621Z self.fn.run( 2025-05-07T20:32:07.3507114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.3507674Z kernel = self.compile( 2025-05-07T20:32:07.3508242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.3508925Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.3509350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.3509590Z 2025-05-07T20:32:07.3509812Z self = 2025-05-07T20:32:07.3510939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.3512369Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a536f1c0>} 2025-05-07T20:32:07.3513831Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.3514907Z context = 2025-05-07T20:32:07.3515292Z 2025-05-07T20:32:07.3515479Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.3516032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.3516540Z module_map=module_map) 2025-05-07T20:32:07.3516926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.3517302Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.3517581Z E ^ 2025-05-07T20:32:07.3518073Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.3518544Z 2025-05-07T20:32:07.3518998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.3519534Z 2025-05-07T20:32:07.3519650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.3520092Z self=, 2025-05-07T20:32:07.3520521Z T=16384, 2025-05-07T20:32:07.3520734Z D=7168, 2025-05-07T20:32:07.3520939Z scale_ub=1200.0, 2025-05-07T20:32:07.3521187Z contiguous=False, 2025-05-07T20:32:07.3521519Z compiled=False, 2025-05-07T20:32:07.3521732Z ) 2025-05-07T20:32:07.7362218Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.7363347Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:07.7364742Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.7366257Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.7367702Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.7369149Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.7370512Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.7371945Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.7373420Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.7374719Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:07.7375975Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.7377230Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:07.7378472Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:07.7387101Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:07.7388527Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.7389883Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.7391046Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:07.7392130Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:07.7393362Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.7395026Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.7396142Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.7397097Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.7397877Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:07.7398956Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.8838393Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.8839521Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:07.8840928Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.8842426Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.8843874Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.8845331Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.8846696Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.8848295Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.8849779Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.8851074Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:07.8852346Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.8853610Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:07.8854694Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:07.8855762Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:07.8857147Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.8858485Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.8859652Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:07.8860747Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:07.8861972Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.8863445Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.8864558Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.8865509Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.8866294Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:07.8867358Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3428016Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.3429170Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:08.3430581Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.3432255Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.3433777Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.3435227Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3436587Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.3438017Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3439488Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.3440910Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:08.3442174Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.3443432Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:08.3444515Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:08.3445569Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:08.3446836Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.3448175Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.3449339Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:08.3450423Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:08.3451645Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.3453059Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.3454162Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3455109Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3455884Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:08.3457023Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3735529Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.3736650Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:08.3738039Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.3739519Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.3740952Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.3742565Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3743916Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.3745355Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3746838Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.3748138Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:08.3749410Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.3750670Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:08.3751750Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:08.3752817Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:08.3754146Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.3755490Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.3756653Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:08.3757742Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:08.3759118Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.3760528Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.3761640Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3762590Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3763365Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:08.3764440Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.7728932Z self = 2025-05-07T20:32:09.7729558Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:09.7730425Z 2025-05-07T20:32:09.7730548Z @given( 2025-05-07T20:32:09.7730886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.7731325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.7731760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.7732186Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.7732539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.7732857Z ) 2025-05-07T20:32:09.7733243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.7733728Z def test_silu_mul_quant( 2025-05-07T20:32:09.7733995Z self, 2025-05-07T20:32:09.7734212Z T: int, 2025-05-07T20:32:09.7734446Z D: int, 2025-05-07T20:32:09.7734683Z scale_ub: Optional[float], 2025-05-07T20:32:09.7734984Z contiguous: bool, 2025-05-07T20:32:09.7735250Z compiled: bool, 2025-05-07T20:32:09.7735506Z ) -> None: 2025-05-07T20:32:09.7735747Z torch.manual_seed(2025) 2025-05-07T20:32:09.7736017Z 2025-05-07T20:32:09.7736313Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.7736688Z 2025-05-07T20:32:09.7736903Z x_sign = torch.sign(x) 2025-05-07T20:32:09.7737214Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.7737556Z x = x_sign * x_clamp 2025-05-07T20:32:09.7737823Z x0 = x[:, :D] 2025-05-07T20:32:09.7738062Z x1 = x[:, D:] 2025-05-07T20:32:09.7738296Z 2025-05-07T20:32:09.7738501Z if contiguous: 2025-05-07T20:32:09.7738762Z x0 = x0.contiguous() 2025-05-07T20:32:09.7739043Z x1 = x1.contiguous() 2025-05-07T20:32:09.7739316Z 2025-05-07T20:32:09.7739529Z if scale_ub is not None: 2025-05-07T20:32:09.7739823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.7740188Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.7740533Z ) 2025-05-07T20:32:09.7740742Z else: 2025-05-07T20:32:09.7740977Z scale_ub_tensor = None 2025-05-07T20:32:09.7741253Z 2025-05-07T20:32:09.7741504Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.7741853Z op = silu_mul_quant 2025-05-07T20:32:09.7742129Z if compiled: 2025-05-07T20:32:09.7742395Z op = torch.compile(op) 2025-05-07T20:32:09.7742718Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.7743023Z 2025-05-07T20:32:09.7743233Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.7743419Z 2025-05-07T20:32:09.7743530Z moe/activation_test.py:117: 2025-05-07T20:32:09.7744020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7744386Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.7744690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.7745441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.7746197Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.7746774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.7747516Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.7748232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.7748810Z kernel = self.compile( 2025-05-07T20:32:09.7749391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.7750106Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.7750539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7750785Z 2025-05-07T20:32:09.7751102Z self = 2025-05-07T20:32:09.7752256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.7756071Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a4cb7ac0>} 2025-05-07T20:32:09.7757506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.7758592Z context = 2025-05-07T20:32:09.7758901Z 2025-05-07T20:32:09.7759090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.7759651Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.7760157Z module_map=module_map) 2025-05-07T20:32:09.7760556Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.7760938Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.7761215Z E ^ 2025-05-07T20:32:09.7761716Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.7762195Z 2025-05-07T20:32:09.7762644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.7763188Z 2025-05-07T20:32:09.7763310Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.7763747Z self=, 2025-05-07T20:32:09.7764177Z T=1, 2025-05-07T20:32:09.7764380Z D=7168, 2025-05-07T20:32:09.7764592Z scale_ub=None, 2025-05-07T20:32:09.7764824Z contiguous=True, 2025-05-07T20:32:09.7765067Z compiled=True, 2025-05-07T20:32:09.7765289Z ) 2025-05-07T20:32:09.7765633Z self = 2025-05-07T20:32:09.7766153Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:09.7766430Z 2025-05-07T20:32:09.7766514Z @given( 2025-05-07T20:32:09.7766767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.7767130Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.7767454Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.7767813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.7768277Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.7768585Z ) 2025-05-07T20:32:09.7768967Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.7769445Z def test_silu_mul_quant( 2025-05-07T20:32:09.7769715Z self, 2025-05-07T20:32:09.7769924Z T: int, 2025-05-07T20:32:09.7770144Z D: int, 2025-05-07T20:32:09.7770384Z scale_ub: Optional[float], 2025-05-07T20:32:09.7770673Z contiguous: bool, 2025-05-07T20:32:09.7770937Z compiled: bool, 2025-05-07T20:32:09.7771184Z ) -> None: 2025-05-07T20:32:09.7771413Z torch.manual_seed(2025) 2025-05-07T20:32:09.7771679Z 2025-05-07T20:32:09.7771978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.7772340Z 2025-05-07T20:32:09.7772554Z x_sign = torch.sign(x) 2025-05-07T20:32:09.7772870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.7773204Z x = x_sign * x_clamp 2025-05-07T20:32:09.7773489Z x0 = x[:, :D] 2025-05-07T20:32:09.7773755Z x1 = x[:, D:] 2025-05-07T20:32:09.7774002Z 2025-05-07T20:32:09.7774206Z if contiguous: 2025-05-07T20:32:09.7774463Z x0 = x0.contiguous() 2025-05-07T20:32:09.7774829Z x1 = x1.contiguous() 2025-05-07T20:32:09.7775094Z 2025-05-07T20:32:09.7775309Z if scale_ub is not None: 2025-05-07T20:32:09.7775607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.7775968Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.7776308Z ) 2025-05-07T20:32:09.7776520Z else: 2025-05-07T20:32:09.7776747Z scale_ub_tensor = None 2025-05-07T20:32:09.7777026Z 2025-05-07T20:32:09.7777281Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.7777622Z op = silu_mul_quant 2025-05-07T20:32:09.7777896Z if compiled: 2025-05-07T20:32:09.7778172Z op = torch.compile(op) 2025-05-07T20:32:09.7778491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.7778822Z 2025-05-07T20:32:09.7779037Z y_fp8, y_scale = fn() 2025-05-07T20:32:09.7779354Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:09.7779675Z 2025-05-07T20:32:09.7779937Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.7780302Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:09.7780616Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:09.7780956Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:09.7781349Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.7781683Z 2025-05-07T20:32:09.7781909Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:09.7782132Z 2025-05-07T20:32:09.7782242Z moe/activation_test.py:126: 2025-05-07T20:32:09.7782567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7782930Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:09.7783294Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.7784143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:09.7784955Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:09.7785547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.7786286Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.7787029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:09.7787804Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.7788709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:09.7789518Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.7790307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:09.7791006Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:09.7791660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:09.7792222Z fn() 2025-05-07T20:32:09.7792765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:09.7793395Z self.fn.run( 2025-05-07T20:32:09.7794027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.7794604Z kernel = self.compile( 2025-05-07T20:32:09.7795188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.7795895Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.7796326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7796694Z 2025-05-07T20:32:09.7796922Z self = 2025-05-07T20:32:09.7798075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.7799550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76943d30a0>} 2025-05-07T20:32:09.7800990Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.7802088Z context = 2025-05-07T20:32:09.7802403Z 2025-05-07T20:32:09.7802588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.7803202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.7803710Z module_map=module_map) 2025-05-07T20:32:09.7804108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.7804490Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:09.7804779Z E ^ 2025-05-07T20:32:09.7805282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.7805765Z 2025-05-07T20:32:09.7806214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.7806772Z 2025-05-07T20:32:09.7806886Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.7807333Z self=, 2025-05-07T20:32:09.7807780Z T=4096, 2025-05-07T20:32:09.7807982Z D=5120, 2025-05-07T20:32:09.7808196Z scale_ub=None, 2025-05-07T20:32:09.7808437Z contiguous=False, 2025-05-07T20:32:09.7808679Z compiled=False, 2025-05-07T20:32:09.7808903Z ) 2025-05-07T20:32:10.3861164Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.3862385Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:10.3864337Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.3865973Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.3867542Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.3869097Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.3870574Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.3872120Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.3873944Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.3875335Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:10.3876705Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.3878067Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:10.3879230Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:10.3880381Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:10.3881750Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.3883247Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.3884506Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:10.3885680Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:10.3887006Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.3888524Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.3889716Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.3890825Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.3891660Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:10.3892813Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.0144363Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:11.0146475Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:11.0149144Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:11.0151969Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:11.0154661Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:11.0156589Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.0158408Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.0160319Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.0162301Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.0164020Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:11.0165707Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.0167379Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:11.0168805Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:11.0170210Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:11.0171891Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.0173914Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.0175845Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:11.0177280Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:11.0178916Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.0180793Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.0182249Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.0183720Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.0184938Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:11.0186332Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.7899060Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:11.7900290Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:11.7901815Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:11.7903458Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:11.7905017Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:11.7906568Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.7908029Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.7909565Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.7911147Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.7912547Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:11.7914077Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.7915707Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:11.7916864Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:11.7918019Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:11.7919395Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.7920843Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.7922104Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:11.7923267Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:11.7924895Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.7926589Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.7927771Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.7928783Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.7929621Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:11.7930768Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.8232198Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:11.8233407Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:11.8234985Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:11.8236591Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:11.8238159Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:11.8239721Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.8248310Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.8250079Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.8251699Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.8253112Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:11.8254498Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.8255877Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:11.8257061Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:11.8258221Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:11.8259730Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.8261181Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.8262449Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:11.8263645Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:11.8265026Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.8266562Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.8267763Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.8268797Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.8269650Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:11.8270806Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4241847Z self = 2025-05-07T20:32:15.4242524Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.4242829Z 2025-05-07T20:32:15.4242920Z @given( 2025-05-07T20:32:15.4243190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4243532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4243870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4244231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4244581Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4244899Z ) 2025-05-07T20:32:15.4245650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4246132Z def test_silu_mul_quant( 2025-05-07T20:32:15.4246392Z self, 2025-05-07T20:32:15.4246613Z T: int, 2025-05-07T20:32:15.4246843Z D: int, 2025-05-07T20:32:15.4247077Z scale_ub: Optional[float], 2025-05-07T20:32:15.4247376Z contiguous: bool, 2025-05-07T20:32:15.4247643Z compiled: bool, 2025-05-07T20:32:15.4247888Z ) -> None: 2025-05-07T20:32:15.4248129Z torch.manual_seed(2025) 2025-05-07T20:32:15.4248398Z 2025-05-07T20:32:15.4248691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4249068Z 2025-05-07T20:32:15.4249281Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4249592Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4249927Z x = x_sign * x_clamp 2025-05-07T20:32:15.4250189Z x0 = x[:, :D] 2025-05-07T20:32:15.4250416Z x1 = x[:, D:] 2025-05-07T20:32:15.4250648Z 2025-05-07T20:32:15.4250851Z if contiguous: 2025-05-07T20:32:15.4251096Z x0 = x0.contiguous() 2025-05-07T20:32:15.4251374Z x1 = x1.contiguous() 2025-05-07T20:32:15.4251807Z 2025-05-07T20:32:15.4252009Z if scale_ub is not None: 2025-05-07T20:32:15.4252303Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4252663Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4252986Z ) 2025-05-07T20:32:15.4253197Z else: 2025-05-07T20:32:15.4253425Z scale_ub_tensor = None 2025-05-07T20:32:15.4253690Z 2025-05-07T20:32:15.4253941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4254280Z op = silu_mul_quant 2025-05-07T20:32:15.4254545Z if compiled: 2025-05-07T20:32:15.4254812Z op = torch.compile(op) 2025-05-07T20:32:15.4255132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4255426Z 2025-05-07T20:32:15.4255632Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4255814Z 2025-05-07T20:32:15.4255927Z moe/activation_test.py:117: 2025-05-07T20:32:15.4256245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4256597Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4256899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4257634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4258360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4258931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4259657Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4260358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4260920Z kernel = self.compile( 2025-05-07T20:32:15.4261495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4262192Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4262621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4262862Z 2025-05-07T20:32:15.4263080Z self = 2025-05-07T20:32:15.4264226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4265712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a7370c10>} 2025-05-07T20:32:15.4267200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4268272Z context = 2025-05-07T20:32:15.4268580Z 2025-05-07T20:32:15.4268754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4269298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4269787Z module_map=module_map) 2025-05-07T20:32:15.4270166Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4270536Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4270811Z E ^ 2025-05-07T20:32:15.4271291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4271775Z 2025-05-07T20:32:15.4272210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.4272752Z 2025-05-07T20:32:15.4272861Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.4273379Z self=, 2025-05-07T20:32:15.4273873Z T=4096, 2025-05-07T20:32:15.4274073Z D=7168, 2025-05-07T20:32:15.4274282Z scale_ub=None, 2025-05-07T20:32:15.4274508Z contiguous=False, 2025-05-07T20:32:15.4274749Z compiled=False, 2025-05-07T20:32:15.4274973Z ) 2025-05-07T20:32:15.4275307Z self = 2025-05-07T20:32:15.4275833Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.4276126Z 2025-05-07T20:32:15.4276208Z @given( 2025-05-07T20:32:15.4276450Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4276783Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4277108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4277455Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4277808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4278112Z ) 2025-05-07T20:32:15.4278480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4278939Z def test_silu_mul_quant( 2025-05-07T20:32:15.4279200Z self, 2025-05-07T20:32:15.4279408Z T: int, 2025-05-07T20:32:15.4279619Z D: int, 2025-05-07T20:32:15.4279847Z scale_ub: Optional[float], 2025-05-07T20:32:15.4280136Z contiguous: bool, 2025-05-07T20:32:15.4280394Z compiled: bool, 2025-05-07T20:32:15.4280622Z ) -> None: 2025-05-07T20:32:15.4280848Z torch.manual_seed(2025) 2025-05-07T20:32:15.4281101Z 2025-05-07T20:32:15.4281381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4281742Z 2025-05-07T20:32:15.4281949Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4282249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4282572Z x = x_sign * x_clamp 2025-05-07T20:32:15.4282833Z x0 = x[:, :D] 2025-05-07T20:32:15.4283057Z x1 = x[:, D:] 2025-05-07T20:32:15.4283279Z 2025-05-07T20:32:15.4283475Z if contiguous: 2025-05-07T20:32:15.4283716Z x0 = x0.contiguous() 2025-05-07T20:32:15.4283989Z x1 = x1.contiguous() 2025-05-07T20:32:15.4284244Z 2025-05-07T20:32:15.4284465Z if scale_ub is not None: 2025-05-07T20:32:15.4284785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4285161Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4285487Z ) 2025-05-07T20:32:15.4285688Z else: 2025-05-07T20:32:15.4285916Z scale_ub_tensor = None 2025-05-07T20:32:15.4286186Z 2025-05-07T20:32:15.4286511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4286848Z op = silu_mul_quant 2025-05-07T20:32:15.4287120Z if compiled: 2025-05-07T20:32:15.4287377Z op = torch.compile(op) 2025-05-07T20:32:15.4287697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4287992Z 2025-05-07T20:32:15.4288197Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4288377Z 2025-05-07T20:32:15.4288482Z moe/activation_test.py:117: 2025-05-07T20:32:15.4288798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4289149Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4289445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4290171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4290899Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4291464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4292183Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4292992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4293554Z kernel = self.compile( 2025-05-07T20:32:15.4294118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4294812Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4295231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4295472Z 2025-05-07T20:32:15.4295690Z self = 2025-05-07T20:32:15.4296821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4298253Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a5ea7520>} 2025-05-07T20:32:15.4299660Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4300732Z context = 2025-05-07T20:32:15.4301032Z 2025-05-07T20:32:15.4301207Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4301758Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4302249Z module_map=module_map) 2025-05-07T20:32:15.4302642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4303008Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4303284Z E ^ 2025-05-07T20:32:15.4303773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4304252Z 2025-05-07T20:32:15.4304739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.4305279Z 2025-05-07T20:32:15.4305389Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.4305824Z self=, 2025-05-07T20:32:15.4306245Z T=128, 2025-05-07T20:32:15.4306438Z D=7168, 2025-05-07T20:32:15.4306647Z scale_ub=None, 2025-05-07T20:32:15.4306876Z contiguous=False, 2025-05-07T20:32:15.4307111Z compiled=True, 2025-05-07T20:32:15.4307327Z ) 2025-05-07T20:32:15.4988161Z self = 2025-05-07T20:32:15.4989274Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:15.4989838Z 2025-05-07T20:32:15.4990002Z @given( 2025-05-07T20:32:15.4990496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4991183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4991815Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4992513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4993205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4993925Z ) 2025-05-07T20:32:15.4994500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4994968Z def test_silu_mul_quant( 2025-05-07T20:32:15.4995226Z self, 2025-05-07T20:32:15.4995429Z T: int, 2025-05-07T20:32:15.4995644Z D: int, 2025-05-07T20:32:15.4995878Z scale_ub: Optional[float], 2025-05-07T20:32:15.4996172Z contiguous: bool, 2025-05-07T20:32:15.4996430Z compiled: bool, 2025-05-07T20:32:15.4996673Z ) -> None: 2025-05-07T20:32:15.4996895Z torch.manual_seed(2025) 2025-05-07T20:32:15.4997288Z 2025-05-07T20:32:15.4997578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4997933Z 2025-05-07T20:32:15.4998143Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4998451Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4998772Z x = x_sign * x_clamp 2025-05-07T20:32:15.4999027Z x0 = x[:, :D] 2025-05-07T20:32:15.4999259Z x1 = x[:, D:] 2025-05-07T20:32:15.4999485Z 2025-05-07T20:32:15.4999675Z if contiguous: 2025-05-07T20:32:15.4999922Z x0 = x0.contiguous() 2025-05-07T20:32:15.5000201Z x1 = x1.contiguous() 2025-05-07T20:32:15.5000450Z 2025-05-07T20:32:15.5000658Z if scale_ub is not None: 2025-05-07T20:32:15.5000952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.5001300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.5001628Z ) 2025-05-07T20:32:15.5001835Z else: 2025-05-07T20:32:15.5002061Z scale_ub_tensor = None 2025-05-07T20:32:15.5002328Z 2025-05-07T20:32:15.5002574Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.5002899Z op = silu_mul_quant 2025-05-07T20:32:15.5003163Z if compiled: 2025-05-07T20:32:15.5003431Z op = torch.compile(op) 2025-05-07T20:32:15.5003738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5004032Z 2025-05-07T20:32:15.5004237Z y_fp8, y_scale = fn() 2025-05-07T20:32:15.5004587Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:15.5004892Z 2025-05-07T20:32:15.5005144Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.5005501Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:15.5005807Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:15.5006138Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:15.5006514Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.5006845Z 2025-05-07T20:32:15.5007069Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:15.5007274Z 2025-05-07T20:32:15.5007385Z moe/activation_test.py:126: 2025-05-07T20:32:15.5007694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5008049Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:15.5008396Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.5009222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:15.5010003Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:15.5010666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.5011385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.5012109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:15.5012867Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.5013656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:15.5014442Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.5015231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:15.5015893Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:15.5016524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:15.5017064Z fn() 2025-05-07T20:32:15.5017584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:15.5018272Z self.fn.run( 2025-05-07T20:32:15.5018760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.5019313Z kernel = self.compile( 2025-05-07T20:32:15.5019870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.5020551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.5020963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5021201Z 2025-05-07T20:32:15.5021414Z self = 2025-05-07T20:32:15.5022540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.5024361Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a5ea7370>} 2025-05-07T20:32:15.5026130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.5027412Z context = 2025-05-07T20:32:15.5027756Z 2025-05-07T20:32:15.5027944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.5028574Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.5029134Z module_map=module_map) 2025-05-07T20:32:15.5029556Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.5029964Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:15.5030276Z E ^ 2025-05-07T20:32:15.5030834Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.5031393Z 2025-05-07T20:32:15.5031904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.5032548Z 2025-05-07T20:32:15.5032660Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5033142Z self=, 2025-05-07T20:32:15.5033666Z T=128, 2025-05-07T20:32:15.5033863Z D=7168, 2025-05-07T20:32:15.5034073Z scale_ub=None, 2025-05-07T20:32:15.5034310Z contiguous=False, 2025-05-07T20:32:15.5034722Z compiled=False, 2025-05-07T20:32:15.5034946Z ) 2025-05-07T20:32:15.7186780Z self = 2025-05-07T20:32:15.7187365Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.7187680Z 2025-05-07T20:32:15.7187810Z @given( 2025-05-07T20:32:15.7188181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7188560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7188895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7189244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7189665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7190095Z ) 2025-05-07T20:32:15.7190574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7191193Z def test_silu_mul_quant( 2025-05-07T20:32:15.7191566Z self, 2025-05-07T20:32:15.7191865Z T: int, 2025-05-07T20:32:15.7192161Z D: int, 2025-05-07T20:32:15.7192455Z scale_ub: Optional[float], 2025-05-07T20:32:15.7192740Z contiguous: bool, 2025-05-07T20:32:15.7193004Z compiled: bool, 2025-05-07T20:32:15.7193455Z ) -> None: 2025-05-07T20:32:15.7193834Z torch.manual_seed(2025) 2025-05-07T20:32:15.7194101Z 2025-05-07T20:32:15.7194394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7194781Z 2025-05-07T20:32:15.7194989Z x_sign = torch.sign(x) 2025-05-07T20:32:15.7195304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.7195629Z x = x_sign * x_clamp 2025-05-07T20:32:15.7195883Z x0 = x[:, :D] 2025-05-07T20:32:15.7196117Z x1 = x[:, D:] 2025-05-07T20:32:15.7196341Z 2025-05-07T20:32:15.7196535Z if contiguous: 2025-05-07T20:32:15.7196784Z x0 = x0.contiguous() 2025-05-07T20:32:15.7197057Z x1 = x1.contiguous() 2025-05-07T20:32:15.7197311Z 2025-05-07T20:32:15.7197514Z if scale_ub is not None: 2025-05-07T20:32:15.7197806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.7198155Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.7198490Z ) 2025-05-07T20:32:15.7198698Z else: 2025-05-07T20:32:15.7198917Z scale_ub_tensor = None 2025-05-07T20:32:15.7199184Z 2025-05-07T20:32:15.7199442Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.7199776Z op = silu_mul_quant 2025-05-07T20:32:15.7200040Z if compiled: 2025-05-07T20:32:15.7200307Z op = torch.compile(op) 2025-05-07T20:32:15.7200622Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.7200910Z 2025-05-07T20:32:15.7201124Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.7201298Z 2025-05-07T20:32:15.7201413Z moe/activation_test.py:117: 2025-05-07T20:32:15.7201726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7202085Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.7202388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.7203112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.7203842Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.7204407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.7205170Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.7205854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.7206417Z kernel = self.compile( 2025-05-07T20:32:15.7206984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.7208772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.7209192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7209436Z 2025-05-07T20:32:15.7209654Z self = 2025-05-07T20:32:15.7210774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.7212198Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7694293640>} 2025-05-07T20:32:15.7213593Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.7214657Z context = 2025-05-07T20:32:15.7214964Z 2025-05-07T20:32:15.7215138Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.7215764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.7216248Z module_map=module_map) 2025-05-07T20:32:15.7216632Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.7217004Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.7217276Z E ^ 2025-05-07T20:32:15.7217760Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.7218232Z 2025-05-07T20:32:15.7218660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.7219193Z 2025-05-07T20:32:15.7219316Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7219777Z self=, 2025-05-07T20:32:15.7220193Z T=4096, 2025-05-07T20:32:15.7220395Z D=5120, 2025-05-07T20:32:15.7220610Z scale_ub=1200.0, 2025-05-07T20:32:15.7220843Z contiguous=True, 2025-05-07T20:32:15.7221080Z compiled=False, 2025-05-07T20:32:15.7221299Z ) 2025-05-07T20:32:15.7221629Z self = 2025-05-07T20:32:15.7222147Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.7222436Z 2025-05-07T20:32:15.7222518Z @given( 2025-05-07T20:32:15.7222764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7223089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7223413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7224114Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7232991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7233320Z ) 2025-05-07T20:32:15.7233809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7234284Z def test_silu_mul_quant( 2025-05-07T20:32:15.7234553Z self, 2025-05-07T20:32:15.7234771Z T: int, 2025-05-07T20:32:15.7234978Z D: int, 2025-05-07T20:32:15.7235215Z scale_ub: Optional[float], 2025-05-07T20:32:15.7235506Z contiguous: bool, 2025-05-07T20:32:15.7235757Z compiled: bool, 2025-05-07T20:32:15.7236005Z ) -> None: 2025-05-07T20:32:15.7236240Z torch.manual_seed(2025) 2025-05-07T20:32:15.7236503Z 2025-05-07T20:32:15.7236795Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7237162Z 2025-05-07T20:32:15.7237372Z x_sign = torch.sign(x) 2025-05-07T20:32:15.7237676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.7238180Z x = x_sign * x_clamp 2025-05-07T20:32:15.7238445Z x0 = x[:, :D] 2025-05-07T20:32:15.7238673Z x1 = x[:, D:] 2025-05-07T20:32:15.7238900Z 2025-05-07T20:32:15.7239102Z if contiguous: 2025-05-07T20:32:15.7239350Z x0 = x0.contiguous() 2025-05-07T20:32:15.7239623Z x1 = x1.contiguous() 2025-05-07T20:32:15.7239884Z 2025-05-07T20:32:15.7240084Z if scale_ub is not None: 2025-05-07T20:32:15.7240376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.7240733Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.7241053Z ) 2025-05-07T20:32:15.7241262Z else: 2025-05-07T20:32:15.7241491Z scale_ub_tensor = None 2025-05-07T20:32:15.7241756Z 2025-05-07T20:32:15.7241998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.7242329Z op = silu_mul_quant 2025-05-07T20:32:15.7242595Z if compiled: 2025-05-07T20:32:15.7242860Z op = torch.compile(op) 2025-05-07T20:32:15.7243176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.7243470Z 2025-05-07T20:32:15.7243671Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.7243850Z 2025-05-07T20:32:15.7244085Z moe/activation_test.py:117: 2025-05-07T20:32:15.7244408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7244791Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.7245120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.7245850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.7246574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.7247140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.7247859Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.7248563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.7249123Z kernel = self.compile( 2025-05-07T20:32:15.7249688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.7250386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.7250806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7251045Z 2025-05-07T20:32:15.7251270Z self = 2025-05-07T20:32:15.7252392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.7253827Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7694290670>} 2025-05-07T20:32:15.7255223Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.7256294Z context = 2025-05-07T20:32:15.7256595Z 2025-05-07T20:32:15.7256770Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.7257316Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.7257811Z module_map=module_map) 2025-05-07T20:32:15.7258199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.7258570Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.7258849Z E ^ 2025-05-07T20:32:15.7259427Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.7259900Z 2025-05-07T20:32:15.7260332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.7260874Z 2025-05-07T20:32:15.7260985Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7261426Z self=, 2025-05-07T20:32:15.7261848Z T=1, 2025-05-07T20:32:15.7262044Z D=5120, 2025-05-07T20:32:15.7262257Z scale_ub=None, 2025-05-07T20:32:15.7262490Z contiguous=True, 2025-05-07T20:32:15.7262724Z compiled=True, 2025-05-07T20:32:15.7262942Z ) 2025-05-07T20:32:16.2129931Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:16.2131081Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:16.2132488Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:16.2134204Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:16.2135655Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:16.2137092Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2138457Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:16.2139895Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2141373Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:16.2142673Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:16.2143939Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:16.2145201Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:16.2146282Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:16.2147347Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:16.2148619Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:16.2150059Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:16.2151229Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:16.2152329Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:16.2153628Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:16.2155094Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:16.2156199Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2157155Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.2158019Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:16.2159089Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3823735Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:16.3825484Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:16.3826886Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:16.3828373Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:16.3829817Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:16.3831260Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3832619Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:16.3834113Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3835589Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:16.3836887Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:16.3838331Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:16.3839585Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:16.3840666Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:16.3841727Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:16.3842990Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:16.3844329Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:16.3845481Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:16.3846716Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:16.3847941Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:16.3849344Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:16.3850451Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3851400Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3852176Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:16.3853241Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.8483523Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:16.8484636Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:16.8487152Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:16.8490150Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:16.8493026Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:16.8495264Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.8496771Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:16.8498524Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.8500338Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:16.8501926Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:16.8503494Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:16.8505085Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:16.8506285Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:16.8507338Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:16.8508605Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:16.8509937Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:16.8511098Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:16.8512185Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:16.8513401Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:16.8514888Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:16.8516023Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.8516972Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.8517749Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:16.8518808Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.8786497Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:16.8787601Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:16.8789126Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:16.8790608Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:16.8792034Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:16.8793467Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.8795137Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:16.8796583Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.8798172Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:16.8799473Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:16.8800737Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:16.8801995Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:16.8803071Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:16.8804135Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:16.8805398Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:16.8806728Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:16.8807899Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:16.8808989Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:16.8810222Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:16.8811633Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:16.8812745Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.8813779Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.8814554Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:16.8815633Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.1979827Z self = 2025-05-07T20:32:17.1980413Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.1980709Z 2025-05-07T20:32:17.1980801Z @given( 2025-05-07T20:32:17.1981073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.1981420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.1981767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.1982137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.1982503Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.1982812Z ) 2025-05-07T20:32:17.1983383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.1983870Z def test_silu_mul_quant( 2025-05-07T20:32:17.1984136Z self, 2025-05-07T20:32:17.1984356Z T: int, 2025-05-07T20:32:17.1984581Z D: int, 2025-05-07T20:32:17.1984824Z scale_ub: Optional[float], 2025-05-07T20:32:17.1985124Z contiguous: bool, 2025-05-07T20:32:17.1985394Z compiled: bool, 2025-05-07T20:32:17.1985640Z ) -> None: 2025-05-07T20:32:17.1985881Z torch.manual_seed(2025) 2025-05-07T20:32:17.1986149Z 2025-05-07T20:32:17.1986452Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.1986824Z 2025-05-07T20:32:17.1987041Z x_sign = torch.sign(x) 2025-05-07T20:32:17.1987366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.1987702Z x = x_sign * x_clamp 2025-05-07T20:32:17.1987968Z x0 = x[:, :D] 2025-05-07T20:32:17.1988211Z x1 = x[:, D:] 2025-05-07T20:32:17.1988443Z 2025-05-07T20:32:17.1988650Z if contiguous: 2025-05-07T20:32:17.1988907Z x0 = x0.contiguous() 2025-05-07T20:32:17.1989186Z x1 = x1.contiguous() 2025-05-07T20:32:17.1989452Z 2025-05-07T20:32:17.1989668Z if scale_ub is not None: 2025-05-07T20:32:17.1989965Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.1990330Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.1990669Z ) 2025-05-07T20:32:17.1990878Z else: 2025-05-07T20:32:17.1991114Z scale_ub_tensor = None 2025-05-07T20:32:17.1991395Z 2025-05-07T20:32:17.1991651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.1991994Z op = silu_mul_quant 2025-05-07T20:32:17.1992271Z if compiled: 2025-05-07T20:32:17.1992545Z op = torch.compile(op) 2025-05-07T20:32:17.1992863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.1993173Z 2025-05-07T20:32:17.1993388Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.1993776Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.1994099Z 2025-05-07T20:32:17.1994365Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.1994722Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.1995080Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.1995437Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.1995819Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.1996160Z 2025-05-07T20:32:17.1996386Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.1996597Z 2025-05-07T20:32:17.1996715Z moe/activation_test.py:126: 2025-05-07T20:32:17.1997185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.1997555Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.1997914Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.1998765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.1999575Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.2000168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.2000911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.2001648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.2002435Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.2003251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.2004058Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.2004920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.2005610Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.2006257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.2006812Z fn() 2025-05-07T20:32:17.2007362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.2007988Z self.fn.run( 2025-05-07T20:32:17.2008494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.2009068Z kernel = self.compile( 2025-05-07T20:32:17.2009653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.2010364Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.2010788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2011043Z 2025-05-07T20:32:17.2011264Z self = 2025-05-07T20:32:17.2012422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.2013891Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767f26d480>} 2025-05-07T20:32:17.2015382Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.2016477Z context = 2025-05-07T20:32:17.2016794Z 2025-05-07T20:32:17.2016975Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.2017535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.2018043Z module_map=module_map) 2025-05-07T20:32:17.2018434Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.2018821Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.2019115Z E ^ 2025-05-07T20:32:17.2019611Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.2020187Z 2025-05-07T20:32:17.2020635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.2021187Z 2025-05-07T20:32:17.2021306Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.2021756Z self=, 2025-05-07T20:32:17.2022195Z T=2048, 2025-05-07T20:32:17.2022410Z D=5120, 2025-05-07T20:32:17.2022631Z scale_ub=None, 2025-05-07T20:32:17.2022874Z contiguous=True, 2025-05-07T20:32:17.2023133Z compiled=True, 2025-05-07T20:32:17.2023372Z ) 2025-05-07T20:32:17.6695853Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:17.6696988Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:17.6698416Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:17.6700099Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:17.6701541Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:17.6702988Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.6704352Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:17.6705788Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.6707259Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:17.6708552Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:17.6709822Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:17.6711075Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:17.6712159Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:17.6713220Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:17.6714536Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:17.6716003Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:17.6717157Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:17.6718237Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:17.6719445Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:17.6720850Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:17.6721952Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.6722897Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.6723897Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:17.6724954Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8383016Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:17.8384109Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:17.8385535Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:17.8387028Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:17.8388472Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:17.8389921Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8391291Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:17.8392728Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8394263Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:17.8395569Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:17.8397000Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:17.8398267Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:17.8399356Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:17.8400422Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:17.8401700Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:17.8403043Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:17.8404220Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:17.8405432Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:17.8406659Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:17.8408079Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:17.8409194Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8410160Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.8410942Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:17.8412016Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.3268092Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:18.3270215Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:18.3272898Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:18.3275533Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:18.3277112Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:18.3278683Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.3280331Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:18.3281896Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.3283511Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:18.3284930Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:18.3286320Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:18.3287701Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:18.3288882Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:18.3290198Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:18.3291587Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:18.3293049Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:18.3294324Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:18.3295561Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:18.3296906Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:18.3298458Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:18.3299668Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.3300710Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.3301553Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:18.3302734Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.3602569Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:18.3603775Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:18.3605528Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:18.3607145Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:18.3608715Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:18.3610282Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.3611765Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:18.3613317Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.3615048Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:18.3616466Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:18.3617843Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:18.3619217Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:18.3620380Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:18.3621538Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:18.3622918Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:18.3624550Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:18.3625808Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:18.3626987Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:18.3628326Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:18.3629860Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:18.3631064Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.3632219Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.3633058Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:18.3634288Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.8471590Z self = 2025-05-07T20:32:18.8472183Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:18.8472532Z 2025-05-07T20:32:18.8472630Z @given( 2025-05-07T20:32:18.8472917Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.8473287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.8473718Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.8474111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.8474512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.8474841Z ) 2025-05-07T20:32:18.8475257Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.8475983Z def test_silu_mul_quant( 2025-05-07T20:32:18.8476274Z self, 2025-05-07T20:32:18.8476505Z T: int, 2025-05-07T20:32:18.8476729Z D: int, 2025-05-07T20:32:18.8476988Z scale_ub: Optional[float], 2025-05-07T20:32:18.8477302Z contiguous: bool, 2025-05-07T20:32:18.8477577Z compiled: bool, 2025-05-07T20:32:18.8477842Z ) -> None: 2025-05-07T20:32:18.8478096Z torch.manual_seed(2025) 2025-05-07T20:32:18.8478372Z 2025-05-07T20:32:18.8478690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.8479085Z 2025-05-07T20:32:18.8479309Z x_sign = torch.sign(x) 2025-05-07T20:32:18.8479650Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.8480012Z x = x_sign * x_clamp 2025-05-07T20:32:18.8480292Z x0 = x[:, :D] 2025-05-07T20:32:18.8480543Z x1 = x[:, D:] 2025-05-07T20:32:18.8480786Z 2025-05-07T20:32:18.8481007Z if contiguous: 2025-05-07T20:32:18.8481280Z x0 = x0.contiguous() 2025-05-07T20:32:18.8481580Z x1 = x1.contiguous() 2025-05-07T20:32:18.8481859Z 2025-05-07T20:32:18.8482081Z if scale_ub is not None: 2025-05-07T20:32:18.8482398Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.8482785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.8483134Z ) 2025-05-07T20:32:18.8483359Z else: 2025-05-07T20:32:18.8483607Z scale_ub_tensor = None 2025-05-07T20:32:18.8483891Z 2025-05-07T20:32:18.8484157Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.8484516Z op = silu_mul_quant 2025-05-07T20:32:18.8484801Z if compiled: 2025-05-07T20:32:18.8485090Z op = torch.compile(op) 2025-05-07T20:32:18.8485432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.8485747Z 2025-05-07T20:32:18.8485968Z y_fp8, y_scale = fn() 2025-05-07T20:32:18.8486297Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:18.8486635Z 2025-05-07T20:32:18.8486903Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.8487284Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:18.8487620Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:18.8487973Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:18.8488386Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.8488742Z 2025-05-07T20:32:18.8488970Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:18.8489197Z 2025-05-07T20:32:18.8489313Z moe/activation_test.py:126: 2025-05-07T20:32:18.8489655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.8490270Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:18.8490648Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.8491542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:18.8492403Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:18.8493018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.8493791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.8494568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:18.8495387Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.8496240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:18.8497088Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.8497918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:18.8498733Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:18.8499410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:18.8499999Z fn() 2025-05-07T20:32:18.8500578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:18.8501233Z self.fn.run( 2025-05-07T20:32:18.8501770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.8502375Z kernel = self.compile( 2025-05-07T20:32:18.8502993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.8503729Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.8504183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.8504447Z 2025-05-07T20:32:18.8504686Z self = 2025-05-07T20:32:18.8505945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.8507498Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767d2ed000>} 2025-05-07T20:32:18.8509004Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.8510155Z context = 2025-05-07T20:32:18.8510491Z 2025-05-07T20:32:18.8510680Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.8511272Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.8511802Z module_map=module_map) 2025-05-07T20:32:18.8512216Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.8512621Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:18.8512919Z E ^ 2025-05-07T20:32:18.8513444Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.8514034Z 2025-05-07T20:32:18.8514596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.8515173Z 2025-05-07T20:32:18.8515299Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.8515811Z self=, 2025-05-07T20:32:18.8516268Z T=128, 2025-05-07T20:32:18.8516488Z D=5120, 2025-05-07T20:32:18.8516707Z scale_ub=None, 2025-05-07T20:32:18.8516954Z contiguous=True, 2025-05-07T20:32:18.8517214Z compiled=True, 2025-05-07T20:32:18.8517446Z ) 2025-05-07T20:32:19.3795676Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:19.3796894Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:19.3798411Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:19.3800001Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:19.3801719Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:19.3803270Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.3804732Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:19.3806343Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.3807945Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:19.3809344Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:19.3810709Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:19.3812073Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:19.3813236Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:19.3814383Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:19.3815748Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:19.3817193Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:19.3818571Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:19.3819745Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:19.3821078Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:19.3822591Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:19.3823957Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.3824992Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.3825831Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:19.3827145Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.5652547Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:19.5653728Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:19.5655224Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:19.5656810Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:19.5658358Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:19.5659899Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.5661352Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:19.5662887Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.5664473Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:19.5665862Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:19.5667222Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:19.5668721Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:19.5669880Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:19.5671017Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:19.5672373Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:19.5673942Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:19.5675186Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:19.5676348Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:19.5677788Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:19.5679301Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:19.5680475Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.5681494Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.5682320Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:19.5683455Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.0749998Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.0751206Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:20.0752718Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.0754403Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.0755965Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.0757508Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.0758965Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.0760670Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.0762257Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.0763652Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:20.0765013Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.0766376Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:20.0767534Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:20.0768801Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:20.0770169Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.0771596Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.0772852Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:20.0774020Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:20.0775340Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.0776860Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.0778041Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.0779061Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.0779898Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:20.0781038Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.1082351Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.1084389Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:20.1086463Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.1088205Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.1089754Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.1091304Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.1092753Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.1094294Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.1095878Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.1097392Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:20.1098754Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.1100105Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:20.1101264Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:20.1102405Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:20.1103780Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.1105213Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.1106457Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:20.1107633Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:20.1108952Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.1110474Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.1111662Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.1112676Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.1113707Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:20.1114853Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.5423751Z self = 2025-05-07T20:32:20.5425461Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:20.5426065Z 2025-05-07T20:32:20.5426185Z @given( 2025-05-07T20:32:20.5426482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.5426836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.5427189Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.5427571Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.5427945Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.5428280Z ) 2025-05-07T20:32:20.5428701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.5429202Z def test_silu_mul_quant( 2025-05-07T20:32:20.5429490Z self, 2025-05-07T20:32:20.5429723Z T: int, 2025-05-07T20:32:20.5430575Z D: int, 2025-05-07T20:32:20.5430831Z scale_ub: Optional[float], 2025-05-07T20:32:20.5431155Z contiguous: bool, 2025-05-07T20:32:20.5431438Z compiled: bool, 2025-05-07T20:32:20.5431697Z ) -> None: 2025-05-07T20:32:20.5431948Z torch.manual_seed(2025) 2025-05-07T20:32:20.5432239Z 2025-05-07T20:32:20.5432548Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.5432942Z 2025-05-07T20:32:20.5433168Z x_sign = torch.sign(x) 2025-05-07T20:32:20.5433580Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.5433937Z x = x_sign * x_clamp 2025-05-07T20:32:20.5434219Z x0 = x[:, :D] 2025-05-07T20:32:20.5434463Z x1 = x[:, D:] 2025-05-07T20:32:20.5434732Z 2025-05-07T20:32:20.5434949Z if contiguous: 2025-05-07T20:32:20.5435220Z x0 = x0.contiguous() 2025-05-07T20:32:20.5435509Z x1 = x1.contiguous() 2025-05-07T20:32:20.5435794Z 2025-05-07T20:32:20.5436019Z if scale_ub is not None: 2025-05-07T20:32:20.5436327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.5436712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.5437066Z ) 2025-05-07T20:32:20.5437283Z else: 2025-05-07T20:32:20.5437527Z scale_ub_tensor = None 2025-05-07T20:32:20.5437819Z 2025-05-07T20:32:20.5438088Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.5438442Z op = silu_mul_quant 2025-05-07T20:32:20.5438730Z if compiled: 2025-05-07T20:32:20.5439018Z op = torch.compile(op) 2025-05-07T20:32:20.5439352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.5439672Z 2025-05-07T20:32:20.5439902Z y_fp8, y_scale = fn() 2025-05-07T20:32:20.5440225Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:20.5440560Z 2025-05-07T20:32:20.5440834Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.5441217Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:20.5441556Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:20.5441918Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:20.5442323Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.5442680Z 2025-05-07T20:32:20.5442917Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:20.5443139Z 2025-05-07T20:32:20.5443264Z moe/activation_test.py:126: 2025-05-07T20:32:20.5443603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.5443992Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:20.5444518Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.5445410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:20.5446270Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:20.5446894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.5447676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.5448452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:20.5449277Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.5450133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:20.5450994Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.5451817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:20.5452549Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:20.5453322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:20.5453910Z fn() 2025-05-07T20:32:20.5454493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:20.5455151Z self.fn.run( 2025-05-07T20:32:20.5455686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.5456285Z kernel = self.compile( 2025-05-07T20:32:20.5456901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.5457651Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.5458100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.5458363Z 2025-05-07T20:32:20.5458605Z self = 2025-05-07T20:32:20.5459829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.5461378Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767ca01360>} 2025-05-07T20:32:20.5462900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.5464048Z context = 2025-05-07T20:32:20.5464379Z 2025-05-07T20:32:20.5464570Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.5465170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.5465716Z module_map=module_map) 2025-05-07T20:32:20.5466134Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.5466551Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:20.5466862Z E ^ 2025-05-07T20:32:20.5467392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.5467917Z 2025-05-07T20:32:20.5468395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.5468979Z 2025-05-07T20:32:20.5469189Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.5469667Z self=, 2025-05-07T20:32:20.5470117Z T=4096, 2025-05-07T20:32:20.5470334Z D=5120, 2025-05-07T20:32:20.5470564Z scale_ub=None, 2025-05-07T20:32:20.5470807Z contiguous=True, 2025-05-07T20:32:20.5471065Z compiled=True, 2025-05-07T20:32:20.5471303Z ) 2025-05-07T20:32:21.0805703Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.0806916Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:21.0808433Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.0810031Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.0811759Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.0813307Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.0814765Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.0816312Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.0817901Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.0819306Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:21.0820669Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.0822033Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:21.0823200Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:21.0824687Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:21.0826048Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.0827531Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.0828921Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:21.0830087Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:21.0831398Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.0832906Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.0834202Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.0835221Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.0836055Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:21.0837238Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.2638163Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.2639358Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:21.2641024Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.2642612Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.2644152Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.2645685Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.2647133Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.2656029Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.2657606Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.2658984Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:21.2660321Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.2661641Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:21.2662959Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:21.2664083Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:21.2665427Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.2666827Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.2668058Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:21.2669209Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:21.2670505Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.2672118Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.2673277Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.2674380Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.2675202Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:21.2676324Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.7689085Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.7690272Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:21.7691753Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.7693340Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.7694870Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.7696398Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.7697843Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.7699551Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.7701124Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.7702508Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:21.7703848Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.7705182Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:21.7706331Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:21.7707458Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:21.7708937Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.7710353Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.7711591Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:21.7712749Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:21.7714135Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.7715656Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.7716874Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.7717879Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.7718695Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:21.7719821Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.8015508Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.8016745Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:21.8018241Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.8019979Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.8021523Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.8023075Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.8025109Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.8026872Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.8028446Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.8029987Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:21.8031349Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.8032695Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:21.8033933Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:21.8035065Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:21.8036436Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.8037864Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.8039111Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:21.8040279Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:21.8041591Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.8043113Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.8044297Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.8045315Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.8046144Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:21.8047409Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.2388449Z self = 2025-05-07T20:32:22.2389035Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.2389331Z 2025-05-07T20:32:22.2389417Z @given( 2025-05-07T20:32:22.2389671Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.2390006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.2390339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.2390702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.2391056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.2391373Z ) 2025-05-07T20:32:22.2391756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.2392231Z def test_silu_mul_quant( 2025-05-07T20:32:22.2392496Z self, 2025-05-07T20:32:22.2392712Z T: int, 2025-05-07T20:32:22.2392920Z D: int, 2025-05-07T20:32:22.2393160Z scale_ub: Optional[float], 2025-05-07T20:32:22.2393724Z contiguous: bool, 2025-05-07T20:32:22.2393987Z compiled: bool, 2025-05-07T20:32:22.2394225Z ) -> None: 2025-05-07T20:32:22.2394461Z torch.manual_seed(2025) 2025-05-07T20:32:22.2394724Z 2025-05-07T20:32:22.2395014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.2395382Z 2025-05-07T20:32:22.2395594Z x_sign = torch.sign(x) 2025-05-07T20:32:22.2395904Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.2396240Z x = x_sign * x_clamp 2025-05-07T20:32:22.2396502Z x0 = x[:, :D] 2025-05-07T20:32:22.2396733Z x1 = x[:, D:] 2025-05-07T20:32:22.2396959Z 2025-05-07T20:32:22.2397160Z if contiguous: 2025-05-07T20:32:22.2397411Z x0 = x0.contiguous() 2025-05-07T20:32:22.2397690Z x1 = x1.contiguous() 2025-05-07T20:32:22.2397950Z 2025-05-07T20:32:22.2398153Z if scale_ub is not None: 2025-05-07T20:32:22.2398449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.2398822Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.2399156Z ) 2025-05-07T20:32:22.2399361Z else: 2025-05-07T20:32:22.2399589Z scale_ub_tensor = None 2025-05-07T20:32:22.2399863Z 2025-05-07T20:32:22.2400110Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.2400448Z op = silu_mul_quant 2025-05-07T20:32:22.2400720Z if compiled: 2025-05-07T20:32:22.2400987Z op = torch.compile(op) 2025-05-07T20:32:22.2401309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.2401606Z 2025-05-07T20:32:22.2401813Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.2402125Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.2402438Z 2025-05-07T20:32:22.2402687Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.2403045Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.2403366Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.2403704Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.2404084Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.2404421Z 2025-05-07T20:32:22.2404642Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.2404850Z 2025-05-07T20:32:22.2404957Z moe/activation_test.py:126: 2025-05-07T20:32:22.2405282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.2405647Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.2405996Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.2406963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.2407766Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.2408353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.2409078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.2409813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.2410586Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.2411390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.2412179Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.2412961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.2413645Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.2414282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.2414921Z fn() 2025-05-07T20:32:22.2415465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.2416090Z self.fn.run( 2025-05-07T20:32:22.2416586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.2417156Z kernel = self.compile( 2025-05-07T20:32:22.2417734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.2418426Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.2418856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.2419103Z 2025-05-07T20:32:22.2419324Z self = 2025-05-07T20:32:22.2420467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.2421943Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767c4130a0>} 2025-05-07T20:32:22.2423353Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.2424895Z context = 2025-05-07T20:32:22.2425215Z 2025-05-07T20:32:22.2425395Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.2425956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.2426467Z module_map=module_map) 2025-05-07T20:32:22.2426903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.2427293Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.2427574Z E ^ 2025-05-07T20:32:22.2428067Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.2428546Z 2025-05-07T20:32:22.2428986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.2429523Z 2025-05-07T20:32:22.2429642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.2430220Z self=, 2025-05-07T20:32:22.2430658Z T=16384, 2025-05-07T20:32:22.2430873Z D=5120, 2025-05-07T20:32:22.2431088Z scale_ub=None, 2025-05-07T20:32:22.2431319Z contiguous=True, 2025-05-07T20:32:22.2431564Z compiled=True, 2025-05-07T20:32:22.2431785Z ) 2025-05-07T20:32:22.2861195Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:22.2862518Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:22.2863935Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:22.2865001Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:22.2866178Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:22.3948967Z self = 2025-05-07T20:32:22.3949979Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.3950514Z 2025-05-07T20:32:22.3950670Z @given( 2025-05-07T20:32:22.3951127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.3951728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.3952321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.3952961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.3953676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.3954224Z ) 2025-05-07T20:32:22.3954919Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.3955770Z def test_silu_mul_quant( 2025-05-07T20:32:22.3956233Z self, 2025-05-07T20:32:22.3956514Z T: int, 2025-05-07T20:32:22.3956748Z D: int, 2025-05-07T20:32:22.3956991Z scale_ub: Optional[float], 2025-05-07T20:32:22.3957299Z contiguous: bool, 2025-05-07T20:32:22.3957573Z compiled: bool, 2025-05-07T20:32:22.3957824Z ) -> None: 2025-05-07T20:32:22.3958066Z torch.manual_seed(2025) 2025-05-07T20:32:22.3958336Z 2025-05-07T20:32:22.3958635Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.3959017Z 2025-05-07T20:32:22.3959238Z x_sign = torch.sign(x) 2025-05-07T20:32:22.3959558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.3959905Z x = x_sign * x_clamp 2025-05-07T20:32:22.3960177Z x0 = x[:, :D] 2025-05-07T20:32:22.3960422Z x1 = x[:, D:] 2025-05-07T20:32:22.3960657Z 2025-05-07T20:32:22.3960867Z if contiguous: 2025-05-07T20:32:22.3961130Z x0 = x0.contiguous() 2025-05-07T20:32:22.3961414Z x1 = x1.contiguous() 2025-05-07T20:32:22.3961690Z 2025-05-07T20:32:22.3961908Z if scale_ub is not None: 2025-05-07T20:32:22.3962210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.3962583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.3962929Z ) 2025-05-07T20:32:22.3963143Z else: 2025-05-07T20:32:22.3963385Z scale_ub_tensor = None 2025-05-07T20:32:22.3963668Z 2025-05-07T20:32:22.3963926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.3964275Z op = silu_mul_quant 2025-05-07T20:32:22.3964560Z if compiled: 2025-05-07T20:32:22.3964835Z op = torch.compile(op) 2025-05-07T20:32:22.3965166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.3965665Z 2025-05-07T20:32:22.3965890Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.3966205Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.3966535Z 2025-05-07T20:32:22.3966803Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.3967173Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.3967501Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.3967855Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.3968251Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.3968598Z 2025-05-07T20:32:22.3968831Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.3969052Z 2025-05-07T20:32:22.3969164Z moe/activation_test.py:126: 2025-05-07T20:32:22.3969499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3969873Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.3970248Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.3971112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.3972069Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.3972675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.3973425Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.3974186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.3974986Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.3975819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.3976647Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.3977449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.3978157Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.3978824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.3979394Z fn() 2025-05-07T20:32:22.3979956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.3980596Z self.fn.run( 2025-05-07T20:32:22.3981109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.3981699Z kernel = self.compile( 2025-05-07T20:32:22.3982299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.3983027Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.3983461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3983718Z 2025-05-07T20:32:22.3983949Z self = 2025-05-07T20:32:22.3985132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.3986634Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767ca01b40>} 2025-05-07T20:32:22.3988214Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.3989337Z context = 2025-05-07T20:32:22.3989657Z 2025-05-07T20:32:22.3989841Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.3990419Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.3990933Z module_map=module_map) 2025-05-07T20:32:22.3991339Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.3991734Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.3992036Z E ^ 2025-05-07T20:32:22.3992544Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.3993045Z 2025-05-07T20:32:22.3993559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.3994120Z 2025-05-07T20:32:22.3994249Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.3994707Z self=, 2025-05-07T20:32:22.3995147Z T=1, 2025-05-07T20:32:22.3995357Z D=5120, 2025-05-07T20:32:22.3995709Z scale_ub=1200.0, 2025-05-07T20:32:22.3995959Z contiguous=True, 2025-05-07T20:32:22.3996211Z compiled=True, 2025-05-07T20:32:22.3996445Z ) 2025-05-07T20:32:22.7581113Z self = 2025-05-07T20:32:22.7582192Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:22.7582751Z 2025-05-07T20:32:22.7582921Z @given( 2025-05-07T20:32:22.7583419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.7584076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.7584733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.7585432Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.7586145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.7586644Z ) 2025-05-07T20:32:22.7587025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.7587502Z def test_silu_mul_quant( 2025-05-07T20:32:22.7587770Z self, 2025-05-07T20:32:22.7587987Z T: int, 2025-05-07T20:32:22.7588206Z D: int, 2025-05-07T20:32:22.7588443Z scale_ub: Optional[float], 2025-05-07T20:32:22.7588739Z contiguous: bool, 2025-05-07T20:32:22.7589000Z compiled: bool, 2025-05-07T20:32:22.7589243Z ) -> None: 2025-05-07T20:32:22.7589481Z torch.manual_seed(2025) 2025-05-07T20:32:22.7589745Z 2025-05-07T20:32:22.7590034Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.7590402Z 2025-05-07T20:32:22.7590616Z x_sign = torch.sign(x) 2025-05-07T20:32:22.7590930Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.7591262Z x = x_sign * x_clamp 2025-05-07T20:32:22.7591527Z x0 = x[:, :D] 2025-05-07T20:32:22.7591767Z x1 = x[:, D:] 2025-05-07T20:32:22.7591988Z 2025-05-07T20:32:22.7592193Z if contiguous: 2025-05-07T20:32:22.7592457Z x0 = x0.contiguous() 2025-05-07T20:32:22.7592731Z x1 = x1.contiguous() 2025-05-07T20:32:22.7592992Z 2025-05-07T20:32:22.7593204Z if scale_ub is not None: 2025-05-07T20:32:22.7593495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.7593935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.7594272Z ) 2025-05-07T20:32:22.7594478Z else: 2025-05-07T20:32:22.7594713Z scale_ub_tensor = None 2025-05-07T20:32:22.7594986Z 2025-05-07T20:32:22.7595231Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.7595572Z op = silu_mul_quant 2025-05-07T20:32:22.7595852Z if compiled: 2025-05-07T20:32:22.7596280Z op = torch.compile(op) 2025-05-07T20:32:22.7596598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.7596923Z 2025-05-07T20:32:22.7597137Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.7597315Z 2025-05-07T20:32:22.7597427Z moe/activation_test.py:117: 2025-05-07T20:32:22.7597747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.7598100Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.7598399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.7598996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.7599588Z return fn(*args, **kwargs) 2025-05-07T20:32:22.7600286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.7601004Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.7601580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.7602304Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.7602998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.7603692Z kernel = self.compile( 2025-05-07T20:32:22.7604266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.7604957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.7605370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.7605617Z 2025-05-07T20:32:22.7605832Z self = 2025-05-07T20:32:22.7607019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.7608481Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767cc3d480>} 2025-05-07T20:32:22.7609890Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.7610966Z context = 2025-05-07T20:32:22.7611278Z 2025-05-07T20:32:22.7611456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.7612012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.7612505Z module_map=module_map) 2025-05-07T20:32:22.7612901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.7613276Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.7613558Z E ^ 2025-05-07T20:32:22.7614048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.7614533Z 2025-05-07T20:32:22.7614969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.7615505Z 2025-05-07T20:32:22.7615625Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.7616062Z self=, 2025-05-07T20:32:22.7616496Z T=1, 2025-05-07T20:32:22.7616701Z D=5120, 2025-05-07T20:32:22.7616913Z scale_ub=None, 2025-05-07T20:32:22.7617142Z contiguous=False, 2025-05-07T20:32:22.7617386Z compiled=True, 2025-05-07T20:32:22.7617611Z ) 2025-05-07T20:32:22.8318861Z self = 2025-05-07T20:32:22.8319427Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.8319728Z 2025-05-07T20:32:22.8319815Z @given( 2025-05-07T20:32:22.8327556Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.8328025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.8328360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.8328716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.8329061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.8329369Z ) 2025-05-07T20:32:22.8329745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.8330212Z def test_silu_mul_quant( 2025-05-07T20:32:22.8330471Z self, 2025-05-07T20:32:22.8330682Z T: int, 2025-05-07T20:32:22.8330890Z D: int, 2025-05-07T20:32:22.8331121Z scale_ub: Optional[float], 2025-05-07T20:32:22.8331422Z contiguous: bool, 2025-05-07T20:32:22.8331676Z compiled: bool, 2025-05-07T20:32:22.8331920Z ) -> None: 2025-05-07T20:32:22.8332150Z torch.manual_seed(2025) 2025-05-07T20:32:22.8332407Z 2025-05-07T20:32:22.8332867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.8333231Z 2025-05-07T20:32:22.8333441Z x_sign = torch.sign(x) 2025-05-07T20:32:22.8333747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.8334075Z x = x_sign * x_clamp 2025-05-07T20:32:22.8334332Z x0 = x[:, :D] 2025-05-07T20:32:22.8334558Z x1 = x[:, D:] 2025-05-07T20:32:22.8334780Z 2025-05-07T20:32:22.8334984Z if contiguous: 2025-05-07T20:32:22.8335222Z x0 = x0.contiguous() 2025-05-07T20:32:22.8335498Z x1 = x1.contiguous() 2025-05-07T20:32:22.8335751Z 2025-05-07T20:32:22.8335952Z if scale_ub is not None: 2025-05-07T20:32:22.8336245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.8336597Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.8336917Z ) 2025-05-07T20:32:22.8337124Z else: 2025-05-07T20:32:22.8337346Z scale_ub_tensor = None 2025-05-07T20:32:22.8337625Z 2025-05-07T20:32:22.8337869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.8338205Z op = silu_mul_quant 2025-05-07T20:32:22.8338471Z if compiled: 2025-05-07T20:32:22.8338733Z op = torch.compile(op) 2025-05-07T20:32:22.8339050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.8339339Z 2025-05-07T20:32:22.8339546Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.8339847Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.8340144Z 2025-05-07T20:32:22.8340394Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.8340743Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.8341049Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.8341379Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.8341756Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.8342103Z 2025-05-07T20:32:22.8342312Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.8342526Z 2025-05-07T20:32:22.8342634Z moe/activation_test.py:126: 2025-05-07T20:32:22.8342950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.8343300Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.8343645Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.8344468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.8345256Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.8345948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.8346696Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.8347443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.8348205Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.8348994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:22.8349783Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.8350547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.8351215Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.8351850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.8352392Z fn() 2025-05-07T20:32:22.8352925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.8353707Z self.fn.run( 2025-05-07T20:32:22.8354215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.8354771Z kernel = self.compile( 2025-05-07T20:32:22.8355332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.8356016Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.8356439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.8356681Z 2025-05-07T20:32:22.8356935Z self = 2025-05-07T20:32:22.8358079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.8359525Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677ea89d0>} 2025-05-07T20:32:22.8360924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.8361993Z context = 2025-05-07T20:32:22.8362293Z 2025-05-07T20:32:22.8362473Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.8363017Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.8363513Z module_map=module_map) 2025-05-07T20:32:22.8363898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.8364268Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.8364556Z E ^ 2025-05-07T20:32:22.8365042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.8365518Z 2025-05-07T20:32:22.8365957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.8366491Z 2025-05-07T20:32:22.8366601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.8367035Z self=, 2025-05-07T20:32:22.8367454Z T=1, 2025-05-07T20:32:22.8367646Z D=5120, 2025-05-07T20:32:22.8367852Z scale_ub=None, 2025-05-07T20:32:22.8368076Z contiguous=True, 2025-05-07T20:32:22.8368312Z compiled=False, 2025-05-07T20:32:22.8368619Z ) 2025-05-07T20:32:23.0055401Z self = 2025-05-07T20:32:23.0056009Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:23.0056397Z 2025-05-07T20:32:23.0056548Z @given( 2025-05-07T20:32:23.0057094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.0057720Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.0058305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.0058922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.0059540Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.0060088Z ) 2025-05-07T20:32:23.0060745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.0061583Z def test_silu_mul_quant( 2025-05-07T20:32:23.0062047Z self, 2025-05-07T20:32:23.0062418Z T: int, 2025-05-07T20:32:23.0062797Z D: int, 2025-05-07T20:32:23.0063216Z scale_ub: Optional[float], 2025-05-07T20:32:23.0063732Z contiguous: bool, 2025-05-07T20:32:23.0064182Z compiled: bool, 2025-05-07T20:32:23.0064610Z ) -> None: 2025-05-07T20:32:23.0065325Z torch.manual_seed(2025) 2025-05-07T20:32:23.0065783Z 2025-05-07T20:32:23.0066305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.0066924Z 2025-05-07T20:32:23.0067146Z x_sign = torch.sign(x) 2025-05-07T20:32:23.0067454Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.0067778Z x = x_sign * x_clamp 2025-05-07T20:32:23.0068028Z x0 = x[:, :D] 2025-05-07T20:32:23.0068255Z x1 = x[:, D:] 2025-05-07T20:32:23.0068475Z 2025-05-07T20:32:23.0068668Z if contiguous: 2025-05-07T20:32:23.0068916Z x0 = x0.contiguous() 2025-05-07T20:32:23.0069189Z x1 = x1.contiguous() 2025-05-07T20:32:23.0069443Z 2025-05-07T20:32:23.0069646Z if scale_ub is not None: 2025-05-07T20:32:23.0069936Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.0070289Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.0070616Z ) 2025-05-07T20:32:23.0070823Z else: 2025-05-07T20:32:23.0071052Z scale_ub_tensor = None 2025-05-07T20:32:23.0071314Z 2025-05-07T20:32:23.0071558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.0071889Z op = silu_mul_quant 2025-05-07T20:32:23.0072151Z if compiled: 2025-05-07T20:32:23.0072423Z op = torch.compile(op) 2025-05-07T20:32:23.0072738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0073030Z 2025-05-07T20:32:23.0073236Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.0073410Z 2025-05-07T20:32:23.0073608Z moe/activation_test.py:117: 2025-05-07T20:32:23.0073925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0074272Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.0074571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0075297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.0076018Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.0076584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.0077302Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.0078000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.0078553Z kernel = self.compile( 2025-05-07T20:32:23.0079120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.0079931Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.0080344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0080586Z 2025-05-07T20:32:23.0080801Z self = 2025-05-07T20:32:23.0081929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.0083355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767cc3d1b0>} 2025-05-07T20:32:23.0084752Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.0085818Z context = 2025-05-07T20:32:23.0086125Z 2025-05-07T20:32:23.0086304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.0086930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.0087420Z module_map=module_map) 2025-05-07T20:32:23.0087800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.0088172Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.0088444Z E ^ 2025-05-07T20:32:23.0088924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.0089399Z 2025-05-07T20:32:23.0089830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.0090368Z 2025-05-07T20:32:23.0090482Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.0090927Z self=, 2025-05-07T20:32:23.0091343Z T=128, 2025-05-07T20:32:23.0091541Z D=5120, 2025-05-07T20:32:23.0091746Z scale_ub=None, 2025-05-07T20:32:23.0091979Z contiguous=False, 2025-05-07T20:32:23.0092220Z compiled=True, 2025-05-07T20:32:23.0092435Z ) 2025-05-07T20:32:23.0092766Z self = 2025-05-07T20:32:23.0093282Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:23.0093563Z 2025-05-07T20:32:23.0093648Z @given( 2025-05-07T20:32:23.0093887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.0094215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.0094540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.0094891Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.0095248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.0095563Z ) 2025-05-07T20:32:23.0095933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.0096394Z def test_silu_mul_quant( 2025-05-07T20:32:23.0096686Z self, 2025-05-07T20:32:23.0096901Z T: int, 2025-05-07T20:32:23.0097116Z D: int, 2025-05-07T20:32:23.0097344Z scale_ub: Optional[float], 2025-05-07T20:32:23.0097639Z contiguous: bool, 2025-05-07T20:32:23.0097898Z compiled: bool, 2025-05-07T20:32:23.0098131Z ) -> None: 2025-05-07T20:32:23.0098373Z torch.manual_seed(2025) 2025-05-07T20:32:23.0098635Z 2025-05-07T20:32:23.0098919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.0099280Z 2025-05-07T20:32:23.0099490Z x_sign = torch.sign(x) 2025-05-07T20:32:23.0099792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.0100120Z x = x_sign * x_clamp 2025-05-07T20:32:23.0100489Z x0 = x[:, :D] 2025-05-07T20:32:23.0100725Z x1 = x[:, D:] 2025-05-07T20:32:23.0100942Z 2025-05-07T20:32:23.0101142Z if contiguous: 2025-05-07T20:32:23.0101388Z x0 = x0.contiguous() 2025-05-07T20:32:23.0101668Z x1 = x1.contiguous() 2025-05-07T20:32:23.0101922Z 2025-05-07T20:32:23.0102128Z if scale_ub is not None: 2025-05-07T20:32:23.0102413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.0102767Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.0103098Z ) 2025-05-07T20:32:23.0103300Z else: 2025-05-07T20:32:23.0103529Z scale_ub_tensor = None 2025-05-07T20:32:23.0103798Z 2025-05-07T20:32:23.0104039Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.0104373Z op = silu_mul_quant 2025-05-07T20:32:23.0104639Z if compiled: 2025-05-07T20:32:23.0104898Z op = torch.compile(op) 2025-05-07T20:32:23.0105223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0105515Z 2025-05-07T20:32:23.0105721Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.0105896Z 2025-05-07T20:32:23.0106001Z moe/activation_test.py:117: 2025-05-07T20:32:23.0106406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0106754Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.0107051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0107644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.0108234Z return fn(*args, **kwargs) 2025-05-07T20:32:23.0108920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.0109642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.0110211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.0110930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.0111618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.0112186Z kernel = self.compile( 2025-05-07T20:32:23.0112760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.0113452Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.0113926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0114174Z 2025-05-07T20:32:23.0114391Z self = 2025-05-07T20:32:23.0115520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.0116996Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677e27910>} 2025-05-07T20:32:23.0118399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.0119470Z context = 2025-05-07T20:32:23.0119777Z 2025-05-07T20:32:23.0119957Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.0120510Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.0121003Z module_map=module_map) 2025-05-07T20:32:23.0121482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.0121859Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.0122130Z E ^ 2025-05-07T20:32:23.0122621Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.0123101Z 2025-05-07T20:32:23.0123534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.0124396Z 2025-05-07T20:32:23.0124518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.0124951Z self=, 2025-05-07T20:32:23.0125374Z T=128, 2025-05-07T20:32:23.0125575Z D=7168, 2025-05-07T20:32:23.0125777Z scale_ub=1200.0, 2025-05-07T20:32:23.0126017Z contiguous=False, 2025-05-07T20:32:23.0126258Z compiled=False, 2025-05-07T20:32:23.0126478Z ) 2025-05-07T20:32:23.1417695Z self = 2025-05-07T20:32:23.1418258Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:23.1418558Z 2025-05-07T20:32:23.1418642Z @given( 2025-05-07T20:32:23.1418890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.1419388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.1419726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.1420076Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.1420417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.1420723Z ) 2025-05-07T20:32:23.1421093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.1421560Z def test_silu_mul_quant( 2025-05-07T20:32:23.1421823Z self, 2025-05-07T20:32:23.1422032Z T: int, 2025-05-07T20:32:23.1422240Z D: int, 2025-05-07T20:32:23.1422474Z scale_ub: Optional[float], 2025-05-07T20:32:23.1422763Z contiguous: bool, 2025-05-07T20:32:23.1423024Z compiled: bool, 2025-05-07T20:32:23.1423256Z ) -> None: 2025-05-07T20:32:23.1423488Z torch.manual_seed(2025) 2025-05-07T20:32:23.1423745Z 2025-05-07T20:32:23.1424198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.1424568Z 2025-05-07T20:32:23.1424779Z x_sign = torch.sign(x) 2025-05-07T20:32:23.1425085Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.1425417Z x = x_sign * x_clamp 2025-05-07T20:32:23.1425677Z x0 = x[:, :D] 2025-05-07T20:32:23.1425907Z x1 = x[:, D:] 2025-05-07T20:32:23.1426132Z 2025-05-07T20:32:23.1426331Z if contiguous: 2025-05-07T20:32:23.1426575Z x0 = x0.contiguous() 2025-05-07T20:32:23.1426898Z x1 = x1.contiguous() 2025-05-07T20:32:23.1427164Z 2025-05-07T20:32:23.1427362Z if scale_ub is not None: 2025-05-07T20:32:23.1427653Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.1428015Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.1428342Z ) 2025-05-07T20:32:23.1428547Z else: 2025-05-07T20:32:23.1428775Z scale_ub_tensor = None 2025-05-07T20:32:23.1429048Z 2025-05-07T20:32:23.1429292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.1429628Z op = silu_mul_quant 2025-05-07T20:32:23.1429895Z if compiled: 2025-05-07T20:32:23.1430153Z op = torch.compile(op) 2025-05-07T20:32:23.1430471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.1430765Z 2025-05-07T20:32:23.1430971Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.1431151Z 2025-05-07T20:32:23.1431256Z moe/activation_test.py:117: 2025-05-07T20:32:23.1431569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.1431910Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.1432212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.1433065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.1433850Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.1434412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.1435128Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.1435827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.1436392Z kernel = self.compile( 2025-05-07T20:32:23.1436956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.1437647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.1438070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.1438308Z 2025-05-07T20:32:23.1438522Z self = 2025-05-07T20:32:23.1439670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.1441223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677e26d40>} 2025-05-07T20:32:23.1442623Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.1443690Z context = 2025-05-07T20:32:23.1443990Z 2025-05-07T20:32:23.1444177Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.1444721Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.1445217Z module_map=module_map) 2025-05-07T20:32:23.1445612Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.1445980Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.1446258Z E ^ 2025-05-07T20:32:23.1446774Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.1447267Z 2025-05-07T20:32:23.1447704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.1448236Z 2025-05-07T20:32:23.1448348Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.1448791Z self=, 2025-05-07T20:32:23.1449221Z T=128, 2025-05-07T20:32:23.1449427Z D=5120, 2025-05-07T20:32:23.1449627Z scale_ub=None, 2025-05-07T20:32:23.1449858Z contiguous=False, 2025-05-07T20:32:23.1450099Z compiled=False, 2025-05-07T20:32:23.1450317Z ) 2025-05-07T20:32:23.1450657Z self = 2025-05-07T20:32:23.1451178Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:23.1451461Z 2025-05-07T20:32:23.1451545Z @given( 2025-05-07T20:32:23.1451795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.1452127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.1452447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.1452801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.1453152Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.1453455Z ) 2025-05-07T20:32:23.1453907Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.1454375Z def test_silu_mul_quant( 2025-05-07T20:32:23.1454633Z self, 2025-05-07T20:32:23.1454834Z T: int, 2025-05-07T20:32:23.1455045Z D: int, 2025-05-07T20:32:23.1455288Z scale_ub: Optional[float], 2025-05-07T20:32:23.1455573Z contiguous: bool, 2025-05-07T20:32:23.1455831Z compiled: bool, 2025-05-07T20:32:23.1456067Z ) -> None: 2025-05-07T20:32:23.1456291Z torch.manual_seed(2025) 2025-05-07T20:32:23.1456548Z 2025-05-07T20:32:23.1456837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.1457192Z 2025-05-07T20:32:23.1457403Z x_sign = torch.sign(x) 2025-05-07T20:32:23.1457710Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.1458030Z x = x_sign * x_clamp 2025-05-07T20:32:23.1458292Z x0 = x[:, :D] 2025-05-07T20:32:23.1458521Z x1 = x[:, D:] 2025-05-07T20:32:23.1458748Z 2025-05-07T20:32:23.1458949Z if contiguous: 2025-05-07T20:32:23.1459197Z x0 = x0.contiguous() 2025-05-07T20:32:23.1459473Z x1 = x1.contiguous() 2025-05-07T20:32:23.1459723Z 2025-05-07T20:32:23.1459930Z if scale_ub is not None: 2025-05-07T20:32:23.1460342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.1460693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.1461024Z ) 2025-05-07T20:32:23.1468248Z else: 2025-05-07T20:32:23.1468558Z scale_ub_tensor = None 2025-05-07T20:32:23.1468832Z 2025-05-07T20:32:23.1469091Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.1469435Z op = silu_mul_quant 2025-05-07T20:32:23.1469704Z if compiled: 2025-05-07T20:32:23.1469970Z op = torch.compile(op) 2025-05-07T20:32:23.1470287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.1470580Z 2025-05-07T20:32:23.1470789Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.1470968Z 2025-05-07T20:32:23.1471079Z moe/activation_test.py:117: 2025-05-07T20:32:23.1471396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.1471747Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.1472047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.1472770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.1473489Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.1474140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.1474862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.1475556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.1476114Z kernel = self.compile( 2025-05-07T20:32:23.1476691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.1477433Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.1477859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.1478098Z 2025-05-07T20:32:23.1478314Z self = 2025-05-07T20:32:23.1479439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.1480867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767ca031c0>} 2025-05-07T20:32:23.1482387Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.1483451Z context = 2025-05-07T20:32:23.1483761Z 2025-05-07T20:32:23.1483937Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.1484486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.1484978Z module_map=module_map) 2025-05-07T20:32:23.1485357Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.1485732Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.1486009Z E ^ 2025-05-07T20:32:23.1486492Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.1486965Z 2025-05-07T20:32:23.1487400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.1487939Z 2025-05-07T20:32:23.1488050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.1488620Z self=, 2025-05-07T20:32:23.1489040Z T=128, 2025-05-07T20:32:23.1489239Z D=5120, 2025-05-07T20:32:23.1489446Z scale_ub=1200.0, 2025-05-07T20:32:23.1489681Z contiguous=True, 2025-05-07T20:32:23.1489912Z compiled=False, 2025-05-07T20:32:23.1490130Z ) 2025-05-07T20:32:23.3467682Z self = 2025-05-07T20:32:23.3468251Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:23.3468558Z 2025-05-07T20:32:23.3468677Z @given( 2025-05-07T20:32:23.3469045Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.3469523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.3469977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.3470402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.3470753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.3471059Z ) 2025-05-07T20:32:23.3471430Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.3471898Z def test_silu_mul_quant( 2025-05-07T20:32:23.3472153Z self, 2025-05-07T20:32:23.3472368Z T: int, 2025-05-07T20:32:23.3472580Z D: int, 2025-05-07T20:32:23.3472807Z scale_ub: Optional[float], 2025-05-07T20:32:23.3473094Z contiguous: bool, 2025-05-07T20:32:23.3473355Z compiled: bool, 2025-05-07T20:32:23.3473690Z ) -> None: 2025-05-07T20:32:23.3473917Z torch.manual_seed(2025) 2025-05-07T20:32:23.3474177Z 2025-05-07T20:32:23.3474470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.3474828Z 2025-05-07T20:32:23.3475040Z x_sign = torch.sign(x) 2025-05-07T20:32:23.3475349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.3475674Z x = x_sign * x_clamp 2025-05-07T20:32:23.3475933Z x0 = x[:, :D] 2025-05-07T20:32:23.3476174Z x1 = x[:, D:] 2025-05-07T20:32:23.3476392Z 2025-05-07T20:32:23.3476597Z if contiguous: 2025-05-07T20:32:23.3476846Z x0 = x0.contiguous() 2025-05-07T20:32:23.3477118Z x1 = x1.contiguous() 2025-05-07T20:32:23.3477373Z 2025-05-07T20:32:23.3477581Z if scale_ub is not None: 2025-05-07T20:32:23.3477868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.3478217Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.3478543Z ) 2025-05-07T20:32:23.3478744Z else: 2025-05-07T20:32:23.3478972Z scale_ub_tensor = None 2025-05-07T20:32:23.3479239Z 2025-05-07T20:32:23.3479668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.3480002Z op = silu_mul_quant 2025-05-07T20:32:23.3480268Z if compiled: 2025-05-07T20:32:23.3480537Z op = torch.compile(op) 2025-05-07T20:32:23.3480847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.3481143Z 2025-05-07T20:32:23.3481348Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.3481522Z 2025-05-07T20:32:23.3481629Z moe/activation_test.py:117: 2025-05-07T20:32:23.3481943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.3482322Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.3482621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.3483350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.3484079Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.3484653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.3485381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.3486077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.3486767Z kernel = self.compile( 2025-05-07T20:32:23.3487342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.3488037Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.3488454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.3488700Z 2025-05-07T20:32:23.3488918Z self = 2025-05-07T20:32:23.3490063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.3491522Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677eabac0>} 2025-05-07T20:32:23.3492941Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.3494017Z context = 2025-05-07T20:32:23.3494321Z 2025-05-07T20:32:23.3494495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.3495043Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.3495534Z module_map=module_map) 2025-05-07T20:32:23.3495922Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.3496296Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.3496568Z E ^ 2025-05-07T20:32:23.3497063Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.3497548Z 2025-05-07T20:32:23.3497989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.3498528Z 2025-05-07T20:32:23.3498643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.3499075Z self=, 2025-05-07T20:32:23.3499499Z T=1, 2025-05-07T20:32:23.3499697Z D=7168, 2025-05-07T20:32:23.3499903Z scale_ub=1200.0, 2025-05-07T20:32:23.3500145Z contiguous=True, 2025-05-07T20:32:23.3500381Z compiled=True, 2025-05-07T20:32:23.3500601Z ) 2025-05-07T20:32:23.3500937Z self = 2025-05-07T20:32:23.3501536Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:23.3501813Z 2025-05-07T20:32:23.3501901Z @given( 2025-05-07T20:32:23.3502145Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.3502479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.3502803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.3503147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.3503492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.3503796Z ) 2025-05-07T20:32:23.3504166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.3504630Z def test_silu_mul_quant( 2025-05-07T20:32:23.3504888Z self, 2025-05-07T20:32:23.3505098Z T: int, 2025-05-07T20:32:23.3505306Z D: int, 2025-05-07T20:32:23.3505541Z scale_ub: Optional[float], 2025-05-07T20:32:23.3505835Z contiguous: bool, 2025-05-07T20:32:23.3506103Z compiled: bool, 2025-05-07T20:32:23.3506344Z ) -> None: 2025-05-07T20:32:23.3506586Z torch.manual_seed(2025) 2025-05-07T20:32:23.3506866Z 2025-05-07T20:32:23.3507181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.3507625Z 2025-05-07T20:32:23.3507826Z x_sign = torch.sign(x) 2025-05-07T20:32:23.3508134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.3508463Z x = x_sign * x_clamp 2025-05-07T20:32:23.3508712Z x0 = x[:, :D] 2025-05-07T20:32:23.3508942Z x1 = x[:, D:] 2025-05-07T20:32:23.3509168Z 2025-05-07T20:32:23.3509362Z if contiguous: 2025-05-07T20:32:23.3509608Z x0 = x0.contiguous() 2025-05-07T20:32:23.3509881Z x1 = x1.contiguous() 2025-05-07T20:32:23.3510138Z 2025-05-07T20:32:23.3510338Z if scale_ub is not None: 2025-05-07T20:32:23.3510626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.3510987Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.3511309Z ) 2025-05-07T20:32:23.3511518Z else: 2025-05-07T20:32:23.3511744Z scale_ub_tensor = None 2025-05-07T20:32:23.3512011Z 2025-05-07T20:32:23.3512261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.3512591Z op = silu_mul_quant 2025-05-07T20:32:23.3512853Z if compiled: 2025-05-07T20:32:23.3513116Z op = torch.compile(op) 2025-05-07T20:32:23.3513430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.3513796Z 2025-05-07T20:32:23.3514005Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.3514187Z 2025-05-07T20:32:23.3514297Z moe/activation_test.py:117: 2025-05-07T20:32:23.3514612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.3514958Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.3515265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.3515861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.3516447Z return fn(*args, **kwargs) 2025-05-07T20:32:23.3517206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.3517938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.3518503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.3519223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.3519918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.3520482Z kernel = self.compile( 2025-05-07T20:32:23.3521138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.3521830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.3522255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.3522501Z 2025-05-07T20:32:23.3522727Z self = 2025-05-07T20:32:23.3524172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.3525714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677ea9d80>} 2025-05-07T20:32:23.3527121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.3528185Z context = 2025-05-07T20:32:23.3528484Z 2025-05-07T20:32:23.3528659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.3529396Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.3529891Z module_map=module_map) 2025-05-07T20:32:23.3530275Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.3530643Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.3530922Z E ^ 2025-05-07T20:32:23.3531410Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.3531883Z 2025-05-07T20:32:23.3532317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.3532865Z 2025-05-07T20:32:23.3532978Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.3533414Z self=, 2025-05-07T20:32:23.3533840Z T=1, 2025-05-07T20:32:23.3534041Z D=7168, 2025-05-07T20:32:23.3534249Z scale_ub=1200.0, 2025-05-07T20:32:23.3534492Z contiguous=False, 2025-05-07T20:32:23.3534728Z compiled=True, 2025-05-07T20:32:23.3534950Z ) 2025-05-07T20:32:23.6964730Z self = 2025-05-07T20:32:23.6966232Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:23.6967009Z 2025-05-07T20:32:23.6967108Z @given( 2025-05-07T20:32:23.6967371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.6967701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.6968025Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.6968383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.6968721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.6969026Z ) 2025-05-07T20:32:23.6969391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.6969852Z def test_silu_mul_quant( 2025-05-07T20:32:23.6970111Z self, 2025-05-07T20:32:23.6970318Z T: int, 2025-05-07T20:32:23.6970523Z D: int, 2025-05-07T20:32:23.6970758Z scale_ub: Optional[float], 2025-05-07T20:32:23.6971044Z contiguous: bool, 2025-05-07T20:32:23.6971292Z compiled: bool, 2025-05-07T20:32:23.6971535Z ) -> None: 2025-05-07T20:32:23.6971763Z torch.manual_seed(2025) 2025-05-07T20:32:23.6972011Z 2025-05-07T20:32:23.6972297Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.6972655Z 2025-05-07T20:32:23.6972852Z x_sign = torch.sign(x) 2025-05-07T20:32:23.6973159Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.6973664Z x = x_sign * x_clamp 2025-05-07T20:32:23.6973924Z x0 = x[:, :D] 2025-05-07T20:32:23.6974147Z x1 = x[:, D:] 2025-05-07T20:32:23.6974369Z 2025-05-07T20:32:23.6974562Z if contiguous: 2025-05-07T20:32:23.6974806Z x0 = x0.contiguous() 2025-05-07T20:32:23.6975080Z x1 = x1.contiguous() 2025-05-07T20:32:23.6975333Z 2025-05-07T20:32:23.6975531Z if scale_ub is not None: 2025-05-07T20:32:23.6975820Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.6976168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.6976486Z ) 2025-05-07T20:32:23.6976689Z else: 2025-05-07T20:32:23.6976918Z scale_ub_tensor = None 2025-05-07T20:32:23.6977177Z 2025-05-07T20:32:23.6977424Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.6977756Z op = silu_mul_quant 2025-05-07T20:32:23.6978014Z if compiled: 2025-05-07T20:32:23.6978285Z op = torch.compile(op) 2025-05-07T20:32:23.6978597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.6978888Z 2025-05-07T20:32:23.6979086Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.6979388Z 2025-05-07T20:32:23.6979493Z moe/activation_test.py:117: 2025-05-07T20:32:23.6979803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.6980146Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.6980447Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.6981032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.6981609Z return fn(*args, **kwargs) 2025-05-07T20:32:23.6982297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.6983015Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.6983579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.6984289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.6984982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.6985536Z kernel = self.compile( 2025-05-07T20:32:23.6986100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.6986907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.6987429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.6987729Z 2025-05-07T20:32:23.6988006Z self = 2025-05-07T20:32:23.6989347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.6990838Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677ea83a0>} 2025-05-07T20:32:23.6992237Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.6993309Z context = 2025-05-07T20:32:23.6993717Z 2025-05-07T20:32:23.6993900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.6994447Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.6994933Z module_map=module_map) 2025-05-07T20:32:23.6995427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.6995807Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.6996076Z E ^ 2025-05-07T20:32:23.6996560Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.6997065Z 2025-05-07T20:32:23.6997522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.6998056Z 2025-05-07T20:32:23.6998174Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.6998603Z self=, 2025-05-07T20:32:23.6999025Z T=1, 2025-05-07T20:32:23.6999220Z D=7168, 2025-05-07T20:32:23.6999423Z scale_ub=None, 2025-05-07T20:32:23.6999650Z contiguous=False, 2025-05-07T20:32:23.6999890Z compiled=True, 2025-05-07T20:32:23.7000099Z ) 2025-05-07T20:32:23.7985592Z self = 2025-05-07T20:32:23.7986389Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:23.7986772Z 2025-05-07T20:32:23.7986894Z @given( 2025-05-07T20:32:23.7987424Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.7987756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.7988079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.7988434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.7988782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.7989079Z ) 2025-05-07T20:32:23.7989452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.7989919Z def test_silu_mul_quant( 2025-05-07T20:32:23.7990183Z self, 2025-05-07T20:32:23.7990387Z T: int, 2025-05-07T20:32:23.7990603Z D: int, 2025-05-07T20:32:23.7990843Z scale_ub: Optional[float], 2025-05-07T20:32:23.7991125Z contiguous: bool, 2025-05-07T20:32:23.7991382Z compiled: bool, 2025-05-07T20:32:23.7991627Z ) -> None: 2025-05-07T20:32:23.7991851Z torch.manual_seed(2025) 2025-05-07T20:32:23.7992120Z 2025-05-07T20:32:23.7992414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.7992770Z 2025-05-07T20:32:23.7992985Z x_sign = torch.sign(x) 2025-05-07T20:32:23.7993297Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.7993714Z x = x_sign * x_clamp 2025-05-07T20:32:23.7993972Z x0 = x[:, :D] 2025-05-07T20:32:23.7994205Z x1 = x[:, D:] 2025-05-07T20:32:23.7994429Z 2025-05-07T20:32:23.7994633Z if contiguous: 2025-05-07T20:32:23.7994884Z x0 = x0.contiguous() 2025-05-07T20:32:23.7995160Z x1 = x1.contiguous() 2025-05-07T20:32:23.7995414Z 2025-05-07T20:32:23.7995623Z if scale_ub is not None: 2025-05-07T20:32:23.7995923Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.7996273Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.7996598Z ) 2025-05-07T20:32:23.7996806Z else: 2025-05-07T20:32:23.7997033Z scale_ub_tensor = None 2025-05-07T20:32:23.7997301Z 2025-05-07T20:32:23.7997548Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.7997874Z op = silu_mul_quant 2025-05-07T20:32:23.7998139Z if compiled: 2025-05-07T20:32:23.7998405Z op = torch.compile(op) 2025-05-07T20:32:23.7998713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.7999006Z 2025-05-07T20:32:23.7999215Z y_fp8, y_scale = fn() 2025-05-07T20:32:23.7999513Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:23.7999821Z 2025-05-07T20:32:23.8000075Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.8000578Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:23.8000887Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:23.8001222Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:23.8001601Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:23.8001928Z 2025-05-07T20:32:23.8002150Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:23.8002355Z 2025-05-07T20:32:23.8002468Z moe/activation_test.py:126: 2025-05-07T20:32:23.8002775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.8003131Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:23.8003478Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:23.8004302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:23.8005080Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:23.8005658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.8006372Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.8007173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:23.8007942Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:23.8008726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:23.8009511Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:23.8010273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:23.8010944Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:23.8011575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:23.8018847Z fn() 2025-05-07T20:32:23.8019411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:23.8020036Z self.fn.run( 2025-05-07T20:32:23.8020531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.8021092Z kernel = self.compile( 2025-05-07T20:32:23.8021657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.8022346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.8022769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.8023011Z 2025-05-07T20:32:23.8023235Z self = 2025-05-07T20:32:23.8024626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.8026061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677eaa3b0>} 2025-05-07T20:32:23.8027453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.8028509Z context = 2025-05-07T20:32:23.8028808Z 2025-05-07T20:32:23.8028992Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.8029701Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.8030197Z module_map=module_map) 2025-05-07T20:32:23.8030583Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.8030955Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:23.8031245Z E ^ 2025-05-07T20:32:23.8031737Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.8032204Z 2025-05-07T20:32:23.8032644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.8033176Z 2025-05-07T20:32:23.8033286Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.8033777Z self=, 2025-05-07T20:32:23.8034200Z T=1, 2025-05-07T20:32:23.8034392Z D=5120, 2025-05-07T20:32:23.8034600Z scale_ub=1200.0, 2025-05-07T20:32:23.8034842Z contiguous=False, 2025-05-07T20:32:23.8035081Z compiled=True, 2025-05-07T20:32:23.8035301Z ) 2025-05-07T20:32:23.9745146Z self = 2025-05-07T20:32:23.9745952Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:23.9746547Z 2025-05-07T20:32:23.9746674Z @given( 2025-05-07T20:32:23.9747147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.9748007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.9748747Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.9749374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.9750008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.9750558Z ) 2025-05-07T20:32:23.9751222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.9752078Z def test_silu_mul_quant( 2025-05-07T20:32:23.9752547Z self, 2025-05-07T20:32:23.9752930Z T: int, 2025-05-07T20:32:23.9753302Z D: int, 2025-05-07T20:32:23.9753853Z scale_ub: Optional[float], 2025-05-07T20:32:23.9754381Z contiguous: bool, 2025-05-07T20:32:23.9754835Z compiled: bool, 2025-05-07T20:32:23.9755277Z ) -> None: 2025-05-07T20:32:23.9755700Z torch.manual_seed(2025) 2025-05-07T20:32:23.9756156Z 2025-05-07T20:32:23.9756677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.9757259Z 2025-05-07T20:32:23.9757463Z x_sign = torch.sign(x) 2025-05-07T20:32:23.9757772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.9758102Z x = x_sign * x_clamp 2025-05-07T20:32:23.9758362Z x0 = x[:, :D] 2025-05-07T20:32:23.9758589Z x1 = x[:, D:] 2025-05-07T20:32:23.9758813Z 2025-05-07T20:32:23.9759015Z if contiguous: 2025-05-07T20:32:23.9759260Z x0 = x0.contiguous() 2025-05-07T20:32:23.9759537Z x1 = x1.contiguous() 2025-05-07T20:32:23.9759802Z 2025-05-07T20:32:23.9760004Z if scale_ub is not None: 2025-05-07T20:32:23.9760297Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.9760651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.9760982Z ) 2025-05-07T20:32:23.9761189Z else: 2025-05-07T20:32:23.9761411Z scale_ub_tensor = None 2025-05-07T20:32:23.9761675Z 2025-05-07T20:32:23.9761927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.9762263Z op = silu_mul_quant 2025-05-07T20:32:23.9762526Z if compiled: 2025-05-07T20:32:23.9762788Z op = torch.compile(op) 2025-05-07T20:32:23.9763103Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.9763392Z 2025-05-07T20:32:23.9763600Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.9763780Z 2025-05-07T20:32:23.9763885Z moe/activation_test.py:117: 2025-05-07T20:32:23.9764343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.9764692Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.9764991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.9765585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.9766174Z return fn(*args, **kwargs) 2025-05-07T20:32:23.9766869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.9767594Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.9768158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.9768868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.9769565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.9770128Z kernel = self.compile( 2025-05-07T20:32:23.9770693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.9771384Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.9771887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.9772126Z 2025-05-07T20:32:23.9772344Z self = 2025-05-07T20:32:23.9773471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.9774900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778d9c60>} 2025-05-07T20:32:23.9776307Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.9777432Z context = 2025-05-07T20:32:23.9777733Z 2025-05-07T20:32:23.9777917Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.9778460Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.9778962Z module_map=module_map) 2025-05-07T20:32:23.9779358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.9779727Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.9780004Z E ^ 2025-05-07T20:32:23.9780492Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.9780964Z 2025-05-07T20:32:23.9781407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.9781942Z 2025-05-07T20:32:23.9782054Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.9782496Z self=, 2025-05-07T20:32:23.9782918Z T=1, 2025-05-07T20:32:23.9783111Z D=5120, 2025-05-07T20:32:23.9783322Z scale_ub=1200.0, 2025-05-07T20:32:23.9783561Z contiguous=False, 2025-05-07T20:32:23.9783800Z compiled=False, 2025-05-07T20:32:23.9784017Z ) 2025-05-07T20:32:23.9784358Z self = 2025-05-07T20:32:23.9784880Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:23.9785160Z 2025-05-07T20:32:23.9785242Z @given( 2025-05-07T20:32:23.9785511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.9785928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.9786258Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.9786608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.9786972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.9787316Z ) 2025-05-07T20:32:23.9787683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.9788149Z def test_silu_mul_quant( 2025-05-07T20:32:23.9788408Z self, 2025-05-07T20:32:23.9788617Z T: int, 2025-05-07T20:32:23.9788824Z D: int, 2025-05-07T20:32:23.9789059Z scale_ub: Optional[float], 2025-05-07T20:32:23.9789348Z contiguous: bool, 2025-05-07T20:32:23.9789600Z compiled: bool, 2025-05-07T20:32:23.9789839Z ) -> None: 2025-05-07T20:32:23.9790069Z torch.manual_seed(2025) 2025-05-07T20:32:23.9790320Z 2025-05-07T20:32:23.9790607Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.9790972Z 2025-05-07T20:32:23.9791173Z x_sign = torch.sign(x) 2025-05-07T20:32:23.9791487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.9791815Z x = x_sign * x_clamp 2025-05-07T20:32:23.9792149Z x0 = x[:, :D] 2025-05-07T20:32:23.9792386Z x1 = x[:, D:] 2025-05-07T20:32:23.9792609Z 2025-05-07T20:32:23.9792803Z if contiguous: 2025-05-07T20:32:23.9793048Z x0 = x0.contiguous() 2025-05-07T20:32:23.9793322Z x1 = x1.contiguous() 2025-05-07T20:32:23.9793625Z 2025-05-07T20:32:23.9793832Z if scale_ub is not None: 2025-05-07T20:32:23.9794124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.9794480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.9794803Z ) 2025-05-07T20:32:23.9795008Z else: 2025-05-07T20:32:23.9795233Z scale_ub_tensor = None 2025-05-07T20:32:23.9795497Z 2025-05-07T20:32:23.9795749Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.9796080Z op = silu_mul_quant 2025-05-07T20:32:23.9796344Z if compiled: 2025-05-07T20:32:23.9796611Z op = torch.compile(op) 2025-05-07T20:32:23.9796937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.9797226Z 2025-05-07T20:32:23.9797434Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.9797609Z 2025-05-07T20:32:23.9797719Z moe/activation_test.py:117: 2025-05-07T20:32:23.9798029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.9798378Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.9798680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.9799405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.9800126Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.9800695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.9801413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.9802108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.9802671Z kernel = self.compile( 2025-05-07T20:32:23.9803243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.9803932Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.9804346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.9804589Z 2025-05-07T20:32:23.9804804Z self = 2025-05-07T20:32:23.9806020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.9807462Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778d9750>} 2025-05-07T20:32:23.9808863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.9809922Z context = 2025-05-07T20:32:23.9810227Z 2025-05-07T20:32:23.9810404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.9810950Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.9811446Z module_map=module_map) 2025-05-07T20:32:23.9811831Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.9812203Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.9812478Z E ^ 2025-05-07T20:32:23.9812960Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.9813544Z 2025-05-07T20:32:23.9813979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.9814519Z 2025-05-07T20:32:23.9814632Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.9815061Z self=, 2025-05-07T20:32:23.9815483Z T=16384, 2025-05-07T20:32:23.9815686Z D=5120, 2025-05-07T20:32:23.9815889Z scale_ub=1200.0, 2025-05-07T20:32:23.9816121Z contiguous=False, 2025-05-07T20:32:23.9816369Z compiled=True, 2025-05-07T20:32:23.9816589Z ) 2025-05-07T20:32:24.0839471Z self = 2025-05-07T20:32:24.0840248Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:24.0840692Z 2025-05-07T20:32:24.0840813Z @given( 2025-05-07T20:32:24.0841200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.0841650Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.0842106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.0842462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.0842806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.0843116Z ) 2025-05-07T20:32:24.0843487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.0843956Z def test_silu_mul_quant( 2025-05-07T20:32:24.0844209Z self, 2025-05-07T20:32:24.0844421Z T: int, 2025-05-07T20:32:24.0844635Z D: int, 2025-05-07T20:32:24.0844864Z scale_ub: Optional[float], 2025-05-07T20:32:24.0845157Z contiguous: bool, 2025-05-07T20:32:24.0845419Z compiled: bool, 2025-05-07T20:32:24.0845654Z ) -> None: 2025-05-07T20:32:24.0845887Z torch.manual_seed(2025) 2025-05-07T20:32:24.0846156Z 2025-05-07T20:32:24.0846439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.0846801Z 2025-05-07T20:32:24.0847011Z x_sign = torch.sign(x) 2025-05-07T20:32:24.0847315Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.0847649Z x = x_sign * x_clamp 2025-05-07T20:32:24.0847907Z x0 = x[:, :D] 2025-05-07T20:32:24.0848133Z x1 = x[:, D:] 2025-05-07T20:32:24.0848357Z 2025-05-07T20:32:24.0848557Z if contiguous: 2025-05-07T20:32:24.0848803Z x0 = x0.contiguous() 2025-05-07T20:32:24.0849078Z x1 = x1.contiguous() 2025-05-07T20:32:24.0849340Z 2025-05-07T20:32:24.0849542Z if scale_ub is not None: 2025-05-07T20:32:24.0850000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.0850364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.0850692Z ) 2025-05-07T20:32:24.0850895Z else: 2025-05-07T20:32:24.0851126Z scale_ub_tensor = None 2025-05-07T20:32:24.0851422Z 2025-05-07T20:32:24.0851672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.0852014Z op = silu_mul_quant 2025-05-07T20:32:24.0852278Z if compiled: 2025-05-07T20:32:24.0852547Z op = torch.compile(op) 2025-05-07T20:32:24.0852863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0853156Z 2025-05-07T20:32:24.0853365Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.0853540Z 2025-05-07T20:32:24.0853657Z moe/activation_test.py:117: 2025-05-07T20:32:24.0853970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0854323Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.0854638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0855222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.0855811Z return fn(*args, **kwargs) 2025-05-07T20:32:24.0856633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.0857358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.0857919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.0858637Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.0859336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.0859898Z kernel = self.compile( 2025-05-07T20:32:24.0860466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.0861152Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.0861573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0861820Z 2025-05-07T20:32:24.0862035Z self = 2025-05-07T20:32:24.0863164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.0864597Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778d96c0>} 2025-05-07T20:32:24.0866003Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.0867075Z context = 2025-05-07T20:32:24.0867375Z 2025-05-07T20:32:24.0867560Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.0868116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.0868609Z module_map=module_map) 2025-05-07T20:32:24.0868998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.0869367Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.0869648Z E ^ 2025-05-07T20:32:24.0870140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.0870609Z 2025-05-07T20:32:24.0871546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.0872093Z 2025-05-07T20:32:24.0872206Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.0872644Z self=, 2025-05-07T20:32:24.0873073Z T=2048, 2025-05-07T20:32:24.0873279Z D=7168, 2025-05-07T20:32:24.0873489Z scale_ub=1200.0, 2025-05-07T20:32:24.0873835Z contiguous=False, 2025-05-07T20:32:24.0874072Z compiled=True, 2025-05-07T20:32:24.0874292Z ) 2025-05-07T20:32:24.0874631Z self = 2025-05-07T20:32:24.0875147Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:24.0875441Z 2025-05-07T20:32:24.0875527Z @given( 2025-05-07T20:32:24.0875779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.0876109Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.0876437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.0876805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.0877162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.0877463Z ) 2025-05-07T20:32:24.0877841Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.0878401Z def test_silu_mul_quant( 2025-05-07T20:32:24.0878660Z self, 2025-05-07T20:32:24.0878870Z T: int, 2025-05-07T20:32:24.0879090Z D: int, 2025-05-07T20:32:24.0879325Z scale_ub: Optional[float], 2025-05-07T20:32:24.0879618Z contiguous: bool, 2025-05-07T20:32:24.0879879Z compiled: bool, 2025-05-07T20:32:24.0880115Z ) -> None: 2025-05-07T20:32:24.0880350Z torch.manual_seed(2025) 2025-05-07T20:32:24.0880609Z 2025-05-07T20:32:24.0880897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.0881263Z 2025-05-07T20:32:24.0881469Z x_sign = torch.sign(x) 2025-05-07T20:32:24.0881785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.0882111Z x = x_sign * x_clamp 2025-05-07T20:32:24.0882369Z x0 = x[:, :D] 2025-05-07T20:32:24.0882602Z x1 = x[:, D:] 2025-05-07T20:32:24.0882829Z 2025-05-07T20:32:24.0883029Z if contiguous: 2025-05-07T20:32:24.0883281Z x0 = x0.contiguous() 2025-05-07T20:32:24.0883552Z x1 = x1.contiguous() 2025-05-07T20:32:24.0883811Z 2025-05-07T20:32:24.0884018Z if scale_ub is not None: 2025-05-07T20:32:24.0884305Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.0884665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.0884995Z ) 2025-05-07T20:32:24.0885201Z else: 2025-05-07T20:32:24.0885434Z scale_ub_tensor = None 2025-05-07T20:32:24.0885708Z 2025-05-07T20:32:24.0885950Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.0886284Z op = silu_mul_quant 2025-05-07T20:32:24.0886557Z if compiled: 2025-05-07T20:32:24.0886818Z op = torch.compile(op) 2025-05-07T20:32:24.0887137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0887432Z 2025-05-07T20:32:24.0887648Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.0887823Z 2025-05-07T20:32:24.0887929Z moe/activation_test.py:117: 2025-05-07T20:32:24.0888247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0888598Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.0888898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0889485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.0890077Z return fn(*args, **kwargs) 2025-05-07T20:32:24.0890773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.0891582Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.0892152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.0892872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.0893567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.0894132Z kernel = self.compile( 2025-05-07T20:32:24.0894705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.0895397Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.0895815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0896063Z 2025-05-07T20:32:24.0896281Z self = 2025-05-07T20:32:24.0897466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.0898895Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778d9d80>} 2025-05-07T20:32:24.0900367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.0901439Z context = 2025-05-07T20:32:24.0901745Z 2025-05-07T20:32:24.0901922Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.0902473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.0902969Z module_map=module_map) 2025-05-07T20:32:24.0903355Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.0903729Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.0904014Z E ^ 2025-05-07T20:32:24.0904499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.0904975Z 2025-05-07T20:32:24.0905408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.0905944Z 2025-05-07T20:32:24.2215873Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2216339Z self=, 2025-05-07T20:32:24.2216831Z T=1, 2025-05-07T20:32:24.2217284Z D=5120, 2025-05-07T20:32:24.2217699Z scale_ub=None, 2025-05-07T20:32:24.2218158Z contiguous=False, 2025-05-07T20:32:24.2218638Z compiled=False, 2025-05-07T20:32:24.2219081Z ) 2025-05-07T20:32:24.2219761Z self = 2025-05-07T20:32:24.2220794Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.2221355Z 2025-05-07T20:32:24.2221521Z @given( 2025-05-07T20:32:24.2222011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.2222669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.2223304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.2224306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.2224999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.2225592Z ) 2025-05-07T20:32:24.2226339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.2227087Z def test_silu_mul_quant( 2025-05-07T20:32:24.2227348Z self, 2025-05-07T20:32:24.2227552Z T: int, 2025-05-07T20:32:24.2227772Z D: int, 2025-05-07T20:32:24.2228199Z scale_ub: Optional[float], 2025-05-07T20:32:24.2228492Z contiguous: bool, 2025-05-07T20:32:24.2228760Z compiled: bool, 2025-05-07T20:32:24.2236454Z ) -> None: 2025-05-07T20:32:24.2236725Z torch.manual_seed(2025) 2025-05-07T20:32:24.2236987Z 2025-05-07T20:32:24.2237283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.2237637Z 2025-05-07T20:32:24.2237842Z x_sign = torch.sign(x) 2025-05-07T20:32:24.2238150Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.2238471Z x = x_sign * x_clamp 2025-05-07T20:32:24.2238722Z x0 = x[:, :D] 2025-05-07T20:32:24.2238951Z x1 = x[:, D:] 2025-05-07T20:32:24.2239166Z 2025-05-07T20:32:24.2239362Z if contiguous: 2025-05-07T20:32:24.2239600Z x0 = x0.contiguous() 2025-05-07T20:32:24.2239867Z x1 = x1.contiguous() 2025-05-07T20:32:24.2240118Z 2025-05-07T20:32:24.2240327Z if scale_ub is not None: 2025-05-07T20:32:24.2240610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.2240972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.2241296Z ) 2025-05-07T20:32:24.2241670Z else: 2025-05-07T20:32:24.2241890Z scale_ub_tensor = None 2025-05-07T20:32:24.2242155Z 2025-05-07T20:32:24.2242402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.2242729Z op = silu_mul_quant 2025-05-07T20:32:24.2242995Z if compiled: 2025-05-07T20:32:24.2243258Z op = torch.compile(op) 2025-05-07T20:32:24.2243561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.2243852Z 2025-05-07T20:32:24.2244056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.2244231Z 2025-05-07T20:32:24.2244337Z moe/activation_test.py:117: 2025-05-07T20:32:24.2244649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.2245003Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.2245304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.2246025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.2246757Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.2247320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.2248034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.2248735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.2249297Z kernel = self.compile( 2025-05-07T20:32:24.2249860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.2250553Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.2250964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.2251208Z 2025-05-07T20:32:24.2251424Z self = 2025-05-07T20:32:24.2252553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.2253988Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778db520>} 2025-05-07T20:32:24.2255384Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.2256835Z context = 2025-05-07T20:32:24.2257143Z 2025-05-07T20:32:24.2257318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.2257860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.2258351Z module_map=module_map) 2025-05-07T20:32:24.2258734Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.2259101Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.2259366Z E ^ 2025-05-07T20:32:24.2259847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.2260318Z 2025-05-07T20:32:24.2260751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.2261284Z 2025-05-07T20:32:24.2261402Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2261838Z self=, 2025-05-07T20:32:24.2262256Z T=4096, 2025-05-07T20:32:24.2262456Z D=7168, 2025-05-07T20:32:24.2262660Z scale_ub=1200.0, 2025-05-07T20:32:24.2262981Z contiguous=False, 2025-05-07T20:32:24.2263219Z compiled=False, 2025-05-07T20:32:24.2263435Z ) 2025-05-07T20:32:24.2263764Z self = 2025-05-07T20:32:24.2264281Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:24.2264567Z 2025-05-07T20:32:24.2264652Z @given( 2025-05-07T20:32:24.2264890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.2265217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.2265538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.2265877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.2266229Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.2266528Z ) 2025-05-07T20:32:24.2266897Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.2267356Z def test_silu_mul_quant( 2025-05-07T20:32:24.2267618Z self, 2025-05-07T20:32:24.2267828Z T: int, 2025-05-07T20:32:24.2268030Z D: int, 2025-05-07T20:32:24.2268260Z scale_ub: Optional[float], 2025-05-07T20:32:24.2268544Z contiguous: bool, 2025-05-07T20:32:24.2268795Z compiled: bool, 2025-05-07T20:32:24.2269031Z ) -> None: 2025-05-07T20:32:24.2269254Z torch.manual_seed(2025) 2025-05-07T20:32:24.2269499Z 2025-05-07T20:32:24.2269783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.2270137Z 2025-05-07T20:32:24.2270335Z x_sign = torch.sign(x) 2025-05-07T20:32:24.2270636Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.2270961Z x = x_sign * x_clamp 2025-05-07T20:32:24.2271212Z x0 = x[:, :D] 2025-05-07T20:32:24.2271439Z x1 = x[:, D:] 2025-05-07T20:32:24.2271656Z 2025-05-07T20:32:24.2271845Z if contiguous: 2025-05-07T20:32:24.2272090Z x0 = x0.contiguous() 2025-05-07T20:32:24.2272367Z x1 = x1.contiguous() 2025-05-07T20:32:24.2272617Z 2025-05-07T20:32:24.2272811Z if scale_ub is not None: 2025-05-07T20:32:24.2273104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.2273453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.2273826Z ) 2025-05-07T20:32:24.2274027Z else: 2025-05-07T20:32:24.2274247Z scale_ub_tensor = None 2025-05-07T20:32:24.2274504Z 2025-05-07T20:32:24.2274745Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.2275071Z op = silu_mul_quant 2025-05-07T20:32:24.2275349Z if compiled: 2025-05-07T20:32:24.2275604Z op = torch.compile(op) 2025-05-07T20:32:24.2276045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.2276334Z 2025-05-07T20:32:24.2276532Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.2276706Z 2025-05-07T20:32:24.2276809Z moe/activation_test.py:117: 2025-05-07T20:32:24.2277123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.2277464Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.2277763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.2278484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.2279207Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.2279761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.2280470Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.2281166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.2281716Z kernel = self.compile( 2025-05-07T20:32:24.2282280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.2283042Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.2283459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.2283697Z 2025-05-07T20:32:24.2283914Z self = 2025-05-07T20:32:24.2285036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.2286466Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb0550>} 2025-05-07T20:32:24.2287880Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.2288938Z context = 2025-05-07T20:32:24.2289233Z 2025-05-07T20:32:24.2289406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.2289944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.2290426Z module_map=module_map) 2025-05-07T20:32:24.2290800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.2291166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.2291437Z E ^ 2025-05-07T20:32:24.2291922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.2292387Z 2025-05-07T20:32:24.2292817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.2293357Z 2025-05-07T20:32:24.2293466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2293895Z self=, 2025-05-07T20:32:24.2294307Z T=16384, 2025-05-07T20:32:24.2294506Z D=7168, 2025-05-07T20:32:24.2294707Z scale_ub=None, 2025-05-07T20:32:24.2294929Z contiguous=True, 2025-05-07T20:32:24.2295155Z compiled=True, 2025-05-07T20:32:24.2295372Z ) 2025-05-07T20:32:24.4272175Z self = 2025-05-07T20:32:24.4272785Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.4273081Z 2025-05-07T20:32:24.4273169Z @given( 2025-05-07T20:32:24.4273692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.4274038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.4274359Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.4274712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.4275069Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.4275382Z ) 2025-05-07T20:32:24.4275752Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.4276227Z def test_silu_mul_quant( 2025-05-07T20:32:24.4276490Z self, 2025-05-07T20:32:24.4276697Z T: int, 2025-05-07T20:32:24.4276914Z D: int, 2025-05-07T20:32:24.4277150Z scale_ub: Optional[float], 2025-05-07T20:32:24.4277434Z contiguous: bool, 2025-05-07T20:32:24.4277694Z compiled: bool, 2025-05-07T20:32:24.4277937Z ) -> None: 2025-05-07T20:32:24.4278169Z torch.manual_seed(2025) 2025-05-07T20:32:24.4278432Z 2025-05-07T20:32:24.4278732Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.4279093Z 2025-05-07T20:32:24.4279301Z x_sign = torch.sign(x) 2025-05-07T20:32:24.4279609Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.4280062Z x = x_sign * x_clamp 2025-05-07T20:32:24.4280328Z x0 = x[:, :D] 2025-05-07T20:32:24.4280560Z x1 = x[:, D:] 2025-05-07T20:32:24.4280782Z 2025-05-07T20:32:24.4280975Z if contiguous: 2025-05-07T20:32:24.4281232Z x0 = x0.contiguous() 2025-05-07T20:32:24.4281514Z x1 = x1.contiguous() 2025-05-07T20:32:24.4281774Z 2025-05-07T20:32:24.4281991Z if scale_ub is not None: 2025-05-07T20:32:24.4282293Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.4282658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.4282988Z ) 2025-05-07T20:32:24.4283201Z else: 2025-05-07T20:32:24.4283443Z scale_ub_tensor = None 2025-05-07T20:32:24.4283712Z 2025-05-07T20:32:24.4283964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.4284304Z op = silu_mul_quant 2025-05-07T20:32:24.4284570Z if compiled: 2025-05-07T20:32:24.4284846Z op = torch.compile(op) 2025-05-07T20:32:24.4285165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.4285459Z 2025-05-07T20:32:24.4285674Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.4285854Z 2025-05-07T20:32:24.4285974Z moe/activation_test.py:117: 2025-05-07T20:32:24.4286288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.4286646Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.4286955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.4287553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.4288148Z return fn(*args, **kwargs) 2025-05-07T20:32:24.4288853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.4289590Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.4290160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.4290883Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.4291585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.4292151Z kernel = self.compile( 2025-05-07T20:32:24.4292721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.4293421Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.4293846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.4294209Z 2025-05-07T20:32:24.4294438Z self = 2025-05-07T20:32:24.4295572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.4297026Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb1360>} 2025-05-07T20:32:24.4298433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.4299509Z context = 2025-05-07T20:32:24.4299811Z 2025-05-07T20:32:24.4300001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.4300547Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.4301049Z module_map=module_map) 2025-05-07T20:32:24.4301518Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.4301886Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.4302168Z E ^ 2025-05-07T20:32:24.4302665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.4303139Z 2025-05-07T20:32:24.4303586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.4304122Z 2025-05-07T20:32:24.4304239Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.4304680Z self=, 2025-05-07T20:32:24.4305107Z T=4096, 2025-05-07T20:32:24.4305313Z D=5120, 2025-05-07T20:32:24.4305527Z scale_ub=None, 2025-05-07T20:32:24.4305763Z contiguous=False, 2025-05-07T20:32:24.4306027Z compiled=True, 2025-05-07T20:32:24.4306241Z ) 2025-05-07T20:32:24.4306587Z self = 2025-05-07T20:32:24.4307108Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:24.4307395Z 2025-05-07T20:32:24.4307487Z @given( 2025-05-07T20:32:24.4307728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.4308063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.4308392Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.4308742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.4309097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.4309405Z ) 2025-05-07T20:32:24.4309779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.4310250Z def test_silu_mul_quant( 2025-05-07T20:32:24.4310509Z self, 2025-05-07T20:32:24.4310716Z T: int, 2025-05-07T20:32:24.4310930Z D: int, 2025-05-07T20:32:24.4311170Z scale_ub: Optional[float], 2025-05-07T20:32:24.4311459Z contiguous: bool, 2025-05-07T20:32:24.4311719Z compiled: bool, 2025-05-07T20:32:24.4311962Z ) -> None: 2025-05-07T20:32:24.4312198Z torch.manual_seed(2025) 2025-05-07T20:32:24.4312451Z 2025-05-07T20:32:24.4312742Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.4313108Z 2025-05-07T20:32:24.4313312Z x_sign = torch.sign(x) 2025-05-07T20:32:24.4313686Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.4314021Z x = x_sign * x_clamp 2025-05-07T20:32:24.4314275Z x0 = x[:, :D] 2025-05-07T20:32:24.4314510Z x1 = x[:, D:] 2025-05-07T20:32:24.4314737Z 2025-05-07T20:32:24.4315021Z if contiguous: 2025-05-07T20:32:24.4315275Z x0 = x0.contiguous() 2025-05-07T20:32:24.4315550Z x1 = x1.contiguous() 2025-05-07T20:32:24.4315804Z 2025-05-07T20:32:24.4316014Z if scale_ub is not None: 2025-05-07T20:32:24.4316314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.4316667Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.4317000Z ) 2025-05-07T20:32:24.4317211Z else: 2025-05-07T20:32:24.4317437Z scale_ub_tensor = None 2025-05-07T20:32:24.4317710Z 2025-05-07T20:32:24.4317964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.4318307Z op = silu_mul_quant 2025-05-07T20:32:24.4318570Z if compiled: 2025-05-07T20:32:24.4318838Z op = torch.compile(op) 2025-05-07T20:32:24.4319155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.4319447Z 2025-05-07T20:32:24.4319661Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.4319844Z 2025-05-07T20:32:24.4319956Z moe/activation_test.py:117: 2025-05-07T20:32:24.4320268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.4320710Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.4321015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.4321603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.4322206Z return fn(*args, **kwargs) 2025-05-07T20:32:24.4322907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.4323641Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.4324405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.4325130Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.4325839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.4326403Z kernel = self.compile( 2025-05-07T20:32:24.4326969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.4327673Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.4328093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.4328337Z 2025-05-07T20:32:24.4328555Z self = 2025-05-07T20:32:24.4329689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.4331130Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb1ea0>} 2025-05-07T20:32:24.4332538Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.4333616Z context = 2025-05-07T20:32:24.4333919Z 2025-05-07T20:32:24.4334096Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.4334648Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.4335147Z module_map=module_map) 2025-05-07T20:32:24.4335538Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.4335907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.4336186Z E ^ 2025-05-07T20:32:24.4336810Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.4337285Z 2025-05-07T20:32:24.4337721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.4338267Z 2025-05-07T20:32:24.8002296Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8003254Z self=, 2025-05-07T20:32:24.8004113Z T=4096, 2025-05-07T20:32:24.8004531Z D=5120, 2025-05-07T20:32:24.8004954Z scale_ub=1200.0, 2025-05-07T20:32:24.8005440Z contiguous=False, 2025-05-07T20:32:24.8005925Z compiled=False, 2025-05-07T20:32:24.8006367Z ) 2025-05-07T20:32:24.8007031Z self = 2025-05-07T20:32:24.8007644Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:24.8007938Z 2025-05-07T20:32:24.8008039Z @given( 2025-05-07T20:32:24.8008286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8008621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8008951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8009467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8009817Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8010126Z ) 2025-05-07T20:32:24.8010504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8010970Z def test_silu_mul_quant( 2025-05-07T20:32:24.8011234Z self, 2025-05-07T20:32:24.8011445Z T: int, 2025-05-07T20:32:24.8011658Z D: int, 2025-05-07T20:32:24.8011894Z scale_ub: Optional[float], 2025-05-07T20:32:24.8012188Z contiguous: bool, 2025-05-07T20:32:24.8012444Z compiled: bool, 2025-05-07T20:32:24.8012689Z ) -> None: 2025-05-07T20:32:24.8012931Z torch.manual_seed(2025) 2025-05-07T20:32:24.8013193Z 2025-05-07T20:32:24.8013487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8013854Z 2025-05-07T20:32:24.8014062Z x_sign = torch.sign(x) 2025-05-07T20:32:24.8014380Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.8014719Z x = x_sign * x_clamp 2025-05-07T20:32:24.8014982Z x0 = x[:, :D] 2025-05-07T20:32:24.8015214Z x1 = x[:, D:] 2025-05-07T20:32:24.8015441Z 2025-05-07T20:32:24.8015649Z if contiguous: 2025-05-07T20:32:24.8015895Z x0 = x0.contiguous() 2025-05-07T20:32:24.8016178Z x1 = x1.contiguous() 2025-05-07T20:32:24.8016444Z 2025-05-07T20:32:24.8016651Z if scale_ub is not None: 2025-05-07T20:32:24.8016951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.8017310Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.8017639Z ) 2025-05-07T20:32:24.8017857Z else: 2025-05-07T20:32:24.8018095Z scale_ub_tensor = None 2025-05-07T20:32:24.8018363Z 2025-05-07T20:32:24.8018614Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.8018963Z op = silu_mul_quant 2025-05-07T20:32:24.8019227Z if compiled: 2025-05-07T20:32:24.8019496Z op = torch.compile(op) 2025-05-07T20:32:24.8019826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8020123Z 2025-05-07T20:32:24.8020330Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.8020511Z 2025-05-07T20:32:24.8020619Z moe/activation_test.py:117: 2025-05-07T20:32:24.8020938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8021291Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.8021587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8029014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.8029791Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.8030370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.8031095Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.8031793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.8032351Z kernel = self.compile( 2025-05-07T20:32:24.8032928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.8033699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.8034118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8034368Z 2025-05-07T20:32:24.8034586Z self = 2025-05-07T20:32:24.8035715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.8037268Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb2680>} 2025-05-07T20:32:24.8038666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.8039726Z context = 2025-05-07T20:32:24.8040027Z 2025-05-07T20:32:24.8040203Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.8040758Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.8041255Z module_map=module_map) 2025-05-07T20:32:24.8041637Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.8042021Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.8042304Z E ^ 2025-05-07T20:32:24.8042789Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.8043270Z 2025-05-07T20:32:24.8043704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.8044241Z 2025-05-07T20:32:24.8044352Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8044788Z self=, 2025-05-07T20:32:24.8045206Z T=4096, 2025-05-07T20:32:24.8045406Z D=5120, 2025-05-07T20:32:24.8045619Z scale_ub=1200.0, 2025-05-07T20:32:24.8045859Z contiguous=False, 2025-05-07T20:32:24.8046103Z compiled=True, 2025-05-07T20:32:24.8046323Z ) 2025-05-07T20:32:24.8046657Z self = 2025-05-07T20:32:24.8047184Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:24.8047477Z 2025-05-07T20:32:24.8047562Z @given( 2025-05-07T20:32:24.8047813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8048140Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8048467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8048818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8049162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8049469Z ) 2025-05-07T20:32:24.8049843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8050300Z def test_silu_mul_quant( 2025-05-07T20:32:24.8050585Z self, 2025-05-07T20:32:24.8050876Z T: int, 2025-05-07T20:32:24.8051087Z D: int, 2025-05-07T20:32:24.8051314Z scale_ub: Optional[float], 2025-05-07T20:32:24.8051602Z contiguous: bool, 2025-05-07T20:32:24.8051865Z compiled: bool, 2025-05-07T20:32:24.8052101Z ) -> None: 2025-05-07T20:32:24.8052335Z torch.manual_seed(2025) 2025-05-07T20:32:24.8052591Z 2025-05-07T20:32:24.8052883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8053241Z 2025-05-07T20:32:24.8053450Z x_sign = torch.sign(x) 2025-05-07T20:32:24.8053762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.8054086Z x = x_sign * x_clamp 2025-05-07T20:32:24.8054346Z x0 = x[:, :D] 2025-05-07T20:32:24.8054581Z x1 = x[:, D:] 2025-05-07T20:32:24.8054799Z 2025-05-07T20:32:24.8054997Z if contiguous: 2025-05-07T20:32:24.8055248Z x0 = x0.contiguous() 2025-05-07T20:32:24.8055524Z x1 = x1.contiguous() 2025-05-07T20:32:24.8055784Z 2025-05-07T20:32:24.8055991Z if scale_ub is not None: 2025-05-07T20:32:24.8056275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.8056628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.8057052Z ) 2025-05-07T20:32:24.8057255Z else: 2025-05-07T20:32:24.8057482Z scale_ub_tensor = None 2025-05-07T20:32:24.8057746Z 2025-05-07T20:32:24.8057985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.8058314Z op = silu_mul_quant 2025-05-07T20:32:24.8058581Z if compiled: 2025-05-07T20:32:24.8058842Z op = torch.compile(op) 2025-05-07T20:32:24.8059151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8059444Z 2025-05-07T20:32:24.8059654Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.8059828Z 2025-05-07T20:32:24.8059933Z moe/activation_test.py:117: 2025-05-07T20:32:24.8060255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8060607Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.8060902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8061503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.8062090Z return fn(*args, **kwargs) 2025-05-07T20:32:24.8062784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.8063498Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.8064062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.8064775Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.8065465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.8066029Z kernel = self.compile( 2025-05-07T20:32:24.8066596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.8067294Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.8067758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8068007Z 2025-05-07T20:32:24.8068222Z self = 2025-05-07T20:32:24.8069342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.8070762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb3ac0>} 2025-05-07T20:32:24.8072234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.8073302Z context = 2025-05-07T20:32:24.8073681Z 2025-05-07T20:32:24.8073855Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.8074402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.8074891Z module_map=module_map) 2025-05-07T20:32:24.8075278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.8075651Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.8075930Z E ^ 2025-05-07T20:32:24.8076411Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.8076891Z 2025-05-07T20:32:24.8077348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.8077903Z 2025-05-07T20:32:24.9392574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9393046Z self=, 2025-05-07T20:32:24.9393493Z T=2048, 2025-05-07T20:32:24.9393783Z D=7168, 2025-05-07T20:32:24.9393997Z scale_ub=1200.0, 2025-05-07T20:32:24.9394249Z contiguous=False, 2025-05-07T20:32:24.9394504Z compiled=False, 2025-05-07T20:32:24.9394731Z ) 2025-05-07T20:32:24.9395104Z self = 2025-05-07T20:32:24.9395643Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:24.9395942Z 2025-05-07T20:32:24.9396028Z @given( 2025-05-07T20:32:24.9396282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9396632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9396963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9397324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9397690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9398002Z ) 2025-05-07T20:32:24.9398381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9398860Z def test_silu_mul_quant( 2025-05-07T20:32:24.9399130Z self, 2025-05-07T20:32:24.9399340Z T: int, 2025-05-07T20:32:24.9399557Z D: int, 2025-05-07T20:32:24.9399798Z scale_ub: Optional[float], 2025-05-07T20:32:24.9400090Z contiguous: bool, 2025-05-07T20:32:24.9400354Z compiled: bool, 2025-05-07T20:32:24.9400603Z ) -> None: 2025-05-07T20:32:24.9400836Z torch.manual_seed(2025) 2025-05-07T20:32:24.9401102Z 2025-05-07T20:32:24.9401404Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9401775Z 2025-05-07T20:32:24.9401986Z x_sign = torch.sign(x) 2025-05-07T20:32:24.9402310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.9402655Z x = x_sign * x_clamp 2025-05-07T20:32:24.9402917Z x0 = x[:, :D] 2025-05-07T20:32:24.9403157Z x1 = x[:, D:] 2025-05-07T20:32:24.9403387Z 2025-05-07T20:32:24.9403587Z if contiguous: 2025-05-07T20:32:24.9403842Z x0 = x0.contiguous() 2025-05-07T20:32:24.9404124Z x1 = x1.contiguous() 2025-05-07T20:32:24.9404385Z 2025-05-07T20:32:24.9404599Z if scale_ub is not None: 2025-05-07T20:32:24.9404900Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.9405261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.9405600Z ) 2025-05-07T20:32:24.9405814Z else: 2025-05-07T20:32:24.9406041Z scale_ub_tensor = None 2025-05-07T20:32:24.9406320Z 2025-05-07T20:32:24.9406736Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.9407080Z op = silu_mul_quant 2025-05-07T20:32:24.9407354Z if compiled: 2025-05-07T20:32:24.9407631Z op = torch.compile(op) 2025-05-07T20:32:24.9407960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9408265Z 2025-05-07T20:32:24.9408477Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.9408659Z 2025-05-07T20:32:24.9408772Z moe/activation_test.py:117: 2025-05-07T20:32:24.9409091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9409453Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.9409760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9410502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.9411245Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.9411833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.9412573Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.9413441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.9414019Z kernel = self.compile( 2025-05-07T20:32:24.9414610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.9415312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.9415741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9415995Z 2025-05-07T20:32:24.9416227Z self = 2025-05-07T20:32:24.9417397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.9418866Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677eaa290>} 2025-05-07T20:32:24.9420308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.9421405Z context = 2025-05-07T20:32:24.9421717Z 2025-05-07T20:32:24.9421903Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.9422471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.9422976Z module_map=module_map) 2025-05-07T20:32:24.9423370Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.9423755Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.9424206Z E ^ 2025-05-07T20:32:24.9424729Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.9425214Z 2025-05-07T20:32:24.9425661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.9426208Z 2025-05-07T20:32:24.9426327Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9426770Z self=, 2025-05-07T20:32:24.9427202Z T=1, 2025-05-07T20:32:24.9427435Z D=7168, 2025-05-07T20:32:24.9427658Z scale_ub=None, 2025-05-07T20:32:24.9427891Z contiguous=True, 2025-05-07T20:32:24.9428138Z compiled=False, 2025-05-07T20:32:24.9428361Z ) 2025-05-07T20:32:24.9428854Z self = 2025-05-07T20:32:24.9429392Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.9429675Z 2025-05-07T20:32:24.9429775Z @given( 2025-05-07T20:32:24.9430031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9430376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9430714Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9431073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9431435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9431752Z ) 2025-05-07T20:32:24.9432132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9432614Z def test_silu_mul_quant( 2025-05-07T20:32:24.9432884Z self, 2025-05-07T20:32:24.9433095Z T: int, 2025-05-07T20:32:24.9433318Z D: int, 2025-05-07T20:32:24.9433639Z scale_ub: Optional[float], 2025-05-07T20:32:24.9433936Z contiguous: bool, 2025-05-07T20:32:24.9434203Z compiled: bool, 2025-05-07T20:32:24.9434453Z ) -> None: 2025-05-07T20:32:24.9434693Z torch.manual_seed(2025) 2025-05-07T20:32:24.9435090Z 2025-05-07T20:32:24.9435392Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9435767Z 2025-05-07T20:32:24.9435977Z x_sign = torch.sign(x) 2025-05-07T20:32:24.9436297Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.9436639Z x = x_sign * x_clamp 2025-05-07T20:32:24.9436898Z x0 = x[:, :D] 2025-05-07T20:32:24.9437140Z x1 = x[:, D:] 2025-05-07T20:32:24.9437402Z 2025-05-07T20:32:24.9437619Z if contiguous: 2025-05-07T20:32:24.9437876Z x0 = x0.contiguous() 2025-05-07T20:32:24.9438163Z x1 = x1.contiguous() 2025-05-07T20:32:24.9438425Z 2025-05-07T20:32:24.9438641Z if scale_ub is not None: 2025-05-07T20:32:24.9438950Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.9439314Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.9439654Z ) 2025-05-07T20:32:24.9439875Z else: 2025-05-07T20:32:24.9440110Z scale_ub_tensor = None 2025-05-07T20:32:24.9440389Z 2025-05-07T20:32:24.9440644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.9440994Z op = silu_mul_quant 2025-05-07T20:32:24.9441267Z if compiled: 2025-05-07T20:32:24.9441543Z op = torch.compile(op) 2025-05-07T20:32:24.9441871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9442170Z 2025-05-07T20:32:24.9442386Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.9442567Z 2025-05-07T20:32:24.9442684Z moe/activation_test.py:117: 2025-05-07T20:32:24.9443006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9443373Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.9443685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9444437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.9445186Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.9445770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.9446517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.9447231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.9447811Z kernel = self.compile( 2025-05-07T20:32:24.9448400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.9449114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.9449629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9449884Z 2025-05-07T20:32:24.9450106Z self = 2025-05-07T20:32:24.9451275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.9452751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76769884c0>} 2025-05-07T20:32:24.9454192Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.9455297Z context = 2025-05-07T20:32:24.9455618Z 2025-05-07T20:32:24.9455804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.9456373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.9456963Z module_map=module_map) 2025-05-07T20:32:24.9457365Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.9457751Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.9458041Z E ^ 2025-05-07T20:32:24.9458544Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.9459035Z 2025-05-07T20:32:24.9459484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.9460037Z 2025-05-07T20:32:24.9460158Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9460612Z self=, 2025-05-07T20:32:24.9461051Z T=16384, 2025-05-07T20:32:24.9461266Z D=7168, 2025-05-07T20:32:24.9461481Z scale_ub=1200.0, 2025-05-07T20:32:24.9461728Z contiguous=False, 2025-05-07T20:32:24.9461987Z compiled=True, 2025-05-07T20:32:25.2226671Z ) 2025-05-07T20:32:25.2227320Z self = 2025-05-07T20:32:25.2228149Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2228563Z 2025-05-07T20:32:25.2228683Z @given( 2025-05-07T20:32:25.2229041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2229502Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2229981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2230336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2230681Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2230991Z ) 2025-05-07T20:32:25.2231375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2231838Z def test_silu_mul_quant( 2025-05-07T20:32:25.2232098Z self, 2025-05-07T20:32:25.2232314Z T: int, 2025-05-07T20:32:25.2232522Z D: int, 2025-05-07T20:32:25.2232753Z scale_ub: Optional[float], 2025-05-07T20:32:25.2233043Z contiguous: bool, 2025-05-07T20:32:25.2233301Z compiled: bool, 2025-05-07T20:32:25.2233629Z ) -> None: 2025-05-07T20:32:25.2233862Z torch.manual_seed(2025) 2025-05-07T20:32:25.2234120Z 2025-05-07T20:32:25.2234406Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2234765Z 2025-05-07T20:32:25.2234970Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2235273Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2235601Z x = x_sign * x_clamp 2025-05-07T20:32:25.2235863Z x0 = x[:, :D] 2025-05-07T20:32:25.2236277Z x1 = x[:, D:] 2025-05-07T20:32:25.2236503Z 2025-05-07T20:32:25.2236701Z if contiguous: 2025-05-07T20:32:25.2236943Z x0 = x0.contiguous() 2025-05-07T20:32:25.2237218Z x1 = x1.contiguous() 2025-05-07T20:32:25.2237478Z 2025-05-07T20:32:25.2237680Z if scale_ub is not None: 2025-05-07T20:32:25.2237971Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2238324Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2238651Z ) 2025-05-07T20:32:25.2238853Z else: 2025-05-07T20:32:25.2239078Z scale_ub_tensor = None 2025-05-07T20:32:25.2239345Z 2025-05-07T20:32:25.2239585Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2239921Z op = silu_mul_quant 2025-05-07T20:32:25.2240191Z if compiled: 2025-05-07T20:32:25.2240449Z op = torch.compile(op) 2025-05-07T20:32:25.2240763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2241066Z 2025-05-07T20:32:25.2241265Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2241446Z 2025-05-07T20:32:25.2241552Z moe/activation_test.py:117: 2025-05-07T20:32:25.2241868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2242339Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2242641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2243234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2243827Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2244518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2245244Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2245814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2246537Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2247237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2247807Z kernel = self.compile( 2025-05-07T20:32:25.2248382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2249069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2249493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2249737Z 2025-05-07T20:32:25.2249954Z self = 2025-05-07T20:32:25.2251090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2252529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76769895a0>} 2025-05-07T20:32:25.2253939Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2255017Z context = 2025-05-07T20:32:25.2255317Z 2025-05-07T20:32:25.2255500Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2256049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2256543Z module_map=module_map) 2025-05-07T20:32:25.2256926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2257401Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2257701Z E ^ 2025-05-07T20:32:25.2258187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2258664Z 2025-05-07T20:32:25.2259105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2259641Z 2025-05-07T20:32:25.2259759Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2260188Z self=, 2025-05-07T20:32:25.2260615Z T=1, 2025-05-07T20:32:25.2260819Z D=7168, 2025-05-07T20:32:25.2261020Z scale_ub=None, 2025-05-07T20:32:25.2261254Z contiguous=False, 2025-05-07T20:32:25.2261495Z compiled=False, 2025-05-07T20:32:25.2261709Z ) 2025-05-07T20:32:25.2262049Z self = 2025-05-07T20:32:25.2262579Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.2262855Z 2025-05-07T20:32:25.2262940Z @given( 2025-05-07T20:32:25.2263188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2263520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2263931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2264274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2264622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2264924Z ) 2025-05-07T20:32:25.2265291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2265760Z def test_silu_mul_quant( 2025-05-07T20:32:25.2266018Z self, 2025-05-07T20:32:25.2266219Z T: int, 2025-05-07T20:32:25.2266431Z D: int, 2025-05-07T20:32:25.2266663Z scale_ub: Optional[float], 2025-05-07T20:32:25.2266952Z contiguous: bool, 2025-05-07T20:32:25.2267207Z compiled: bool, 2025-05-07T20:32:25.2267478Z ) -> None: 2025-05-07T20:32:25.2267726Z torch.manual_seed(2025) 2025-05-07T20:32:25.2274990Z 2025-05-07T20:32:25.2275314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2275697Z 2025-05-07T20:32:25.2275901Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2276213Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2276544Z x = x_sign * x_clamp 2025-05-07T20:32:25.2276795Z x0 = x[:, :D] 2025-05-07T20:32:25.2277023Z x1 = x[:, D:] 2025-05-07T20:32:25.2277248Z 2025-05-07T20:32:25.2277444Z if contiguous: 2025-05-07T20:32:25.2277692Z x0 = x0.contiguous() 2025-05-07T20:32:25.2277965Z x1 = x1.contiguous() 2025-05-07T20:32:25.2278217Z 2025-05-07T20:32:25.2278414Z if scale_ub is not None: 2025-05-07T20:32:25.2278701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2279061Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2279395Z ) 2025-05-07T20:32:25.2279601Z else: 2025-05-07T20:32:25.2279827Z scale_ub_tensor = None 2025-05-07T20:32:25.2280090Z 2025-05-07T20:32:25.2280344Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2280677Z op = silu_mul_quant 2025-05-07T20:32:25.2280938Z if compiled: 2025-05-07T20:32:25.2281204Z op = torch.compile(op) 2025-05-07T20:32:25.2281517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2281805Z 2025-05-07T20:32:25.2282015Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2282189Z 2025-05-07T20:32:25.2282303Z moe/activation_test.py:117: 2025-05-07T20:32:25.2282614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2282969Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2283269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2284112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2284836Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2285402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2286124Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2286818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2287422Z kernel = self.compile( 2025-05-07T20:32:25.2287997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2288695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2289109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2289352Z 2025-05-07T20:32:25.2289572Z self = 2025-05-07T20:32:25.2290699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2292212Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676989d80>} 2025-05-07T20:32:25.2293615Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2294685Z context = 2025-05-07T20:32:25.2294987Z 2025-05-07T20:32:25.2295168Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2295719Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2296206Z module_map=module_map) 2025-05-07T20:32:25.2296599Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2296968Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2297238Z E ^ 2025-05-07T20:32:25.2297725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2298195Z 2025-05-07T20:32:25.2298632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2299166Z 2025-05-07T20:32:25.2299282Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2299714Z self=, 2025-05-07T20:32:25.2300132Z T=2048, 2025-05-07T20:32:25.2300331Z D=7168, 2025-05-07T20:32:25.2300537Z scale_ub=None, 2025-05-07T20:32:25.2300771Z contiguous=False, 2025-05-07T20:32:25.2301010Z compiled=True, 2025-05-07T20:32:25.2301221Z ) 2025-05-07T20:32:25.3326980Z self = 2025-05-07T20:32:25.3327606Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.3328028Z 2025-05-07T20:32:25.3328155Z @given( 2025-05-07T20:32:25.3328474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3328909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3329334Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3329731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3330080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3330386Z ) 2025-05-07T20:32:25.3330757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3331391Z def test_silu_mul_quant( 2025-05-07T20:32:25.3331656Z self, 2025-05-07T20:32:25.3331865Z T: int, 2025-05-07T20:32:25.3332074Z D: int, 2025-05-07T20:32:25.3332313Z scale_ub: Optional[float], 2025-05-07T20:32:25.3332604Z contiguous: bool, 2025-05-07T20:32:25.3332861Z compiled: bool, 2025-05-07T20:32:25.3333103Z ) -> None: 2025-05-07T20:32:25.3333333Z torch.manual_seed(2025) 2025-05-07T20:32:25.3333591Z 2025-05-07T20:32:25.3333884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3334245Z 2025-05-07T20:32:25.3334448Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3334763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3335097Z x = x_sign * x_clamp 2025-05-07T20:32:25.3335356Z x0 = x[:, :D] 2025-05-07T20:32:25.3335583Z x1 = x[:, D:] 2025-05-07T20:32:25.3335807Z 2025-05-07T20:32:25.3336007Z if contiguous: 2025-05-07T20:32:25.3336259Z x0 = x0.contiguous() 2025-05-07T20:32:25.3336539Z x1 = x1.contiguous() 2025-05-07T20:32:25.3336800Z 2025-05-07T20:32:25.3337002Z if scale_ub is not None: 2025-05-07T20:32:25.3337292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3337785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3338111Z ) 2025-05-07T20:32:25.3338321Z else: 2025-05-07T20:32:25.3338552Z scale_ub_tensor = None 2025-05-07T20:32:25.3338818Z 2025-05-07T20:32:25.3339068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3339406Z op = silu_mul_quant 2025-05-07T20:32:25.3339670Z if compiled: 2025-05-07T20:32:25.3339938Z op = torch.compile(op) 2025-05-07T20:32:25.3340256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3340549Z 2025-05-07T20:32:25.3340753Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3340933Z 2025-05-07T20:32:25.3341051Z moe/activation_test.py:117: 2025-05-07T20:32:25.3341365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3341712Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3342021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3342618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.3343205Z return fn(*args, **kwargs) 2025-05-07T20:32:25.3343899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3344623Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3345190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3345901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3346599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3347164Z kernel = self.compile( 2025-05-07T20:32:25.3347782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3348483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3348904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3349142Z 2025-05-07T20:32:25.3349363Z self = 2025-05-07T20:32:25.3350483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3351997Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767698af80>} 2025-05-07T20:32:25.3353401Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3354542Z context = 2025-05-07T20:32:25.3354843Z 2025-05-07T20:32:25.3355024Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3355565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3356056Z module_map=module_map) 2025-05-07T20:32:25.3356441Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3356812Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3357094Z E ^ 2025-05-07T20:32:25.3357587Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3358058Z 2025-05-07T20:32:25.3358500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3359117Z 2025-05-07T20:32:25.3359230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3359666Z self=, 2025-05-07T20:32:25.3360092Z T=4096, 2025-05-07T20:32:25.3360289Z D=7168, 2025-05-07T20:32:25.3360498Z scale_ub=None, 2025-05-07T20:32:25.3360731Z contiguous=False, 2025-05-07T20:32:25.3360975Z compiled=True, 2025-05-07T20:32:25.3361187Z ) 2025-05-07T20:32:25.3361526Z self = 2025-05-07T20:32:25.3362044Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.3362330Z 2025-05-07T20:32:25.3362439Z @given( 2025-05-07T20:32:25.3362688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3363020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3363346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3363702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3364052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3364359Z ) 2025-05-07T20:32:25.3364730Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3365195Z def test_silu_mul_quant( 2025-05-07T20:32:25.3365455Z self, 2025-05-07T20:32:25.3365664Z T: int, 2025-05-07T20:32:25.3365871Z D: int, 2025-05-07T20:32:25.3366105Z scale_ub: Optional[float], 2025-05-07T20:32:25.3366398Z contiguous: bool, 2025-05-07T20:32:25.3366653Z compiled: bool, 2025-05-07T20:32:25.3366892Z ) -> None: 2025-05-07T20:32:25.3367121Z torch.manual_seed(2025) 2025-05-07T20:32:25.3367375Z 2025-05-07T20:32:25.3367677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3368039Z 2025-05-07T20:32:25.3368242Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3368553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3368886Z x = x_sign * x_clamp 2025-05-07T20:32:25.3369141Z x0 = x[:, :D] 2025-05-07T20:32:25.3369381Z x1 = x[:, D:] 2025-05-07T20:32:25.3369603Z 2025-05-07T20:32:25.3369798Z if contiguous: 2025-05-07T20:32:25.3370041Z x0 = x0.contiguous() 2025-05-07T20:32:25.3370314Z x1 = x1.contiguous() 2025-05-07T20:32:25.3370574Z 2025-05-07T20:32:25.3370778Z if scale_ub is not None: 2025-05-07T20:32:25.3371075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3371435Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3371767Z ) 2025-05-07T20:32:25.3371972Z else: 2025-05-07T20:32:25.3372287Z scale_ub_tensor = None 2025-05-07T20:32:25.3372566Z 2025-05-07T20:32:25.3372813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3373151Z op = silu_mul_quant 2025-05-07T20:32:25.3373428Z if compiled: 2025-05-07T20:32:25.3373694Z op = torch.compile(op) 2025-05-07T20:32:25.3374017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3374320Z 2025-05-07T20:32:25.3374526Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3374706Z 2025-05-07T20:32:25.3374812Z moe/activation_test.py:117: 2025-05-07T20:32:25.3375130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3375482Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3375781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3376375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.3376976Z return fn(*args, **kwargs) 2025-05-07T20:32:25.3377671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3378401Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3379081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3379803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3380499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3381066Z kernel = self.compile( 2025-05-07T20:32:25.3381641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3382332Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3382757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3383006Z 2025-05-07T20:32:25.3383226Z self = 2025-05-07T20:32:25.3384358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3385793Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767698be20>} 2025-05-07T20:32:25.3387194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3388315Z context = 2025-05-07T20:32:25.3388617Z 2025-05-07T20:32:25.3388810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3389366Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3389868Z module_map=module_map) 2025-05-07T20:32:25.3390257Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3390637Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3390912Z E ^ 2025-05-07T20:32:25.3391406Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3391879Z 2025-05-07T20:32:25.3392323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3392857Z 2025-05-07T20:32:25.7221598Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.7222436Z self=, 2025-05-07T20:32:25.7223560Z T=16384, 2025-05-07T20:32:25.7224298Z D=5120, 2025-05-07T20:32:25.7224797Z scale_ub=1200.0, 2025-05-07T20:32:25.7225224Z contiguous=False, 2025-05-07T20:32:25.7225663Z compiled=False, 2025-05-07T20:32:25.7226069Z ) 2025-05-07T20:32:25.7226680Z self = 2025-05-07T20:32:25.7227617Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.7227969Z 2025-05-07T20:32:25.7228070Z @given( 2025-05-07T20:32:25.7228331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.7228675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.7229017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.7229383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.7229745Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.7230069Z ) 2025-05-07T20:32:25.7230465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.7230954Z def test_silu_mul_quant( 2025-05-07T20:32:25.7231219Z self, 2025-05-07T20:32:25.7231441Z T: int, 2025-05-07T20:32:25.7231666Z D: int, 2025-05-07T20:32:25.7232050Z scale_ub: Optional[float], 2025-05-07T20:32:25.7232353Z contiguous: bool, 2025-05-07T20:32:25.7232624Z compiled: bool, 2025-05-07T20:32:25.7232876Z ) -> None: 2025-05-07T20:32:25.7233118Z torch.manual_seed(2025) 2025-05-07T20:32:25.7233392Z 2025-05-07T20:32:25.7233767Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.7234144Z 2025-05-07T20:32:25.7234364Z x_sign = torch.sign(x) 2025-05-07T20:32:25.7234683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.7235026Z x = x_sign * x_clamp 2025-05-07T20:32:25.7235297Z x0 = x[:, :D] 2025-05-07T20:32:25.7235533Z x1 = x[:, D:] 2025-05-07T20:32:25.7235766Z 2025-05-07T20:32:25.7235979Z if contiguous: 2025-05-07T20:32:25.7236234Z x0 = x0.contiguous() 2025-05-07T20:32:25.7236521Z x1 = x1.contiguous() 2025-05-07T20:32:25.7236790Z 2025-05-07T20:32:25.7237016Z if scale_ub is not None: 2025-05-07T20:32:25.7237317Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.7237683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.7238026Z ) 2025-05-07T20:32:25.7238238Z else: 2025-05-07T20:32:25.7238480Z scale_ub_tensor = None 2025-05-07T20:32:25.7238762Z 2025-05-07T20:32:25.7239018Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.7239367Z op = silu_mul_quant 2025-05-07T20:32:25.7239648Z if compiled: 2025-05-07T20:32:25.7239926Z op = torch.compile(op) 2025-05-07T20:32:25.7240257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7240568Z 2025-05-07T20:32:25.7240787Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.7240975Z 2025-05-07T20:32:25.7241087Z moe/activation_test.py:117: 2025-05-07T20:32:25.7241416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7241785Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.7242098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7242857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.7243612Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.7244197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.7244947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.7245675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.7246395Z kernel = self.compile( 2025-05-07T20:32:25.7246989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.7247714Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.7248154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7248404Z 2025-05-07T20:32:25.7248635Z self = 2025-05-07T20:32:25.7249801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.7251290Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76768717e0>} 2025-05-07T20:32:25.7252756Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.7253951Z context = 2025-05-07T20:32:25.7254264Z 2025-05-07T20:32:25.7254449Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.7255020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.7255539Z module_map=module_map) 2025-05-07T20:32:25.7255941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.7256323Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.7256613Z E ^ 2025-05-07T20:32:25.7257124Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.7257640Z 2025-05-07T20:32:25.7258122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.7258685Z 2025-05-07T20:32:25.7258800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.7259262Z self=, 2025-05-07T20:32:25.7259703Z T=16384, 2025-05-07T20:32:25.7259918Z D=5120, 2025-05-07T20:32:25.7260138Z scale_ub=1200.0, 2025-05-07T20:32:25.7260392Z contiguous=True, 2025-05-07T20:32:25.7260640Z compiled=True, 2025-05-07T20:32:25.7260871Z ) 2025-05-07T20:32:25.7261223Z self = 2025-05-07T20:32:25.7261767Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.7262075Z 2025-05-07T20:32:25.7262163Z @given( 2025-05-07T20:32:25.7262421Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.7262775Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.7263112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.7263476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.7263844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.7264161Z ) 2025-05-07T20:32:25.7264554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.7265041Z def test_silu_mul_quant( 2025-05-07T20:32:25.7265306Z self, 2025-05-07T20:32:25.7265526Z T: int, 2025-05-07T20:32:25.7265750Z D: int, 2025-05-07T20:32:25.7265993Z scale_ub: Optional[float], 2025-05-07T20:32:25.7266296Z contiguous: bool, 2025-05-07T20:32:25.7266566Z compiled: bool, 2025-05-07T20:32:25.7266812Z ) -> None: 2025-05-07T20:32:25.7267054Z torch.manual_seed(2025) 2025-05-07T20:32:25.7267325Z 2025-05-07T20:32:25.7267621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.7268088Z 2025-05-07T20:32:25.7268308Z x_sign = torch.sign(x) 2025-05-07T20:32:25.7268632Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.7268969Z x = x_sign * x_clamp 2025-05-07T20:32:25.7269241Z x0 = x[:, :D] 2025-05-07T20:32:25.7269484Z x1 = x[:, D:] 2025-05-07T20:32:25.7269712Z 2025-05-07T20:32:25.7269919Z if contiguous: 2025-05-07T20:32:25.7270181Z x0 = x0.contiguous() 2025-05-07T20:32:25.7270463Z x1 = x1.contiguous() 2025-05-07T20:32:25.7270730Z 2025-05-07T20:32:25.7270947Z if scale_ub is not None: 2025-05-07T20:32:25.7271248Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.7271637Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.7271981Z ) 2025-05-07T20:32:25.7272198Z else: 2025-05-07T20:32:25.7272428Z scale_ub_tensor = None 2025-05-07T20:32:25.7272707Z 2025-05-07T20:32:25.7272974Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.7273321Z op = silu_mul_quant 2025-05-07T20:32:25.7273680Z if compiled: 2025-05-07T20:32:25.7273959Z op = torch.compile(op) 2025-05-07T20:32:25.7274380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7274685Z 2025-05-07T20:32:25.7274903Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.7275087Z 2025-05-07T20:32:25.7275197Z moe/activation_test.py:117: 2025-05-07T20:32:25.7275526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7275890Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.7276206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7276813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.7277439Z return fn(*args, **kwargs) 2025-05-07T20:32:25.7278169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.7278917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.7279509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.7280264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.7280990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.7281574Z kernel = self.compile( 2025-05-07T20:32:25.7282166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.7293824Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.7294271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7294522Z 2025-05-07T20:32:25.7294759Z self = 2025-05-07T20:32:25.7295927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.7297413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676871090>} 2025-05-07T20:32:25.7298915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.7300029Z context = 2025-05-07T20:32:25.7300347Z 2025-05-07T20:32:25.7300532Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.7301219Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.7301741Z module_map=module_map) 2025-05-07T20:32:25.7302145Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.7302532Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.7302821Z E ^ 2025-05-07T20:32:25.7303332Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.7303826Z 2025-05-07T20:32:25.7304276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.7304837Z 2025-05-07T20:32:25.9333194Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.9333694Z self=, 2025-05-07T20:32:25.9334166Z T=16384, 2025-05-07T20:32:25.9334407Z D=5120, 2025-05-07T20:32:25.9334634Z scale_ub=None, 2025-05-07T20:32:25.9334891Z contiguous=False, 2025-05-07T20:32:25.9335157Z compiled=True, 2025-05-07T20:32:25.9335394Z ) 2025-05-07T20:32:25.9335760Z self = 2025-05-07T20:32:25.9336509Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.9336833Z 2025-05-07T20:32:25.9336924Z @given( 2025-05-07T20:32:25.9337199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.9337553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.9337906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.9338288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.9338663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.9338990Z ) 2025-05-07T20:32:25.9339396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.9339897Z def test_silu_mul_quant( 2025-05-07T20:32:25.9340191Z self, 2025-05-07T20:32:25.9340426Z T: int, 2025-05-07T20:32:25.9340650Z D: int, 2025-05-07T20:32:25.9340900Z scale_ub: Optional[float], 2025-05-07T20:32:25.9341213Z contiguous: bool, 2025-05-07T20:32:25.9341494Z compiled: bool, 2025-05-07T20:32:25.9341754Z ) -> None: 2025-05-07T20:32:25.9342003Z torch.manual_seed(2025) 2025-05-07T20:32:25.9342300Z 2025-05-07T20:32:25.9342610Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.9343002Z 2025-05-07T20:32:25.9343227Z x_sign = torch.sign(x) 2025-05-07T20:32:25.9343557Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.9343908Z x = x_sign * x_clamp 2025-05-07T20:32:25.9344189Z x0 = x[:, :D] 2025-05-07T20:32:25.9344435Z x1 = x[:, D:] 2025-05-07T20:32:25.9344681Z 2025-05-07T20:32:25.9344902Z if contiguous: 2025-05-07T20:32:25.9345169Z x0 = x0.contiguous() 2025-05-07T20:32:25.9345478Z x1 = x1.contiguous() 2025-05-07T20:32:25.9345758Z 2025-05-07T20:32:25.9345985Z if scale_ub is not None: 2025-05-07T20:32:25.9346299Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.9346692Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.9347046Z ) 2025-05-07T20:32:25.9347261Z else: 2025-05-07T20:32:25.9347505Z scale_ub_tensor = None 2025-05-07T20:32:25.9347792Z 2025-05-07T20:32:25.9348058Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.9348417Z op = silu_mul_quant 2025-05-07T20:32:25.9348708Z if compiled: 2025-05-07T20:32:25.9348994Z op = torch.compile(op) 2025-05-07T20:32:25.9349339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.9349654Z 2025-05-07T20:32:25.9349873Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.9350068Z 2025-05-07T20:32:25.9350185Z moe/activation_test.py:117: 2025-05-07T20:32:25.9350718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.9351102Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.9351424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.9352069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.9352706Z return fn(*args, **kwargs) 2025-05-07T20:32:25.9353454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.9354362Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.9354974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.9355752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.9356502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.9357105Z kernel = self.compile( 2025-05-07T20:32:25.9357754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.9358645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.9359106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.9359371Z 2025-05-07T20:32:25.9359606Z self = 2025-05-07T20:32:25.9360835Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.9362392Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676872290>} 2025-05-07T20:32:25.9363909Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.9365083Z context = 2025-05-07T20:32:25.9365419Z 2025-05-07T20:32:25.9365612Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.9366209Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.9366741Z module_map=module_map) 2025-05-07T20:32:25.9367162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.9367596Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.9367923Z E ^ 2025-05-07T20:32:25.9368461Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.9368980Z 2025-05-07T20:32:25.9369452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.9370046Z 2025-05-07T20:32:25.9370171Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.9370640Z self=, 2025-05-07T20:32:25.9371101Z T=2048, 2025-05-07T20:32:25.9371323Z D=5120, 2025-05-07T20:32:25.9371549Z scale_ub=None, 2025-05-07T20:32:25.9371804Z contiguous=False, 2025-05-07T20:32:25.9372070Z compiled=True, 2025-05-07T20:32:25.9372302Z ) 2025-05-07T20:32:26.0521013Z self = 2025-05-07T20:32:26.0521633Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:26.0521954Z 2025-05-07T20:32:26.0522044Z @given( 2025-05-07T20:32:26.0522314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.0522838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.0523198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.0523579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.0524120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.0524475Z ) 2025-05-07T20:32:26.0524920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.0525492Z def test_silu_mul_quant( 2025-05-07T20:32:26.0525773Z self, 2025-05-07T20:32:26.0526001Z T: int, 2025-05-07T20:32:26.0526237Z D: int, 2025-05-07T20:32:26.0526484Z scale_ub: Optional[float], 2025-05-07T20:32:26.0526801Z contiguous: bool, 2025-05-07T20:32:26.0527081Z compiled: bool, 2025-05-07T20:32:26.0527336Z ) -> None: 2025-05-07T20:32:26.0527594Z torch.manual_seed(2025) 2025-05-07T20:32:26.0527899Z 2025-05-07T20:32:26.0528210Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.0528597Z 2025-05-07T20:32:26.0528820Z x_sign = torch.sign(x) 2025-05-07T20:32:26.0529146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.0529639Z x = x_sign * x_clamp 2025-05-07T20:32:26.0529914Z x0 = x[:, :D] 2025-05-07T20:32:26.0530159Z x1 = x[:, D:] 2025-05-07T20:32:26.0530399Z 2025-05-07T20:32:26.0530613Z if contiguous: 2025-05-07T20:32:26.0530878Z x0 = x0.contiguous() 2025-05-07T20:32:26.0531175Z x1 = x1.contiguous() 2025-05-07T20:32:26.0531454Z 2025-05-07T20:32:26.0531669Z if scale_ub is not None: 2025-05-07T20:32:26.0531980Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.0532363Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.0532718Z ) 2025-05-07T20:32:26.0532937Z else: 2025-05-07T20:32:26.0533185Z scale_ub_tensor = None 2025-05-07T20:32:26.0533479Z 2025-05-07T20:32:26.0533743Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.0534100Z op = silu_mul_quant 2025-05-07T20:32:26.0534388Z if compiled: 2025-05-07T20:32:26.0534678Z op = torch.compile(op) 2025-05-07T20:32:26.0535016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.0535328Z 2025-05-07T20:32:26.0535548Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.0535742Z 2025-05-07T20:32:26.0535856Z moe/activation_test.py:117: 2025-05-07T20:32:26.0536194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.0536571Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.0536888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.0537528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.0538208Z return fn(*args, **kwargs) 2025-05-07T20:32:26.0538954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.0539730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.0540345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.0541123Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.0541870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.0542473Z kernel = self.compile( 2025-05-07T20:32:26.0543091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.0543826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.0544274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.0544536Z 2025-05-07T20:32:26.0544898Z self = 2025-05-07T20:32:26.0546112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.0547657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676872170>} 2025-05-07T20:32:26.0549166Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.0550317Z context = 2025-05-07T20:32:26.0550640Z 2025-05-07T20:32:26.0550840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.0551426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.0551953Z module_map=module_map) 2025-05-07T20:32:26.0552481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.0552880Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.0553173Z E ^ 2025-05-07T20:32:26.0553804Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.0554311Z 2025-05-07T20:32:26.0554784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.0555356Z 2025-05-07T20:32:26.0555482Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.0555947Z self=, 2025-05-07T20:32:26.0556401Z T=2048, 2025-05-07T20:32:26.0556621Z D=5120, 2025-05-07T20:32:26.0556841Z scale_ub=1200.0, 2025-05-07T20:32:26.0557100Z contiguous=False, 2025-05-07T20:32:26.0557366Z compiled=True, 2025-05-07T20:32:26.0557595Z ) 2025-05-07T20:32:26.0557964Z self = 2025-05-07T20:32:26.0558535Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:26.0558847Z 2025-05-07T20:32:26.0558937Z @given( 2025-05-07T20:32:26.0559203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.0559563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.0559918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.0560292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.0560672Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.0561000Z ) 2025-05-07T20:32:26.0561397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.0561907Z def test_silu_mul_quant( 2025-05-07T20:32:26.0562189Z self, 2025-05-07T20:32:26.0562408Z T: int, 2025-05-07T20:32:26.0562637Z D: int, 2025-05-07T20:32:26.0562894Z scale_ub: Optional[float], 2025-05-07T20:32:26.0563207Z contiguous: bool, 2025-05-07T20:32:26.0563483Z compiled: bool, 2025-05-07T20:32:26.0563743Z ) -> None: 2025-05-07T20:32:26.0563987Z torch.manual_seed(2025) 2025-05-07T20:32:26.0564266Z 2025-05-07T20:32:26.0564579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.0564970Z 2025-05-07T20:32:26.0565190Z x_sign = torch.sign(x) 2025-05-07T20:32:26.0565524Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.0565879Z x = x_sign * x_clamp 2025-05-07T20:32:26.0566150Z x0 = x[:, :D] 2025-05-07T20:32:26.0566399Z x1 = x[:, D:] 2025-05-07T20:32:26.0566639Z 2025-05-07T20:32:26.0566848Z if contiguous: 2025-05-07T20:32:26.0567211Z x0 = x0.contiguous() 2025-05-07T20:32:26.0567515Z x1 = x1.contiguous() 2025-05-07T20:32:26.0567833Z 2025-05-07T20:32:26.0568057Z if scale_ub is not None: 2025-05-07T20:32:26.0568378Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.0568757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.0569112Z ) 2025-05-07T20:32:26.0569338Z else: 2025-05-07T20:32:26.0569576Z scale_ub_tensor = None 2025-05-07T20:32:26.0569866Z 2025-05-07T20:32:26.0570134Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.0570513Z op = silu_mul_quant 2025-05-07T20:32:26.0570801Z if compiled: 2025-05-07T20:32:26.0571091Z op = torch.compile(op) 2025-05-07T20:32:26.0571437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.0571746Z 2025-05-07T20:32:26.0571972Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.0572160Z 2025-05-07T20:32:26.0572287Z moe/activation_test.py:117: 2025-05-07T20:32:26.0572634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.0573007Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.0573422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.0574056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.0574685Z return fn(*args, **kwargs) 2025-05-07T20:32:26.0575434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.0576217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.0576826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.0577597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.0578351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.0578955Z kernel = self.compile( 2025-05-07T20:32:26.0579562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.0580311Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.0580762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.0581020Z 2025-05-07T20:32:26.0581259Z self = 2025-05-07T20:32:26.0582464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.0584006Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676873880>} 2025-05-07T20:32:26.0585513Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.0586674Z context = 2025-05-07T20:32:26.0586998Z 2025-05-07T20:32:26.0587193Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.0587778Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.0588308Z module_map=module_map) 2025-05-07T20:32:26.0588724Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.0589120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.0589418Z E ^ 2025-05-07T20:32:26.0590032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.0590540Z 2025-05-07T20:32:26.0591012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.0591593Z 2025-05-07T20:32:26.2714876Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.2715382Z self=, 2025-05-07T20:32:26.2715887Z T=4096, 2025-05-07T20:32:26.2716110Z D=5120, 2025-05-07T20:32:26.2716334Z scale_ub=1200.0, 2025-05-07T20:32:26.2716598Z contiguous=True, 2025-05-07T20:32:26.2716860Z compiled=True, 2025-05-07T20:32:26.2717106Z ) 2025-05-07T20:32:26.2717474Z self = 2025-05-07T20:32:26.2718048Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:26.2718359Z 2025-05-07T20:32:26.2718459Z @given( 2025-05-07T20:32:26.2718732Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.2719097Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.2719458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.2720023Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.2720406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.2720738Z ) 2025-05-07T20:32:26.2721138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.2721649Z def test_silu_mul_quant( 2025-05-07T20:32:26.2721932Z self, 2025-05-07T20:32:26.2722160Z T: int, 2025-05-07T20:32:26.2722387Z D: int, 2025-05-07T20:32:26.2722643Z scale_ub: Optional[float], 2025-05-07T20:32:26.2722958Z contiguous: bool, 2025-05-07T20:32:26.2723236Z compiled: bool, 2025-05-07T20:32:26.2723500Z ) -> None: 2025-05-07T20:32:26.2723918Z torch.manual_seed(2025) 2025-05-07T20:32:26.2724209Z 2025-05-07T20:32:26.2724526Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.2724924Z 2025-05-07T20:32:26.2725147Z x_sign = torch.sign(x) 2025-05-07T20:32:26.2725494Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.2725855Z x = x_sign * x_clamp 2025-05-07T20:32:26.2726130Z x0 = x[:, :D] 2025-05-07T20:32:26.2726382Z x1 = x[:, D:] 2025-05-07T20:32:26.2726629Z 2025-05-07T20:32:26.2726839Z if contiguous: 2025-05-07T20:32:26.2727107Z x0 = x0.contiguous() 2025-05-07T20:32:26.2727404Z x1 = x1.contiguous() 2025-05-07T20:32:26.2727686Z 2025-05-07T20:32:26.2727908Z if scale_ub is not None: 2025-05-07T20:32:26.2728227Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.2728614Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.2728966Z ) 2025-05-07T20:32:26.2729194Z else: 2025-05-07T20:32:26.2729447Z scale_ub_tensor = None 2025-05-07T20:32:26.2729738Z 2025-05-07T20:32:26.2730005Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.2730373Z op = silu_mul_quant 2025-05-07T20:32:26.2730665Z if compiled: 2025-05-07T20:32:26.2730956Z op = torch.compile(op) 2025-05-07T20:32:26.2731305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.2731618Z 2025-05-07T20:32:26.2731844Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.2732035Z 2025-05-07T20:32:26.2732156Z moe/activation_test.py:117: 2025-05-07T20:32:26.2732504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.2732881Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.2733208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.2733856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.2734635Z return fn(*args, **kwargs) 2025-05-07T20:32:26.2735396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.2736189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.2736814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.2737598Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.2738366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.2738979Z kernel = self.compile( 2025-05-07T20:32:26.2739598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.2740359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.2740822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.2741083Z 2025-05-07T20:32:26.2741327Z self = 2025-05-07T20:32:26.2742554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.2744257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767f8940>} 2025-05-07T20:32:26.2745791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.2746966Z context = 2025-05-07T20:32:26.2747296Z 2025-05-07T20:32:26.2747499Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.2748142Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.2748685Z module_map=module_map) 2025-05-07T20:32:26.2749103Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.2749503Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.2749806Z E ^ 2025-05-07T20:32:26.2750348Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.2750863Z 2025-05-07T20:32:26.2751347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.2751935Z 2025-05-07T20:32:26.2752055Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.2752530Z self=, 2025-05-07T20:32:26.2752997Z T=128, 2025-05-07T20:32:26.2753213Z D=5120, 2025-05-07T20:32:26.2753440Z scale_ub=1200.0, 2025-05-07T20:32:26.2753790Z contiguous=False, 2025-05-07T20:32:26.2754047Z compiled=True, 2025-05-07T20:32:26.2754288Z ) 2025-05-07T20:32:26.4019673Z self = 2025-05-07T20:32:26.4020313Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:26.4020631Z 2025-05-07T20:32:26.4020726Z @given( 2025-05-07T20:32:26.4021004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.4021370Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.4028534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.4028976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.4029401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.4029738Z ) 2025-05-07T20:32:26.4030335Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.4030852Z def test_silu_mul_quant( 2025-05-07T20:32:26.4031131Z self, 2025-05-07T20:32:26.4031361Z T: int, 2025-05-07T20:32:26.4031599Z D: int, 2025-05-07T20:32:26.4031852Z scale_ub: Optional[float], 2025-05-07T20:32:26.4032176Z contiguous: bool, 2025-05-07T20:32:26.4032461Z compiled: bool, 2025-05-07T20:32:26.4032723Z ) -> None: 2025-05-07T20:32:26.4032976Z torch.manual_seed(2025) 2025-05-07T20:32:26.4033259Z 2025-05-07T20:32:26.4033645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.4034087Z 2025-05-07T20:32:26.4034320Z x_sign = torch.sign(x) 2025-05-07T20:32:26.4034675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.4035068Z x = x_sign * x_clamp 2025-05-07T20:32:26.4035361Z x0 = x[:, :D] 2025-05-07T20:32:26.4035618Z x1 = x[:, D:] 2025-05-07T20:32:26.4035878Z 2025-05-07T20:32:26.4036104Z if contiguous: 2025-05-07T20:32:26.4036391Z x0 = x0.contiguous() 2025-05-07T20:32:26.4036708Z x1 = x1.contiguous() 2025-05-07T20:32:26.4037024Z 2025-05-07T20:32:26.4037400Z if scale_ub is not None: 2025-05-07T20:32:26.4037761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.4038209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.4038607Z ) 2025-05-07T20:32:26.4038833Z else: 2025-05-07T20:32:26.4039090Z scale_ub_tensor = None 2025-05-07T20:32:26.4039403Z 2025-05-07T20:32:26.4039681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.4040077Z op = silu_mul_quant 2025-05-07T20:32:26.4040388Z if compiled: 2025-05-07T20:32:26.4040685Z op = torch.compile(op) 2025-05-07T20:32:26.4041060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4041407Z 2025-05-07T20:32:26.4041639Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.4041850Z 2025-05-07T20:32:26.4041967Z moe/activation_test.py:117: 2025-05-07T20:32:26.4042317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4042713Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.4043035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4043696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.4044346Z return fn(*args, **kwargs) 2025-05-07T20:32:26.4045108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.4045896Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.4046516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.4047307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.4048065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.4048681Z kernel = self.compile( 2025-05-07T20:32:26.4049312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.4050069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.4050522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4050790Z 2025-05-07T20:32:26.4051025Z self = 2025-05-07T20:32:26.4052264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.4053932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767f91b0>} 2025-05-07T20:32:26.4055473Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.4056651Z context = 2025-05-07T20:32:26.4056986Z 2025-05-07T20:32:26.4057178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.4057782Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.4058325Z module_map=module_map) 2025-05-07T20:32:26.4058742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.4059149Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.4059459Z E ^ 2025-05-07T20:32:26.4059993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.4060514Z 2025-05-07T20:32:26.4060991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.4061664Z 2025-05-07T20:32:26.4061792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.4062268Z self=, 2025-05-07T20:32:26.4062726Z T=16384, 2025-05-07T20:32:26.4062952Z D=7168, 2025-05-07T20:32:26.4063180Z scale_ub=1200.0, 2025-05-07T20:32:26.4063434Z contiguous=True, 2025-05-07T20:32:26.4063694Z compiled=True, 2025-05-07T20:32:26.4063928Z ) 2025-05-07T20:32:26.4064294Z self = 2025-05-07T20:32:26.4064871Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:26.4065197Z 2025-05-07T20:32:26.4065294Z @given( 2025-05-07T20:32:26.4065560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.4065925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.4066291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.4066676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.4067057Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.4067389Z ) 2025-05-07T20:32:26.4067800Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.4068353Z def test_silu_mul_quant( 2025-05-07T20:32:26.4068637Z self, 2025-05-07T20:32:26.4068864Z T: int, 2025-05-07T20:32:26.4069090Z D: int, 2025-05-07T20:32:26.4069347Z scale_ub: Optional[float], 2025-05-07T20:32:26.4069665Z contiguous: bool, 2025-05-07T20:32:26.4069939Z compiled: bool, 2025-05-07T20:32:26.4070200Z ) -> None: 2025-05-07T20:32:26.4070453Z torch.manual_seed(2025) 2025-05-07T20:32:26.4070727Z 2025-05-07T20:32:26.4071042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.4071435Z 2025-05-07T20:32:26.4071663Z x_sign = torch.sign(x) 2025-05-07T20:32:26.4072006Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.4072371Z x = x_sign * x_clamp 2025-05-07T20:32:26.4072654Z x0 = x[:, :D] 2025-05-07T20:32:26.4072904Z x1 = x[:, D:] 2025-05-07T20:32:26.4073147Z 2025-05-07T20:32:26.4073364Z if contiguous: 2025-05-07T20:32:26.4073708Z x0 = x0.contiguous() 2025-05-07T20:32:26.4074006Z x1 = x1.contiguous() 2025-05-07T20:32:26.4074291Z 2025-05-07T20:32:26.4074511Z if scale_ub is not None: 2025-05-07T20:32:26.4074829Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.4075215Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.4075663Z ) 2025-05-07T20:32:26.4075893Z else: 2025-05-07T20:32:26.4076141Z scale_ub_tensor = None 2025-05-07T20:32:26.4076427Z 2025-05-07T20:32:26.4076696Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.4077065Z op = silu_mul_quant 2025-05-07T20:32:26.4077359Z if compiled: 2025-05-07T20:32:26.4077648Z op = torch.compile(op) 2025-05-07T20:32:26.4077993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4078312Z 2025-05-07T20:32:26.4078533Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.4078732Z 2025-05-07T20:32:26.4078846Z moe/activation_test.py:117: 2025-05-07T20:32:26.4079187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4079563Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.4079889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4080536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.4081177Z return fn(*args, **kwargs) 2025-05-07T20:32:26.4081937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.4082813Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.4083433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.4084218Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.4084980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.4085593Z kernel = self.compile( 2025-05-07T20:32:26.4086220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.4086969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.4087431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4087695Z 2025-05-07T20:32:26.4087938Z self = 2025-05-07T20:32:26.4089179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.4090746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767f97e0>} 2025-05-07T20:32:26.4092283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.4093471Z context = 2025-05-07T20:32:26.4093805Z 2025-05-07T20:32:26.4094005Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.4094599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.4095149Z module_map=module_map) 2025-05-07T20:32:26.4095568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.4095976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.4096274Z E ^ 2025-05-07T20:32:26.4096812Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.4097329Z 2025-05-07T20:32:26.4097816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.4098403Z 2025-05-07T20:32:26.7704826Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.7705569Z self=, 2025-05-07T20:32:26.7706036Z T=16384, 2025-05-07T20:32:26.7706258Z D=5120, 2025-05-07T20:32:26.7706482Z scale_ub=1200.0, 2025-05-07T20:32:26.7706742Z contiguous=True, 2025-05-07T20:32:26.7706997Z compiled=False, 2025-05-07T20:32:26.7707237Z ) 2025-05-07T20:32:26.7707598Z self = 2025-05-07T20:32:26.7708165Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:26.7708480Z 2025-05-07T20:32:26.7708574Z @given( 2025-05-07T20:32:26.7708833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.7709189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.7709540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.7709915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.7710280Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.7710613Z ) 2025-05-07T20:32:26.7711008Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.7711501Z def test_silu_mul_quant( 2025-05-07T20:32:26.7711778Z self, 2025-05-07T20:32:26.7712162Z T: int, 2025-05-07T20:32:26.7712386Z D: int, 2025-05-07T20:32:26.7712637Z scale_ub: Optional[float], 2025-05-07T20:32:26.7712945Z contiguous: bool, 2025-05-07T20:32:26.7713215Z compiled: bool, 2025-05-07T20:32:26.7713474Z ) -> None: 2025-05-07T20:32:26.7713788Z torch.manual_seed(2025) 2025-05-07T20:32:26.7714057Z 2025-05-07T20:32:26.7714370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.7714763Z 2025-05-07T20:32:26.7714980Z x_sign = torch.sign(x) 2025-05-07T20:32:26.7715313Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.7715663Z x = x_sign * x_clamp 2025-05-07T20:32:26.7715937Z x0 = x[:, :D] 2025-05-07T20:32:26.7716186Z x1 = x[:, D:] 2025-05-07T20:32:26.7716425Z 2025-05-07T20:32:26.7716651Z if contiguous: 2025-05-07T20:32:26.7716912Z x0 = x0.contiguous() 2025-05-07T20:32:26.7717217Z x1 = x1.contiguous() 2025-05-07T20:32:26.7717494Z 2025-05-07T20:32:26.7717714Z if scale_ub is not None: 2025-05-07T20:32:26.7718030Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.7718412Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.7718759Z ) 2025-05-07T20:32:26.7718983Z else: 2025-05-07T20:32:26.7719226Z scale_ub_tensor = None 2025-05-07T20:32:26.7719510Z 2025-05-07T20:32:26.7719779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.7720137Z op = silu_mul_quant 2025-05-07T20:32:26.7720423Z if compiled: 2025-05-07T20:32:26.7720709Z op = torch.compile(op) 2025-05-07T20:32:26.7721054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7721365Z 2025-05-07T20:32:26.7721591Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.7721785Z 2025-05-07T20:32:26.7721899Z moe/activation_test.py:117: 2025-05-07T20:32:26.7722241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7722613Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.7722936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7723716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.7724664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.7725274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.7726052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.7726941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.7727543Z kernel = self.compile( 2025-05-07T20:32:26.7728158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.7728907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.7729357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7729623Z 2025-05-07T20:32:26.7729857Z self = 2025-05-07T20:32:26.7731070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.7732618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767fa950>} 2025-05-07T20:32:26.7734126Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.7735396Z context = 2025-05-07T20:32:26.7735725Z 2025-05-07T20:32:26.7735916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.7736510Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.7737044Z module_map=module_map) 2025-05-07T20:32:26.7737457Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.7737858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.7738160Z E ^ 2025-05-07T20:32:26.7738685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.7739197Z 2025-05-07T20:32:26.7739662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.7740248Z 2025-05-07T20:32:26.7740368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.7740862Z self=, 2025-05-07T20:32:26.7741322Z T=1, 2025-05-07T20:32:26.7741530Z D=7168, 2025-05-07T20:32:26.7741759Z scale_ub=1200.0, 2025-05-07T20:32:26.7742023Z contiguous=False, 2025-05-07T20:32:26.7742280Z compiled=False, 2025-05-07T20:32:26.7742520Z ) 2025-05-07T20:32:26.7742885Z self = 2025-05-07T20:32:26.7743434Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:26.7743745Z 2025-05-07T20:32:26.7743835Z @given( 2025-05-07T20:32:26.7744111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.7744470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.7744818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.7745204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.7745590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.7745919Z ) 2025-05-07T20:32:26.7746322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.7746826Z def test_silu_mul_quant( 2025-05-07T20:32:26.7747100Z self, 2025-05-07T20:32:26.7747329Z T: int, 2025-05-07T20:32:26.7747560Z D: int, 2025-05-07T20:32:26.7747807Z scale_ub: Optional[float], 2025-05-07T20:32:26.7748122Z contiguous: bool, 2025-05-07T20:32:26.7748400Z compiled: bool, 2025-05-07T20:32:26.7748654Z ) -> None: 2025-05-07T20:32:26.7748904Z torch.manual_seed(2025) 2025-05-07T20:32:26.7749179Z 2025-05-07T20:32:26.7749634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.7750025Z 2025-05-07T20:32:26.7750249Z x_sign = torch.sign(x) 2025-05-07T20:32:26.7750585Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.7750940Z x = x_sign * x_clamp 2025-05-07T20:32:26.7751215Z x0 = x[:, :D] 2025-05-07T20:32:26.7751467Z x1 = x[:, D:] 2025-05-07T20:32:26.7751704Z 2025-05-07T20:32:26.7751922Z if contiguous: 2025-05-07T20:32:26.7752193Z x0 = x0.contiguous() 2025-05-07T20:32:26.7752487Z x1 = x1.contiguous() 2025-05-07T20:32:26.7752762Z 2025-05-07T20:32:26.7752987Z if scale_ub is not None: 2025-05-07T20:32:26.7753297Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.7753748Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.7754105Z ) 2025-05-07T20:32:26.7754324Z else: 2025-05-07T20:32:26.7754573Z scale_ub_tensor = None 2025-05-07T20:32:26.7754870Z 2025-05-07T20:32:26.7755133Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.7755492Z op = silu_mul_quant 2025-05-07T20:32:26.7755780Z if compiled: 2025-05-07T20:32:26.7756168Z op = torch.compile(op) 2025-05-07T20:32:26.7756502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7756819Z 2025-05-07T20:32:26.7757046Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.7757234Z 2025-05-07T20:32:26.7757348Z moe/activation_test.py:117: 2025-05-07T20:32:26.7757693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7758080Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.7758400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7759186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.7759967Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.7760583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.7761352Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.7762114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.7762723Z kernel = self.compile( 2025-05-07T20:32:26.7763331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.7764074Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.7764529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7764786Z 2025-05-07T20:32:26.7765025Z self = 2025-05-07T20:32:26.7766234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.7767769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767fbac0>} 2025-05-07T20:32:26.7769278Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.7770429Z context = 2025-05-07T20:32:26.7770754Z 2025-05-07T20:32:26.7770952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.7771537Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.7772162Z module_map=module_map) 2025-05-07T20:32:26.7772581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.7772981Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.7773280Z E ^ 2025-05-07T20:32:26.7773812Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.7774320Z 2025-05-07T20:32:26.7774793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.7775368Z 2025-05-07T20:32:26.9906810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.9908127Z self=, 2025-05-07T20:32:26.9908904Z T=4096, 2025-05-07T20:32:26.9909232Z D=7168, 2025-05-07T20:32:26.9909537Z scale_ub=1200.0, 2025-05-07T20:32:26.9909812Z contiguous=False, 2025-05-07T20:32:26.9910082Z compiled=True, 2025-05-07T20:32:26.9910321Z ) 2025-05-07T20:32:26.9910711Z self = 2025-05-07T20:32:26.9911293Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:26.9911836Z 2025-05-07T20:32:26.9911930Z @given( 2025-05-07T20:32:26.9912205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.9912574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.9912929Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.9913318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.9913777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.9914113Z ) 2025-05-07T20:32:26.9914519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.9915039Z def test_silu_mul_quant( 2025-05-07T20:32:26.9915335Z self, 2025-05-07T20:32:26.9915562Z T: int, 2025-05-07T20:32:26.9915799Z D: int, 2025-05-07T20:32:26.9916066Z scale_ub: Optional[float], 2025-05-07T20:32:26.9916381Z contiguous: bool, 2025-05-07T20:32:26.9916669Z compiled: bool, 2025-05-07T20:32:26.9916936Z ) -> None: 2025-05-07T20:32:26.9917192Z torch.manual_seed(2025) 2025-05-07T20:32:26.9917480Z 2025-05-07T20:32:26.9917807Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.9918203Z 2025-05-07T20:32:26.9918437Z x_sign = torch.sign(x) 2025-05-07T20:32:26.9918778Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.9919145Z x = x_sign * x_clamp 2025-05-07T20:32:26.9919421Z x0 = x[:, :D] 2025-05-07T20:32:26.9919678Z x1 = x[:, D:] 2025-05-07T20:32:26.9919927Z 2025-05-07T20:32:26.9920149Z if contiguous: 2025-05-07T20:32:26.9920423Z x0 = x0.contiguous() 2025-05-07T20:32:26.9920730Z x1 = x1.contiguous() 2025-05-07T20:32:26.9921006Z 2025-05-07T20:32:26.9921233Z if scale_ub is not None: 2025-05-07T20:32:26.9921553Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.9921938Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.9922295Z ) 2025-05-07T20:32:26.9922534Z else: 2025-05-07T20:32:26.9922776Z scale_ub_tensor = None 2025-05-07T20:32:26.9923071Z 2025-05-07T20:32:26.9923346Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.9923705Z op = silu_mul_quant 2025-05-07T20:32:26.9924371Z if compiled: 2025-05-07T20:32:26.9924666Z op = torch.compile(op) 2025-05-07T20:32:26.9925009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9925335Z 2025-05-07T20:32:26.9925563Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.9925755Z 2025-05-07T20:32:26.9925879Z moe/activation_test.py:117: 2025-05-07T20:32:26.9926222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9933283Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.9933640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9934295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.9934950Z return fn(*args, **kwargs) 2025-05-07T20:32:26.9935717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.9936511Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.9937124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.9937905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.9938667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.9939273Z kernel = self.compile( 2025-05-07T20:32:26.9939916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.9940671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.9941264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9941533Z 2025-05-07T20:32:26.9941775Z self = 2025-05-07T20:32:26.9943347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.9944915Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76762d4550>} 2025-05-07T20:32:26.9946444Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.9947604Z context = 2025-05-07T20:32:26.9947943Z 2025-05-07T20:32:26.9948137Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.9948732Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.9949271Z module_map=module_map) 2025-05-07T20:32:26.9949686Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.9950091Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.9950393Z E ^ 2025-05-07T20:32:26.9950925Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.9951444Z 2025-05-07T20:32:26.9951921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.9952512Z 2025-05-07T20:32:26.9952632Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.9953111Z self=, 2025-05-07T20:32:26.9953665Z T=128, 2025-05-07T20:32:26.9953883Z D=7168, 2025-05-07T20:32:26.9954113Z scale_ub=1200.0, 2025-05-07T20:32:26.9954366Z contiguous=False, 2025-05-07T20:32:26.9954633Z compiled=True, 2025-05-07T20:32:26.9954868Z ) 2025-05-07T20:32:27.1088563Z self = 2025-05-07T20:32:27.1089399Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:27.1089834Z 2025-05-07T20:32:27.1089963Z @given( 2025-05-07T20:32:27.1090338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.1090852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.1091541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.1091986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.1092365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.1092698Z ) 2025-05-07T20:32:27.1093096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.1093597Z def test_silu_mul_quant( 2025-05-07T20:32:27.1093876Z self, 2025-05-07T20:32:27.1094098Z T: int, 2025-05-07T20:32:27.1094324Z D: int, 2025-05-07T20:32:27.1094575Z scale_ub: Optional[float], 2025-05-07T20:32:27.1094879Z contiguous: bool, 2025-05-07T20:32:27.1095159Z compiled: bool, 2025-05-07T20:32:27.1095422Z ) -> None: 2025-05-07T20:32:27.1095663Z torch.manual_seed(2025) 2025-05-07T20:32:27.1095940Z 2025-05-07T20:32:27.1096255Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.1096640Z 2025-05-07T20:32:27.1096867Z x_sign = torch.sign(x) 2025-05-07T20:32:27.1097200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.1097555Z x = x_sign * x_clamp 2025-05-07T20:32:27.1097842Z x0 = x[:, :D] 2025-05-07T20:32:27.1098274Z x1 = x[:, D:] 2025-05-07T20:32:27.1098514Z 2025-05-07T20:32:27.1098723Z if contiguous: 2025-05-07T20:32:27.1098991Z x0 = x0.contiguous() 2025-05-07T20:32:27.1099287Z x1 = x1.contiguous() 2025-05-07T20:32:27.1099557Z 2025-05-07T20:32:27.1099778Z if scale_ub is not None: 2025-05-07T20:32:27.1100090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.1100469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.1100820Z ) 2025-05-07T20:32:27.1101040Z else: 2025-05-07T20:32:27.1101275Z scale_ub_tensor = None 2025-05-07T20:32:27.1101562Z 2025-05-07T20:32:27.1101831Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.1102186Z op = silu_mul_quant 2025-05-07T20:32:27.1102475Z if compiled: 2025-05-07T20:32:27.1102760Z op = torch.compile(op) 2025-05-07T20:32:27.1103101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.1103418Z 2025-05-07T20:32:27.1103641Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.1103830Z 2025-05-07T20:32:27.1103950Z moe/activation_test.py:117: 2025-05-07T20:32:27.1104284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.1104664Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.1104992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.1105627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:27.1106266Z return fn(*args, **kwargs) 2025-05-07T20:32:27.1107019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.1107805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.1108409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.1109193Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.1109947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.1110548Z kernel = self.compile( 2025-05-07T20:32:27.1111165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.1111912Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.1112359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.1112620Z 2025-05-07T20:32:27.1112852Z self = 2025-05-07T20:32:27.1114267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.1115856Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76762d4f70>} 2025-05-07T20:32:27.1117388Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.1118556Z context = 2025-05-07T20:32:27.1118887Z 2025-05-07T20:32:27.1119080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.1119682Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.1120221Z module_map=module_map) 2025-05-07T20:32:27.1120636Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.1121039Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.1121462Z E ^ 2025-05-07T20:32:27.1121997Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.1122512Z 2025-05-07T20:32:27.1122990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.1123577Z 2025-05-07T20:32:27.1123699Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.1124557Z self=, 2025-05-07T20:32:27.1125018Z T=2048, 2025-05-07T20:32:27.1125230Z D=7168, 2025-05-07T20:32:27.1125453Z scale_ub=None, 2025-05-07T20:32:27.1125706Z contiguous=True, 2025-05-07T20:32:27.1125976Z compiled=True, 2025-05-07T20:32:27.1126204Z ) 2025-05-07T20:32:27.1126571Z self = 2025-05-07T20:32:27.1127131Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:27.1127441Z 2025-05-07T20:32:27.1127530Z @given( 2025-05-07T20:32:27.1127801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.1128159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.1128509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.1128887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.1129266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.1129591Z ) 2025-05-07T20:32:27.1129995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.1130502Z def test_silu_mul_quant( 2025-05-07T20:32:27.1130782Z self, 2025-05-07T20:32:27.1131005Z T: int, 2025-05-07T20:32:27.1131238Z D: int, 2025-05-07T20:32:27.1131494Z scale_ub: Optional[float], 2025-05-07T20:32:27.1131802Z contiguous: bool, 2025-05-07T20:32:27.1132081Z compiled: bool, 2025-05-07T20:32:27.1132345Z ) -> None: 2025-05-07T20:32:27.1132589Z torch.manual_seed(2025) 2025-05-07T20:32:27.1132866Z 2025-05-07T20:32:27.1133180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.1133565Z 2025-05-07T20:32:27.1133790Z x_sign = torch.sign(x) 2025-05-07T20:32:27.1134125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.1134481Z x = x_sign * x_clamp 2025-05-07T20:32:27.1134764Z x0 = x[:, :D] 2025-05-07T20:32:27.1135015Z x1 = x[:, D:] 2025-05-07T20:32:27.1135252Z 2025-05-07T20:32:27.1135469Z if contiguous: 2025-05-07T20:32:27.1135739Z x0 = x0.contiguous() 2025-05-07T20:32:27.1136036Z x1 = x1.contiguous() 2025-05-07T20:32:27.1136461Z 2025-05-07T20:32:27.1136696Z if scale_ub is not None: 2025-05-07T20:32:27.1137012Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.1137392Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.1137749Z ) 2025-05-07T20:32:27.1137973Z else: 2025-05-07T20:32:27.1138212Z scale_ub_tensor = None 2025-05-07T20:32:27.1138501Z 2025-05-07T20:32:27.1138768Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.1139122Z op = silu_mul_quant 2025-05-07T20:32:27.1139414Z if compiled: 2025-05-07T20:32:27.1139702Z op = torch.compile(op) 2025-05-07T20:32:27.1140038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.1140355Z 2025-05-07T20:32:27.1140578Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.1140768Z 2025-05-07T20:32:27.1140884Z moe/activation_test.py:117: 2025-05-07T20:32:27.1141234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.1141615Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.1141939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.1142570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:27.1143374Z return fn(*args, **kwargs) 2025-05-07T20:32:27.1144125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.1144903Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.1145518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.1146299Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.1147056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.1147668Z kernel = self.compile( 2025-05-07T20:32:27.1148336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.1149092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.1149556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.1149821Z 2025-05-07T20:32:27.1150058Z self = 2025-05-07T20:32:27.1151281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.1152836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76762d5bd0>} 2025-05-07T20:32:27.1154496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.1155797Z context = 2025-05-07T20:32:27.1156130Z 2025-05-07T20:32:27.1156323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.1156922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.1157458Z module_map=module_map) 2025-05-07T20:32:27.1157877Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.1158337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.1158642Z E ^ 2025-05-07T20:32:27.1159171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.1159690Z 2025-05-07T20:32:27.1160267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.1160858Z 2025-05-07T20:32:27.1996562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.1997125Z self=, 2025-05-07T20:32:27.1997799Z T=16384, 2025-05-07T20:32:27.1998100Z D=5120, 2025-05-07T20:32:27.1998380Z scale_ub=None, 2025-05-07T20:32:27.1998676Z contiguous=False, 2025-05-07T20:32:27.1998937Z compiled=False, 2025-05-07T20:32:27.1999177Z ) 2025-05-07T20:32:27.1999550Z self = 2025-05-07T20:32:27.2000124Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:27.2000453Z 2025-05-07T20:32:27.2000544Z @given( 2025-05-07T20:32:27.2000818Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.2001186Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.2001545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.2001928Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.2002310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.2002856Z ) 2025-05-07T20:32:27.2003265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.2003778Z def test_silu_mul_quant( 2025-05-07T20:32:27.2004054Z self, 2025-05-07T20:32:27.2004287Z T: int, 2025-05-07T20:32:27.2004520Z D: int, 2025-05-07T20:32:27.2004772Z scale_ub: Optional[float], 2025-05-07T20:32:27.2005091Z contiguous: bool, 2025-05-07T20:32:27.2005371Z compiled: bool, 2025-05-07T20:32:27.2005630Z ) -> None: 2025-05-07T20:32:27.2005886Z torch.manual_seed(2025) 2025-05-07T20:32:27.2006167Z 2025-05-07T20:32:27.2006477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.2006879Z 2025-05-07T20:32:27.2007110Z x_sign = torch.sign(x) 2025-05-07T20:32:27.2007453Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.2009780Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.2011946Z 2025-05-07T20:32:27.2012087Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:27.2012359Z 2025-05-07T20:32:27.2012487Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.2012961Z self=, 2025-05-07T20:32:27.2013425Z T=4096, 2025-05-07T20:32:27.2013645Z D=7168, 2025-05-07T20:32:27.2013865Z scale_ub=1200.0, 2025-05-07T20:32:27.2014126Z contiguous=True, 2025-05-07T20:32:27.2014391Z compiled=True, 2025-05-07T20:32:27.2014621Z ) 2025-05-07T20:32:27.2014985Z self = 2025-05-07T20:32:27.2015547Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:27.2015854Z 2025-05-07T20:32:27.2015953Z @given( 2025-05-07T20:32:27.2016210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.2016571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.2016925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.2017297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.2017676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.2018141Z ) 2025-05-07T20:32:27.2018548Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.2019055Z def test_silu_mul_quant( 2025-05-07T20:32:27.2019334Z self, 2025-05-07T20:32:27.2019559Z T: int, 2025-05-07T20:32:27.2019791Z D: int, 2025-05-07T20:32:27.2020043Z scale_ub: Optional[float], 2025-05-07T20:32:27.2020358Z contiguous: bool, 2025-05-07T20:32:27.2020630Z compiled: bool, 2025-05-07T20:32:27.2020890Z ) -> None: 2025-05-07T20:32:27.2021142Z torch.manual_seed(2025) 2025-05-07T20:32:27.2021414Z 2025-05-07T20:32:27.2021725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.2022117Z 2025-05-07T20:32:27.2022332Z x_sign = torch.sign(x) 2025-05-07T20:32:27.2022665Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.2025307Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.2027602Z 2025-05-07T20:32:27.2027747Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:27.2027991Z 2025-05-07T20:32:27.2028119Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.2028593Z self=, 2025-05-07T20:32:27.2029053Z T=16384, 2025-05-07T20:32:27.2029277Z D=7168, 2025-05-07T20:32:27.2029492Z scale_ub=None, 2025-05-07T20:32:27.2029744Z contiguous=False, 2025-05-07T20:32:27.2030007Z compiled=False, 2025-05-07T20:32:27.2030241Z ) 2025-05-07T20:32:27.2030607Z self = 2025-05-07T20:32:27.2031180Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:27.2031504Z 2025-05-07T20:32:27.2031595Z @given( 2025-05-07T20:32:27.2031858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.2032244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.2032601Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.2032974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.2033356Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.2033784Z ) 2025-05-07T20:32:27.2034185Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.2034696Z def test_silu_mul_quant( 2025-05-07T20:32:27.2034976Z self, 2025-05-07T20:32:27.2035193Z T: int, 2025-05-07T20:32:27.2035427Z D: int, 2025-05-07T20:32:27.2035681Z scale_ub: Optional[float], 2025-05-07T20:32:27.2035987Z contiguous: bool, 2025-05-07T20:32:27.2036269Z compiled: bool, 2025-05-07T20:32:27.2036530Z ) -> None: 2025-05-07T20:32:27.2036781Z torch.manual_seed(2025) 2025-05-07T20:32:27.2037059Z 2025-05-07T20:32:27.2037378Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.2039971Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.2042141Z 2025-05-07T20:32:27.2042284Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.2042528Z 2025-05-07T20:32:27.2042649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.2043136Z self=, 2025-05-07T20:32:27.2043600Z T=2048, 2025-05-07T20:32:27.2043813Z D=7168, 2025-05-07T20:32:27.2044038Z scale_ub=1200.0, 2025-05-07T20:32:27.2044299Z contiguous=True, 2025-05-07T20:32:27.2044549Z compiled=True, 2025-05-07T20:32:27.2044789Z ) 2025-05-07T20:32:27.2045154Z self = 2025-05-07T20:32:27.2045722Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:27.2046031Z 2025-05-07T20:32:27.2046120Z @given( 2025-05-07T20:32:27.2046385Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.2046743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.2047099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.2047480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.2047859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.2048283Z ) 2025-05-07T20:32:27.2048688Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.2049202Z def test_silu_mul_quant( 2025-05-07T20:32:27.2049482Z self, 2025-05-07T20:32:27.2049701Z T: int, 2025-05-07T20:32:27.2049933Z D: int, 2025-05-07T20:32:27.2050186Z scale_ub: Optional[float], 2025-05-07T20:32:27.2050494Z contiguous: bool, 2025-05-07T20:32:27.2050772Z compiled: bool, 2025-05-07T20:32:27.2051031Z ) -> None: 2025-05-07T20:32:27.2051273Z torch.manual_seed(2025) 2025-05-07T20:32:27.2051552Z 2025-05-07T20:32:27.2051868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.2052259Z 2025-05-07T20:32:27.2052493Z x_sign = torch.sign(x) 2025-05-07T20:32:27.2052832Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.2055132Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.2057246Z 2025-05-07T20:32:27.2057387Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:27.2057631Z 2025-05-07T20:32:27.2057758Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.2058257Z self=, 2025-05-07T20:32:27.2058949Z T=2048, 2025-05-07T20:32:27.2059164Z D=7168, 2025-05-07T20:32:27.2059388Z scale_ub=None, 2025-05-07T20:32:27.2059635Z contiguous=True, 2025-05-07T20:32:27.2059895Z compiled=False, 2025-05-07T20:32:27.2060133Z ) 2025-05-07T20:32:27.3477225Z self = 2025-05-07T20:32:27.3477848Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:27.3478279Z 2025-05-07T20:32:27.3478395Z @given( 2025-05-07T20:32:27.3478658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.3479021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.3479372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.3479751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.3480121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.3480447Z ) 2025-05-07T20:32:27.3481065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.3481571Z def test_silu_mul_quant( 2025-05-07T20:32:27.3481850Z self, 2025-05-07T20:32:27.3482077Z T: int, 2025-05-07T20:32:27.3482304Z D: int, 2025-05-07T20:32:27.3482556Z scale_ub: Optional[float], 2025-05-07T20:32:27.3482865Z contiguous: bool, 2025-05-07T20:32:27.3483130Z compiled: bool, 2025-05-07T20:32:27.3483386Z ) -> None: 2025-05-07T20:32:27.3483631Z torch.manual_seed(2025) 2025-05-07T20:32:27.3483899Z 2025-05-07T20:32:27.3484211Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.3484600Z 2025-05-07T20:32:27.3484820Z > x_sign = torch.sign(x) 2025-05-07T20:32:27.3487037Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.3489292Z 2025-05-07T20:32:27.3489427Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:27.3489674Z 2025-05-07T20:32:27.3489792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.3490266Z self=, 2025-05-07T20:32:27.3490718Z T=1, 2025-05-07T20:32:27.3490929Z D=7168, 2025-05-07T20:32:27.3491150Z scale_ub=1200.0, 2025-05-07T20:32:27.3491398Z contiguous=True, 2025-05-07T20:32:27.3491651Z compiled=False, 2025-05-07T20:32:27.3491890Z ) 2025-05-07T20:32:27.3498325Z self = 2025-05-07T20:32:27.3498900Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:27.3499206Z 2025-05-07T20:32:27.3499297Z @given( 2025-05-07T20:32:27.3499562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.3499928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.3500275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.3500646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.3501017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.3501339Z ) 2025-05-07T20:32:27.3501735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.3502235Z def test_silu_mul_quant( 2025-05-07T20:32:27.3502515Z self, 2025-05-07T20:32:27.3502734Z T: int, 2025-05-07T20:32:27.3502957Z D: int, 2025-05-07T20:32:27.3503207Z scale_ub: Optional[float], 2025-05-07T20:32:27.3503515Z contiguous: bool, 2025-05-07T20:32:27.3503785Z compiled: bool, 2025-05-07T20:32:27.3504036Z ) -> None: 2025-05-07T20:32:27.3504280Z torch.manual_seed(2025) 2025-05-07T20:32:27.3504549Z 2025-05-07T20:32:27.3504861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.3505240Z 2025-05-07T20:32:27.3505462Z x_sign = torch.sign(x) 2025-05-07T20:32:27.3505790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.3506141Z x = x_sign * x_clamp 2025-05-07T20:32:27.3506407Z x0 = x[:, :D] 2025-05-07T20:32:27.3506652Z x1 = x[:, D:] 2025-05-07T20:32:27.3506891Z 2025-05-07T20:32:27.3507095Z if contiguous: 2025-05-07T20:32:27.3507356Z x0 = x0.contiguous() 2025-05-07T20:32:27.3507648Z x1 = x1.contiguous() 2025-05-07T20:32:27.3507912Z 2025-05-07T20:32:27.3508132Z if scale_ub is not None: 2025-05-07T20:32:27.3508438Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.3508929Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.3509290Z ) 2025-05-07T20:32:27.3509517Z else: 2025-05-07T20:32:27.3509752Z scale_ub_tensor = None 2025-05-07T20:32:27.3510038Z 2025-05-07T20:32:27.3510307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.3510655Z op = silu_mul_quant 2025-05-07T20:32:27.3510937Z if compiled: 2025-05-07T20:32:27.3511219Z op = torch.compile(op) 2025-05-07T20:32:27.3511554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.3511857Z 2025-05-07T20:32:27.3512078Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.3512263Z 2025-05-07T20:32:27.3512382Z moe/activation_test.py:117: 2025-05-07T20:32:27.3512712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.3513143Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.3513676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.3514461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.3515232Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.3515956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.3516725Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.3517466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.3518062Z kernel = self.compile( 2025-05-07T20:32:27.3518672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.3519410Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.3519853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.3520115Z 2025-05-07T20:32:27.3520348Z self = 2025-05-07T20:32:27.3521553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.3523095Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76762d7b50>} 2025-05-07T20:32:27.3524914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.3526063Z context = 2025-05-07T20:32:27.3526387Z 2025-05-07T20:32:27.3526583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.3527171Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.3527701Z module_map=module_map) 2025-05-07T20:32:27.3528117Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.3528512Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.3528804Z E ^ 2025-05-07T20:32:27.3529322Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.3529834Z 2025-05-07T20:32:27.3530300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.3530872Z 2025-05-07T20:32:27.3530994Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.3531460Z self=, 2025-05-07T20:32:27.3532056Z T=128, 2025-05-07T20:32:27.3532277Z D=5120, 2025-05-07T20:32:27.3532495Z scale_ub=None, 2025-05-07T20:32:27.3532741Z contiguous=True, 2025-05-07T20:32:27.3532994Z compiled=False, 2025-05-07T20:32:27.3533235Z ) 2025-05-07T20:32:27.4376408Z self = 2025-05-07T20:32:27.4377197Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:27.4377621Z 2025-05-07T20:32:27.4377744Z @given( 2025-05-07T20:32:27.4378271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.4379058Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.4379946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.4380586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.4381209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.4381744Z ) 2025-05-07T20:32:27.4382428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.4383274Z def test_silu_mul_quant( 2025-05-07T20:32:27.4383727Z self, 2025-05-07T20:32:27.4384102Z T: int, 2025-05-07T20:32:27.4384474Z D: int, 2025-05-07T20:32:27.4385200Z scale_ub: Optional[float], 2025-05-07T20:32:27.4385717Z contiguous: bool, 2025-05-07T20:32:27.4386172Z compiled: bool, 2025-05-07T20:32:27.4386595Z ) -> None: 2025-05-07T20:32:27.4387008Z torch.manual_seed(2025) 2025-05-07T20:32:27.4387468Z 2025-05-07T20:32:27.4387976Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.4388528Z 2025-05-07T20:32:27.4388749Z x_sign = torch.sign(x) 2025-05-07T20:32:27.4389071Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.4389412Z x = x_sign * x_clamp 2025-05-07T20:32:27.4389687Z x0 = x[:, :D] 2025-05-07T20:32:27.4389928Z x1 = x[:, D:] 2025-05-07T20:32:27.4390156Z 2025-05-07T20:32:27.4390367Z if contiguous: 2025-05-07T20:32:27.4390628Z x0 = x0.contiguous() 2025-05-07T20:32:27.4390914Z x1 = x1.contiguous() 2025-05-07T20:32:27.4391182Z 2025-05-07T20:32:27.4391401Z if scale_ub is not None: 2025-05-07T20:32:27.4391702Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.4392076Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.4392424Z ) 2025-05-07T20:32:27.4392632Z else: 2025-05-07T20:32:27.4392867Z scale_ub_tensor = None 2025-05-07T20:32:27.4393147Z 2025-05-07T20:32:27.4393400Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.4393815Z op = silu_mul_quant 2025-05-07T20:32:27.4394094Z if compiled: 2025-05-07T20:32:27.4394369Z op = torch.compile(op) 2025-05-07T20:32:27.4394694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.4394996Z 2025-05-07T20:32:27.4395216Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.4395401Z 2025-05-07T20:32:27.4395516Z moe/activation_test.py:117: 2025-05-07T20:32:27.4395847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.4396219Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.4396528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.4397304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.4398080Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.4398679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.4399433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.4400171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.4400898Z kernel = self.compile( 2025-05-07T20:32:27.4401503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.4402236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.4402680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.4402933Z 2025-05-07T20:32:27.4403168Z self = 2025-05-07T20:32:27.4404367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.4405907Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676008670>} 2025-05-07T20:32:27.4407416Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.4408655Z context = 2025-05-07T20:32:27.4408974Z 2025-05-07T20:32:27.4409165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.4409743Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.4410267Z module_map=module_map) 2025-05-07T20:32:27.4410675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.4411066Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.4411355Z E ^ 2025-05-07T20:32:27.4411874Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.4412375Z 2025-05-07T20:32:27.4412848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.4413420Z 2025-05-07T20:32:27.4413536Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.4414004Z self=, 2025-05-07T20:32:27.4414451Z T=128, 2025-05-07T20:32:27.4414654Z D=7168, 2025-05-07T20:32:27.4414873Z scale_ub=None, 2025-05-07T20:32:27.4415112Z contiguous=True, 2025-05-07T20:32:27.4415356Z compiled=False, 2025-05-07T20:32:27.4415589Z ) 2025-05-07T20:32:27.4415946Z self = 2025-05-07T20:32:27.4416488Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:27.4416784Z 2025-05-07T20:32:27.4416871Z @given( 2025-05-07T20:32:27.4417127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.4417478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.4417821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.4418224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.4418606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.4418926Z ) 2025-05-07T20:32:27.4419318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.4419811Z def test_silu_mul_quant( 2025-05-07T20:32:27.4420182Z self, 2025-05-07T20:32:27.4420437Z T: int, 2025-05-07T20:32:27.4420658Z D: int, 2025-05-07T20:32:27.4420905Z scale_ub: Optional[float], 2025-05-07T20:32:27.4421204Z contiguous: bool, 2025-05-07T20:32:27.4421473Z compiled: bool, 2025-05-07T20:32:27.4421726Z ) -> None: 2025-05-07T20:32:27.4421960Z torch.manual_seed(2025) 2025-05-07T20:32:27.4422229Z 2025-05-07T20:32:27.4422533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.4422909Z 2025-05-07T20:32:27.4423258Z x_sign = torch.sign(x) 2025-05-07T20:32:27.4423586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.4424223Z x = x_sign * x_clamp 2025-05-07T20:32:27.4424495Z x0 = x[:, :D] 2025-05-07T20:32:27.4424745Z x1 = x[:, D:] 2025-05-07T20:32:27.4424981Z 2025-05-07T20:32:27.4425187Z if contiguous: 2025-05-07T20:32:27.4425454Z x0 = x0.contiguous() 2025-05-07T20:32:27.4425748Z x1 = x1.contiguous() 2025-05-07T20:32:27.4426012Z 2025-05-07T20:32:27.4426231Z if scale_ub is not None: 2025-05-07T20:32:27.4426546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.4426918Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.4427266Z ) 2025-05-07T20:32:27.4427485Z else: 2025-05-07T20:32:27.4427716Z scale_ub_tensor = None 2025-05-07T20:32:27.4428006Z 2025-05-07T20:32:27.4428322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.4428675Z op = silu_mul_quant 2025-05-07T20:32:27.4428958Z if compiled: 2025-05-07T20:32:27.4429238Z op = torch.compile(op) 2025-05-07T20:32:27.4429748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.4430062Z 2025-05-07T20:32:27.4430279Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.4430464Z 2025-05-07T20:32:27.4430581Z moe/activation_test.py:117: 2025-05-07T20:32:27.4431006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.4431573Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.4431893Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.4432669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.4433441Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.4434118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.4434884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.4435623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.4436229Z kernel = self.compile( 2025-05-07T20:32:27.4436840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.4437575Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.4438022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.4438287Z 2025-05-07T20:32:27.4438517Z self = 2025-05-07T20:32:27.4439740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.4441275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676008ee0>} 2025-05-07T20:32:27.4442850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.4443994Z context = 2025-05-07T20:32:27.4444320Z 2025-05-07T20:32:27.4444510Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.4445093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.4445617Z module_map=module_map) 2025-05-07T20:32:27.4446187Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.4446586Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.4446876Z E ^ 2025-05-07T20:32:27.4447394Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.4447907Z 2025-05-07T20:32:27.4448370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.4448942Z 2025-05-07T20:32:27.4449067Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.4449524Z self=, 2025-05-07T20:32:27.4449978Z T=2048, 2025-05-07T20:32:27.4450191Z D=7168, 2025-05-07T20:32:27.4450406Z scale_ub=1200.0, 2025-05-07T20:32:27.4450660Z contiguous=True, 2025-05-07T20:32:27.4450915Z compiled=False, 2025-05-07T20:32:27.4451139Z ) 2025-05-07T20:32:27.5472640Z self = 2025-05-07T20:32:27.5474204Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:27.5474949Z 2025-05-07T20:32:27.5475122Z @given( 2025-05-07T20:32:27.5475570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.5476482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.5477067Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.5477704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.5478244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.5478605Z ) 2025-05-07T20:32:27.5478996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.5479484Z def test_silu_mul_quant( 2025-05-07T20:32:27.5479747Z self, 2025-05-07T20:32:27.5479967Z T: int, 2025-05-07T20:32:27.5480189Z D: int, 2025-05-07T20:32:27.5480428Z scale_ub: Optional[float], 2025-05-07T20:32:27.5480736Z contiguous: bool, 2025-05-07T20:32:27.5481001Z compiled: bool, 2025-05-07T20:32:27.5481246Z ) -> None: 2025-05-07T20:32:27.5481486Z torch.manual_seed(2025) 2025-05-07T20:32:27.5481754Z 2025-05-07T20:32:27.5482059Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.5484324Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.5486374Z 2025-05-07T20:32:27.5486505Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.5486753Z 2025-05-07T20:32:27.5486871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.5487328Z self=, 2025-05-07T20:32:27.5487774Z T=1, 2025-05-07T20:32:27.5487982Z D=5120, 2025-05-07T20:32:27.5488233Z scale_ub=1200.0, 2025-05-07T20:32:27.5488500Z contiguous=True, 2025-05-07T20:32:27.5488745Z compiled=False, 2025-05-07T20:32:27.5488973Z ) 2025-05-07T20:32:27.5489325Z self = 2025-05-07T20:32:27.5489854Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:27.5490151Z 2025-05-07T20:32:27.5490237Z @given( 2025-05-07T20:32:27.5490491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.5490832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.5491171Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.5491666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.5492029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.5492350Z ) 2025-05-07T20:32:27.5492742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.5493240Z def test_silu_mul_quant( 2025-05-07T20:32:27.5493504Z self, 2025-05-07T20:32:27.5493721Z T: int, 2025-05-07T20:32:27.5493942Z D: int, 2025-05-07T20:32:27.5494179Z scale_ub: Optional[float], 2025-05-07T20:32:27.5494482Z contiguous: bool, 2025-05-07T20:32:27.5494749Z compiled: bool, 2025-05-07T20:32:27.5494993Z ) -> None: 2025-05-07T20:32:27.5495232Z torch.manual_seed(2025) 2025-05-07T20:32:27.5495498Z 2025-05-07T20:32:27.5495794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.5496170Z 2025-05-07T20:32:27.5496390Z x_sign = torch.sign(x) 2025-05-07T20:32:27.5496713Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.5497059Z x = x_sign * x_clamp 2025-05-07T20:32:27.5497327Z x0 = x[:, :D] 2025-05-07T20:32:27.5497564Z x1 = x[:, D:] 2025-05-07T20:32:27.5497798Z 2025-05-07T20:32:27.5498125Z if contiguous: 2025-05-07T20:32:27.5498401Z x0 = x0.contiguous() 2025-05-07T20:32:27.5498693Z x1 = x1.contiguous() 2025-05-07T20:32:27.5498962Z 2025-05-07T20:32:27.5499174Z if scale_ub is not None: 2025-05-07T20:32:27.5499473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.5499845Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.5500187Z ) 2025-05-07T20:32:27.5500397Z else: 2025-05-07T20:32:27.5500633Z scale_ub_tensor = None 2025-05-07T20:32:27.5500910Z 2025-05-07T20:32:27.5501164Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.5501511Z op = silu_mul_quant 2025-05-07T20:32:27.5501792Z if compiled: 2025-05-07T20:32:27.5502062Z op = torch.compile(op) 2025-05-07T20:32:27.5502390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.5502693Z 2025-05-07T20:32:27.5502902Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.5503095Z 2025-05-07T20:32:27.5503205Z moe/activation_test.py:117: 2025-05-07T20:32:27.5503534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5503900Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.5504206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.5504963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.5505734Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.5506325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.5507077Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.5507802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.5508440Z kernel = self.compile( 2025-05-07T20:32:27.5509035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.5509754Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.5510191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5510441Z 2025-05-07T20:32:27.5510675Z self = 2025-05-07T20:32:27.5511848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.5513429Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676009e10>} 2025-05-07T20:32:27.5514994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.5516124Z context = 2025-05-07T20:32:27.5516440Z 2025-05-07T20:32:27.5516623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.5517199Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.5517716Z module_map=module_map) 2025-05-07T20:32:27.5518149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.5518557Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.5518851Z E ^ 2025-05-07T20:32:27.5519375Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.5519872Z 2025-05-07T20:32:27.5520327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.5520983Z 2025-05-07T20:32:27.5521098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.5521552Z self=, 2025-05-07T20:32:27.5521996Z T=2048, 2025-05-07T20:32:27.5522199Z D=5120, 2025-05-07T20:32:27.5522411Z scale_ub=None, 2025-05-07T20:32:27.5522649Z contiguous=True, 2025-05-07T20:32:27.5522891Z compiled=False, 2025-05-07T20:32:27.5523115Z ) 2025-05-07T20:32:27.5523470Z self = 2025-05-07T20:32:27.5524340Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:27.5524649Z 2025-05-07T20:32:27.5524734Z @given( 2025-05-07T20:32:27.5524987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.5525335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.5525673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.5526035Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.5526396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.5526703Z ) 2025-05-07T20:32:27.5527085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.5527566Z def test_silu_mul_quant( 2025-05-07T20:32:27.5527829Z self, 2025-05-07T20:32:27.5528067Z T: int, 2025-05-07T20:32:27.5528313Z D: int, 2025-05-07T20:32:27.5528548Z scale_ub: Optional[float], 2025-05-07T20:32:27.5528848Z contiguous: bool, 2025-05-07T20:32:27.5529116Z compiled: bool, 2025-05-07T20:32:27.5529357Z ) -> None: 2025-05-07T20:32:27.5529599Z torch.manual_seed(2025) 2025-05-07T20:32:27.5529864Z 2025-05-07T20:32:27.5530168Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.5530538Z 2025-05-07T20:32:27.5530771Z > x_sign = torch.sign(x) 2025-05-07T20:32:27.5538791Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.5540827Z 2025-05-07T20:32:27.5540966Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:27.5541202Z 2025-05-07T20:32:27.5541490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.5541945Z self=, 2025-05-07T20:32:27.5542384Z T=16384, 2025-05-07T20:32:27.5542601Z D=5120, 2025-05-07T20:32:27.5542816Z scale_ub=None, 2025-05-07T20:32:27.5543052Z contiguous=True, 2025-05-07T20:32:27.5543302Z compiled=False, 2025-05-07T20:32:27.5543523Z ) 2025-05-07T20:32:27.6583187Z self = 2025-05-07T20:32:27.6584056Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:27.6584499Z 2025-05-07T20:32:27.6584637Z @given( 2025-05-07T20:32:27.6584990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.6585444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.6585788Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.6586152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.6586528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.6586851Z ) 2025-05-07T20:32:27.6587243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.6587933Z def test_silu_mul_quant( 2025-05-07T20:32:27.6588247Z self, 2025-05-07T20:32:27.6588480Z T: int, 2025-05-07T20:32:27.6588699Z D: int, 2025-05-07T20:32:27.6588939Z scale_ub: Optional[float], 2025-05-07T20:32:27.6589250Z contiguous: bool, 2025-05-07T20:32:27.6589515Z compiled: bool, 2025-05-07T20:32:27.6589775Z ) -> None: 2025-05-07T20:32:27.6590019Z torch.manual_seed(2025) 2025-05-07T20:32:27.6590286Z 2025-05-07T20:32:27.6590613Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.6592939Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.6595136Z 2025-05-07T20:32:27.6595273Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.6595512Z 2025-05-07T20:32:27.6595637Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.6596099Z self=, 2025-05-07T20:32:27.6596554Z T=4096, 2025-05-07T20:32:27.6596765Z D=5120, 2025-05-07T20:32:27.6596978Z scale_ub=None, 2025-05-07T20:32:27.6597219Z contiguous=True, 2025-05-07T20:32:27.6597470Z compiled=False, 2025-05-07T20:32:27.6597704Z ) 2025-05-07T20:32:27.6598060Z self = 2025-05-07T20:32:27.6598654Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:27.6598960Z 2025-05-07T20:32:27.6599046Z @given( 2025-05-07T20:32:27.6599310Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.6599661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.6600000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.6600371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.6600744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.6601058Z ) 2025-05-07T20:32:27.6601452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.6601947Z def test_silu_mul_quant( 2025-05-07T20:32:27.6602215Z self, 2025-05-07T20:32:27.6602433Z T: int, 2025-05-07T20:32:27.6602656Z D: int, 2025-05-07T20:32:27.6603037Z scale_ub: Optional[float], 2025-05-07T20:32:27.6603350Z contiguous: bool, 2025-05-07T20:32:27.6603620Z compiled: bool, 2025-05-07T20:32:27.6603876Z ) -> None: 2025-05-07T20:32:27.6604114Z torch.manual_seed(2025) 2025-05-07T20:32:27.6604390Z 2025-05-07T20:32:27.6604693Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.6606982Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.6609124Z 2025-05-07T20:32:27.6609262Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.6609508Z 2025-05-07T20:32:27.6609625Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.6610086Z self=, 2025-05-07T20:32:27.6610623Z T=2048, 2025-05-07T20:32:27.6610830Z D=5120, 2025-05-07T20:32:27.6611043Z scale_ub=None, 2025-05-07T20:32:27.6611283Z contiguous=False, 2025-05-07T20:32:27.6611534Z compiled=False, 2025-05-07T20:32:27.6611760Z ) 2025-05-07T20:32:27.6612114Z self = 2025-05-07T20:32:27.6612663Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:27.6612974Z 2025-05-07T20:32:27.6613060Z @given( 2025-05-07T20:32:27.6613324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.6613676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.6614034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.6614416Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.6614786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.6615102Z ) 2025-05-07T20:32:27.6615496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.6615998Z def test_silu_mul_quant( 2025-05-07T20:32:27.6616263Z self, 2025-05-07T20:32:27.6616479Z T: int, 2025-05-07T20:32:27.6616699Z D: int, 2025-05-07T20:32:27.6616941Z scale_ub: Optional[float], 2025-05-07T20:32:27.6617243Z contiguous: bool, 2025-05-07T20:32:27.6617511Z compiled: bool, 2025-05-07T20:32:27.6617759Z ) -> None: 2025-05-07T20:32:27.6618000Z torch.manual_seed(2025) 2025-05-07T20:32:27.6618272Z 2025-05-07T20:32:27.6618616Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.6620882Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.6622971Z 2025-05-07T20:32:27.6623103Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.6623345Z 2025-05-07T20:32:27.6623463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.6624329Z self=, 2025-05-07T20:32:27.6624782Z T=4096, 2025-05-07T20:32:27.6624998Z D=7168, 2025-05-07T20:32:27.6625218Z scale_ub=None, 2025-05-07T20:32:27.6625453Z contiguous=True, 2025-05-07T20:32:27.6625703Z compiled=True, 2025-05-07T20:32:27.6626078Z ) 2025-05-07T20:32:27.6626430Z self = 2025-05-07T20:32:27.6626978Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:27.6627285Z 2025-05-07T20:32:27.6627371Z @given( 2025-05-07T20:32:27.6627632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.6627980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.6628375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.6628744Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.6629108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.6629433Z ) 2025-05-07T20:32:27.6629825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.6630312Z def test_silu_mul_quant( 2025-05-07T20:32:27.6630583Z self, 2025-05-07T20:32:27.6630802Z T: int, 2025-05-07T20:32:27.6631027Z D: int, 2025-05-07T20:32:27.6631269Z scale_ub: Optional[float], 2025-05-07T20:32:27.6631571Z contiguous: bool, 2025-05-07T20:32:27.6631841Z compiled: bool, 2025-05-07T20:32:27.6632085Z ) -> None: 2025-05-07T20:32:27.6632460Z torch.manual_seed(2025) 2025-05-07T20:32:27.6632749Z 2025-05-07T20:32:27.6633050Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.6635395Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.6637454Z 2025-05-07T20:32:27.6637586Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.6637828Z 2025-05-07T20:32:27.6637944Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.6638413Z self=, 2025-05-07T20:32:27.6638854Z T=2048, 2025-05-07T20:32:27.6639065Z D=5120, 2025-05-07T20:32:27.6639278Z scale_ub=1200.0, 2025-05-07T20:32:27.6639523Z contiguous=False, 2025-05-07T20:32:27.6639775Z compiled=False, 2025-05-07T20:32:27.6640003Z ) 2025-05-07T20:32:27.6640350Z self = 2025-05-07T20:32:27.6640896Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:27.6641206Z 2025-05-07T20:32:27.6641291Z @given( 2025-05-07T20:32:27.6641548Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.6641893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.6642239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.6642605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.6642964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.6643289Z ) 2025-05-07T20:32:27.6643676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.6644160Z def test_silu_mul_quant( 2025-05-07T20:32:27.6644429Z self, 2025-05-07T20:32:27.6644649Z T: int, 2025-05-07T20:32:27.6644868Z D: int, 2025-05-07T20:32:27.6645108Z scale_ub: Optional[float], 2025-05-07T20:32:27.6645413Z contiguous: bool, 2025-05-07T20:32:27.6645681Z compiled: bool, 2025-05-07T20:32:27.6645923Z ) -> None: 2025-05-07T20:32:27.6646161Z torch.manual_seed(2025) 2025-05-07T20:32:27.6646431Z 2025-05-07T20:32:27.6646727Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.6649159Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.6651240Z 2025-05-07T20:32:27.6651373Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.6651610Z 2025-05-07T20:32:27.6651732Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.6652196Z self=, 2025-05-07T20:32:27.6652640Z T=4096, 2025-05-07T20:32:27.6652850Z D=7168, 2025-05-07T20:32:27.6653066Z scale_ub=1200.0, 2025-05-07T20:32:27.6653317Z contiguous=True, 2025-05-07T20:32:27.6653568Z compiled=False, 2025-05-07T20:32:27.6653797Z ) 2025-05-07T20:32:27.8067667Z self = 2025-05-07T20:32:27.8069488Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:27.8070373Z 2025-05-07T20:32:27.8070592Z @given( 2025-05-07T20:32:27.8071102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8071693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8072275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8072924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8073713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8074257Z ) 2025-05-07T20:32:27.8074933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8075781Z def test_silu_mul_quant( 2025-05-07T20:32:27.8076253Z self, 2025-05-07T20:32:27.8076632Z T: int, 2025-05-07T20:32:27.8077008Z D: int, 2025-05-07T20:32:27.8077418Z scale_ub: Optional[float], 2025-05-07T20:32:27.8077972Z contiguous: bool, 2025-05-07T20:32:27.8078530Z compiled: bool, 2025-05-07T20:32:27.8078959Z ) -> None: 2025-05-07T20:32:27.8079270Z torch.manual_seed(2025) 2025-05-07T20:32:27.8079619Z 2025-05-07T20:32:27.8079991Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.8082357Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.8084496Z 2025-05-07T20:32:27.8084633Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.8084890Z 2025-05-07T20:32:27.8085009Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8085486Z self=, 2025-05-07T20:32:27.8085939Z T=16384, 2025-05-07T20:32:27.8086169Z D=7168, 2025-05-07T20:32:27.8086392Z scale_ub=None, 2025-05-07T20:32:27.8086647Z contiguous=False, 2025-05-07T20:32:27.8086904Z compiled=True, 2025-05-07T20:32:27.8087139Z ) 2025-05-07T20:32:27.8087506Z self = 2025-05-07T20:32:27.8088073Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:27.8088398Z 2025-05-07T20:32:27.8088488Z @given( 2025-05-07T20:32:27.8088756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8089262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8089625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8090007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8090491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8090868Z ) 2025-05-07T20:32:27.8091275Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8091785Z def test_silu_mul_quant( 2025-05-07T20:32:27.8092061Z self, 2025-05-07T20:32:27.8092287Z T: int, 2025-05-07T20:32:27.8092516Z D: int, 2025-05-07T20:32:27.8092772Z scale_ub: Optional[float], 2025-05-07T20:32:27.8093094Z contiguous: bool, 2025-05-07T20:32:27.8093378Z compiled: bool, 2025-05-07T20:32:27.8093633Z ) -> None: 2025-05-07T20:32:27.8093888Z torch.manual_seed(2025) 2025-05-07T20:32:27.8094168Z 2025-05-07T20:32:27.8094483Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.8096832Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.8099138Z 2025-05-07T20:32:27.8099280Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.8099530Z 2025-05-07T20:32:27.8099653Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8100142Z self=, 2025-05-07T20:32:27.8100605Z T=4096, 2025-05-07T20:32:27.8100833Z D=7168, 2025-05-07T20:32:27.8101092Z scale_ub=None, 2025-05-07T20:32:27.8101451Z contiguous=True, 2025-05-07T20:32:27.8101841Z compiled=False, 2025-05-07T20:32:27.8102144Z ) 2025-05-07T20:32:27.8102507Z self = 2025-05-07T20:32:27.8103085Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:27.8103402Z 2025-05-07T20:32:27.8103492Z @given( 2025-05-07T20:32:27.8103760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8104117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8104469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8104851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8105225Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8105555Z ) 2025-05-07T20:32:27.8105960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8106471Z def test_silu_mul_quant( 2025-05-07T20:32:27.8106755Z self, 2025-05-07T20:32:27.8106981Z T: int, 2025-05-07T20:32:27.8107203Z D: int, 2025-05-07T20:32:27.8107456Z scale_ub: Optional[float], 2025-05-07T20:32:27.8107779Z contiguous: bool, 2025-05-07T20:32:27.8108050Z compiled: bool, 2025-05-07T20:32:27.8108316Z ) -> None: 2025-05-07T20:32:27.8108608Z torch.manual_seed(2025) 2025-05-07T20:32:27.8108889Z 2025-05-07T20:32:27.8109195Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.8111665Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.8113906Z 2025-05-07T20:32:27.8114051Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.8114298Z 2025-05-07T20:32:27.8114425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8114897Z self=, 2025-05-07T20:32:27.8115379Z T=16384, 2025-05-07T20:32:27.8115603Z D=7168, 2025-05-07T20:32:27.8115831Z scale_ub=None, 2025-05-07T20:32:27.8116071Z contiguous=True, 2025-05-07T20:32:27.8116334Z compiled=False, 2025-05-07T20:32:27.8116573Z ) 2025-05-07T20:32:27.8116929Z self = 2025-05-07T20:32:27.8117496Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:27.8117811Z 2025-05-07T20:32:27.8117914Z @given( 2025-05-07T20:32:27.8118179Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8118540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8118897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8119372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8119754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8120084Z ) 2025-05-07T20:32:27.8120491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8120993Z def test_silu_mul_quant( 2025-05-07T20:32:27.8121276Z self, 2025-05-07T20:32:27.8121503Z T: int, 2025-05-07T20:32:27.8121726Z D: int, 2025-05-07T20:32:27.8121978Z scale_ub: Optional[float], 2025-05-07T20:32:27.8122294Z contiguous: bool, 2025-05-07T20:32:27.8122570Z compiled: bool, 2025-05-07T20:32:27.8122828Z ) -> None: 2025-05-07T20:32:27.8123078Z torch.manual_seed(2025) 2025-05-07T20:32:27.8123356Z 2025-05-07T20:32:27.8123668Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.8126334Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.8128510Z 2025-05-07T20:32:27.8128646Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.8128888Z 2025-05-07T20:32:27.8129014Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8129489Z self=, 2025-05-07T20:32:27.8129952Z T=16384, 2025-05-07T20:32:27.8130176Z D=7168, 2025-05-07T20:32:27.8130392Z scale_ub=1200.0, 2025-05-07T20:32:27.8130649Z contiguous=True, 2025-05-07T20:32:27.8130904Z compiled=False, 2025-05-07T20:32:27.8131140Z ) 2025-05-07T20:32:27.8131499Z self = 2025-05-07T20:32:27.8132067Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:27.8132381Z 2025-05-07T20:32:27.8132475Z @given( 2025-05-07T20:32:27.8132737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8133095Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8133448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8133820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8134197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8134527Z ) 2025-05-07T20:32:27.8135085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8135596Z def test_silu_mul_quant( 2025-05-07T20:32:27.8135875Z self, 2025-05-07T20:32:27.8136097Z T: int, 2025-05-07T20:32:27.8136327Z D: int, 2025-05-07T20:32:27.8136583Z scale_ub: Optional[float], 2025-05-07T20:32:27.8136891Z contiguous: bool, 2025-05-07T20:32:27.8137167Z compiled: bool, 2025-05-07T20:32:27.8137425Z ) -> None: 2025-05-07T20:32:27.8137675Z torch.manual_seed(2025) 2025-05-07T20:32:27.8137946Z 2025-05-07T20:32:27.8138283Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.8140627Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:27.8142871Z 2025-05-07T20:32:27.8143013Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:27.8143255Z 2025-05-07T20:32:27.8143374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8143852Z self=, 2025-05-07T20:32:27.8144314Z T=128, 2025-05-07T20:32:27.8144534Z D=5120, 2025-05-07T20:32:27.8144752Z scale_ub=1200.0, 2025-05-07T20:32:27.8145011Z contiguous=False, 2025-05-07T20:32:27.8145272Z compiled=False, 2025-05-07T20:32:27.8145503Z ) 2025-05-07T20:32:28.1910752Z self = 2025-05-07T20:32:28.1911856Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.1912412Z 2025-05-07T20:32:28.1912574Z @given( 2025-05-07T20:32:28.1913047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.1913819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.1914448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.1915116Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.1915783Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.1916351Z ) 2025-05-07T20:32:28.1917058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.1917949Z def test_silu_mul_quant( 2025-05-07T20:32:28.1918403Z self, 2025-05-07T20:32:28.1918658Z T: int, 2025-05-07T20:32:28.1918914Z D: int, 2025-05-07T20:32:28.1919170Z scale_ub: Optional[float], 2025-05-07T20:32:28.1919482Z contiguous: bool, 2025-05-07T20:32:28.1919765Z compiled: bool, 2025-05-07T20:32:28.1920036Z ) -> None: 2025-05-07T20:32:28.1920286Z torch.manual_seed(2025) 2025-05-07T20:32:28.1920568Z 2025-05-07T20:32:28.1920896Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.1921292Z 2025-05-07T20:32:28.1921524Z x_sign = torch.sign(x) 2025-05-07T20:32:28.1921862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.1922218Z x = x_sign * x_clamp 2025-05-07T20:32:28.1922496Z x0 = x[:, :D] 2025-05-07T20:32:28.1922748Z x1 = x[:, D:] 2025-05-07T20:32:28.1922982Z 2025-05-07T20:32:28.1923198Z if contiguous: 2025-05-07T20:32:28.1923465Z x0 = x0.contiguous() 2025-05-07T20:32:28.1924197Z x1 = x1.contiguous() 2025-05-07T20:32:28.1924503Z 2025-05-07T20:32:28.1924728Z if scale_ub is not None: 2025-05-07T20:32:28.1925044Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.1925432Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.1926002Z ) 2025-05-07T20:32:28.1926243Z else: 2025-05-07T20:32:28.1926485Z scale_ub_tensor = None 2025-05-07T20:32:28.1926780Z 2025-05-07T20:32:28.1933993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.1934405Z op = silu_mul_quant 2025-05-07T20:32:28.1934695Z if compiled: 2025-05-07T20:32:28.1934985Z op = torch.compile(op) 2025-05-07T20:32:28.1935331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.1935639Z 2025-05-07T20:32:28.1935864Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.1936052Z 2025-05-07T20:32:28.1936172Z moe/activation_test.py:117: 2025-05-07T20:32:28.1936516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.1936892Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.1937213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.1938020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.1938796Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.1939408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.1940357Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.1941099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.1941704Z kernel = self.compile( 2025-05-07T20:32:28.1942327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.1943078Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.1943524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.1943792Z 2025-05-07T20:32:28.1944032Z self = 2025-05-07T20:32:28.1945246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.1946805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7675f9dcf0>} 2025-05-07T20:32:28.1948308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.1949466Z context = 2025-05-07T20:32:28.1949799Z 2025-05-07T20:32:28.1949988Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.1950585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.1951111Z module_map=module_map) 2025-05-07T20:32:28.1951531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.1951944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.1952244Z E ^ 2025-05-07T20:32:28.1952763Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.1953274Z 2025-05-07T20:32:28.1953871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.1954452Z 2025-05-07T20:32:28.1954581Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.1955048Z self=, 2025-05-07T20:32:28.1955503Z T=2048, 2025-05-07T20:32:28.1955724Z D=7168, 2025-05-07T20:32:28.1956042Z scale_ub=None, 2025-05-07T20:32:28.1956288Z contiguous=False, 2025-05-07T20:32:28.1956547Z compiled=False, 2025-05-07T20:32:28.1956781Z ) 2025-05-07T20:32:28.1957140Z self = 2025-05-07T20:32:28.1957700Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.1958010Z 2025-05-07T20:32:28.1958097Z @given( 2025-05-07T20:32:28.1958360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.1958717Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.1959065Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.1959442Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.1959818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.1960135Z ) 2025-05-07T20:32:28.1960531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.1961041Z def test_silu_mul_quant( 2025-05-07T20:32:28.1961323Z self, 2025-05-07T20:32:28.1961546Z T: int, 2025-05-07T20:32:28.1961775Z D: int, 2025-05-07T20:32:28.1962015Z scale_ub: Optional[float], 2025-05-07T20:32:28.1962417Z contiguous: bool, 2025-05-07T20:32:28.1962691Z compiled: bool, 2025-05-07T20:32:28.1962943Z ) -> None: 2025-05-07T20:32:28.1963186Z torch.manual_seed(2025) 2025-05-07T20:32:28.1963469Z 2025-05-07T20:32:28.1963778Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.1966097Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.1968345Z 2025-05-07T20:32:28.1968485Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.1968737Z 2025-05-07T20:32:28.1968854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.1969322Z self=, 2025-05-07T20:32:28.1969773Z T=128, 2025-05-07T20:32:28.1969979Z D=7168, 2025-05-07T20:32:28.1970194Z scale_ub=1200.0, 2025-05-07T20:32:28.1970446Z contiguous=True, 2025-05-07T20:32:28.1970700Z compiled=True, 2025-05-07T20:32:28.1970930Z ) 2025-05-07T20:32:28.2427201Z self = 2025-05-07T20:32:28.2427808Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.2428120Z 2025-05-07T20:32:28.2428211Z @given( 2025-05-07T20:32:28.2428490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.2428844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.2429195Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.2429562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.2429940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.2430264Z ) 2025-05-07T20:32:28.2430654Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.2431156Z def test_silu_mul_quant( 2025-05-07T20:32:28.2431432Z self, 2025-05-07T20:32:28.2431648Z T: int, 2025-05-07T20:32:28.2431873Z D: int, 2025-05-07T20:32:28.2432125Z scale_ub: Optional[float], 2025-05-07T20:32:28.2432428Z contiguous: bool, 2025-05-07T20:32:28.2432703Z compiled: bool, 2025-05-07T20:32:28.2432960Z ) -> None: 2025-05-07T20:32:28.2433201Z torch.manual_seed(2025) 2025-05-07T20:32:28.2433472Z 2025-05-07T20:32:28.2434051Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.2434439Z 2025-05-07T20:32:28.2434665Z x_sign = torch.sign(x) 2025-05-07T20:32:28.2434996Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.2435353Z x = x_sign * x_clamp 2025-05-07T20:32:28.2435621Z x0 = x[:, :D] 2025-05-07T20:32:28.2435864Z x1 = x[:, D:] 2025-05-07T20:32:28.2436099Z 2025-05-07T20:32:28.2436306Z if contiguous: 2025-05-07T20:32:28.2436567Z x0 = x0.contiguous() 2025-05-07T20:32:28.2436859Z x1 = x1.contiguous() 2025-05-07T20:32:28.2437128Z 2025-05-07T20:32:28.2437348Z if scale_ub is not None: 2025-05-07T20:32:28.2437660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.2438035Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.2438386Z ) 2025-05-07T20:32:28.2438606Z else: 2025-05-07T20:32:28.2438849Z scale_ub_tensor = None 2025-05-07T20:32:28.2439133Z 2025-05-07T20:32:28.2439399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.2439748Z op = silu_mul_quant 2025-05-07T20:32:28.2440033Z if compiled: 2025-05-07T20:32:28.2440462Z op = torch.compile(op) 2025-05-07T20:32:28.2440799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.2441106Z 2025-05-07T20:32:28.2441326Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.2441510Z 2025-05-07T20:32:28.2441631Z moe/activation_test.py:117: 2025-05-07T20:32:28.2441960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.2442338Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.2442657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.2443285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.2443917Z return fn(*args, **kwargs) 2025-05-07T20:32:28.2444665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.2445438Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.2446042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.2446812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.2447560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.2448158Z kernel = self.compile( 2025-05-07T20:32:28.2448768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.2449510Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.2449956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.2450217Z 2025-05-07T20:32:28.2450451Z self = 2025-05-07T20:32:28.2451665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.2453219Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7675f9f0a0>} 2025-05-07T20:32:28.2454733Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.2455889Z context = 2025-05-07T20:32:28.2456212Z 2025-05-07T20:32:28.2456489Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.2457082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.2457612Z module_map=module_map) 2025-05-07T20:32:28.2458024Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.2458423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.2458717Z E ^ 2025-05-07T20:32:28.2459264Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.2459772Z 2025-05-07T20:32:28.2460246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.2460822Z 2025-05-07T20:32:28.2460941Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.2461413Z self=, 2025-05-07T20:32:28.2461865Z T=128, 2025-05-07T20:32:28.2462079Z D=7168, 2025-05-07T20:32:28.2462306Z scale_ub=1200.0, 2025-05-07T20:32:28.2462564Z contiguous=True, 2025-05-07T20:32:28.2462818Z compiled=False, 2025-05-07T20:32:28.2463046Z ) 2025-05-07T20:32:28.2463502Z self = 2025-05-07T20:32:28.2464057Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.2464361Z 2025-05-07T20:32:28.2464448Z @given( 2025-05-07T20:32:28.2464712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.2465066Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.2465408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.2465782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.2466154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.2466472Z ) 2025-05-07T20:32:28.2466865Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.2467364Z def test_silu_mul_quant( 2025-05-07T20:32:28.2467638Z self, 2025-05-07T20:32:28.2467853Z T: int, 2025-05-07T20:32:28.2468078Z D: int, 2025-05-07T20:32:28.2468330Z scale_ub: Optional[float], 2025-05-07T20:32:28.2468631Z contiguous: bool, 2025-05-07T20:32:28.2468905Z compiled: bool, 2025-05-07T20:32:28.2469159Z ) -> None: 2025-05-07T20:32:28.2469399Z torch.manual_seed(2025) 2025-05-07T20:32:28.2469672Z 2025-05-07T20:32:28.2469978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.2470364Z 2025-05-07T20:32:28.2470588Z x_sign = torch.sign(x) 2025-05-07T20:32:28.2470916Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.2473181Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.2475315Z 2025-05-07T20:32:28.2475457Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:28.2475698Z 2025-05-07T20:32:28.2475815Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.2476288Z self=, 2025-05-07T20:32:28.2476735Z T=128, 2025-05-07T20:32:28.2476940Z D=5120, 2025-05-07T20:32:28.2477161Z scale_ub=1200.0, 2025-05-07T20:32:28.2477417Z contiguous=True, 2025-05-07T20:32:28.2477666Z compiled=True, 2025-05-07T20:32:28.2477898Z ) 2025-05-07T20:32:28.2478386Z self = 2025-05-07T20:32:28.2478984Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.2479293Z 2025-05-07T20:32:28.2479383Z @given( 2025-05-07T20:32:28.2479651Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.2480005Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.2480350Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.2480729Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.2481102Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.2481422Z ) 2025-05-07T20:32:28.2481822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.2482323Z def test_silu_mul_quant( 2025-05-07T20:32:28.2482595Z self, 2025-05-07T20:32:28.2482815Z T: int, 2025-05-07T20:32:28.2483044Z D: int, 2025-05-07T20:32:28.2483286Z scale_ub: Optional[float], 2025-05-07T20:32:28.2483602Z contiguous: bool, 2025-05-07T20:32:28.2483886Z compiled: bool, 2025-05-07T20:32:28.2484133Z ) -> None: 2025-05-07T20:32:28.2484380Z torch.manual_seed(2025) 2025-05-07T20:32:28.2484654Z 2025-05-07T20:32:28.2485055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.2485435Z 2025-05-07T20:32:28.2485660Z > x_sign = torch.sign(x) 2025-05-07T20:32:28.2487829Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.2489899Z 2025-05-07T20:32:28.2490040Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:28.2490280Z 2025-05-07T20:32:28.2490397Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.2490872Z self=, 2025-05-07T20:32:28.2491325Z T=128, 2025-05-07T20:32:28.2491540Z D=7168, 2025-05-07T20:32:28.2491754Z scale_ub=None, 2025-05-07T20:32:28.2491998Z contiguous=True, 2025-05-07T20:32:28.2492251Z compiled=True, 2025-05-07T20:32:28.2492481Z ) 2025-05-07T20:32:28.5621050Z self = 2025-05-07T20:32:28.5621741Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.5622046Z 2025-05-07T20:32:28.5622148Z @given( 2025-05-07T20:32:28.5622414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5622781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5623142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5623516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5624206Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5624548Z ) 2025-05-07T20:32:28.5624944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5625442Z def test_silu_mul_quant( 2025-05-07T20:32:28.5625718Z self, 2025-05-07T20:32:28.5625937Z T: int, 2025-05-07T20:32:28.5626165Z D: int, 2025-05-07T20:32:28.5626414Z scale_ub: Optional[float], 2025-05-07T20:32:28.5626719Z contiguous: bool, 2025-05-07T20:32:28.5626996Z compiled: bool, 2025-05-07T20:32:28.5627255Z ) -> None: 2025-05-07T20:32:28.5627502Z torch.manual_seed(2025) 2025-05-07T20:32:28.5627774Z 2025-05-07T20:32:28.5628089Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5630698Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.5633331Z 2025-05-07T20:32:28.5633475Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.5633805Z 2025-05-07T20:32:28.5694054Z FAILED 2025-05-07T20:32:28.5694203Z 2025-05-07T20:32:28.5694407Z =================================== FAILURES =================================== 2025-05-07T20:32:28.5695012Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:28.5695562Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:28.5696406Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:28.5697115Z | yield 2025-05-07T20:32:28.5697895Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:32:28.5698648Z | self._callTestMethod(testMethod) 2025-05-07T20:32:28.5699413Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:32:28.5700248Z | method() 2025-05-07T20:32:28.5701239Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:28.5702393Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5703406Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:28.5704384Z | raise the_error_hypothesis_found 2025-05-07T20:32:28.5705144Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:28.5705917Z +-+---------------- 1 ---------------- 2025-05-07T20:32:28.5706374Z | Traceback (most recent call last): 2025-05-07T20:32:28.5707489Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:28.5708706Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5711871Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.5715024Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:28.5715712Z | self=, 2025-05-07T20:32:28.5716366Z | T=128, 2025-05-07T20:32:28.5716689Z | D=7168, 2025-05-07T20:32:28.5717039Z | scale_ub=1200.0, 2025-05-07T20:32:28.5717416Z | contiguous=True, 2025-05-07T20:32:28.5717803Z | compiled=False, 2025-05-07T20:32:28.5718170Z | ) 2025-05-07T20:32:28.5718381Z | 2025-05-07T20:32:28.5718985Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:32:28.5719674Z +---------------- 2 ---------------- 2025-05-07T20:32:28.5720159Z | Traceback (most recent call last): 2025-05-07T20:32:28.5720976Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:28.5721860Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5724462Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.5726686Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:28.5727193Z | self=, 2025-05-07T20:32:28.5727664Z | T=128, 2025-05-07T20:32:28.5727901Z | D=7168, 2025-05-07T20:32:28.5728149Z | scale_ub=None, 2025-05-07T20:32:28.5728589Z | contiguous=True, 2025-05-07T20:32:28.5728874Z | compiled=True, 2025-05-07T20:32:28.5729136Z | ) 2025-05-07T20:32:28.5729340Z | 2025-05-07T20:32:28.5729943Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:28.5730634Z +---------------- 3 ---------------- 2025-05-07T20:32:28.5730963Z | Traceback (most recent call last): 2025-05-07T20:32:28.5731768Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:28.5732651Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5735335Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.5737547Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:28.5738045Z | self=, 2025-05-07T20:32:28.5738508Z | T=128, 2025-05-07T20:32:28.5738743Z | D=5120, 2025-05-07T20:32:28.5738984Z | scale_ub=1200.0, 2025-05-07T20:32:28.5739266Z | contiguous=True, 2025-05-07T20:32:28.5739619Z | compiled=True, 2025-05-07T20:32:28.5739881Z | ) 2025-05-07T20:32:28.5740105Z | 2025-05-07T20:32:28.5740844Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:28.5741731Z +---------------- 4 ---------------- 2025-05-07T20:32:28.5742129Z | Traceback (most recent call last): 2025-05-07T20:32:28.5743272Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:28.5744387Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.5745403Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:28.5746478Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.5747966Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:28.5749214Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.5750183Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:28.5751331Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5767675Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:28.5768991Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.5770253Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:28.5771512Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.5772743Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:28.5773975Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.5774988Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:28.5775877Z | fn() 2025-05-07T20:32:28.5776766Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:28.5777692Z | self.fn.run( 2025-05-07T20:32:28.5778291Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:28.5778955Z | kernel = self.compile( 2025-05-07T20:32:28.5779657Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:28.5780455Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5781254Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:28.5782265Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5782859Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5783355Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.5783797Z | ^ 2025-05-07T20:32:28.5784573Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5785528Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:28.5786182Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:28.5786900Z | self=, 2025-05-07T20:32:28.5787595Z | T=1, # or any other generated value 2025-05-07T20:32:28.5788093Z | D=5120, # or any other generated value 2025-05-07T20:32:28.5788642Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:28.5789214Z | contiguous=True, # or any other generated value 2025-05-07T20:32:28.5789803Z | compiled=True, # or any other generated value 2025-05-07T20:32:28.5790287Z | ) 2025-05-07T20:32:28.5790587Z | 2025-05-07T20:32:28.5791434Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:28.5792391Z +------------------------------------ 2025-05-07T20:32:28.5792971Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:28.5793719Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.5794558Z self=, 2025-05-07T20:32:28.5795207Z T=1, 2025-05-07T20:32:28.5795502Z D=5120, 2025-05-07T20:32:28.5795820Z scale_ub=None, 2025-05-07T20:32:28.5796180Z contiguous=True, 2025-05-07T20:32:28.5796535Z compiled=True, 2025-05-07T20:32:28.5796876Z ) 2025-05-07T20:32:28.5797390Z self = 2025-05-07T20:32:28.5798151Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.5798559Z 2025-05-07T20:32:28.5798691Z @given( 2025-05-07T20:32:28.5799071Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5799544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5800005Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5800494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5800965Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5801406Z ) 2025-05-07T20:32:28.5801958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5802658Z def test_silu_mul_quant( 2025-05-07T20:32:28.5803062Z self, 2025-05-07T20:32:28.5803495Z T: int, 2025-05-07T20:32:28.5803831Z D: int, 2025-05-07T20:32:28.5804198Z scale_ub: Optional[float], 2025-05-07T20:32:28.5804629Z contiguous: bool, 2025-05-07T20:32:28.5804994Z compiled: bool, 2025-05-07T20:32:28.5805337Z ) -> None: 2025-05-07T20:32:28.5805659Z torch.manual_seed(2025) 2025-05-07T20:32:28.5806014Z 2025-05-07T20:32:28.5806413Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5806924Z 2025-05-07T20:32:28.5807224Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5807665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5808123Z x = x_sign * x_clamp 2025-05-07T20:32:28.5808491Z x0 = x[:, :D] 2025-05-07T20:32:28.5808836Z x1 = x[:, D:] 2025-05-07T20:32:28.5809154Z 2025-05-07T20:32:28.5809447Z if contiguous: 2025-05-07T20:32:28.5809806Z x0 = x0.contiguous() 2025-05-07T20:32:28.5810203Z x1 = x1.contiguous() 2025-05-07T20:32:28.5810569Z 2025-05-07T20:32:28.5810870Z if scale_ub is not None: 2025-05-07T20:32:28.5811284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5811776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5812235Z ) 2025-05-07T20:32:28.5812528Z else: 2025-05-07T20:32:28.5812844Z scale_ub_tensor = None 2025-05-07T20:32:28.5813229Z 2025-05-07T20:32:28.5813581Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5814045Z op = silu_mul_quant 2025-05-07T20:32:28.5814423Z if compiled: 2025-05-07T20:32:28.5814807Z op = torch.compile(op) 2025-05-07T20:32:28.5815252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5815668Z 2025-05-07T20:32:28.5815972Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.5816427Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.5816906Z 2025-05-07T20:32:28.5817293Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5817832Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.5818299Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.5818805Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.5819378Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.5819872Z 2025-05-07T20:32:28.5820204Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.5820519Z 2025-05-07T20:32:28.5820687Z moe/activation_test.py:126: 2025-05-07T20:32:28.5821160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5821697Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.5822340Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.5823582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.5825144Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.5826021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.5827093Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5828146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.5829360Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.5830551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.5831757Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.5832915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.5834393Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.5835331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.5836160Z fn() 2025-05-07T20:32:28.5836973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.5837903Z self.fn.run( 2025-05-07T20:32:28.5838645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.5839468Z kernel = self.compile( 2025-05-07T20:32:28.5840324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.5841351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5841949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5842283Z 2025-05-07T20:32:28.5842572Z self = 2025-05-07T20:32:28.5844112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.5846158Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a4c60550>} 2025-05-07T20:32:28.5848252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.5849874Z context = 2025-05-07T20:32:28.5850332Z 2025-05-07T20:32:28.5850601Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.5851447Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5852197Z module_map=module_map) 2025-05-07T20:32:28.5852777Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5853354Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.5853790Z E ^ 2025-05-07T20:32:28.5854535Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5855253Z 2025-05-07T20:32:28.5855909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.5856896Z 2025-05-07T20:32:28.5857067Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.5857717Z self=, 2025-05-07T20:32:28.5858338Z T=2048, 2025-05-07T20:32:28.5858648Z D=5120, 2025-05-07T20:32:28.5858957Z scale_ub=1200.0, 2025-05-07T20:32:28.5859311Z contiguous=True, 2025-05-07T20:32:28.5859674Z compiled=False, 2025-05-07T20:32:28.5860008Z ) 2025-05-07T20:32:28.5860513Z self = 2025-05-07T20:32:28.5861281Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.5861715Z 2025-05-07T20:32:28.5861840Z @given( 2025-05-07T20:32:28.5862210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5862698Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5863182Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5863699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5864188Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5864627Z ) 2025-05-07T20:32:28.5865193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5865961Z def test_silu_mul_quant( 2025-05-07T20:32:28.5866350Z self, 2025-05-07T20:32:28.5866681Z T: int, 2025-05-07T20:32:28.5867008Z D: int, 2025-05-07T20:32:28.5867319Z scale_ub: Optional[float], 2025-05-07T20:32:28.5867751Z contiguous: bool, 2025-05-07T20:32:28.5868148Z compiled: bool, 2025-05-07T20:32:28.5868526Z ) -> None: 2025-05-07T20:32:28.5868886Z torch.manual_seed(2025) 2025-05-07T20:32:28.5869239Z 2025-05-07T20:32:28.5869670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5870221Z 2025-05-07T20:32:28.5870541Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5870998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5871493Z x = x_sign * x_clamp 2025-05-07T20:32:28.5871887Z x0 = x[:, :D] 2025-05-07T20:32:28.5872197Z x1 = x[:, D:] 2025-05-07T20:32:28.5872507Z 2025-05-07T20:32:28.5872814Z if contiguous: 2025-05-07T20:32:28.5873187Z x0 = x0.contiguous() 2025-05-07T20:32:28.5873728Z x1 = x1.contiguous() 2025-05-07T20:32:28.5874080Z 2025-05-07T20:32:28.5874358Z if scale_ub is not None: 2025-05-07T20:32:28.5874738Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5875203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5875641Z ) 2025-05-07T20:32:28.5875915Z else: 2025-05-07T20:32:28.5876279Z scale_ub_tensor = None 2025-05-07T20:32:28.5876671Z 2025-05-07T20:32:28.5876999Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5877451Z op = silu_mul_quant 2025-05-07T20:32:28.5877813Z if compiled: 2025-05-07T20:32:28.5878175Z op = torch.compile(op) 2025-05-07T20:32:28.5878677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5879105Z 2025-05-07T20:32:28.5879378Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.5879635Z 2025-05-07T20:32:28.5879780Z moe/activation_test.py:117: 2025-05-07T20:32:28.5880203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5880684Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.5881092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5882190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.5883298Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.5884133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.5885895Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5886837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.5887604Z kernel = self.compile( 2025-05-07T20:32:28.5888379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.5889345Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5889930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5890258Z 2025-05-07T20:32:28.5890576Z self = 2025-05-07T20:32:28.5892112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.5894093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a53d89d0>} 2025-05-07T20:32:28.5896039Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.5897695Z context = 2025-05-07T20:32:28.5898135Z 2025-05-07T20:32:28.5898372Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.5899175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5899901Z module_map=module_map) 2025-05-07T20:32:28.5900464Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5901002Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.5901419Z E ^ 2025-05-07T20:32:28.5902139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5902825Z 2025-05-07T20:32:28.5903456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.5904254Z 2025-05-07T20:32:28.5904418Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.5905065Z self=, 2025-05-07T20:32:28.5905698Z T=2048, 2025-05-07T20:32:28.5906001Z D=5120, 2025-05-07T20:32:28.5906316Z scale_ub=1200.0, 2025-05-07T20:32:28.5906677Z contiguous=True, 2025-05-07T20:32:28.5907032Z compiled=True, 2025-05-07T20:32:28.5907373Z ) 2025-05-07T20:32:28.5907882Z self = 2025-05-07T20:32:28.5908646Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.5909072Z 2025-05-07T20:32:28.5909199Z @given( 2025-05-07T20:32:28.5909569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5910055Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5910556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5911090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5911629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5912098Z ) 2025-05-07T20:32:28.5912665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5913374Z def test_silu_mul_quant( 2025-05-07T20:32:28.5913891Z self, 2025-05-07T20:32:28.5914216Z T: int, 2025-05-07T20:32:28.5914546Z D: int, 2025-05-07T20:32:28.5914899Z scale_ub: Optional[float], 2025-05-07T20:32:28.5915336Z contiguous: bool, 2025-05-07T20:32:28.5915707Z compiled: bool, 2025-05-07T20:32:28.5916051Z ) -> None: 2025-05-07T20:32:28.5916508Z torch.manual_seed(2025) 2025-05-07T20:32:28.5916905Z 2025-05-07T20:32:28.5917334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5917891Z 2025-05-07T20:32:28.5918219Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5918689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5919169Z x = x_sign * x_clamp 2025-05-07T20:32:28.5919557Z x0 = x[:, :D] 2025-05-07T20:32:28.5919905Z x1 = x[:, D:] 2025-05-07T20:32:28.5920230Z 2025-05-07T20:32:28.5920525Z if contiguous: 2025-05-07T20:32:28.5920889Z x0 = x0.contiguous() 2025-05-07T20:32:28.5921294Z x1 = x1.contiguous() 2025-05-07T20:32:28.5921686Z 2025-05-07T20:32:28.5922001Z if scale_ub is not None: 2025-05-07T20:32:28.5922440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5922974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5923483Z ) 2025-05-07T20:32:28.5924080Z else: 2025-05-07T20:32:28.5924440Z scale_ub_tensor = None 2025-05-07T20:32:28.5924853Z 2025-05-07T20:32:28.5925219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5925943Z op = silu_mul_quant 2025-05-07T20:32:28.5926354Z if compiled: 2025-05-07T20:32:28.5926761Z op = torch.compile(op) 2025-05-07T20:32:28.5927238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5927686Z 2025-05-07T20:32:28.5928006Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.5928465Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.5928936Z 2025-05-07T20:32:28.5929315Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5929835Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.5930304Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.5930810Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.5931391Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.5931886Z 2025-05-07T20:32:28.5932217Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.5932514Z 2025-05-07T20:32:28.5932671Z moe/activation_test.py:126: 2025-05-07T20:32:28.5933108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5933625Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.5934145Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.5935348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.5936515Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.5937377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.5938452Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5939468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.5940530Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.5941665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.5942817Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.5943923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.5944898Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.5945813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.5946603Z fn() 2025-05-07T20:32:28.5947557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.5948465Z self.fn.run( 2025-05-07T20:32:28.5949192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.5950028Z kernel = self.compile( 2025-05-07T20:32:28.5950847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.5951823Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5952425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5952779Z 2025-05-07T20:32:28.5953085Z self = 2025-05-07T20:32:28.5954809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.5956886Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a536f1c0>} 2025-05-07T20:32:28.5959043Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.5960534Z context = 2025-05-07T20:32:28.5960960Z 2025-05-07T20:32:28.5961210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.5962013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5962686Z module_map=module_map) 2025-05-07T20:32:28.5963196Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5963719Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.5964112Z E ^ 2025-05-07T20:32:28.5964763Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5965417Z 2025-05-07T20:32:28.5966014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.5966748Z 2025-05-07T20:32:28.5966897Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.5967487Z self=, 2025-05-07T20:32:28.5968064Z T=16384, 2025-05-07T20:32:28.5968346Z D=7168, 2025-05-07T20:32:28.5968629Z scale_ub=1200.0, 2025-05-07T20:32:28.5968942Z contiguous=False, 2025-05-07T20:32:28.5969286Z compiled=False, 2025-05-07T20:32:28.5969609Z ) 2025-05-07T20:32:28.5970095Z self = 2025-05-07T20:32:28.5970861Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.5971299Z 2025-05-07T20:32:28.5971431Z @given( 2025-05-07T20:32:28.5971761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5972229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5972667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5973156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5973643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5974081Z ) 2025-05-07T20:32:28.5974595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5975298Z def test_silu_mul_quant( 2025-05-07T20:32:28.5975701Z self, 2025-05-07T20:32:28.5976028Z T: int, 2025-05-07T20:32:28.5976357Z D: int, 2025-05-07T20:32:28.5976723Z scale_ub: Optional[float], 2025-05-07T20:32:28.5977274Z contiguous: bool, 2025-05-07T20:32:28.5977621Z compiled: bool, 2025-05-07T20:32:28.5977965Z ) -> None: 2025-05-07T20:32:28.5978291Z torch.manual_seed(2025) 2025-05-07T20:32:28.5978631Z 2025-05-07T20:32:28.5979030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5979567Z 2025-05-07T20:32:28.5979848Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5980284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5980763Z x = x_sign * x_clamp 2025-05-07T20:32:28.5981141Z x0 = x[:, :D] 2025-05-07T20:32:28.5981478Z x1 = x[:, D:] 2025-05-07T20:32:28.5981819Z 2025-05-07T20:32:28.5982113Z if contiguous: 2025-05-07T20:32:28.5982493Z x0 = x0.contiguous() 2025-05-07T20:32:28.5982917Z x1 = x1.contiguous() 2025-05-07T20:32:28.5983309Z 2025-05-07T20:32:28.5983614Z if scale_ub is not None: 2025-05-07T20:32:28.5984061Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5984596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5985062Z ) 2025-05-07T20:32:28.5985373Z else: 2025-05-07T20:32:28.5985706Z scale_ub_tensor = None 2025-05-07T20:32:28.5986250Z 2025-05-07T20:32:28.5986606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5987084Z op = silu_mul_quant 2025-05-07T20:32:28.5987483Z if compiled: 2025-05-07T20:32:28.5987880Z op = torch.compile(op) 2025-05-07T20:32:28.5988360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5988844Z 2025-05-07T20:32:28.5989158Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.5989435Z 2025-05-07T20:32:28.5989598Z moe/activation_test.py:117: 2025-05-07T20:32:28.6001980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6002503Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6002951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6003990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6005020Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6005854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6006929Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6007986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6008838Z kernel = self.compile( 2025-05-07T20:32:28.6009696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6010669Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6011276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6011628Z 2025-05-07T20:32:28.6011942Z self = 2025-05-07T20:32:28.6013562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6015625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a4cb7ac0>} 2025-05-07T20:32:28.6017667Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6019225Z context = 2025-05-07T20:32:28.6019662Z 2025-05-07T20:32:28.6020081Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6020903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6021650Z module_map=module_map) 2025-05-07T20:32:28.6022222Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6022779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6023182Z E ^ 2025-05-07T20:32:28.6024209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6024932Z 2025-05-07T20:32:28.6025607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6026402Z 2025-05-07T20:32:28.6026584Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6027265Z self=, 2025-05-07T20:32:28.6027910Z T=1, 2025-05-07T20:32:28.6028219Z D=7168, 2025-05-07T20:32:28.6028528Z scale_ub=None, 2025-05-07T20:32:28.6028889Z contiguous=True, 2025-05-07T20:32:28.6029263Z compiled=True, 2025-05-07T20:32:28.6029797Z ) 2025-05-07T20:32:28.6030300Z self = 2025-05-07T20:32:28.6031055Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.6031459Z 2025-05-07T20:32:28.6031588Z @given( 2025-05-07T20:32:28.6031969Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6032462Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6032953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6033478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6034126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6034571Z ) 2025-05-07T20:32:28.6035112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6035794Z def test_silu_mul_quant( 2025-05-07T20:32:28.6036173Z self, 2025-05-07T20:32:28.6036480Z T: int, 2025-05-07T20:32:28.6036799Z D: int, 2025-05-07T20:32:28.6037161Z scale_ub: Optional[float], 2025-05-07T20:32:28.6037598Z contiguous: bool, 2025-05-07T20:32:28.6037969Z compiled: bool, 2025-05-07T20:32:28.6038348Z ) -> None: 2025-05-07T20:32:28.6038702Z torch.manual_seed(2025) 2025-05-07T20:32:28.6039092Z 2025-05-07T20:32:28.6039532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6040087Z 2025-05-07T20:32:28.6040401Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6040878Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6041381Z x = x_sign * x_clamp 2025-05-07T20:32:28.6041749Z x0 = x[:, :D] 2025-05-07T20:32:28.6042104Z x1 = x[:, D:] 2025-05-07T20:32:28.6042439Z 2025-05-07T20:32:28.6042740Z if contiguous: 2025-05-07T20:32:28.6043119Z x0 = x0.contiguous() 2025-05-07T20:32:28.6043543Z x1 = x1.contiguous() 2025-05-07T20:32:28.6043931Z 2025-05-07T20:32:28.6044259Z if scale_ub is not None: 2025-05-07T20:32:28.6044704Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6045237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6045722Z ) 2025-05-07T20:32:28.6046044Z else: 2025-05-07T20:32:28.6046383Z scale_ub_tensor = None 2025-05-07T20:32:28.6046774Z 2025-05-07T20:32:28.6047138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6047631Z op = silu_mul_quant 2025-05-07T20:32:28.6048020Z if compiled: 2025-05-07T20:32:28.6048413Z op = torch.compile(op) 2025-05-07T20:32:28.6048874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6049293Z 2025-05-07T20:32:28.6049779Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6050193Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6050603Z 2025-05-07T20:32:28.6050956Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6051494Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6051967Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6052455Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6053025Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6053533Z 2025-05-07T20:32:28.6053861Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6054181Z 2025-05-07T20:32:28.6054346Z moe/activation_test.py:126: 2025-05-07T20:32:28.6054840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6055313Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6055784Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6056902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6057961Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6058905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6059876Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6060850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6061871Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6062931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6064001Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6065035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6066012Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6067014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6067885Z fn() 2025-05-07T20:32:28.6068763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6069681Z self.fn.run( 2025-05-07T20:32:28.6070436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6071275Z kernel = self.compile( 2025-05-07T20:32:28.6072112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6073150Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6073875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6074226Z 2025-05-07T20:32:28.6074545Z self = 2025-05-07T20:32:28.6076232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6078431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76943d30a0>} 2025-05-07T20:32:28.6080390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6081680Z context = 2025-05-07T20:32:28.6082013Z 2025-05-07T20:32:28.6082222Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6082816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6083357Z module_map=module_map) 2025-05-07T20:32:28.6083780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6084183Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6084493Z E ^ 2025-05-07T20:32:28.6085030Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6085549Z 2025-05-07T20:32:28.6086033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6086617Z 2025-05-07T20:32:28.6086738Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6087225Z self=, 2025-05-07T20:32:28.6087686Z T=4096, 2025-05-07T20:32:28.6087902Z D=5120, 2025-05-07T20:32:28.6088129Z scale_ub=None, 2025-05-07T20:32:28.6088474Z contiguous=False, 2025-05-07T20:32:28.6088740Z compiled=False, 2025-05-07T20:32:28.6088972Z ) 2025-05-07T20:32:28.6089340Z self = 2025-05-07T20:32:28.6089908Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.6090219Z 2025-05-07T20:32:28.6090313Z @given( 2025-05-07T20:32:28.6090583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6090948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6091298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6091679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6092066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6092392Z ) 2025-05-07T20:32:28.6092803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6093314Z def test_silu_mul_quant( 2025-05-07T20:32:28.6093608Z self, 2025-05-07T20:32:28.6093834Z T: int, 2025-05-07T20:32:28.6094070Z D: int, 2025-05-07T20:32:28.6094326Z scale_ub: Optional[float], 2025-05-07T20:32:28.6094634Z contiguous: bool, 2025-05-07T20:32:28.6094913Z compiled: bool, 2025-05-07T20:32:28.6095174Z ) -> None: 2025-05-07T20:32:28.6095422Z torch.manual_seed(2025) 2025-05-07T20:32:28.6095716Z 2025-05-07T20:32:28.6096033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6096421Z 2025-05-07T20:32:28.6096651Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6096993Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6097344Z x = x_sign * x_clamp 2025-05-07T20:32:28.6097627Z x0 = x[:, :D] 2025-05-07T20:32:28.6097879Z x1 = x[:, D:] 2025-05-07T20:32:28.6098116Z 2025-05-07T20:32:28.6098361Z if contiguous: 2025-05-07T20:32:28.6098658Z x0 = x0.contiguous() 2025-05-07T20:32:28.6098973Z x1 = x1.contiguous() 2025-05-07T20:32:28.6099248Z 2025-05-07T20:32:28.6099475Z if scale_ub is not None: 2025-05-07T20:32:28.6099792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6100170Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6100526Z ) 2025-05-07T20:32:28.6100751Z else: 2025-05-07T20:32:28.6100991Z scale_ub_tensor = None 2025-05-07T20:32:28.6101284Z 2025-05-07T20:32:28.6101554Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6101910Z op = silu_mul_quant 2025-05-07T20:32:28.6102201Z if compiled: 2025-05-07T20:32:28.6102492Z op = torch.compile(op) 2025-05-07T20:32:28.6102921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6103245Z 2025-05-07T20:32:28.6103472Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6103660Z 2025-05-07T20:32:28.6103778Z moe/activation_test.py:117: 2025-05-07T20:32:28.6104132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6104534Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6104866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6105649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6106437Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6107063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6107841Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6108650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6109307Z kernel = self.compile( 2025-05-07T20:32:28.6109934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6110803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6111263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6111524Z 2025-05-07T20:32:28.6111769Z self = 2025-05-07T20:32:28.6113013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6114669Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a7370c10>} 2025-05-07T20:32:28.6116209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6117389Z context = 2025-05-07T20:32:28.6117720Z 2025-05-07T20:32:28.6117921Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6118515Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6119058Z module_map=module_map) 2025-05-07T20:32:28.6119479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6119888Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6120184Z E ^ 2025-05-07T20:32:28.6120724Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6121241Z 2025-05-07T20:32:28.6121723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6122317Z 2025-05-07T20:32:28.6122445Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6122917Z self=, 2025-05-07T20:32:28.6123378Z T=4096, 2025-05-07T20:32:28.6123601Z D=7168, 2025-05-07T20:32:28.6124160Z scale_ub=None, 2025-05-07T20:32:28.6124427Z contiguous=False, 2025-05-07T20:32:28.6124698Z compiled=False, 2025-05-07T20:32:28.6124929Z ) 2025-05-07T20:32:28.6125299Z self = 2025-05-07T20:32:28.6125873Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.6126188Z 2025-05-07T20:32:28.6126280Z @given( 2025-05-07T20:32:28.6126714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6127083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6127443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6127829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6128212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6128553Z ) 2025-05-07T20:32:28.6128974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6129480Z def test_silu_mul_quant( 2025-05-07T20:32:28.6129763Z self, 2025-05-07T20:32:28.6129986Z T: int, 2025-05-07T20:32:28.6130225Z D: int, 2025-05-07T20:32:28.6130480Z scale_ub: Optional[float], 2025-05-07T20:32:28.6130794Z contiguous: bool, 2025-05-07T20:32:28.6131077Z compiled: bool, 2025-05-07T20:32:28.6131337Z ) -> None: 2025-05-07T20:32:28.6131587Z torch.manual_seed(2025) 2025-05-07T20:32:28.6131873Z 2025-05-07T20:32:28.6132195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6132583Z 2025-05-07T20:32:28.6132814Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6133153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6133650Z x = x_sign * x_clamp 2025-05-07T20:32:28.6133926Z x0 = x[:, :D] 2025-05-07T20:32:28.6134183Z x1 = x[:, D:] 2025-05-07T20:32:28.6134436Z 2025-05-07T20:32:28.6134653Z if contiguous: 2025-05-07T20:32:28.6134927Z x0 = x0.contiguous() 2025-05-07T20:32:28.6135230Z x1 = x1.contiguous() 2025-05-07T20:32:28.6135506Z 2025-05-07T20:32:28.6135734Z if scale_ub is not None: 2025-05-07T20:32:28.6136051Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6136434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6136792Z ) 2025-05-07T20:32:28.6137020Z else: 2025-05-07T20:32:28.6137267Z scale_ub_tensor = None 2025-05-07T20:32:28.6137559Z 2025-05-07T20:32:28.6137830Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6138191Z op = silu_mul_quant 2025-05-07T20:32:28.6138494Z if compiled: 2025-05-07T20:32:28.6138784Z op = torch.compile(op) 2025-05-07T20:32:28.6139129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6139441Z 2025-05-07T20:32:28.6139670Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6139859Z 2025-05-07T20:32:28.6139980Z moe/activation_test.py:117: 2025-05-07T20:32:28.6140318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6140703Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6141031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6141816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6142608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6143226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6144007Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6144768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6145380Z kernel = self.compile( 2025-05-07T20:32:28.6146004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6146757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6147211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6147480Z 2025-05-07T20:32:28.6147718Z self = 2025-05-07T20:32:28.6149046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6150607Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a5ea7520>} 2025-05-07T20:32:28.6152123Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6153286Z context = 2025-05-07T20:32:28.6153700Z 2025-05-07T20:32:28.6153897Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6154502Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6155042Z module_map=module_map) 2025-05-07T20:32:28.6155462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6155872Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6156259Z E ^ 2025-05-07T20:32:28.6156792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6157314Z 2025-05-07T20:32:28.6157785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6158393Z 2025-05-07T20:32:28.6158527Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6159050Z self=, 2025-05-07T20:32:28.6159519Z T=128, 2025-05-07T20:32:28.6159743Z D=7168, 2025-05-07T20:32:28.6159966Z scale_ub=None, 2025-05-07T20:32:28.6160222Z contiguous=False, 2025-05-07T20:32:28.6160490Z compiled=True, 2025-05-07T20:32:28.6160729Z ) 2025-05-07T20:32:28.6161098Z self = 2025-05-07T20:32:28.6161661Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6161974Z 2025-05-07T20:32:28.6162071Z @given( 2025-05-07T20:32:28.6162339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6162702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6163060Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6163439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6163824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6164157Z ) 2025-05-07T20:32:28.6164558Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6165068Z def test_silu_mul_quant( 2025-05-07T20:32:28.6165352Z self, 2025-05-07T20:32:28.6165581Z T: int, 2025-05-07T20:32:28.6165814Z D: int, 2025-05-07T20:32:28.6166075Z scale_ub: Optional[float], 2025-05-07T20:32:28.6166391Z contiguous: bool, 2025-05-07T20:32:28.6166671Z compiled: bool, 2025-05-07T20:32:28.6166934Z ) -> None: 2025-05-07T20:32:28.6167195Z torch.manual_seed(2025) 2025-05-07T20:32:28.6167472Z 2025-05-07T20:32:28.6167793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6168186Z 2025-05-07T20:32:28.6168409Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6168749Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6169109Z x = x_sign * x_clamp 2025-05-07T20:32:28.6169387Z x0 = x[:, :D] 2025-05-07T20:32:28.6169643Z x1 = x[:, D:] 2025-05-07T20:32:28.6169889Z 2025-05-07T20:32:28.6170107Z if contiguous: 2025-05-07T20:32:28.6170382Z x0 = x0.contiguous() 2025-05-07T20:32:28.6170686Z x1 = x1.contiguous() 2025-05-07T20:32:28.6170966Z 2025-05-07T20:32:28.6171288Z if scale_ub is not None: 2025-05-07T20:32:28.6171615Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6172010Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6172372Z ) 2025-05-07T20:32:28.6172608Z else: 2025-05-07T20:32:28.6172862Z scale_ub_tensor = None 2025-05-07T20:32:28.6173152Z 2025-05-07T20:32:28.6173429Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6173795Z op = silu_mul_quant 2025-05-07T20:32:28.6174087Z if compiled: 2025-05-07T20:32:28.6174380Z op = torch.compile(op) 2025-05-07T20:32:28.6174732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6175053Z 2025-05-07T20:32:28.6175286Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6175630Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6175966Z 2025-05-07T20:32:28.6176255Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6176645Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6176985Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6177346Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6177854Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6178214Z 2025-05-07T20:32:28.6178448Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6178682Z 2025-05-07T20:32:28.6178799Z moe/activation_test.py:126: 2025-05-07T20:32:28.6179149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6179536Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6179920Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6180825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6181701Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6182326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6183114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6183916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6184748Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6185610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6186471Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6187310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6188049Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6188742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6189344Z fn() 2025-05-07T20:32:28.6189945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6190610Z self.fn.run( 2025-05-07T20:32:28.6191151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6191765Z kernel = self.compile( 2025-05-07T20:32:28.6192386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6193143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6193675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6193939Z 2025-05-07T20:32:28.6194305Z self = 2025-05-07T20:32:28.6195538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6197110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76a5ea7370>} 2025-05-07T20:32:28.6198640Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6199810Z context = 2025-05-07T20:32:28.6200141Z 2025-05-07T20:32:28.6200340Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6200941Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6208496Z module_map=module_map) 2025-05-07T20:32:28.6208941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6209477Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6209794Z E ^ 2025-05-07T20:32:28.6210335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6210856Z 2025-05-07T20:32:28.6211342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6211941Z 2025-05-07T20:32:28.6212065Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6212546Z self=, 2025-05-07T20:32:28.6213009Z T=128, 2025-05-07T20:32:28.6213226Z D=7168, 2025-05-07T20:32:28.6213463Z scale_ub=None, 2025-05-07T20:32:28.6213722Z contiguous=False, 2025-05-07T20:32:28.6213985Z compiled=False, 2025-05-07T20:32:28.6214232Z ) 2025-05-07T20:32:28.6214604Z self = 2025-05-07T20:32:28.6215174Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.6215492Z 2025-05-07T20:32:28.6215587Z @given( 2025-05-07T20:32:28.6215863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6216232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6216590Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6216979Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6217368Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6217700Z ) 2025-05-07T20:32:28.6218110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6218631Z def test_silu_mul_quant( 2025-05-07T20:32:28.6218921Z self, 2025-05-07T20:32:28.6219154Z T: int, 2025-05-07T20:32:28.6219393Z D: int, 2025-05-07T20:32:28.6219649Z scale_ub: Optional[float], 2025-05-07T20:32:28.6219978Z contiguous: bool, 2025-05-07T20:32:28.6220265Z compiled: bool, 2025-05-07T20:32:28.6220525Z ) -> None: 2025-05-07T20:32:28.6220781Z torch.manual_seed(2025) 2025-05-07T20:32:28.6221066Z 2025-05-07T20:32:28.6221381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6221777Z 2025-05-07T20:32:28.6222008Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6222342Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6222704Z x = x_sign * x_clamp 2025-05-07T20:32:28.6222991Z x0 = x[:, :D] 2025-05-07T20:32:28.6223241Z x1 = x[:, D:] 2025-05-07T20:32:28.6223483Z 2025-05-07T20:32:28.6223701Z if contiguous: 2025-05-07T20:32:28.6224427Z x0 = x0.contiguous() 2025-05-07T20:32:28.6224738Z x1 = x1.contiguous() 2025-05-07T20:32:28.6225018Z 2025-05-07T20:32:28.6225244Z if scale_ub is not None: 2025-05-07T20:32:28.6225553Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6225947Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6226307Z ) 2025-05-07T20:32:28.6226528Z else: 2025-05-07T20:32:28.6226777Z scale_ub_tensor = None 2025-05-07T20:32:28.6227069Z 2025-05-07T20:32:28.6227337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6227702Z op = silu_mul_quant 2025-05-07T20:32:28.6227995Z if compiled: 2025-05-07T20:32:28.6228278Z op = torch.compile(op) 2025-05-07T20:32:28.6228624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6228945Z 2025-05-07T20:32:28.6229166Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6229362Z 2025-05-07T20:32:28.6229483Z moe/activation_test.py:117: 2025-05-07T20:32:28.6229824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6230209Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6230532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6231468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6232261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6232874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6233720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6234481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6235092Z kernel = self.compile( 2025-05-07T20:32:28.6235718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6236472Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6236928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6237199Z 2025-05-07T20:32:28.6237443Z self = 2025-05-07T20:32:28.6238669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6240227Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7694293640>} 2025-05-07T20:32:28.6241751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6242919Z context = 2025-05-07T20:32:28.6243249Z 2025-05-07T20:32:28.6243447Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6244047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6244585Z module_map=module_map) 2025-05-07T20:32:28.6245006Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6245408Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6245709Z E ^ 2025-05-07T20:32:28.6246242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6246755Z 2025-05-07T20:32:28.6247323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6247921Z 2025-05-07T20:32:28.6248042Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6248518Z self=, 2025-05-07T20:32:28.6248983Z T=4096, 2025-05-07T20:32:28.6249198Z D=5120, 2025-05-07T20:32:28.6249424Z scale_ub=1200.0, 2025-05-07T20:32:28.6249685Z contiguous=True, 2025-05-07T20:32:28.6249939Z compiled=False, 2025-05-07T20:32:28.6250178Z ) 2025-05-07T20:32:28.6250549Z self = 2025-05-07T20:32:28.6251110Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.6251427Z 2025-05-07T20:32:28.6251516Z @given( 2025-05-07T20:32:28.6251788Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6252148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6252498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6252887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6253273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6253601Z ) 2025-05-07T20:32:28.6254009Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6254606Z def test_silu_mul_quant( 2025-05-07T20:32:28.6254884Z self, 2025-05-07T20:32:28.6255114Z T: int, 2025-05-07T20:32:28.6255345Z D: int, 2025-05-07T20:32:28.6255596Z scale_ub: Optional[float], 2025-05-07T20:32:28.6255910Z contiguous: bool, 2025-05-07T20:32:28.6256188Z compiled: bool, 2025-05-07T20:32:28.6256442Z ) -> None: 2025-05-07T20:32:28.6256693Z torch.manual_seed(2025) 2025-05-07T20:32:28.6256974Z 2025-05-07T20:32:28.6257287Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6257674Z 2025-05-07T20:32:28.6257902Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6258243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6258598Z x = x_sign * x_clamp 2025-05-07T20:32:28.6258879Z x0 = x[:, :D] 2025-05-07T20:32:28.6259134Z x1 = x[:, D:] 2025-05-07T20:32:28.6259381Z 2025-05-07T20:32:28.6259597Z if contiguous: 2025-05-07T20:32:28.6259865Z x0 = x0.contiguous() 2025-05-07T20:32:28.6260161Z x1 = x1.contiguous() 2025-05-07T20:32:28.6260441Z 2025-05-07T20:32:28.6260663Z if scale_ub is not None: 2025-05-07T20:32:28.6260977Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6261363Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6261720Z ) 2025-05-07T20:32:28.6261940Z else: 2025-05-07T20:32:28.6262182Z scale_ub_tensor = None 2025-05-07T20:32:28.6262473Z 2025-05-07T20:32:28.6262738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6263105Z op = silu_mul_quant 2025-05-07T20:32:28.6263403Z if compiled: 2025-05-07T20:32:28.6263694Z op = torch.compile(op) 2025-05-07T20:32:28.6264033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6264356Z 2025-05-07T20:32:28.6264591Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6264783Z 2025-05-07T20:32:28.6264899Z moe/activation_test.py:117: 2025-05-07T20:32:28.6265244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6265632Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6265952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6266735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6267523Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6268143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6269030Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6269794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6270413Z kernel = self.compile( 2025-05-07T20:32:28.6271028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6271778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6272233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6272496Z 2025-05-07T20:32:28.6272737Z self = 2025-05-07T20:32:28.6274005Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6275569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7694290670>} 2025-05-07T20:32:28.6277276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6278439Z context = 2025-05-07T20:32:28.6278767Z 2025-05-07T20:32:28.6278965Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6279555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6280093Z module_map=module_map) 2025-05-07T20:32:28.6280513Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6280911Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6281221Z E ^ 2025-05-07T20:32:28.6281754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6282270Z 2025-05-07T20:32:28.6282759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6283343Z 2025-05-07T20:32:28.6283463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6283939Z self=, 2025-05-07T20:32:28.6284399Z T=1, 2025-05-07T20:32:28.6284610Z D=5120, 2025-05-07T20:32:28.6284837Z scale_ub=None, 2025-05-07T20:32:28.6285084Z contiguous=True, 2025-05-07T20:32:28.6285342Z compiled=True, 2025-05-07T20:32:28.6285573Z ) 2025-05-07T20:32:28.6285936Z self = 2025-05-07T20:32:28.6286496Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.6286790Z 2025-05-07T20:32:28.6286879Z @given( 2025-05-07T20:32:28.6287144Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6287506Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6287872Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6288250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6288628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6288954Z ) 2025-05-07T20:32:28.6289353Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6289855Z def test_silu_mul_quant( 2025-05-07T20:32:28.6290129Z self, 2025-05-07T20:32:28.6290348Z T: int, 2025-05-07T20:32:28.6290572Z D: int, 2025-05-07T20:32:28.6290830Z scale_ub: Optional[float], 2025-05-07T20:32:28.6290938Z contiguous: bool, 2025-05-07T20:32:28.6291040Z compiled: bool, 2025-05-07T20:32:28.6291261Z ) -> None: 2025-05-07T20:32:28.6291375Z torch.manual_seed(2025) 2025-05-07T20:32:28.6291462Z 2025-05-07T20:32:28.6291666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6291759Z 2025-05-07T20:32:28.6291868Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6292022Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6292126Z x = x_sign * x_clamp 2025-05-07T20:32:28.6292221Z x0 = x[:, :D] 2025-05-07T20:32:28.6292322Z x1 = x[:, D:] 2025-05-07T20:32:28.6292411Z 2025-05-07T20:32:28.6292517Z if contiguous: 2025-05-07T20:32:28.6292623Z x0 = x0.contiguous() 2025-05-07T20:32:28.6292728Z x1 = x1.contiguous() 2025-05-07T20:32:28.6292821Z 2025-05-07T20:32:28.6292927Z if scale_ub is not None: 2025-05-07T20:32:28.6293050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6293213Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6293309Z ) 2025-05-07T20:32:28.6293398Z else: 2025-05-07T20:32:28.6293517Z scale_ub_tensor = None 2025-05-07T20:32:28.6293602Z 2025-05-07T20:32:28.6293755Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6293959Z op = silu_mul_quant 2025-05-07T20:32:28.6294060Z if compiled: 2025-05-07T20:32:28.6294185Z op = torch.compile(op) 2025-05-07T20:32:28.6294308Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6294394Z 2025-05-07T20:32:28.6294509Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6294650Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6294736Z 2025-05-07T20:32:28.6294902Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6295022Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6295139Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6295293Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6295456Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6295543Z 2025-05-07T20:32:28.6295669Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6295680Z 2025-05-07T20:32:28.6295796Z moe/activation_test.py:126: 2025-05-07T20:32:28.6295955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6296082Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6296239Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6296882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6297002Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6297416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6297687Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6298107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6298409Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6298872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6299167Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6299599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6299793Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6300193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6300285Z fn() 2025-05-07T20:32:28.6300832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6300939Z self.fn.run( 2025-05-07T20:32:28.6301327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6301444Z kernel = self.compile( 2025-05-07T20:32:28.6301886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6302092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6302250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6302255Z 2025-05-07T20:32:28.6302492Z self = 2025-05-07T20:32:28.6303384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6303967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767f26d480>} 2025-05-07T20:32:28.6304903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6305133Z context = 2025-05-07T20:32:28.6305138Z 2025-05-07T20:32:28.6305331Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6305643Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6305769Z module_map=module_map) 2025-05-07T20:32:28.6305966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6306092Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6306183Z E ^ 2025-05-07T20:32:28.6306595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6306606Z 2025-05-07T20:32:28.6307088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6307094Z 2025-05-07T20:32:28.6307216Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6307481Z self=, 2025-05-07T20:32:28.6307573Z T=2048, 2025-05-07T20:32:28.6307666Z D=5120, 2025-05-07T20:32:28.6307772Z scale_ub=None, 2025-05-07T20:32:28.6307873Z contiguous=True, 2025-05-07T20:32:28.6307972Z compiled=True, 2025-05-07T20:32:28.6308066Z ) 2025-05-07T20:32:28.6308324Z self = 2025-05-07T20:32:28.6308524Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.6308537Z 2025-05-07T20:32:28.6308628Z @given( 2025-05-07T20:32:28.6308775Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6308904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6309040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6309179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6309319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6309409Z ) 2025-05-07T20:32:28.6309694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6309811Z def test_silu_mul_quant( 2025-05-07T20:32:28.6309902Z self, 2025-05-07T20:32:28.6309994Z T: int, 2025-05-07T20:32:28.6310093Z D: int, 2025-05-07T20:32:28.6310208Z scale_ub: Optional[float], 2025-05-07T20:32:28.6310409Z contiguous: bool, 2025-05-07T20:32:28.6310512Z compiled: bool, 2025-05-07T20:32:28.6310605Z ) -> None: 2025-05-07T20:32:28.6310722Z torch.manual_seed(2025) 2025-05-07T20:32:28.6310810Z 2025-05-07T20:32:28.6311012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6311107Z 2025-05-07T20:32:28.6311216Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6311362Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6311473Z x = x_sign * x_clamp 2025-05-07T20:32:28.6311568Z x0 = x[:, :D] 2025-05-07T20:32:28.6311662Z x1 = x[:, D:] 2025-05-07T20:32:28.6311753Z 2025-05-07T20:32:28.6311852Z if contiguous: 2025-05-07T20:32:28.6311964Z x0 = x0.contiguous() 2025-05-07T20:32:28.6312068Z x1 = x1.contiguous() 2025-05-07T20:32:28.6312154Z 2025-05-07T20:32:28.6312267Z if scale_ub is not None: 2025-05-07T20:32:28.6312397Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6312553Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6312648Z ) 2025-05-07T20:32:28.6312737Z else: 2025-05-07T20:32:28.6312850Z scale_ub_tensor = None 2025-05-07T20:32:28.6313032Z 2025-05-07T20:32:28.6313185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6313291Z op = silu_mul_quant 2025-05-07T20:32:28.6313397Z if compiled: 2025-05-07T20:32:28.6313609Z op = torch.compile(op) 2025-05-07T20:32:28.6313740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6313826Z 2025-05-07T20:32:28.6313932Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6314080Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6314168Z 2025-05-07T20:32:28.6314326Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6314453Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6314577Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6314719Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6314893Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6314987Z 2025-05-07T20:32:28.6315106Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6315118Z 2025-05-07T20:32:28.6315234Z moe/activation_test.py:126: 2025-05-07T20:32:28.6315384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6315515Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6315674Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6316312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6316442Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6316861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6317126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6317547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6317847Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6318308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6318601Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6319030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6319231Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6319740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6319839Z fn() 2025-05-07T20:32:28.6320301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6320404Z self.fn.run( 2025-05-07T20:32:28.6320804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6320916Z kernel = self.compile( 2025-05-07T20:32:28.6321352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6321563Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6321711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6321716Z 2025-05-07T20:32:28.6321956Z self = 2025-05-07T20:32:28.6322850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6323546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767d2ed000>} 2025-05-07T20:32:28.6324727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6324959Z context = 2025-05-07T20:32:28.6324964Z 2025-05-07T20:32:28.6325169Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6325474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6325617Z module_map=module_map) 2025-05-07T20:32:28.6325805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6325927Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6326036Z E ^ 2025-05-07T20:32:28.6326445Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6326450Z 2025-05-07T20:32:28.6326925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6326938Z 2025-05-07T20:32:28.6327063Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6327321Z self=, 2025-05-07T20:32:28.6327419Z T=128, 2025-05-07T20:32:28.6327510Z D=5120, 2025-05-07T20:32:28.6327608Z scale_ub=None, 2025-05-07T20:32:28.6327715Z contiguous=True, 2025-05-07T20:32:28.6327820Z compiled=True, 2025-05-07T20:32:28.6327907Z ) 2025-05-07T20:32:28.6328168Z self = 2025-05-07T20:32:28.6328363Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.6328374Z 2025-05-07T20:32:28.6328470Z @given( 2025-05-07T20:32:28.6328611Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6328729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6328878Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6329020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6329158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6329254Z ) 2025-05-07T20:32:28.6329539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6329652Z def test_silu_mul_quant( 2025-05-07T20:32:28.6329754Z self, 2025-05-07T20:32:28.6329845Z T: int, 2025-05-07T20:32:28.6330097Z D: int, 2025-05-07T20:32:28.6330223Z scale_ub: Optional[float], 2025-05-07T20:32:28.6330328Z contiguous: bool, 2025-05-07T20:32:28.6330435Z compiled: bool, 2025-05-07T20:32:28.6330527Z ) -> None: 2025-05-07T20:32:28.6330642Z torch.manual_seed(2025) 2025-05-07T20:32:28.6330738Z 2025-05-07T20:32:28.6330938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6331027Z 2025-05-07T20:32:28.6331144Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6331289Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6331394Z x = x_sign * x_clamp 2025-05-07T20:32:28.6331497Z x0 = x[:, :D] 2025-05-07T20:32:28.6331592Z x1 = x[:, D:] 2025-05-07T20:32:28.6331678Z 2025-05-07T20:32:28.6331790Z if contiguous: 2025-05-07T20:32:28.6331898Z x0 = x0.contiguous() 2025-05-07T20:32:28.6332008Z x1 = x1.contiguous() 2025-05-07T20:32:28.6332094Z 2025-05-07T20:32:28.6332205Z if scale_ub is not None: 2025-05-07T20:32:28.6332337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6332494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6332712Z ) 2025-05-07T20:32:28.6332807Z else: 2025-05-07T20:32:28.6332919Z scale_ub_tensor = None 2025-05-07T20:32:28.6333004Z 2025-05-07T20:32:28.6333170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6333276Z op = silu_mul_quant 2025-05-07T20:32:28.6333376Z if compiled: 2025-05-07T20:32:28.6333500Z op = torch.compile(op) 2025-05-07T20:32:28.6333627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6333712Z 2025-05-07T20:32:28.6333826Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6333967Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6334059Z 2025-05-07T20:32:28.6334223Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6334343Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6334466Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6334608Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6334780Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6334875Z 2025-05-07T20:32:28.6334994Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6335000Z 2025-05-07T20:32:28.6335120Z moe/activation_test.py:126: 2025-05-07T20:32:28.6335271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6335395Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6335558Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6336313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6336491Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6336950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6337279Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6337746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6338042Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6338501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6338797Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6339226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6339542Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6339941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6340032Z fn() 2025-05-07T20:32:28.6340501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6340598Z self.fn.run( 2025-05-07T20:32:28.6340986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6341101Z kernel = self.compile( 2025-05-07T20:32:28.6341536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6341746Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6341895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6341901Z 2025-05-07T20:32:28.6342144Z self = 2025-05-07T20:32:28.6343039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6343704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767ca01360>} 2025-05-07T20:32:28.6344559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6344785Z context = 2025-05-07T20:32:28.6344790Z 2025-05-07T20:32:28.6344982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6345297Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6345424Z module_map=module_map) 2025-05-07T20:32:28.6345620Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6345748Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6345841Z E ^ 2025-05-07T20:32:28.6346256Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6346261Z 2025-05-07T20:32:28.6346738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6346743Z 2025-05-07T20:32:28.6346875Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6347132Z self=, 2025-05-07T20:32:28.6347224Z T=4096, 2025-05-07T20:32:28.6347325Z D=5120, 2025-05-07T20:32:28.6347427Z scale_ub=None, 2025-05-07T20:32:28.6347527Z contiguous=True, 2025-05-07T20:32:28.6347632Z compiled=True, 2025-05-07T20:32:28.6347718Z ) 2025-05-07T20:32:28.6347968Z self = 2025-05-07T20:32:28.6348186Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.6348191Z 2025-05-07T20:32:28.6348284Z @given( 2025-05-07T20:32:28.6348431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6348550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6348689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6348835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6348969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6349058Z ) 2025-05-07T20:32:28.6349349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6349551Z def test_silu_mul_quant( 2025-05-07T20:32:28.6349643Z self, 2025-05-07T20:32:28.6349743Z T: int, 2025-05-07T20:32:28.6349832Z D: int, 2025-05-07T20:32:28.6349947Z scale_ub: Optional[float], 2025-05-07T20:32:28.6350066Z contiguous: bool, 2025-05-07T20:32:28.6350168Z compiled: bool, 2025-05-07T20:32:28.6350265Z ) -> None: 2025-05-07T20:32:28.6350375Z torch.manual_seed(2025) 2025-05-07T20:32:28.6350461Z 2025-05-07T20:32:28.6350662Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6350749Z 2025-05-07T20:32:28.6350856Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6351006Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6351110Z x = x_sign * x_clamp 2025-05-07T20:32:28.6351204Z x0 = x[:, :D] 2025-05-07T20:32:28.6351303Z x1 = x[:, D:] 2025-05-07T20:32:28.6351388Z 2025-05-07T20:32:28.6351485Z if contiguous: 2025-05-07T20:32:28.6351603Z x0 = x0.contiguous() 2025-05-07T20:32:28.6351709Z x1 = x1.contiguous() 2025-05-07T20:32:28.6351793Z 2025-05-07T20:32:28.6351906Z if scale_ub is not None: 2025-05-07T20:32:28.6352030Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6352288Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6352380Z ) 2025-05-07T20:32:28.6352472Z else: 2025-05-07T20:32:28.6352588Z scale_ub_tensor = None 2025-05-07T20:32:28.6352673Z 2025-05-07T20:32:28.6352823Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6352935Z op = silu_mul_quant 2025-05-07T20:32:28.6353034Z if compiled: 2025-05-07T20:32:28.6353153Z op = torch.compile(op) 2025-05-07T20:32:28.6353283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6353368Z 2025-05-07T20:32:28.6353475Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6353720Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6353807Z 2025-05-07T20:32:28.6353974Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6354092Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6354219Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6354365Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6354526Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6354612Z 2025-05-07T20:32:28.6354736Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6354741Z 2025-05-07T20:32:28.6354857Z moe/activation_test.py:126: 2025-05-07T20:32:28.6355013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6355137Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6355295Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6355943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6356063Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6356475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6356746Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6357164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6357464Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6357918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6358210Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6358741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6358954Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6359422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6359520Z fn() 2025-05-07T20:32:28.6360071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6360176Z self.fn.run( 2025-05-07T20:32:28.6360564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6360674Z kernel = self.compile( 2025-05-07T20:32:28.6361116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6361322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6361480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6361486Z 2025-05-07T20:32:28.6361721Z self = 2025-05-07T20:32:28.6362604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6363311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767c4130a0>} 2025-05-07T20:32:28.6364157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6364382Z context = 2025-05-07T20:32:28.6364387Z 2025-05-07T20:32:28.6364584Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6364887Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6365026Z module_map=module_map) 2025-05-07T20:32:28.6365213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6365341Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6365433Z E ^ 2025-05-07T20:32:28.6365840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6371584Z 2025-05-07T20:32:28.6372099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6372105Z 2025-05-07T20:32:28.6372228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6372505Z self=, 2025-05-07T20:32:28.6372596Z T=16384, 2025-05-07T20:32:28.6372685Z D=5120, 2025-05-07T20:32:28.6372787Z scale_ub=None, 2025-05-07T20:32:28.6372884Z contiguous=True, 2025-05-07T20:32:28.6372981Z compiled=True, 2025-05-07T20:32:28.6373076Z ) 2025-05-07T20:32:28.6373327Z self = 2025-05-07T20:32:28.6373534Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.6373539Z 2025-05-07T20:32:28.6373629Z @given( 2025-05-07T20:32:28.6373767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6373891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6374021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6374154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6374292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6374381Z ) 2025-05-07T20:32:28.6374786Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6374897Z def test_silu_mul_quant( 2025-05-07T20:32:28.6374987Z self, 2025-05-07T20:32:28.6375079Z T: int, 2025-05-07T20:32:28.6375172Z D: int, 2025-05-07T20:32:28.6375286Z scale_ub: Optional[float], 2025-05-07T20:32:28.6375396Z contiguous: bool, 2025-05-07T20:32:28.6375497Z compiled: bool, 2025-05-07T20:32:28.6375589Z ) -> None: 2025-05-07T20:32:28.6375703Z torch.manual_seed(2025) 2025-05-07T20:32:28.6375787Z 2025-05-07T20:32:28.6375984Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6376073Z 2025-05-07T20:32:28.6376180Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6376323Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6376431Z x = x_sign * x_clamp 2025-05-07T20:32:28.6376522Z x0 = x[:, :D] 2025-05-07T20:32:28.6376623Z x1 = x[:, D:] 2025-05-07T20:32:28.6376712Z 2025-05-07T20:32:28.6376809Z if contiguous: 2025-05-07T20:32:28.6376922Z x0 = x0.contiguous() 2025-05-07T20:32:28.6377026Z x1 = x1.contiguous() 2025-05-07T20:32:28.6377111Z 2025-05-07T20:32:28.6377314Z if scale_ub is not None: 2025-05-07T20:32:28.6377438Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6377594Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6377689Z ) 2025-05-07T20:32:28.6377777Z else: 2025-05-07T20:32:28.6377884Z scale_ub_tensor = None 2025-05-07T20:32:28.6377973Z 2025-05-07T20:32:28.6378126Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6378233Z op = silu_mul_quant 2025-05-07T20:32:28.6378333Z if compiled: 2025-05-07T20:32:28.6378450Z op = torch.compile(op) 2025-05-07T20:32:28.6378579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6378661Z 2025-05-07T20:32:28.6378771Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6378914Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6379000Z 2025-05-07T20:32:28.6379157Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6379285Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6379400Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6379541Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6379707Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6379791Z 2025-05-07T20:32:28.6379912Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6379917Z 2025-05-07T20:32:28.6380033Z moe/activation_test.py:126: 2025-05-07T20:32:28.6380181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6380310Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6380473Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6381116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6381246Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6381660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6381923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6382341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6382640Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6383096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6383474Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6383907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6384100Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6384503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6384594Z fn() 2025-05-07T20:32:28.6385052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6385152Z self.fn.run( 2025-05-07T20:32:28.6385542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6385656Z kernel = self.compile( 2025-05-07T20:32:28.6386096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6386302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6386453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6386458Z 2025-05-07T20:32:28.6386693Z self = 2025-05-07T20:32:28.6387676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6388255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767ca01b40>} 2025-05-07T20:32:28.6389103Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6389332Z context = 2025-05-07T20:32:28.6389337Z 2025-05-07T20:32:28.6389528Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6389837Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6389965Z module_map=module_map) 2025-05-07T20:32:28.6390151Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6390273Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6390361Z E ^ 2025-05-07T20:32:28.6390765Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6390770Z 2025-05-07T20:32:28.6391242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6391247Z 2025-05-07T20:32:28.6391369Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6391626Z self=, 2025-05-07T20:32:28.6391716Z T=1, 2025-05-07T20:32:28.6391804Z D=5120, 2025-05-07T20:32:28.6391910Z scale_ub=1200.0, 2025-05-07T20:32:28.6392009Z contiguous=True, 2025-05-07T20:32:28.6392106Z compiled=True, 2025-05-07T20:32:28.6392194Z ) 2025-05-07T20:32:28.6392441Z self = 2025-05-07T20:32:28.6392630Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.6392636Z 2025-05-07T20:32:28.6392729Z @given( 2025-05-07T20:32:28.6392867Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6392982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6393118Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6393258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6393480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6393673Z ) 2025-05-07T20:32:28.6393960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6394074Z def test_silu_mul_quant( 2025-05-07T20:32:28.6394169Z self, 2025-05-07T20:32:28.6394257Z T: int, 2025-05-07T20:32:28.6394349Z D: int, 2025-05-07T20:32:28.6394461Z scale_ub: Optional[float], 2025-05-07T20:32:28.6394563Z contiguous: bool, 2025-05-07T20:32:28.6394664Z compiled: bool, 2025-05-07T20:32:28.6394754Z ) -> None: 2025-05-07T20:32:28.6394869Z torch.manual_seed(2025) 2025-05-07T20:32:28.6394953Z 2025-05-07T20:32:28.6395146Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6395236Z 2025-05-07T20:32:28.6395340Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6395484Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6395590Z x = x_sign * x_clamp 2025-05-07T20:32:28.6395689Z x0 = x[:, :D] 2025-05-07T20:32:28.6395782Z x1 = x[:, D:] 2025-05-07T20:32:28.6395868Z 2025-05-07T20:32:28.6395963Z if contiguous: 2025-05-07T20:32:28.6396066Z x0 = x0.contiguous() 2025-05-07T20:32:28.6396271Z x1 = x1.contiguous() 2025-05-07T20:32:28.6396355Z 2025-05-07T20:32:28.6396463Z if scale_ub is not None: 2025-05-07T20:32:28.6396591Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6396747Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6396838Z ) 2025-05-07T20:32:28.6396926Z else: 2025-05-07T20:32:28.6397035Z scale_ub_tensor = None 2025-05-07T20:32:28.6397123Z 2025-05-07T20:32:28.6397273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6397378Z op = silu_mul_quant 2025-05-07T20:32:28.6397481Z if compiled: 2025-05-07T20:32:28.6397596Z op = torch.compile(op) 2025-05-07T20:32:28.6397723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6397811Z 2025-05-07T20:32:28.6397916Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6397921Z 2025-05-07T20:32:28.6398035Z moe/activation_test.py:117: 2025-05-07T20:32:28.6398190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6398306Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6398425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6398845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6398954Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6399524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6399636Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6400054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6400311Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6400701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6400819Z kernel = self.compile( 2025-05-07T20:32:28.6401256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6401459Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6401608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6401613Z 2025-05-07T20:32:28.6401844Z self = 2025-05-07T20:32:28.6402846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6403424Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767cc3d480>} 2025-05-07T20:32:28.6404279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6404498Z context = 2025-05-07T20:32:28.6404503Z 2025-05-07T20:32:28.6404692Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6405000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6405124Z module_map=module_map) 2025-05-07T20:32:28.6405318Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6405437Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6405525Z E ^ 2025-05-07T20:32:28.6405935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6406025Z 2025-05-07T20:32:28.6406496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6406502Z 2025-05-07T20:32:28.6406622Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6406881Z self=, 2025-05-07T20:32:28.6406973Z T=1, 2025-05-07T20:32:28.6407069Z D=5120, 2025-05-07T20:32:28.6407164Z scale_ub=None, 2025-05-07T20:32:28.6407263Z contiguous=False, 2025-05-07T20:32:28.6407362Z compiled=True, 2025-05-07T20:32:28.6407445Z ) 2025-05-07T20:32:28.6407697Z self = 2025-05-07T20:32:28.6407889Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6407894Z 2025-05-07T20:32:28.6407983Z @given( 2025-05-07T20:32:28.6408119Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6408244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6408381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6408521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6408654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6408739Z ) 2025-05-07T20:32:28.6409025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6409132Z def test_silu_mul_quant( 2025-05-07T20:32:28.6409221Z self, 2025-05-07T20:32:28.6409314Z T: int, 2025-05-07T20:32:28.6409401Z D: int, 2025-05-07T20:32:28.6409514Z scale_ub: Optional[float], 2025-05-07T20:32:28.6409621Z contiguous: bool, 2025-05-07T20:32:28.6409725Z compiled: bool, 2025-05-07T20:32:28.6409816Z ) -> None: 2025-05-07T20:32:28.6409928Z torch.manual_seed(2025) 2025-05-07T20:32:28.6410011Z 2025-05-07T20:32:28.6410211Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6410300Z 2025-05-07T20:32:28.6410409Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6410555Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6410657Z x = x_sign * x_clamp 2025-05-07T20:32:28.6410748Z x0 = x[:, :D] 2025-05-07T20:32:28.6410846Z x1 = x[:, D:] 2025-05-07T20:32:28.6410929Z 2025-05-07T20:32:28.6411024Z if contiguous: 2025-05-07T20:32:28.6411131Z x0 = x0.contiguous() 2025-05-07T20:32:28.6411232Z x1 = x1.contiguous() 2025-05-07T20:32:28.6411316Z 2025-05-07T20:32:28.6411422Z if scale_ub is not None: 2025-05-07T20:32:28.6411542Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6411788Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6411882Z ) 2025-05-07T20:32:28.6411969Z else: 2025-05-07T20:32:28.6412080Z scale_ub_tensor = None 2025-05-07T20:32:28.6412169Z 2025-05-07T20:32:28.6412316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6412424Z op = silu_mul_quant 2025-05-07T20:32:28.6412521Z if compiled: 2025-05-07T20:32:28.6412634Z op = torch.compile(op) 2025-05-07T20:32:28.6412761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6412843Z 2025-05-07T20:32:28.6412952Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6413095Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6413177Z 2025-05-07T20:32:28.6413332Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6413455Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6413575Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6413719Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6413880Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6414055Z 2025-05-07T20:32:28.6414173Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6414178Z 2025-05-07T20:32:28.6414290Z moe/activation_test.py:126: 2025-05-07T20:32:28.6414435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6414560Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6414715Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6415355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6415473Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6415889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6416150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6416568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6416870Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6417325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6417616Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6418043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6418235Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6418653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6418746Z fn() 2025-05-07T20:32:28.6419199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6419303Z self.fn.run( 2025-05-07T20:32:28.6419686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6419795Z kernel = self.compile( 2025-05-07T20:32:28.6420226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6420425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6420570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6420575Z 2025-05-07T20:32:28.6420808Z self = 2025-05-07T20:32:28.6421772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6422351Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677ea89d0>} 2025-05-07T20:32:28.6423190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6423411Z context = 2025-05-07T20:32:28.6423416Z 2025-05-07T20:32:28.6423604Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6424171Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6424334Z module_map=module_map) 2025-05-07T20:32:28.6424524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6424643Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6424903Z E ^ 2025-05-07T20:32:28.6425308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6425313Z 2025-05-07T20:32:28.6425783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6425788Z 2025-05-07T20:32:28.6425907Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6426159Z self=, 2025-05-07T20:32:28.6426250Z T=1, 2025-05-07T20:32:28.6426339Z D=5120, 2025-05-07T20:32:28.6426433Z scale_ub=None, 2025-05-07T20:32:28.6426534Z contiguous=True, 2025-05-07T20:32:28.6426629Z compiled=False, 2025-05-07T20:32:28.6426723Z ) 2025-05-07T20:32:28.6426969Z self = 2025-05-07T20:32:28.6427154Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.6427165Z 2025-05-07T20:32:28.6427256Z @given( 2025-05-07T20:32:28.6427392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6427506Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6427641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6427775Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6427909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6427995Z ) 2025-05-07T20:32:28.6428276Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6428388Z def test_silu_mul_quant( 2025-05-07T20:32:28.6428475Z self, 2025-05-07T20:32:28.6428565Z T: int, 2025-05-07T20:32:28.6428654Z D: int, 2025-05-07T20:32:28.6428771Z scale_ub: Optional[float], 2025-05-07T20:32:28.6428872Z contiguous: bool, 2025-05-07T20:32:28.6428973Z compiled: bool, 2025-05-07T20:32:28.6429063Z ) -> None: 2025-05-07T20:32:28.6429175Z torch.manual_seed(2025) 2025-05-07T20:32:28.6429261Z 2025-05-07T20:32:28.6429452Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6429537Z 2025-05-07T20:32:28.6429647Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6429790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6429895Z x = x_sign * x_clamp 2025-05-07T20:32:28.6429986Z x0 = x[:, :D] 2025-05-07T20:32:28.6430077Z x1 = x[:, D:] 2025-05-07T20:32:28.6430163Z 2025-05-07T20:32:28.6430258Z if contiguous: 2025-05-07T20:32:28.6430362Z x0 = x0.contiguous() 2025-05-07T20:32:28.6430468Z x1 = x1.contiguous() 2025-05-07T20:32:28.6430551Z 2025-05-07T20:32:28.6430793Z if scale_ub is not None: 2025-05-07T20:32:28.6430922Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6431076Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6431163Z ) 2025-05-07T20:32:28.6431260Z else: 2025-05-07T20:32:28.6431367Z scale_ub_tensor = None 2025-05-07T20:32:28.6431451Z 2025-05-07T20:32:28.6431599Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6431701Z op = silu_mul_quant 2025-05-07T20:32:28.6431805Z if compiled: 2025-05-07T20:32:28.6431918Z op = torch.compile(op) 2025-05-07T20:32:28.6432040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6432127Z 2025-05-07T20:32:28.6432232Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6432237Z 2025-05-07T20:32:28.6432347Z moe/activation_test.py:117: 2025-05-07T20:32:28.6432498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6432617Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6432738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6433303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6433500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6433989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6434275Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6434727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6434840Z kernel = self.compile( 2025-05-07T20:32:28.6435353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6435578Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6435733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6435738Z 2025-05-07T20:32:28.6435993Z self = 2025-05-07T20:32:28.6437083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6437776Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767cc3d1b0>} 2025-05-07T20:32:28.6438860Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6439109Z context = 2025-05-07T20:32:28.6439114Z 2025-05-07T20:32:28.6439315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6439661Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6439791Z module_map=module_map) 2025-05-07T20:32:28.6439990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6440105Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6440192Z E ^ 2025-05-07T20:32:28.6440670Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6440675Z 2025-05-07T20:32:28.6441230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6441235Z 2025-05-07T20:32:28.6441359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6441843Z self=, 2025-05-07T20:32:28.6441933Z T=128, 2025-05-07T20:32:28.6442022Z D=5120, 2025-05-07T20:32:28.6442116Z scale_ub=None, 2025-05-07T20:32:28.6442219Z contiguous=False, 2025-05-07T20:32:28.6442315Z compiled=True, 2025-05-07T20:32:28.6442398Z ) 2025-05-07T20:32:28.6442643Z self = 2025-05-07T20:32:28.6442839Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6442844Z 2025-05-07T20:32:28.6442934Z @given( 2025-05-07T20:32:28.6443078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6443193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6443327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6443469Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6443601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6443694Z ) 2025-05-07T20:32:28.6443984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6444094Z def test_silu_mul_quant( 2025-05-07T20:32:28.6444277Z self, 2025-05-07T20:32:28.6444375Z T: int, 2025-05-07T20:32:28.6444463Z D: int, 2025-05-07T20:32:28.6444577Z scale_ub: Optional[float], 2025-05-07T20:32:28.6444687Z contiguous: bool, 2025-05-07T20:32:28.6444786Z compiled: bool, 2025-05-07T20:32:28.6444885Z ) -> None: 2025-05-07T20:32:28.6444995Z torch.manual_seed(2025) 2025-05-07T20:32:28.6445079Z 2025-05-07T20:32:28.6445280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6445365Z 2025-05-07T20:32:28.6445471Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6445621Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6445727Z x = x_sign * x_clamp 2025-05-07T20:32:28.6445828Z x0 = x[:, :D] 2025-05-07T20:32:28.6445928Z x1 = x[:, D:] 2025-05-07T20:32:28.6446011Z 2025-05-07T20:32:28.6446107Z if contiguous: 2025-05-07T20:32:28.6446219Z x0 = x0.contiguous() 2025-05-07T20:32:28.6446326Z x1 = x1.contiguous() 2025-05-07T20:32:28.6446417Z 2025-05-07T20:32:28.6446522Z if scale_ub is not None: 2025-05-07T20:32:28.6446644Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6446805Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6446895Z ) 2025-05-07T20:32:28.6446986Z else: 2025-05-07T20:32:28.6447100Z scale_ub_tensor = None 2025-05-07T20:32:28.6447185Z 2025-05-07T20:32:28.6447335Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6447444Z op = silu_mul_quant 2025-05-07T20:32:28.6447543Z if compiled: 2025-05-07T20:32:28.6447658Z op = torch.compile(op) 2025-05-07T20:32:28.6447790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6447877Z 2025-05-07T20:32:28.6447988Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6447993Z 2025-05-07T20:32:28.6448105Z moe/activation_test.py:117: 2025-05-07T20:32:28.6448264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6448386Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6448503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6448923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6449037Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6449600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6449719Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6450216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6450474Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6450866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6450978Z kernel = self.compile( 2025-05-07T20:32:28.6451412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6451622Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6451766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6451771Z 2025-05-07T20:32:28.6452010Z self = 2025-05-07T20:32:28.6452891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6453459Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677e27910>} 2025-05-07T20:32:28.6454391Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6454609Z context = 2025-05-07T20:32:28.6454614Z 2025-05-07T20:32:28.6454811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6455113Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6455241Z module_map=module_map) 2025-05-07T20:32:28.6455426Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6455545Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6455639Z E ^ 2025-05-07T20:32:28.6456043Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6456055Z 2025-05-07T20:32:28.6456523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6456528Z 2025-05-07T20:32:28.6456656Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6456909Z self=, 2025-05-07T20:32:28.6457009Z T=128, 2025-05-07T20:32:28.6457098Z D=7168, 2025-05-07T20:32:28.6457195Z scale_ub=1200.0, 2025-05-07T20:32:28.6457303Z contiguous=False, 2025-05-07T20:32:28.6457401Z compiled=False, 2025-05-07T20:32:28.6457487Z ) 2025-05-07T20:32:28.6457743Z self = 2025-05-07T20:32:28.6457945Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.6457950Z 2025-05-07T20:32:28.6458038Z @given( 2025-05-07T20:32:28.6458182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6458304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6458443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6458588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6458737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6458851Z ) 2025-05-07T20:32:28.6459134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6459245Z def test_silu_mul_quant( 2025-05-07T20:32:28.6459339Z self, 2025-05-07T20:32:28.6459428Z T: int, 2025-05-07T20:32:28.6459518Z D: int, 2025-05-07T20:32:28.6459639Z scale_ub: Optional[float], 2025-05-07T20:32:28.6459742Z contiguous: bool, 2025-05-07T20:32:28.6459930Z compiled: bool, 2025-05-07T20:32:28.6460029Z ) -> None: 2025-05-07T20:32:28.6460138Z torch.manual_seed(2025) 2025-05-07T20:32:28.6460228Z 2025-05-07T20:32:28.6460423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6460517Z 2025-05-07T20:32:28.6460629Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6460777Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6460881Z x = x_sign * x_clamp 2025-05-07T20:32:28.6460980Z x0 = x[:, :D] 2025-05-07T20:32:28.6461073Z x1 = x[:, D:] 2025-05-07T20:32:28.6461157Z 2025-05-07T20:32:28.6461261Z if contiguous: 2025-05-07T20:32:28.6461369Z x0 = x0.contiguous() 2025-05-07T20:32:28.6461470Z x1 = x1.contiguous() 2025-05-07T20:32:28.6461559Z 2025-05-07T20:32:28.6461663Z if scale_ub is not None: 2025-05-07T20:32:28.6461790Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6461951Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6462040Z ) 2025-05-07T20:32:28.6462137Z else: 2025-05-07T20:32:28.6462245Z scale_ub_tensor = None 2025-05-07T20:32:28.6462419Z 2025-05-07T20:32:28.6462575Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6462679Z op = silu_mul_quant 2025-05-07T20:32:28.6462778Z if compiled: 2025-05-07T20:32:28.6462899Z op = torch.compile(op) 2025-05-07T20:32:28.6463020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6463105Z 2025-05-07T20:32:28.6463215Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6463220Z 2025-05-07T20:32:28.6463332Z moe/activation_test.py:117: 2025-05-07T20:32:28.6463487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6463605Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6463721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6464300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6464412Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6464827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6465089Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6465476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6465591Z kernel = self.compile( 2025-05-07T20:32:28.6466024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6466229Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6466383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6466392Z 2025-05-07T20:32:28.6466625Z self = 2025-05-07T20:32:28.6467506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6468081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677e26d40>} 2025-05-07T20:32:28.6468923Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6469147Z context = 2025-05-07T20:32:28.6469152Z 2025-05-07T20:32:28.6469430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6469736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6469859Z module_map=module_map) 2025-05-07T20:32:28.6470050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6470173Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6470262Z E ^ 2025-05-07T20:32:28.6470666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6470676Z 2025-05-07T20:32:28.6471143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6471148Z 2025-05-07T20:32:28.6471269Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6471532Z self=, 2025-05-07T20:32:28.6471622Z T=128, 2025-05-07T20:32:28.6471717Z D=5120, 2025-05-07T20:32:28.6471819Z scale_ub=None, 2025-05-07T20:32:28.6471920Z contiguous=False, 2025-05-07T20:32:28.6472017Z compiled=False, 2025-05-07T20:32:28.6472108Z ) 2025-05-07T20:32:28.6472472Z self = 2025-05-07T20:32:28.6472673Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.6472678Z 2025-05-07T20:32:28.6472768Z @given( 2025-05-07T20:32:28.6472904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6473027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6473160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6473295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6473435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6473595Z ) 2025-05-07T20:32:28.6473901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6474011Z def test_silu_mul_quant( 2025-05-07T20:32:28.6474102Z self, 2025-05-07T20:32:28.6474198Z T: int, 2025-05-07T20:32:28.6474288Z D: int, 2025-05-07T20:32:28.6474419Z scale_ub: Optional[float], 2025-05-07T20:32:28.6474530Z contiguous: bool, 2025-05-07T20:32:28.6474630Z compiled: bool, 2025-05-07T20:32:28.6474722Z ) -> None: 2025-05-07T20:32:28.6474841Z torch.manual_seed(2025) 2025-05-07T20:32:28.6474926Z 2025-05-07T20:32:28.6475123Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6475218Z 2025-05-07T20:32:28.6475326Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6475471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6475580Z x = x_sign * x_clamp 2025-05-07T20:32:28.6475675Z x0 = x[:, :D] 2025-05-07T20:32:28.6475774Z x1 = x[:, D:] 2025-05-07T20:32:28.6475863Z 2025-05-07T20:32:28.6475965Z if contiguous: 2025-05-07T20:32:28.6476083Z x0 = x0.contiguous() 2025-05-07T20:32:28.6476187Z x1 = x1.contiguous() 2025-05-07T20:32:28.6476274Z 2025-05-07T20:32:28.6476386Z if scale_ub is not None: 2025-05-07T20:32:28.6476514Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6476671Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6476767Z ) 2025-05-07T20:32:28.6476856Z else: 2025-05-07T20:32:28.6476966Z scale_ub_tensor = None 2025-05-07T20:32:28.6477057Z 2025-05-07T20:32:28.6477208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6477324Z op = silu_mul_quant 2025-05-07T20:32:28.6477423Z if compiled: 2025-05-07T20:32:28.6477540Z op = torch.compile(op) 2025-05-07T20:32:28.6477670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6477755Z 2025-05-07T20:32:28.6477886Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6478011Z 2025-05-07T20:32:28.6478187Z moe/activation_test.py:117: 2025-05-07T20:32:28.6478398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6478543Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6478670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6479236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6479358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6479766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6480022Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6480417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6480525Z kernel = self.compile( 2025-05-07T20:32:28.6480967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6481176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6481420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6481426Z 2025-05-07T20:32:28.6481665Z self = 2025-05-07T20:32:28.6482538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6483115Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767ca031c0>} 2025-05-07T20:32:28.6483961Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6484180Z context = 2025-05-07T20:32:28.6484190Z 2025-05-07T20:32:28.6484387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6484687Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6484816Z module_map=module_map) 2025-05-07T20:32:28.6485002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6485116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6485213Z E ^ 2025-05-07T20:32:28.6485616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6485621Z 2025-05-07T20:32:28.6486094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6486105Z 2025-05-07T20:32:28.6486226Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6486488Z self=, 2025-05-07T20:32:28.6486585Z T=128, 2025-05-07T20:32:28.6486675Z D=5120, 2025-05-07T20:32:28.6486774Z scale_ub=1200.0, 2025-05-07T20:32:28.6486878Z contiguous=True, 2025-05-07T20:32:28.6486974Z compiled=False, 2025-05-07T20:32:28.6487059Z ) 2025-05-07T20:32:28.6487539Z self = 2025-05-07T20:32:28.6487737Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.6487742Z 2025-05-07T20:32:28.6487837Z @given( 2025-05-07T20:32:28.6487973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6488088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6488342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6488499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6488641Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6488743Z ) 2025-05-07T20:32:28.6489025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6489135Z def test_silu_mul_quant( 2025-05-07T20:32:28.6489231Z self, 2025-05-07T20:32:28.6489321Z T: int, 2025-05-07T20:32:28.6489412Z D: int, 2025-05-07T20:32:28.6489532Z scale_ub: Optional[float], 2025-05-07T20:32:28.6489635Z contiguous: bool, 2025-05-07T20:32:28.6489742Z compiled: bool, 2025-05-07T20:32:28.6489833Z ) -> None: 2025-05-07T20:32:28.6489943Z torch.manual_seed(2025) 2025-05-07T20:32:28.6490035Z 2025-05-07T20:32:28.6490232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6490317Z 2025-05-07T20:32:28.6490437Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6490583Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6490688Z x = x_sign * x_clamp 2025-05-07T20:32:28.6490789Z x0 = x[:, :D] 2025-05-07T20:32:28.6491018Z x1 = x[:, D:] 2025-05-07T20:32:28.6491104Z 2025-05-07T20:32:28.6491206Z if contiguous: 2025-05-07T20:32:28.6491313Z x0 = x0.contiguous() 2025-05-07T20:32:28.6491423Z x1 = x1.contiguous() 2025-05-07T20:32:28.6491507Z 2025-05-07T20:32:28.6491611Z if scale_ub is not None: 2025-05-07T20:32:28.6491745Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6491899Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6491988Z ) 2025-05-07T20:32:28.6492087Z else: 2025-05-07T20:32:28.6492196Z scale_ub_tensor = None 2025-05-07T20:32:28.6492281Z 2025-05-07T20:32:28.6492435Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6492546Z op = silu_mul_quant 2025-05-07T20:32:28.6492645Z if compiled: 2025-05-07T20:32:28.6492767Z op = torch.compile(op) 2025-05-07T20:32:28.6492889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6492980Z 2025-05-07T20:32:28.6493092Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6493097Z 2025-05-07T20:32:28.6493209Z moe/activation_test.py:117: 2025-05-07T20:32:28.6493362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6493481Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6493598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6494170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6494284Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6494701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6494966Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6495354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6495477Z kernel = self.compile( 2025-05-07T20:32:28.6495911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6496113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6496264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6496269Z 2025-05-07T20:32:28.6496502Z self = 2025-05-07T20:32:28.6497473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6498041Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677eabac0>} 2025-05-07T20:32:28.6498938Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6499157Z context = 2025-05-07T20:32:28.6499161Z 2025-05-07T20:32:28.6499352Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6499660Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6499784Z module_map=module_map) 2025-05-07T20:32:28.6499974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6500096Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6500187Z E ^ 2025-05-07T20:32:28.6500595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6500684Z 2025-05-07T20:32:28.6501154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6501159Z 2025-05-07T20:32:28.6501281Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6501541Z self=, 2025-05-07T20:32:28.6501633Z T=1, 2025-05-07T20:32:28.6501724Z D=7168, 2025-05-07T20:32:28.6501830Z scale_ub=1200.0, 2025-05-07T20:32:28.6501929Z contiguous=True, 2025-05-07T20:32:28.6502030Z compiled=True, 2025-05-07T20:32:28.6502115Z ) 2025-05-07T20:32:28.6502363Z self = 2025-05-07T20:32:28.6502564Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.6502569Z 2025-05-07T20:32:28.6502658Z @given( 2025-05-07T20:32:28.6502796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6502924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6503058Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6503198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6503337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6503424Z ) 2025-05-07T20:32:28.6503712Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6503821Z def test_silu_mul_quant( 2025-05-07T20:32:28.6503910Z self, 2025-05-07T20:32:28.6504005Z T: int, 2025-05-07T20:32:28.6504095Z D: int, 2025-05-07T20:32:28.6504207Z scale_ub: Optional[float], 2025-05-07T20:32:28.6504315Z contiguous: bool, 2025-05-07T20:32:28.6504419Z compiled: bool, 2025-05-07T20:32:28.6504510Z ) -> None: 2025-05-07T20:32:28.6504624Z torch.manual_seed(2025) 2025-05-07T20:32:28.6504709Z 2025-05-07T20:32:28.6504903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6505002Z 2025-05-07T20:32:28.6505108Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6505260Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6505363Z x = x_sign * x_clamp 2025-05-07T20:32:28.6505456Z x0 = x[:, :D] 2025-05-07T20:32:28.6505555Z x1 = x[:, D:] 2025-05-07T20:32:28.6505639Z 2025-05-07T20:32:28.6505736Z if contiguous: 2025-05-07T20:32:28.6505847Z x0 = x0.contiguous() 2025-05-07T20:32:28.6505951Z x1 = x1.contiguous() 2025-05-07T20:32:28.6506035Z 2025-05-07T20:32:28.6506147Z if scale_ub is not None: 2025-05-07T20:32:28.6506268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6506522Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6506618Z ) 2025-05-07T20:32:28.6506704Z else: 2025-05-07T20:32:28.6506817Z scale_ub_tensor = None 2025-05-07T20:32:28.6506907Z 2025-05-07T20:32:28.6507055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6507163Z op = silu_mul_quant 2025-05-07T20:32:28.6507261Z if compiled: 2025-05-07T20:32:28.6507377Z op = torch.compile(op) 2025-05-07T20:32:28.6507503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6507586Z 2025-05-07T20:32:28.6507691Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6507696Z 2025-05-07T20:32:28.6507815Z moe/activation_test.py:117: 2025-05-07T20:32:28.6507963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6508084Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6508197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6508623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6508759Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6509317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6509546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6515352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6515630Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6516022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6516138Z kernel = self.compile( 2025-05-07T20:32:28.6516571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6516786Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6516929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6516942Z 2025-05-07T20:32:28.6517172Z self = 2025-05-07T20:32:28.6518054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6518625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677ea9d80>} 2025-05-07T20:32:28.6519625Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6519849Z context = 2025-05-07T20:32:28.6519855Z 2025-05-07T20:32:28.6520042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6520357Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6520480Z module_map=module_map) 2025-05-07T20:32:28.6520667Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6520783Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6520871Z E ^ 2025-05-07T20:32:28.6521277Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6521282Z 2025-05-07T20:32:28.6521746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6521751Z 2025-05-07T20:32:28.6522004Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6522260Z self=, 2025-05-07T20:32:28.6522348Z T=1, 2025-05-07T20:32:28.6522444Z D=7168, 2025-05-07T20:32:28.6522540Z scale_ub=1200.0, 2025-05-07T20:32:28.6522641Z contiguous=False, 2025-05-07T20:32:28.6522740Z compiled=True, 2025-05-07T20:32:28.6522827Z ) 2025-05-07T20:32:28.6523071Z self = 2025-05-07T20:32:28.6523267Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6523272Z 2025-05-07T20:32:28.6523359Z @given( 2025-05-07T20:32:28.6523501Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6523614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6523745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6524149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6524288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6524372Z ) 2025-05-07T20:32:28.6524657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6524932Z def test_silu_mul_quant( 2025-05-07T20:32:28.6525021Z self, 2025-05-07T20:32:28.6525115Z T: int, 2025-05-07T20:32:28.6525200Z D: int, 2025-05-07T20:32:28.6525315Z scale_ub: Optional[float], 2025-05-07T20:32:28.6525416Z contiguous: bool, 2025-05-07T20:32:28.6525516Z compiled: bool, 2025-05-07T20:32:28.6525610Z ) -> None: 2025-05-07T20:32:28.6525716Z torch.manual_seed(2025) 2025-05-07T20:32:28.6525800Z 2025-05-07T20:32:28.6525998Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6526080Z 2025-05-07T20:32:28.6526186Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6526335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6526439Z x = x_sign * x_clamp 2025-05-07T20:32:28.6526531Z x0 = x[:, :D] 2025-05-07T20:32:28.6526629Z x1 = x[:, D:] 2025-05-07T20:32:28.6526711Z 2025-05-07T20:32:28.6526805Z if contiguous: 2025-05-07T20:32:28.6526921Z x0 = x0.contiguous() 2025-05-07T20:32:28.6527023Z x1 = x1.contiguous() 2025-05-07T20:32:28.6527110Z 2025-05-07T20:32:28.6527213Z if scale_ub is not None: 2025-05-07T20:32:28.6527333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6527492Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6527577Z ) 2025-05-07T20:32:28.6527663Z else: 2025-05-07T20:32:28.6527771Z scale_ub_tensor = None 2025-05-07T20:32:28.6527854Z 2025-05-07T20:32:28.6528002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6528106Z op = silu_mul_quant 2025-05-07T20:32:28.6528204Z if compiled: 2025-05-07T20:32:28.6528326Z op = torch.compile(op) 2025-05-07T20:32:28.6528445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6528526Z 2025-05-07T20:32:28.6528633Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6528644Z 2025-05-07T20:32:28.6528756Z moe/activation_test.py:117: 2025-05-07T20:32:28.6528901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6529018Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6529132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6529553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6529659Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6530216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6530332Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6530864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6531126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6531521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6531627Z kernel = self.compile( 2025-05-07T20:32:28.6532059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6532259Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6532403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6532408Z 2025-05-07T20:32:28.6532641Z self = 2025-05-07T20:32:28.6533512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6534081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677ea83a0>} 2025-05-07T20:32:28.6535000Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6535221Z context = 2025-05-07T20:32:28.6535226Z 2025-05-07T20:32:28.6535414Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6535710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6535835Z module_map=module_map) 2025-05-07T20:32:28.6536024Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6536139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6536230Z E ^ 2025-05-07T20:32:28.6536627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6536637Z 2025-05-07T20:32:28.6537103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6537108Z 2025-05-07T20:32:28.6537227Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6537480Z self=, 2025-05-07T20:32:28.6537571Z T=1, 2025-05-07T20:32:28.6537658Z D=7168, 2025-05-07T20:32:28.6537752Z scale_ub=None, 2025-05-07T20:32:28.6537856Z contiguous=False, 2025-05-07T20:32:28.6537950Z compiled=True, 2025-05-07T20:32:28.6538031Z ) 2025-05-07T20:32:28.6538283Z self = 2025-05-07T20:32:28.6538470Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6538475Z 2025-05-07T20:32:28.6538566Z @given( 2025-05-07T20:32:28.6538705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6538820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6538955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6539088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6539216Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6539305Z ) 2025-05-07T20:32:28.6539583Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6539692Z def test_silu_mul_quant( 2025-05-07T20:32:28.6539780Z self, 2025-05-07T20:32:28.6539867Z T: int, 2025-05-07T20:32:28.6539956Z D: int, 2025-05-07T20:32:28.6540067Z scale_ub: Optional[float], 2025-05-07T20:32:28.6540260Z contiguous: bool, 2025-05-07T20:32:28.6540362Z compiled: bool, 2025-05-07T20:32:28.6540451Z ) -> None: 2025-05-07T20:32:28.6540558Z torch.manual_seed(2025) 2025-05-07T20:32:28.6540652Z 2025-05-07T20:32:28.6540846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6540929Z 2025-05-07T20:32:28.6541037Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6541178Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6541283Z x = x_sign * x_clamp 2025-05-07T20:32:28.6541375Z x0 = x[:, :D] 2025-05-07T20:32:28.6541465Z x1 = x[:, D:] 2025-05-07T20:32:28.6541550Z 2025-05-07T20:32:28.6541646Z if contiguous: 2025-05-07T20:32:28.6541748Z x0 = x0.contiguous() 2025-05-07T20:32:28.6541853Z x1 = x1.contiguous() 2025-05-07T20:32:28.6541935Z 2025-05-07T20:32:28.6542036Z if scale_ub is not None: 2025-05-07T20:32:28.6542164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6542317Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6542404Z ) 2025-05-07T20:32:28.6542495Z else: 2025-05-07T20:32:28.6542690Z scale_ub_tensor = None 2025-05-07T20:32:28.6542772Z 2025-05-07T20:32:28.6542924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6543026Z op = silu_mul_quant 2025-05-07T20:32:28.6543128Z if compiled: 2025-05-07T20:32:28.6543241Z op = torch.compile(op) 2025-05-07T20:32:28.6543360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6543448Z 2025-05-07T20:32:28.6543550Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6543687Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6543771Z 2025-05-07T20:32:28.6543926Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6544042Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6544165Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6544303Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6544465Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6544555Z 2025-05-07T20:32:28.6544668Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6544673Z 2025-05-07T20:32:28.6544788Z moe/activation_test.py:126: 2025-05-07T20:32:28.6544934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6545055Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6545211Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6545838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6545957Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6546368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6546620Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6547036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6547333Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6547783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6548072Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6548538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6548735Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6549238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6549328Z fn() 2025-05-07T20:32:28.6549781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6549883Z self.fn.run( 2025-05-07T20:32:28.6550263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6550379Z kernel = self.compile( 2025-05-07T20:32:28.6550808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6551011Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6551153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6551158Z 2025-05-07T20:32:28.6551389Z self = 2025-05-07T20:32:28.6552267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6552914Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677eaa3b0>} 2025-05-07T20:32:28.6553851Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6554066Z context = 2025-05-07T20:32:28.6554071Z 2025-05-07T20:32:28.6554263Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6554564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6554686Z module_map=module_map) 2025-05-07T20:32:28.6554873Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6554990Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6555086Z E ^ 2025-05-07T20:32:28.6555489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6555494Z 2025-05-07T20:32:28.6555957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6555962Z 2025-05-07T20:32:28.6556087Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6556339Z self=, 2025-05-07T20:32:28.6556426Z T=1, 2025-05-07T20:32:28.6556516Z D=5120, 2025-05-07T20:32:28.6556609Z scale_ub=1200.0, 2025-05-07T20:32:28.6556709Z contiguous=False, 2025-05-07T20:32:28.6556813Z compiled=True, 2025-05-07T20:32:28.6556896Z ) 2025-05-07T20:32:28.6557143Z self = 2025-05-07T20:32:28.6557336Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6557346Z 2025-05-07T20:32:28.6557433Z @given( 2025-05-07T20:32:28.6557571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6557683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6557817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6557954Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6558083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6558168Z ) 2025-05-07T20:32:28.6558452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6558558Z def test_silu_mul_quant( 2025-05-07T20:32:28.6558648Z self, 2025-05-07T20:32:28.6558735Z T: int, 2025-05-07T20:32:28.6558915Z D: int, 2025-05-07T20:32:28.6559031Z scale_ub: Optional[float], 2025-05-07T20:32:28.6559133Z contiguous: bool, 2025-05-07T20:32:28.6559229Z compiled: bool, 2025-05-07T20:32:28.6559326Z ) -> None: 2025-05-07T20:32:28.6559432Z torch.manual_seed(2025) 2025-05-07T20:32:28.6559514Z 2025-05-07T20:32:28.6559786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6559911Z 2025-05-07T20:32:28.6560065Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6560229Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6560331Z x = x_sign * x_clamp 2025-05-07T20:32:28.6560420Z x0 = x[:, :D] 2025-05-07T20:32:28.6560515Z x1 = x[:, D:] 2025-05-07T20:32:28.6560599Z 2025-05-07T20:32:28.6560696Z if contiguous: 2025-05-07T20:32:28.6560800Z x0 = x0.contiguous() 2025-05-07T20:32:28.6560924Z x1 = x1.contiguous() 2025-05-07T20:32:28.6561061Z 2025-05-07T20:32:28.6561200Z if scale_ub is not None: 2025-05-07T20:32:28.6561323Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6561481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6561676Z ) 2025-05-07T20:32:28.6561763Z else: 2025-05-07T20:32:28.6561875Z scale_ub_tensor = None 2025-05-07T20:32:28.6561956Z 2025-05-07T20:32:28.6562105Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6562211Z op = silu_mul_quant 2025-05-07T20:32:28.6562308Z if compiled: 2025-05-07T20:32:28.6562425Z op = torch.compile(op) 2025-05-07T20:32:28.6562547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6562629Z 2025-05-07T20:32:28.6562734Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6562739Z 2025-05-07T20:32:28.6562849Z moe/activation_test.py:117: 2025-05-07T20:32:28.6563000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6563118Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6563232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6563646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6563763Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6564323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6564441Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6564846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6565100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6565488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6565598Z kernel = self.compile( 2025-05-07T20:32:28.6566033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6566232Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6566380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6566385Z 2025-05-07T20:32:28.6566618Z self = 2025-05-07T20:32:28.6567483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6568048Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778d9c60>} 2025-05-07T20:32:28.6568969Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6569192Z context = 2025-05-07T20:32:28.6569197Z 2025-05-07T20:32:28.6569389Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6569686Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6569815Z module_map=module_map) 2025-05-07T20:32:28.6569999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6570112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6570201Z E ^ 2025-05-07T20:32:28.6570598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6570603Z 2025-05-07T20:32:28.6571075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6571079Z 2025-05-07T20:32:28.6571199Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6571535Z self=, 2025-05-07T20:32:28.6571627Z T=1, 2025-05-07T20:32:28.6571715Z D=5120, 2025-05-07T20:32:28.6571809Z scale_ub=1200.0, 2025-05-07T20:32:28.6571912Z contiguous=False, 2025-05-07T20:32:28.6572006Z compiled=False, 2025-05-07T20:32:28.6572088Z ) 2025-05-07T20:32:28.6572334Z self = 2025-05-07T20:32:28.6572524Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.6572529Z 2025-05-07T20:32:28.6572619Z @given( 2025-05-07T20:32:28.6572754Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6572873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6573006Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6573140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6573269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6573363Z ) 2025-05-07T20:32:28.6573640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6573745Z def test_silu_mul_quant( 2025-05-07T20:32:28.6573834Z self, 2025-05-07T20:32:28.6573923Z T: int, 2025-05-07T20:32:28.6574015Z D: int, 2025-05-07T20:32:28.6574125Z scale_ub: Optional[float], 2025-05-07T20:32:28.6574226Z contiguous: bool, 2025-05-07T20:32:28.6574323Z compiled: bool, 2025-05-07T20:32:28.6574412Z ) -> None: 2025-05-07T20:32:28.6574519Z torch.manual_seed(2025) 2025-05-07T20:32:28.6574605Z 2025-05-07T20:32:28.6574794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6574884Z 2025-05-07T20:32:28.6574993Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6575134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6575234Z x = x_sign * x_clamp 2025-05-07T20:32:28.6575337Z x0 = x[:, :D] 2025-05-07T20:32:28.6575429Z x1 = x[:, D:] 2025-05-07T20:32:28.6575511Z 2025-05-07T20:32:28.6575609Z if contiguous: 2025-05-07T20:32:28.6575713Z x0 = x0.contiguous() 2025-05-07T20:32:28.6575816Z x1 = x1.contiguous() 2025-05-07T20:32:28.6575899Z 2025-05-07T20:32:28.6576001Z if scale_ub is not None: 2025-05-07T20:32:28.6576127Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6576281Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6576366Z ) 2025-05-07T20:32:28.6576454Z else: 2025-05-07T20:32:28.6576561Z scale_ub_tensor = None 2025-05-07T20:32:28.6576642Z 2025-05-07T20:32:28.6576891Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6576996Z op = silu_mul_quant 2025-05-07T20:32:28.6577094Z if compiled: 2025-05-07T20:32:28.6577211Z op = torch.compile(op) 2025-05-07T20:32:28.6577342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6577427Z 2025-05-07T20:32:28.6577530Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6577535Z 2025-05-07T20:32:28.6577644Z moe/activation_test.py:117: 2025-05-07T20:32:28.6577793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6577906Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6578020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6578635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6578745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6579158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6579410Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6579792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6579985Z kernel = self.compile( 2025-05-07T20:32:28.6580416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6580618Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6580764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6580769Z 2025-05-07T20:32:28.6580996Z self = 2025-05-07T20:32:28.6581873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6582441Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778d9750>} 2025-05-07T20:32:28.6583281Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6583498Z context = 2025-05-07T20:32:28.6583503Z 2025-05-07T20:32:28.6583689Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6583986Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6584107Z module_map=module_map) 2025-05-07T20:32:28.6584293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6584407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6584493Z E ^ 2025-05-07T20:32:28.6584892Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6584903Z 2025-05-07T20:32:28.6585364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6585369Z 2025-05-07T20:32:28.6585487Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6585738Z self=, 2025-05-07T20:32:28.6585826Z T=16384, 2025-05-07T20:32:28.6585911Z D=5120, 2025-05-07T20:32:28.6586008Z scale_ub=1200.0, 2025-05-07T20:32:28.6586105Z contiguous=False, 2025-05-07T20:32:28.6586203Z compiled=True, 2025-05-07T20:32:28.6586285Z ) 2025-05-07T20:32:28.6586641Z self = 2025-05-07T20:32:28.6586853Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6586858Z 2025-05-07T20:32:28.6586944Z @given( 2025-05-07T20:32:28.6587086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6587207Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6587338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6587472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6587612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6587699Z ) 2025-05-07T20:32:28.6587986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6588095Z def test_silu_mul_quant( 2025-05-07T20:32:28.6588183Z self, 2025-05-07T20:32:28.6588276Z T: int, 2025-05-07T20:32:28.6588364Z D: int, 2025-05-07T20:32:28.6588477Z scale_ub: Optional[float], 2025-05-07T20:32:28.6588591Z contiguous: bool, 2025-05-07T20:32:28.6588690Z compiled: bool, 2025-05-07T20:32:28.6588781Z ) -> None: 2025-05-07T20:32:28.6588896Z torch.manual_seed(2025) 2025-05-07T20:32:28.6588981Z 2025-05-07T20:32:28.6589263Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6589356Z 2025-05-07T20:32:28.6589462Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6589611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6589713Z x = x_sign * x_clamp 2025-05-07T20:32:28.6589805Z x0 = x[:, :D] 2025-05-07T20:32:28.6589904Z x1 = x[:, D:] 2025-05-07T20:32:28.6589989Z 2025-05-07T20:32:28.6590084Z if contiguous: 2025-05-07T20:32:28.6590195Z x0 = x0.contiguous() 2025-05-07T20:32:28.6590298Z x1 = x1.contiguous() 2025-05-07T20:32:28.6590381Z 2025-05-07T20:32:28.6590490Z if scale_ub is not None: 2025-05-07T20:32:28.6590618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6590772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6590864Z ) 2025-05-07T20:32:28.6590952Z else: 2025-05-07T20:32:28.6591073Z scale_ub_tensor = None 2025-05-07T20:32:28.6591158Z 2025-05-07T20:32:28.6591306Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6591417Z op = silu_mul_quant 2025-05-07T20:32:28.6591515Z if compiled: 2025-05-07T20:32:28.6591628Z op = torch.compile(op) 2025-05-07T20:32:28.6591757Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6591842Z 2025-05-07T20:32:28.6591946Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6591951Z 2025-05-07T20:32:28.6592067Z moe/activation_test.py:117: 2025-05-07T20:32:28.6592216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6592339Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6592459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6592875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6592987Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6593644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6593764Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6594173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6594427Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6594817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6594925Z kernel = self.compile( 2025-05-07T20:32:28.6595527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6595736Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6595880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6595891Z 2025-05-07T20:32:28.6596122Z self = 2025-05-07T20:32:28.6596993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6597553Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778d96c0>} 2025-05-07T20:32:28.6598397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6598611Z context = 2025-05-07T20:32:28.6598616Z 2025-05-07T20:32:28.6598899Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6599198Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6599321Z module_map=module_map) 2025-05-07T20:32:28.6599509Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6599623Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6599711Z E ^ 2025-05-07T20:32:28.6600118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6600122Z 2025-05-07T20:32:28.6600589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6600594Z 2025-05-07T20:32:28.6600718Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6600971Z self=, 2025-05-07T20:32:28.6601065Z T=2048, 2025-05-07T20:32:28.6601160Z D=7168, 2025-05-07T20:32:28.6601256Z scale_ub=1200.0, 2025-05-07T20:32:28.6601358Z contiguous=False, 2025-05-07T20:32:28.6601462Z compiled=True, 2025-05-07T20:32:28.6601547Z ) 2025-05-07T20:32:28.6601798Z self = 2025-05-07T20:32:28.6601997Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6602002Z 2025-05-07T20:32:28.6602091Z @given( 2025-05-07T20:32:28.6602233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6602348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6602480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6602624Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6602754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6602841Z ) 2025-05-07T20:32:28.6603125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6603239Z def test_silu_mul_quant( 2025-05-07T20:32:28.6603333Z self, 2025-05-07T20:32:28.6603421Z T: int, 2025-05-07T20:32:28.6603510Z D: int, 2025-05-07T20:32:28.6603631Z scale_ub: Optional[float], 2025-05-07T20:32:28.6603733Z contiguous: bool, 2025-05-07T20:32:28.6603832Z compiled: bool, 2025-05-07T20:32:28.6603931Z ) -> None: 2025-05-07T20:32:28.6604039Z torch.manual_seed(2025) 2025-05-07T20:32:28.6604123Z 2025-05-07T20:32:28.6604319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6604405Z 2025-05-07T20:32:28.6604512Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6604749Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6604854Z x = x_sign * x_clamp 2025-05-07T20:32:28.6604953Z x0 = x[:, :D] 2025-05-07T20:32:28.6605044Z x1 = x[:, D:] 2025-05-07T20:32:28.6605127Z 2025-05-07T20:32:28.6605234Z if contiguous: 2025-05-07T20:32:28.6605337Z x0 = x0.contiguous() 2025-05-07T20:32:28.6605439Z x1 = x1.contiguous() 2025-05-07T20:32:28.6605527Z 2025-05-07T20:32:28.6605630Z if scale_ub is not None: 2025-05-07T20:32:28.6605749Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6605908Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6605996Z ) 2025-05-07T20:32:28.6606083Z else: 2025-05-07T20:32:28.6606196Z scale_ub_tensor = None 2025-05-07T20:32:28.6606281Z 2025-05-07T20:32:28.6606436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6606540Z op = silu_mul_quant 2025-05-07T20:32:28.6606643Z if compiled: 2025-05-07T20:32:28.6606763Z op = torch.compile(op) 2025-05-07T20:32:28.6606883Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6606968Z 2025-05-07T20:32:28.6607165Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6607170Z 2025-05-07T20:32:28.6607282Z moe/activation_test.py:117: 2025-05-07T20:32:28.6607431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6607555Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6607671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6608091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6608199Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6608809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6608926Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6609337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6609591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6609989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6610097Z kernel = self.compile( 2025-05-07T20:32:28.6610537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6610737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6610881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6610886Z 2025-05-07T20:32:28.6611121Z self = 2025-05-07T20:32:28.6611993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6612560Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778d9d80>} 2025-05-07T20:32:28.6613401Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6613616Z context = 2025-05-07T20:32:28.6613627Z 2025-05-07T20:32:28.6613816Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6614111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6614325Z module_map=module_map) 2025-05-07T20:32:28.6614941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6615055Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6615151Z E ^ 2025-05-07T20:32:28.6615557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6615563Z 2025-05-07T20:32:28.6616033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6616038Z 2025-05-07T20:32:28.6616157Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6616409Z self=, 2025-05-07T20:32:28.6616506Z T=1, 2025-05-07T20:32:28.6616596Z D=5120, 2025-05-07T20:32:28.6616690Z scale_ub=None, 2025-05-07T20:32:28.6616795Z contiguous=False, 2025-05-07T20:32:28.6616892Z compiled=False, 2025-05-07T20:32:28.6616982Z ) 2025-05-07T20:32:28.6617233Z self = 2025-05-07T20:32:28.6617421Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.6617544Z 2025-05-07T20:32:28.6617639Z @given( 2025-05-07T20:32:28.6617776Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6617892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6618030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6618164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6618293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6618385Z ) 2025-05-07T20:32:28.6618665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6618780Z def test_silu_mul_quant( 2025-05-07T20:32:28.6618867Z self, 2025-05-07T20:32:28.6618955Z T: int, 2025-05-07T20:32:28.6619050Z D: int, 2025-05-07T20:32:28.6619169Z scale_ub: Optional[float], 2025-05-07T20:32:28.6619271Z contiguous: bool, 2025-05-07T20:32:28.6619376Z compiled: bool, 2025-05-07T20:32:28.6619466Z ) -> None: 2025-05-07T20:32:28.6619582Z torch.manual_seed(2025) 2025-05-07T20:32:28.6619673Z 2025-05-07T20:32:28.6619865Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6619952Z 2025-05-07T20:32:28.6620063Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6620207Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6620317Z x = x_sign * x_clamp 2025-05-07T20:32:28.6620411Z x0 = x[:, :D] 2025-05-07T20:32:28.6620504Z x1 = x[:, D:] 2025-05-07T20:32:28.6620593Z 2025-05-07T20:32:28.6620689Z if contiguous: 2025-05-07T20:32:28.6620793Z x0 = x0.contiguous() 2025-05-07T20:32:28.6620900Z x1 = x1.contiguous() 2025-05-07T20:32:28.6620984Z 2025-05-07T20:32:28.6621092Z if scale_ub is not None: 2025-05-07T20:32:28.6621221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6621377Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6621465Z ) 2025-05-07T20:32:28.6621562Z else: 2025-05-07T20:32:28.6621669Z scale_ub_tensor = None 2025-05-07T20:32:28.6621752Z 2025-05-07T20:32:28.6621905Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6622008Z op = silu_mul_quant 2025-05-07T20:32:28.6622111Z if compiled: 2025-05-07T20:32:28.6622225Z op = torch.compile(op) 2025-05-07T20:32:28.6622347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6622436Z 2025-05-07T20:32:28.6622540Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6622545Z 2025-05-07T20:32:28.6622655Z moe/activation_test.py:117: 2025-05-07T20:32:28.6622809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6623023Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6623141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6623712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6624048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6624529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6624784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6625168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6625284Z kernel = self.compile( 2025-05-07T20:32:28.6625715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6625927Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6626079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6626085Z 2025-05-07T20:32:28.6626315Z self = 2025-05-07T20:32:28.6627349Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6627910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76778db520>} 2025-05-07T20:32:28.6628748Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6628968Z context = 2025-05-07T20:32:28.6628973Z 2025-05-07T20:32:28.6629161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6629467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6629598Z module_map=module_map) 2025-05-07T20:32:28.6629787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6629899Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6629989Z E ^ 2025-05-07T20:32:28.6630392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6630396Z 2025-05-07T20:32:28.6630858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6630863Z 2025-05-07T20:32:28.6630987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6631241Z self=, 2025-05-07T20:32:28.6631330Z T=4096, 2025-05-07T20:32:28.6631425Z D=7168, 2025-05-07T20:32:28.6631520Z scale_ub=1200.0, 2025-05-07T20:32:28.6631629Z contiguous=False, 2025-05-07T20:32:28.6631732Z compiled=False, 2025-05-07T20:32:28.6631816Z ) 2025-05-07T20:32:28.6632063Z self = 2025-05-07T20:32:28.6632267Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.6632272Z 2025-05-07T20:32:28.6632360Z @given( 2025-05-07T20:32:28.6632502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6632616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6632748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6632889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6633021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6633231Z ) 2025-05-07T20:32:28.6633574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6633683Z def test_silu_mul_quant( 2025-05-07T20:32:28.6633776Z self, 2025-05-07T20:32:28.6633871Z T: int, 2025-05-07T20:32:28.6633960Z D: int, 2025-05-07T20:32:28.6634072Z scale_ub: Optional[float], 2025-05-07T20:32:28.6634180Z contiguous: bool, 2025-05-07T20:32:28.6634278Z compiled: bool, 2025-05-07T20:32:28.6634374Z ) -> None: 2025-05-07T20:32:28.6634482Z torch.manual_seed(2025) 2025-05-07T20:32:28.6634565Z 2025-05-07T20:32:28.6634761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6634848Z 2025-05-07T20:32:28.6634954Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6635105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6635207Z x = x_sign * x_clamp 2025-05-07T20:32:28.6635307Z x0 = x[:, :D] 2025-05-07T20:32:28.6635406Z x1 = x[:, D:] 2025-05-07T20:32:28.6635490Z 2025-05-07T20:32:28.6635586Z if contiguous: 2025-05-07T20:32:28.6635696Z x0 = x0.contiguous() 2025-05-07T20:32:28.6635890Z x1 = x1.contiguous() 2025-05-07T20:32:28.6635980Z 2025-05-07T20:32:28.6636088Z if scale_ub is not None: 2025-05-07T20:32:28.6636208Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6636367Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6636454Z ) 2025-05-07T20:32:28.6636539Z else: 2025-05-07T20:32:28.6636651Z scale_ub_tensor = None 2025-05-07T20:32:28.6636736Z 2025-05-07T20:32:28.6636884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6636991Z op = silu_mul_quant 2025-05-07T20:32:28.6637090Z if compiled: 2025-05-07T20:32:28.6637205Z op = torch.compile(op) 2025-05-07T20:32:28.6637338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6637421Z 2025-05-07T20:32:28.6637527Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6637533Z 2025-05-07T20:32:28.6637642Z moe/activation_test.py:117: 2025-05-07T20:32:28.6637798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6637919Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6638033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6638592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6638708Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6639117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6639374Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6639762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6639870Z kernel = self.compile( 2025-05-07T20:32:28.6640305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6640514Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6640656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6640665Z 2025-05-07T20:32:28.6640895Z self = 2025-05-07T20:32:28.6641763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6642416Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb0550>} 2025-05-07T20:32:28.6643250Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6643475Z context = 2025-05-07T20:32:28.6643480Z 2025-05-07T20:32:28.6643668Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6643965Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6644092Z module_map=module_map) 2025-05-07T20:32:28.6644275Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6644388Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6644481Z E ^ 2025-05-07T20:32:28.6644885Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6644890Z 2025-05-07T20:32:28.6645359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6645447Z 2025-05-07T20:32:28.6645568Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6645819Z self=, 2025-05-07T20:32:28.6645915Z T=16384, 2025-05-07T20:32:28.6646007Z D=7168, 2025-05-07T20:32:28.6646105Z scale_ub=None, 2025-05-07T20:32:28.6646204Z contiguous=True, 2025-05-07T20:32:28.6646299Z compiled=True, 2025-05-07T20:32:28.6646389Z ) 2025-05-07T20:32:28.6646635Z self = 2025-05-07T20:32:28.6646832Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.6646837Z 2025-05-07T20:32:28.6646931Z @given( 2025-05-07T20:32:28.6647071Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6647186Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6647323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6647457Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6647598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6647686Z ) 2025-05-07T20:32:28.6647965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6648079Z def test_silu_mul_quant( 2025-05-07T20:32:28.6648167Z self, 2025-05-07T20:32:28.6648257Z T: int, 2025-05-07T20:32:28.6648353Z D: int, 2025-05-07T20:32:28.6648465Z scale_ub: Optional[float], 2025-05-07T20:32:28.6648569Z contiguous: bool, 2025-05-07T20:32:28.6648674Z compiled: bool, 2025-05-07T20:32:28.6648763Z ) -> None: 2025-05-07T20:32:28.6648870Z torch.manual_seed(2025) 2025-05-07T20:32:28.6648961Z 2025-05-07T20:32:28.6649155Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6649245Z 2025-05-07T20:32:28.6649351Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6649493Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6649609Z x = x_sign * x_clamp 2025-05-07T20:32:28.6649703Z x0 = x[:, :D] 2025-05-07T20:32:28.6649794Z x1 = x[:, D:] 2025-05-07T20:32:28.6649881Z 2025-05-07T20:32:28.6649977Z if contiguous: 2025-05-07T20:32:28.6650080Z x0 = x0.contiguous() 2025-05-07T20:32:28.6650189Z x1 = x1.contiguous() 2025-05-07T20:32:28.6650270Z 2025-05-07T20:32:28.6650373Z if scale_ub is not None: 2025-05-07T20:32:28.6650499Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6650652Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6650740Z ) 2025-05-07T20:32:28.6650832Z else: 2025-05-07T20:32:28.6650940Z scale_ub_tensor = None 2025-05-07T20:32:28.6651116Z 2025-05-07T20:32:28.6651266Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6651370Z op = silu_mul_quant 2025-05-07T20:32:28.6651475Z if compiled: 2025-05-07T20:32:28.6651593Z op = torch.compile(op) 2025-05-07T20:32:28.6651715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6656997Z 2025-05-07T20:32:28.6657127Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6657133Z 2025-05-07T20:32:28.6657247Z moe/activation_test.py:117: 2025-05-07T20:32:28.6657399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6657516Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6657635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6658062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6658168Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6658734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6658844Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6659389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6659645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6660026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6660135Z kernel = self.compile( 2025-05-07T20:32:28.6660571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6660770Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6660916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6660925Z 2025-05-07T20:32:28.6661156Z self = 2025-05-07T20:32:28.6662024Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6662599Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb1360>} 2025-05-07T20:32:28.6663432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6663652Z context = 2025-05-07T20:32:28.6663658Z 2025-05-07T20:32:28.6663851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6664152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6664275Z module_map=module_map) 2025-05-07T20:32:28.6664463Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6664583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6664670Z E ^ 2025-05-07T20:32:28.6665067Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6665073Z 2025-05-07T20:32:28.6665537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6665543Z 2025-05-07T20:32:28.6665660Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6665915Z self=, 2025-05-07T20:32:28.6666001Z T=4096, 2025-05-07T20:32:28.6666178Z D=5120, 2025-05-07T20:32:28.6666277Z scale_ub=None, 2025-05-07T20:32:28.6666378Z contiguous=False, 2025-05-07T20:32:28.6666474Z compiled=True, 2025-05-07T20:32:28.6666562Z ) 2025-05-07T20:32:28.6666813Z self = 2025-05-07T20:32:28.6667008Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6667017Z 2025-05-07T20:32:28.6667104Z @given( 2025-05-07T20:32:28.6667238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6667354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6667484Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6667616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6667748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6667836Z ) 2025-05-07T20:32:28.6668120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6668234Z def test_silu_mul_quant( 2025-05-07T20:32:28.6668322Z self, 2025-05-07T20:32:28.6668431Z T: int, 2025-05-07T20:32:28.6668531Z D: int, 2025-05-07T20:32:28.6668751Z scale_ub: Optional[float], 2025-05-07T20:32:28.6668856Z contiguous: bool, 2025-05-07T20:32:28.6668955Z compiled: bool, 2025-05-07T20:32:28.6669044Z ) -> None: 2025-05-07T20:32:28.6669156Z torch.manual_seed(2025) 2025-05-07T20:32:28.6669239Z 2025-05-07T20:32:28.6669428Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6669515Z 2025-05-07T20:32:28.6669619Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6669762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6669867Z x = x_sign * x_clamp 2025-05-07T20:32:28.6669961Z x0 = x[:, :D] 2025-05-07T20:32:28.6670052Z x1 = x[:, D:] 2025-05-07T20:32:28.6670136Z 2025-05-07T20:32:28.6670239Z if contiguous: 2025-05-07T20:32:28.6670345Z x0 = x0.contiguous() 2025-05-07T20:32:28.6670446Z x1 = x1.contiguous() 2025-05-07T20:32:28.6670527Z 2025-05-07T20:32:28.6670634Z if scale_ub is not None: 2025-05-07T20:32:28.6670760Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6670912Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6671001Z ) 2025-05-07T20:32:28.6671087Z else: 2025-05-07T20:32:28.6671198Z scale_ub_tensor = None 2025-05-07T20:32:28.6671281Z 2025-05-07T20:32:28.6671427Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6671532Z op = silu_mul_quant 2025-05-07T20:32:28.6671630Z if compiled: 2025-05-07T20:32:28.6671744Z op = torch.compile(op) 2025-05-07T20:32:28.6671867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6671948Z 2025-05-07T20:32:28.6672051Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6672060Z 2025-05-07T20:32:28.6672174Z moe/activation_test.py:117: 2025-05-07T20:32:28.6672320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6672438Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6672559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6672974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6673082Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6673722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6673832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6674237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6674489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6674962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6675071Z kernel = self.compile( 2025-05-07T20:32:28.6675498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6675705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6675847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6675852Z 2025-05-07T20:32:28.6676078Z self = 2025-05-07T20:32:28.6676950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6677522Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb1ea0>} 2025-05-07T20:32:28.6678357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6678656Z context = 2025-05-07T20:32:28.6678661Z 2025-05-07T20:32:28.6678850Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6679145Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6679266Z module_map=module_map) 2025-05-07T20:32:28.6679454Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6679567Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6679656Z E ^ 2025-05-07T20:32:28.6680062Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6680066Z 2025-05-07T20:32:28.6680527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6680539Z 2025-05-07T20:32:28.6680660Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6680910Z self=, 2025-05-07T20:32:28.6680999Z T=4096, 2025-05-07T20:32:28.6681092Z D=5120, 2025-05-07T20:32:28.6681185Z scale_ub=1200.0, 2025-05-07T20:32:28.6681283Z contiguous=False, 2025-05-07T20:32:28.6681381Z compiled=False, 2025-05-07T20:32:28.6681464Z ) 2025-05-07T20:32:28.6681710Z self = 2025-05-07T20:32:28.6681908Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.6681913Z 2025-05-07T20:32:28.6682004Z @given( 2025-05-07T20:32:28.6682143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6682257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6682387Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6682530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6682658Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6682747Z ) 2025-05-07T20:32:28.6683025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6683133Z def test_silu_mul_quant( 2025-05-07T20:32:28.6683223Z self, 2025-05-07T20:32:28.6683312Z T: int, 2025-05-07T20:32:28.6683397Z D: int, 2025-05-07T20:32:28.6683511Z scale_ub: Optional[float], 2025-05-07T20:32:28.6683612Z contiguous: bool, 2025-05-07T20:32:28.6683710Z compiled: bool, 2025-05-07T20:32:28.6683802Z ) -> None: 2025-05-07T20:32:28.6683909Z torch.manual_seed(2025) 2025-05-07T20:32:28.6684082Z 2025-05-07T20:32:28.6684282Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6684366Z 2025-05-07T20:32:28.6684472Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6684621Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6684722Z x = x_sign * x_clamp 2025-05-07T20:32:28.6684815Z x0 = x[:, :D] 2025-05-07T20:32:28.6684905Z x1 = x[:, D:] 2025-05-07T20:32:28.6684987Z 2025-05-07T20:32:28.6685085Z if contiguous: 2025-05-07T20:32:28.6685188Z x0 = x0.contiguous() 2025-05-07T20:32:28.6685288Z x1 = x1.contiguous() 2025-05-07T20:32:28.6685376Z 2025-05-07T20:32:28.6685477Z if scale_ub is not None: 2025-05-07T20:32:28.6685597Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6685752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6685838Z ) 2025-05-07T20:32:28.6685931Z else: 2025-05-07T20:32:28.6686043Z scale_ub_tensor = None 2025-05-07T20:32:28.6686123Z 2025-05-07T20:32:28.6686273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6686374Z op = silu_mul_quant 2025-05-07T20:32:28.6686556Z if compiled: 2025-05-07T20:32:28.6686671Z op = torch.compile(op) 2025-05-07T20:32:28.6686790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6686872Z 2025-05-07T20:32:28.6686978Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6686983Z 2025-05-07T20:32:28.6687091Z moe/activation_test.py:117: 2025-05-07T20:32:28.6687235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6687353Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6687465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6688032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6688143Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6688546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6688810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6689195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6689302Z kernel = self.compile( 2025-05-07T20:32:28.6689737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6689936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6690083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6690087Z 2025-05-07T20:32:28.6690316Z self = 2025-05-07T20:32:28.6691185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6691758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb2680>} 2025-05-07T20:32:28.6692588Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6692806Z context = 2025-05-07T20:32:28.6692811Z 2025-05-07T20:32:28.6692998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6693408Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6693532Z module_map=module_map) 2025-05-07T20:32:28.6693714Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6693833Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6693921Z E ^ 2025-05-07T20:32:28.6694318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6694323Z 2025-05-07T20:32:28.6694787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6694792Z 2025-05-07T20:32:28.6694912Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6695167Z self=, 2025-05-07T20:32:28.6695255Z T=4096, 2025-05-07T20:32:28.6695342Z D=5120, 2025-05-07T20:32:28.6695440Z scale_ub=1200.0, 2025-05-07T20:32:28.6695537Z contiguous=False, 2025-05-07T20:32:28.6695638Z compiled=True, 2025-05-07T20:32:28.6695725Z ) 2025-05-07T20:32:28.6695971Z self = 2025-05-07T20:32:28.6696164Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6696259Z 2025-05-07T20:32:28.6696347Z @given( 2025-05-07T20:32:28.6696483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6696602Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6696732Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6696864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6696997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6697083Z ) 2025-05-07T20:32:28.6697360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6697469Z def test_silu_mul_quant( 2025-05-07T20:32:28.6697556Z self, 2025-05-07T20:32:28.6697647Z T: int, 2025-05-07T20:32:28.6697739Z D: int, 2025-05-07T20:32:28.6697850Z scale_ub: Optional[float], 2025-05-07T20:32:28.6697955Z contiguous: bool, 2025-05-07T20:32:28.6698051Z compiled: bool, 2025-05-07T20:32:28.6698145Z ) -> None: 2025-05-07T20:32:28.6698256Z torch.manual_seed(2025) 2025-05-07T20:32:28.6698354Z 2025-05-07T20:32:28.6698556Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6698646Z 2025-05-07T20:32:28.6698775Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6698938Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6699047Z x = x_sign * x_clamp 2025-05-07T20:32:28.6699139Z x0 = x[:, :D] 2025-05-07T20:32:28.6699229Z x1 = x[:, D:] 2025-05-07T20:32:28.6699315Z 2025-05-07T20:32:28.6699410Z if contiguous: 2025-05-07T20:32:28.6699517Z x0 = x0.contiguous() 2025-05-07T20:32:28.6699622Z x1 = x1.contiguous() 2025-05-07T20:32:28.6699704Z 2025-05-07T20:32:28.6699815Z if scale_ub is not None: 2025-05-07T20:32:28.6699937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6700090Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6700187Z ) 2025-05-07T20:32:28.6700273Z else: 2025-05-07T20:32:28.6700380Z scale_ub_tensor = None 2025-05-07T20:32:28.6700471Z 2025-05-07T20:32:28.6700619Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6700723Z op = silu_mul_quant 2025-05-07T20:32:28.6700822Z if compiled: 2025-05-07T20:32:28.6700934Z op = torch.compile(op) 2025-05-07T20:32:28.6701054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6701140Z 2025-05-07T20:32:28.6701244Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6701248Z 2025-05-07T20:32:28.6701360Z moe/activation_test.py:117: 2025-05-07T20:32:28.6701597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6701713Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6701831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6702248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6702360Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6702927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6703038Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6703445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6703700Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6704084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6704201Z kernel = self.compile( 2025-05-07T20:32:28.6704633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6704925Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6705069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6705074Z 2025-05-07T20:32:28.6705307Z self = 2025-05-07T20:32:28.6706180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6706744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676cb3ac0>} 2025-05-07T20:32:28.6707588Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6707809Z context = 2025-05-07T20:32:28.6707814Z 2025-05-07T20:32:28.6708002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6708303Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6708424Z module_map=module_map) 2025-05-07T20:32:28.6708609Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6708721Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6708808Z E ^ 2025-05-07T20:32:28.6709211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6709221Z 2025-05-07T20:32:28.6709686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6709691Z 2025-05-07T20:32:28.6709816Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6710073Z self=, 2025-05-07T20:32:28.6710163Z T=2048, 2025-05-07T20:32:28.6710253Z D=7168, 2025-05-07T20:32:28.6710347Z scale_ub=1200.0, 2025-05-07T20:32:28.6710446Z contiguous=False, 2025-05-07T20:32:28.6710544Z compiled=False, 2025-05-07T20:32:28.6710627Z ) 2025-05-07T20:32:28.6710875Z self = 2025-05-07T20:32:28.6711079Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.6711084Z 2025-05-07T20:32:28.6711171Z @given( 2025-05-07T20:32:28.6711310Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6711514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6711648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6711784Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6711920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6712005Z ) 2025-05-07T20:32:28.6712288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6712394Z def test_silu_mul_quant( 2025-05-07T20:32:28.6712482Z self, 2025-05-07T20:32:28.6712573Z T: int, 2025-05-07T20:32:28.6712659Z D: int, 2025-05-07T20:32:28.6712771Z scale_ub: Optional[float], 2025-05-07T20:32:28.6712876Z contiguous: bool, 2025-05-07T20:32:28.6712975Z compiled: bool, 2025-05-07T20:32:28.6713065Z ) -> None: 2025-05-07T20:32:28.6713173Z torch.manual_seed(2025) 2025-05-07T20:32:28.6713255Z 2025-05-07T20:32:28.6713456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6713586Z 2025-05-07T20:32:28.6713691Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6713836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6713937Z x = x_sign * x_clamp 2025-05-07T20:32:28.6714117Z x0 = x[:, :D] 2025-05-07T20:32:28.6714213Z x1 = x[:, D:] 2025-05-07T20:32:28.6714295Z 2025-05-07T20:32:28.6714391Z if contiguous: 2025-05-07T20:32:28.6714498Z x0 = x0.contiguous() 2025-05-07T20:32:28.6714600Z x1 = x1.contiguous() 2025-05-07T20:32:28.6714682Z 2025-05-07T20:32:28.6714789Z if scale_ub is not None: 2025-05-07T20:32:28.6714910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6715074Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6715163Z ) 2025-05-07T20:32:28.6715248Z else: 2025-05-07T20:32:28.6715361Z scale_ub_tensor = None 2025-05-07T20:32:28.6715444Z 2025-05-07T20:32:28.6715597Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6715704Z op = silu_mul_quant 2025-05-07T20:32:28.6715801Z if compiled: 2025-05-07T20:32:28.6715913Z op = torch.compile(op) 2025-05-07T20:32:28.6716044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6716126Z 2025-05-07T20:32:28.6716230Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6716238Z 2025-05-07T20:32:28.6716347Z moe/activation_test.py:117: 2025-05-07T20:32:28.6716495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6716615Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6716730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6717294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6717408Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6717817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6718073Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6718464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6718571Z kernel = self.compile( 2025-05-07T20:32:28.6719008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6719208Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6719350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6719355Z 2025-05-07T20:32:28.6719589Z self = 2025-05-07T20:32:28.6721322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6721899Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7677eaa290>} 2025-05-07T20:32:28.6722741Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6722963Z context = 2025-05-07T20:32:28.6722968Z 2025-05-07T20:32:28.6723157Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6723453Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6723579Z module_map=module_map) 2025-05-07T20:32:28.6723967Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6724139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6724273Z E ^ 2025-05-07T20:32:28.6724691Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6724857Z 2025-05-07T20:32:28.6725329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6725334Z 2025-05-07T20:32:28.6725451Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6725705Z self=, 2025-05-07T20:32:28.6725796Z T=1, 2025-05-07T20:32:28.6725883Z D=7168, 2025-05-07T20:32:28.6725976Z scale_ub=None, 2025-05-07T20:32:28.6726076Z contiguous=True, 2025-05-07T20:32:28.6726170Z compiled=False, 2025-05-07T20:32:28.6726254Z ) 2025-05-07T20:32:28.6726507Z self = 2025-05-07T20:32:28.6726695Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.6726700Z 2025-05-07T20:32:28.6726791Z @given( 2025-05-07T20:32:28.6726933Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6727048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6727181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6727315Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6727447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6727541Z ) 2025-05-07T20:32:28.6727821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6727936Z def test_silu_mul_quant( 2025-05-07T20:32:28.6728026Z self, 2025-05-07T20:32:28.6728116Z T: int, 2025-05-07T20:32:28.6728218Z D: int, 2025-05-07T20:32:28.6728334Z scale_ub: Optional[float], 2025-05-07T20:32:28.6728443Z contiguous: bool, 2025-05-07T20:32:28.6728549Z compiled: bool, 2025-05-07T20:32:28.6728640Z ) -> None: 2025-05-07T20:32:28.6728751Z torch.manual_seed(2025) 2025-05-07T20:32:28.6728849Z 2025-05-07T20:32:28.6729042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6729128Z 2025-05-07T20:32:28.6729240Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6729383Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6729491Z x = x_sign * x_clamp 2025-05-07T20:32:28.6729584Z x0 = x[:, :D] 2025-05-07T20:32:28.6729676Z x1 = x[:, D:] 2025-05-07T20:32:28.6729764Z 2025-05-07T20:32:28.6729860Z if contiguous: 2025-05-07T20:32:28.6729965Z x0 = x0.contiguous() 2025-05-07T20:32:28.6730073Z x1 = x1.contiguous() 2025-05-07T20:32:28.6730156Z 2025-05-07T20:32:28.6730262Z if scale_ub is not None: 2025-05-07T20:32:28.6730546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6730705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6730792Z ) 2025-05-07T20:32:28.6730885Z else: 2025-05-07T20:32:28.6731001Z scale_ub_tensor = None 2025-05-07T20:32:28.6731085Z 2025-05-07T20:32:28.6731240Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6731344Z op = silu_mul_quant 2025-05-07T20:32:28.6731450Z if compiled: 2025-05-07T20:32:28.6731565Z op = torch.compile(op) 2025-05-07T20:32:28.6731687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6731774Z 2025-05-07T20:32:28.6731881Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6731886Z 2025-05-07T20:32:28.6732002Z moe/activation_test.py:117: 2025-05-07T20:32:28.6732156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6732274Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6732395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6732967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6733178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6733591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6733847Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6734233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6734348Z kernel = self.compile( 2025-05-07T20:32:28.6734781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6734992Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6735144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6735149Z 2025-05-07T20:32:28.6735381Z self = 2025-05-07T20:32:28.6736265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6736831Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76769884c0>} 2025-05-07T20:32:28.6737675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6737892Z context = 2025-05-07T20:32:28.6737902Z 2025-05-07T20:32:28.6738090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6738422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6738573Z module_map=module_map) 2025-05-07T20:32:28.6738767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6738884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6738973Z E ^ 2025-05-07T20:32:28.6739381Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6739386Z 2025-05-07T20:32:28.6739853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6739859Z 2025-05-07T20:32:28.6739983Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6740324Z self=, 2025-05-07T20:32:28.6740418Z T=16384, 2025-05-07T20:32:28.6740512Z D=7168, 2025-05-07T20:32:28.6740610Z scale_ub=1200.0, 2025-05-07T20:32:28.6740713Z contiguous=False, 2025-05-07T20:32:28.6740820Z compiled=True, 2025-05-07T20:32:28.6740905Z ) 2025-05-07T20:32:28.6741154Z self = 2025-05-07T20:32:28.6741367Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6741372Z 2025-05-07T20:32:28.6741464Z @given( 2025-05-07T20:32:28.6741607Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6741728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6741863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6742007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6742138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6742226Z ) 2025-05-07T20:32:28.6742522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6742632Z def test_silu_mul_quant( 2025-05-07T20:32:28.6742723Z self, 2025-05-07T20:32:28.6742983Z T: int, 2025-05-07T20:32:28.6743070Z D: int, 2025-05-07T20:32:28.6743184Z scale_ub: Optional[float], 2025-05-07T20:32:28.6743293Z contiguous: bool, 2025-05-07T20:32:28.6743392Z compiled: bool, 2025-05-07T20:32:28.6743488Z ) -> None: 2025-05-07T20:32:28.6743597Z torch.manual_seed(2025) 2025-05-07T20:32:28.6743683Z 2025-05-07T20:32:28.6743881Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6743967Z 2025-05-07T20:32:28.6744073Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6744223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6744327Z x = x_sign * x_clamp 2025-05-07T20:32:28.6744419Z x0 = x[:, :D] 2025-05-07T20:32:28.6744520Z x1 = x[:, D:] 2025-05-07T20:32:28.6744605Z 2025-05-07T20:32:28.6744703Z if contiguous: 2025-05-07T20:32:28.6744816Z x0 = x0.contiguous() 2025-05-07T20:32:28.6744920Z x1 = x1.contiguous() 2025-05-07T20:32:28.6745017Z 2025-05-07T20:32:28.6745121Z if scale_ub is not None: 2025-05-07T20:32:28.6745243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6745402Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6745490Z ) 2025-05-07T20:32:28.6745576Z else: 2025-05-07T20:32:28.6745688Z scale_ub_tensor = None 2025-05-07T20:32:28.6745773Z 2025-05-07T20:32:28.6745925Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6746035Z op = silu_mul_quant 2025-05-07T20:32:28.6746133Z if compiled: 2025-05-07T20:32:28.6746248Z op = torch.compile(op) 2025-05-07T20:32:28.6746374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6746464Z 2025-05-07T20:32:28.6746570Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6746580Z 2025-05-07T20:32:28.6746690Z moe/activation_test.py:117: 2025-05-07T20:32:28.6746837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6746964Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6747078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6747498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6747610Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6748172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6748291Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6748696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6749045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6749441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6749554Z kernel = self.compile( 2025-05-07T20:32:28.6749987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6750198Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6750343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6750348Z 2025-05-07T20:32:28.6750585Z self = 2025-05-07T20:32:28.6751458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6752029Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76769895a0>} 2025-05-07T20:32:28.6752955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6753172Z context = 2025-05-07T20:32:28.6753177Z 2025-05-07T20:32:28.6753374Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6753729Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6753852Z module_map=module_map) 2025-05-07T20:32:28.6754042Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6754162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6754256Z E ^ 2025-05-07T20:32:28.6754656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6754667Z 2025-05-07T20:32:28.6755132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6755138Z 2025-05-07T20:32:28.6755263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6755518Z self=, 2025-05-07T20:32:28.6755614Z T=1, 2025-05-07T20:32:28.6755703Z D=7168, 2025-05-07T20:32:28.6755800Z scale_ub=None, 2025-05-07T20:32:28.6755903Z contiguous=False, 2025-05-07T20:32:28.6756001Z compiled=False, 2025-05-07T20:32:28.6756086Z ) 2025-05-07T20:32:28.6756337Z self = 2025-05-07T20:32:28.6756531Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.6756536Z 2025-05-07T20:32:28.6756626Z @given( 2025-05-07T20:32:28.6756772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6756888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6757032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6757167Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6757298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6757389Z ) 2025-05-07T20:32:28.6757671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6757780Z def test_silu_mul_quant( 2025-05-07T20:32:28.6757876Z self, 2025-05-07T20:32:28.6757966Z T: int, 2025-05-07T20:32:28.6758055Z D: int, 2025-05-07T20:32:28.6758175Z scale_ub: Optional[float], 2025-05-07T20:32:28.6758280Z contiguous: bool, 2025-05-07T20:32:28.6758380Z compiled: bool, 2025-05-07T20:32:28.6758564Z ) -> None: 2025-05-07T20:32:28.6758674Z torch.manual_seed(2025) 2025-05-07T20:32:28.6758764Z 2025-05-07T20:32:28.6758958Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6759047Z 2025-05-07T20:32:28.6759162Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6759306Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6759409Z x = x_sign * x_clamp 2025-05-07T20:32:28.6759510Z x0 = x[:, :D] 2025-05-07T20:32:28.6759602Z x1 = x[:, D:] 2025-05-07T20:32:28.6759686Z 2025-05-07T20:32:28.6759788Z if contiguous: 2025-05-07T20:32:28.6759894Z x0 = x0.contiguous() 2025-05-07T20:32:28.6759997Z x1 = x1.contiguous() 2025-05-07T20:32:28.6760084Z 2025-05-07T20:32:28.6760189Z if scale_ub is not None: 2025-05-07T20:32:28.6760310Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6760473Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6760569Z ) 2025-05-07T20:32:28.6760666Z else: 2025-05-07T20:32:28.6760778Z scale_ub_tensor = None 2025-05-07T20:32:28.6760861Z 2025-05-07T20:32:28.6761015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6761239Z op = silu_mul_quant 2025-05-07T20:32:28.6761339Z if compiled: 2025-05-07T20:32:28.6761460Z op = torch.compile(op) 2025-05-07T20:32:28.6761584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6761672Z 2025-05-07T20:32:28.6761781Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6761785Z 2025-05-07T20:32:28.6761900Z moe/activation_test.py:117: 2025-05-07T20:32:28.6762057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6762174Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6762293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6762867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6762984Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6763392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6763657Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6764044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6764158Z kernel = self.compile( 2025-05-07T20:32:28.6764590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6764791Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6764943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6764948Z 2025-05-07T20:32:28.6765185Z self = 2025-05-07T20:32:28.6766066Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6766639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676989d80>} 2025-05-07T20:32:28.6767476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6767699Z context = 2025-05-07T20:32:28.6767703Z 2025-05-07T20:32:28.6767891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6768303Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6768444Z module_map=module_map) 2025-05-07T20:32:28.6768644Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6768765Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6768854Z E ^ 2025-05-07T20:32:28.6769255Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6769263Z 2025-05-07T20:32:28.6769731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6769736Z 2025-05-07T20:32:28.6769857Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6770114Z self=, 2025-05-07T20:32:28.6770205Z T=2048, 2025-05-07T20:32:28.6770295Z D=7168, 2025-05-07T20:32:28.6770404Z scale_ub=None, 2025-05-07T20:32:28.6770504Z contiguous=False, 2025-05-07T20:32:28.6770601Z compiled=True, 2025-05-07T20:32:28.6770692Z ) 2025-05-07T20:32:28.6770939Z self = 2025-05-07T20:32:28.6771232Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6771237Z 2025-05-07T20:32:28.6771325Z @given( 2025-05-07T20:32:28.6771463Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6771586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6771717Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6771852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6771989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6772076Z ) 2025-05-07T20:32:28.6772362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6772476Z def test_silu_mul_quant( 2025-05-07T20:32:28.6772566Z self, 2025-05-07T20:32:28.6772663Z T: int, 2025-05-07T20:32:28.6772752Z D: int, 2025-05-07T20:32:28.6772866Z scale_ub: Optional[float], 2025-05-07T20:32:28.6772981Z contiguous: bool, 2025-05-07T20:32:28.6773081Z compiled: bool, 2025-05-07T20:32:28.6773172Z ) -> None: 2025-05-07T20:32:28.6773289Z torch.manual_seed(2025) 2025-05-07T20:32:28.6773373Z 2025-05-07T20:32:28.6773566Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6773659Z 2025-05-07T20:32:28.6773766Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6773911Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6774021Z x = x_sign * x_clamp 2025-05-07T20:32:28.6774115Z x0 = x[:, :D] 2025-05-07T20:32:28.6774214Z x1 = x[:, D:] 2025-05-07T20:32:28.6774299Z 2025-05-07T20:32:28.6774395Z if contiguous: 2025-05-07T20:32:28.6774512Z x0 = x0.contiguous() 2025-05-07T20:32:28.6774615Z x1 = x1.contiguous() 2025-05-07T20:32:28.6774700Z 2025-05-07T20:32:28.6774808Z if scale_ub is not None: 2025-05-07T20:32:28.6774931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6775094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6775192Z ) 2025-05-07T20:32:28.6775281Z else: 2025-05-07T20:32:28.6775390Z scale_ub_tensor = None 2025-05-07T20:32:28.6775479Z 2025-05-07T20:32:28.6775632Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6775741Z op = silu_mul_quant 2025-05-07T20:32:28.6775839Z if compiled: 2025-05-07T20:32:28.6775954Z op = torch.compile(op) 2025-05-07T20:32:28.6776081Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6776165Z 2025-05-07T20:32:28.6776269Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6776273Z 2025-05-07T20:32:28.6776486Z moe/activation_test.py:117: 2025-05-07T20:32:28.6776633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6776749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6776877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6777296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6777410Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6777970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6778082Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6778495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6778749Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6779143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6779262Z kernel = self.compile( 2025-05-07T20:32:28.6779697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6779989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6780135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6780140Z 2025-05-07T20:32:28.6780371Z self = 2025-05-07T20:32:28.6781252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6781819Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767698af80>} 2025-05-07T20:32:28.6782666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6782889Z context = 2025-05-07T20:32:28.6782894Z 2025-05-07T20:32:28.6783088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6783388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6783513Z module_map=module_map) 2025-05-07T20:32:28.6783703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6783818Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6783907Z E ^ 2025-05-07T20:32:28.6784318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6784323Z 2025-05-07T20:32:28.6784789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6784800Z 2025-05-07T20:32:28.6784927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6785183Z self=, 2025-05-07T20:32:28.6785274Z T=4096, 2025-05-07T20:32:28.6785368Z D=7168, 2025-05-07T20:32:28.6785465Z scale_ub=None, 2025-05-07T20:32:28.6785564Z contiguous=False, 2025-05-07T20:32:28.6785669Z compiled=True, 2025-05-07T20:32:28.6785753Z ) 2025-05-07T20:32:28.6786000Z self = 2025-05-07T20:32:28.6786203Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6786208Z 2025-05-07T20:32:28.6786298Z @given( 2025-05-07T20:32:28.6786528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6786645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6786777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6786915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6787052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6787138Z ) 2025-05-07T20:32:28.6787424Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6787533Z def test_silu_mul_quant( 2025-05-07T20:32:28.6787629Z self, 2025-05-07T20:32:28.6787718Z T: int, 2025-05-07T20:32:28.6787806Z D: int, 2025-05-07T20:32:28.6787926Z scale_ub: Optional[float], 2025-05-07T20:32:28.6788030Z contiguous: bool, 2025-05-07T20:32:28.6788128Z compiled: bool, 2025-05-07T20:32:28.6788224Z ) -> None: 2025-05-07T20:32:28.6788333Z torch.manual_seed(2025) 2025-05-07T20:32:28.6788420Z 2025-05-07T20:32:28.6788621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6788705Z 2025-05-07T20:32:28.6788813Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6788964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6789155Z x = x_sign * x_clamp 2025-05-07T20:32:28.6789250Z x0 = x[:, :D] 2025-05-07T20:32:28.6789346Z x1 = x[:, D:] 2025-05-07T20:32:28.6789429Z 2025-05-07T20:32:28.6789532Z if contiguous: 2025-05-07T20:32:28.6789639Z x0 = x0.contiguous() 2025-05-07T20:32:28.6789744Z x1 = x1.contiguous() 2025-05-07T20:32:28.6789833Z 2025-05-07T20:32:28.6789937Z if scale_ub is not None: 2025-05-07T20:32:28.6790061Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6790224Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6790311Z ) 2025-05-07T20:32:28.6790403Z else: 2025-05-07T20:32:28.6790525Z scale_ub_tensor = None 2025-05-07T20:32:28.6790608Z 2025-05-07T20:32:28.6790756Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6790866Z op = silu_mul_quant 2025-05-07T20:32:28.6790965Z if compiled: 2025-05-07T20:32:28.6791092Z op = torch.compile(op) 2025-05-07T20:32:28.6791215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6791301Z 2025-05-07T20:32:28.6791410Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6791415Z 2025-05-07T20:32:28.6791525Z moe/activation_test.py:117: 2025-05-07T20:32:28.6791674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6791795Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6791909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6792325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6792436Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6793001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6793119Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6793575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6793830Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6794224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6794333Z kernel = self.compile( 2025-05-07T20:32:28.6794769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6800327Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6800483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6800606Z 2025-05-07T20:32:28.6800845Z self = 2025-05-07T20:32:28.6801717Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6802287Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f767698be20>} 2025-05-07T20:32:28.6803131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6803347Z context = 2025-05-07T20:32:28.6803352Z 2025-05-07T20:32:28.6803551Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6803849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6803971Z module_map=module_map) 2025-05-07T20:32:28.6804246Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6804361Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6804452Z E ^ 2025-05-07T20:32:28.6804857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6804862Z 2025-05-07T20:32:28.6805330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6805334Z 2025-05-07T20:32:28.6805455Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6805706Z self=, 2025-05-07T20:32:28.6805806Z T=16384, 2025-05-07T20:32:28.6805895Z D=5120, 2025-05-07T20:32:28.6805988Z scale_ub=1200.0, 2025-05-07T20:32:28.6806091Z contiguous=False, 2025-05-07T20:32:28.6806187Z compiled=False, 2025-05-07T20:32:28.6806280Z ) 2025-05-07T20:32:28.6806529Z self = 2025-05-07T20:32:28.6806731Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.6806736Z 2025-05-07T20:32:28.6806828Z @given( 2025-05-07T20:32:28.6806968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6807080Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6807213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6807344Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6807471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6807560Z ) 2025-05-07T20:32:28.6807844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6807950Z def test_silu_mul_quant( 2025-05-07T20:32:28.6808040Z self, 2025-05-07T20:32:28.6808129Z T: int, 2025-05-07T20:32:28.6808215Z D: int, 2025-05-07T20:32:28.6808337Z scale_ub: Optional[float], 2025-05-07T20:32:28.6808437Z contiguous: bool, 2025-05-07T20:32:28.6808533Z compiled: bool, 2025-05-07T20:32:28.6808625Z ) -> None: 2025-05-07T20:32:28.6808733Z torch.manual_seed(2025) 2025-05-07T20:32:28.6808819Z 2025-05-07T20:32:28.6809008Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6809093Z 2025-05-07T20:32:28.6809203Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6809344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6809445Z x = x_sign * x_clamp 2025-05-07T20:32:28.6809541Z x0 = x[:, :D] 2025-05-07T20:32:28.6809632Z x1 = x[:, D:] 2025-05-07T20:32:28.6809713Z 2025-05-07T20:32:28.6809900Z if contiguous: 2025-05-07T20:32:28.6810006Z x0 = x0.contiguous() 2025-05-07T20:32:28.6810107Z x1 = x1.contiguous() 2025-05-07T20:32:28.6810192Z 2025-05-07T20:32:28.6810294Z if scale_ub is not None: 2025-05-07T20:32:28.6810420Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6810578Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6810663Z ) 2025-05-07T20:32:28.6810755Z else: 2025-05-07T20:32:28.6810860Z scale_ub_tensor = None 2025-05-07T20:32:28.6810942Z 2025-05-07T20:32:28.6811091Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6811194Z op = silu_mul_quant 2025-05-07T20:32:28.6811291Z if compiled: 2025-05-07T20:32:28.6811406Z op = torch.compile(op) 2025-05-07T20:32:28.6811528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6811612Z 2025-05-07T20:32:28.6811719Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6811723Z 2025-05-07T20:32:28.6811833Z moe/activation_test.py:117: 2025-05-07T20:32:28.6811981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6812184Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6812297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6812867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6812977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6813388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6813642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6814025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6814137Z kernel = self.compile( 2025-05-07T20:32:28.6814575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6814775Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6814929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6814933Z 2025-05-07T20:32:28.6815163Z self = 2025-05-07T20:32:28.6816041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6816607Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76768717e0>} 2025-05-07T20:32:28.6817450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6817665Z context = 2025-05-07T20:32:28.6817676Z 2025-05-07T20:32:28.6817862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6818161Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6818284Z module_map=module_map) 2025-05-07T20:32:28.6818472Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6818584Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6818671Z E ^ 2025-05-07T20:32:28.6819073Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6819078Z 2025-05-07T20:32:28.6819627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6819633Z 2025-05-07T20:32:28.6819751Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6820013Z self=, 2025-05-07T20:32:28.6820099Z T=16384, 2025-05-07T20:32:28.6820188Z D=5120, 2025-05-07T20:32:28.6820283Z scale_ub=1200.0, 2025-05-07T20:32:28.6820379Z contiguous=True, 2025-05-07T20:32:28.6820477Z compiled=True, 2025-05-07T20:32:28.6820559Z ) 2025-05-07T20:32:28.6820805Z self = 2025-05-07T20:32:28.6821006Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.6821011Z 2025-05-07T20:32:28.6821098Z @given( 2025-05-07T20:32:28.6821231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6821349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6821485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6821622Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6821750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6821922Z ) 2025-05-07T20:32:28.6822203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6822311Z def test_silu_mul_quant( 2025-05-07T20:32:28.6822397Z self, 2025-05-07T20:32:28.6822488Z T: int, 2025-05-07T20:32:28.6822575Z D: int, 2025-05-07T20:32:28.6822685Z scale_ub: Optional[float], 2025-05-07T20:32:28.6822789Z contiguous: bool, 2025-05-07T20:32:28.6822886Z compiled: bool, 2025-05-07T20:32:28.6822972Z ) -> None: 2025-05-07T20:32:28.6823082Z torch.manual_seed(2025) 2025-05-07T20:32:28.6823164Z 2025-05-07T20:32:28.6823358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6823442Z 2025-05-07T20:32:28.6823553Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6823701Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6824073Z x = x_sign * x_clamp 2025-05-07T20:32:28.6824215Z x0 = x[:, :D] 2025-05-07T20:32:28.6824363Z x1 = x[:, D:] 2025-05-07T20:32:28.6824446Z 2025-05-07T20:32:28.6824541Z if contiguous: 2025-05-07T20:32:28.6824648Z x0 = x0.contiguous() 2025-05-07T20:32:28.6824750Z x1 = x1.contiguous() 2025-05-07T20:32:28.6824831Z 2025-05-07T20:32:28.6824939Z if scale_ub is not None: 2025-05-07T20:32:28.6825059Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6825217Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6825303Z ) 2025-05-07T20:32:28.6825387Z else: 2025-05-07T20:32:28.6825496Z scale_ub_tensor = None 2025-05-07T20:32:28.6825576Z 2025-05-07T20:32:28.6825729Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6825835Z op = silu_mul_quant 2025-05-07T20:32:28.6825932Z if compiled: 2025-05-07T20:32:28.6826045Z op = torch.compile(op) 2025-05-07T20:32:28.6826169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6826257Z 2025-05-07T20:32:28.6826359Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6826369Z 2025-05-07T20:32:28.6826480Z moe/activation_test.py:117: 2025-05-07T20:32:28.6826624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6826744Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6826855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6827269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6827377Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6828092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6828206Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6828667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6828925Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6829314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6829420Z kernel = self.compile( 2025-05-07T20:32:28.6829851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6830053Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6830196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6830201Z 2025-05-07T20:32:28.6830440Z self = 2025-05-07T20:32:28.6831311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6832027Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676871090>} 2025-05-07T20:32:28.6832871Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6833087Z context = 2025-05-07T20:32:28.6833092Z 2025-05-07T20:32:28.6833281Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6833650Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6833772Z module_map=module_map) 2025-05-07T20:32:28.6833960Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6834081Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6834176Z E ^ 2025-05-07T20:32:28.6834576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6834582Z 2025-05-07T20:32:28.6835045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6835050Z 2025-05-07T20:32:28.6835171Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6835422Z self=, 2025-05-07T20:32:28.6835513Z T=16384, 2025-05-07T20:32:28.6835601Z D=5120, 2025-05-07T20:32:28.6835696Z scale_ub=None, 2025-05-07T20:32:28.6835804Z contiguous=False, 2025-05-07T20:32:28.6835898Z compiled=True, 2025-05-07T20:32:28.6835980Z ) 2025-05-07T20:32:28.6836228Z self = 2025-05-07T20:32:28.6836433Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6836438Z 2025-05-07T20:32:28.6836526Z @given( 2025-05-07T20:32:28.6836663Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6836776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6836906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6837042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6837172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6837260Z ) 2025-05-07T20:32:28.6837540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6837645Z def test_silu_mul_quant( 2025-05-07T20:32:28.6837734Z self, 2025-05-07T20:32:28.6837914Z T: int, 2025-05-07T20:32:28.6838003Z D: int, 2025-05-07T20:32:28.6838118Z scale_ub: Optional[float], 2025-05-07T20:32:28.6838220Z contiguous: bool, 2025-05-07T20:32:28.6838319Z compiled: bool, 2025-05-07T20:32:28.6838410Z ) -> None: 2025-05-07T20:32:28.6838519Z torch.manual_seed(2025) 2025-05-07T20:32:28.6838601Z 2025-05-07T20:32:28.6838796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6838879Z 2025-05-07T20:32:28.6838989Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6839129Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6839231Z x = x_sign * x_clamp 2025-05-07T20:32:28.6839325Z x0 = x[:, :D] 2025-05-07T20:32:28.6839416Z x1 = x[:, D:] 2025-05-07T20:32:28.6839497Z 2025-05-07T20:32:28.6839596Z if contiguous: 2025-05-07T20:32:28.6839700Z x0 = x0.contiguous() 2025-05-07T20:32:28.6839808Z x1 = x1.contiguous() 2025-05-07T20:32:28.6839895Z 2025-05-07T20:32:28.6839997Z if scale_ub is not None: 2025-05-07T20:32:28.6840116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6840360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6840445Z ) 2025-05-07T20:32:28.6840533Z else: 2025-05-07T20:32:28.6840639Z scale_ub_tensor = None 2025-05-07T20:32:28.6840720Z 2025-05-07T20:32:28.6840869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6840972Z op = silu_mul_quant 2025-05-07T20:32:28.6841067Z if compiled: 2025-05-07T20:32:28.6841182Z op = torch.compile(op) 2025-05-07T20:32:28.6841303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6841383Z 2025-05-07T20:32:28.6841490Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6841495Z 2025-05-07T20:32:28.6841604Z moe/activation_test.py:117: 2025-05-07T20:32:28.6841761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6841875Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6841987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6842412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6842517Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6843074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6843189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6843594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6843850Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6844238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6844344Z kernel = self.compile( 2025-05-07T20:32:28.6844776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6844979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6845122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6845127Z 2025-05-07T20:32:28.6845359Z self = 2025-05-07T20:32:28.6846229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6846887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676872290>} 2025-05-07T20:32:28.6847727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6847952Z context = 2025-05-07T20:32:28.6847957Z 2025-05-07T20:32:28.6848146Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6848443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6848569Z module_map=module_map) 2025-05-07T20:32:28.6848751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6848863Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6848951Z E ^ 2025-05-07T20:32:28.6849357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6849363Z 2025-05-07T20:32:28.6849832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6849837Z 2025-05-07T20:32:28.6850040Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6850293Z self=, 2025-05-07T20:32:28.6850385Z T=2048, 2025-05-07T20:32:28.6850472Z D=5120, 2025-05-07T20:32:28.6850564Z scale_ub=None, 2025-05-07T20:32:28.6850667Z contiguous=False, 2025-05-07T20:32:28.6850762Z compiled=True, 2025-05-07T20:32:28.6850847Z ) 2025-05-07T20:32:28.6851092Z self = 2025-05-07T20:32:28.6851288Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6851293Z 2025-05-07T20:32:28.6851384Z @given( 2025-05-07T20:32:28.6851516Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6851634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6851767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6851899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6852039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6852124Z ) 2025-05-07T20:32:28.6852403Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6852512Z def test_silu_mul_quant( 2025-05-07T20:32:28.6852598Z self, 2025-05-07T20:32:28.6852686Z T: int, 2025-05-07T20:32:28.6852773Z D: int, 2025-05-07T20:32:28.6852883Z scale_ub: Optional[float], 2025-05-07T20:32:28.6852982Z contiguous: bool, 2025-05-07T20:32:28.6853082Z compiled: bool, 2025-05-07T20:32:28.6853173Z ) -> None: 2025-05-07T20:32:28.6853279Z torch.manual_seed(2025) 2025-05-07T20:32:28.6853364Z 2025-05-07T20:32:28.6853558Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6853642Z 2025-05-07T20:32:28.6853751Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6853893Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6854002Z x = x_sign * x_clamp 2025-05-07T20:32:28.6854093Z x0 = x[:, :D] 2025-05-07T20:32:28.6854184Z x1 = x[:, D:] 2025-05-07T20:32:28.6854268Z 2025-05-07T20:32:28.6854363Z if contiguous: 2025-05-07T20:32:28.6854466Z x0 = x0.contiguous() 2025-05-07T20:32:28.6854568Z x1 = x1.contiguous() 2025-05-07T20:32:28.6854649Z 2025-05-07T20:32:28.6854750Z if scale_ub is not None: 2025-05-07T20:32:28.6854876Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6855029Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6855114Z ) 2025-05-07T20:32:28.6855204Z else: 2025-05-07T20:32:28.6855308Z scale_ub_tensor = None 2025-05-07T20:32:28.6855392Z 2025-05-07T20:32:28.6855627Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6855730Z op = silu_mul_quant 2025-05-07T20:32:28.6855829Z if compiled: 2025-05-07T20:32:28.6855940Z op = torch.compile(op) 2025-05-07T20:32:28.6856064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6856151Z 2025-05-07T20:32:28.6856255Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6856260Z 2025-05-07T20:32:28.6856369Z moe/activation_test.py:117: 2025-05-07T20:32:28.6856517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6856631Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6856746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6857164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6857269Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6857842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6857953Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6858363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6858740Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6859138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6859249Z kernel = self.compile( 2025-05-07T20:32:28.6859679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6859876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6860022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6860027Z 2025-05-07T20:32:28.6860261Z self = 2025-05-07T20:32:28.6861139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6861706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676872170>} 2025-05-07T20:32:28.6862544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6862761Z context = 2025-05-07T20:32:28.6862766Z 2025-05-07T20:32:28.6862953Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6863258Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6863379Z module_map=module_map) 2025-05-07T20:32:28.6863566Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6863682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6863767Z E ^ 2025-05-07T20:32:28.6864166Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6864175Z 2025-05-07T20:32:28.6864638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6864643Z 2025-05-07T20:32:28.6864761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6865014Z self=, 2025-05-07T20:32:28.6865102Z T=2048, 2025-05-07T20:32:28.6865188Z D=5120, 2025-05-07T20:32:28.6865403Z scale_ub=1200.0, 2025-05-07T20:32:28.6865504Z contiguous=False, 2025-05-07T20:32:28.6865597Z compiled=True, 2025-05-07T20:32:28.6865683Z ) 2025-05-07T20:32:28.6865935Z self = 2025-05-07T20:32:28.6866140Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6866145Z 2025-05-07T20:32:28.6866230Z @given( 2025-05-07T20:32:28.6866363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6866478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6866606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6866737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6866869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6866952Z ) 2025-05-07T20:32:28.6867228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6867343Z def test_silu_mul_quant( 2025-05-07T20:32:28.6867428Z self, 2025-05-07T20:32:28.6867517Z T: int, 2025-05-07T20:32:28.6867602Z D: int, 2025-05-07T20:32:28.6867711Z scale_ub: Optional[float], 2025-05-07T20:32:28.6867901Z contiguous: bool, 2025-05-07T20:32:28.6867997Z compiled: bool, 2025-05-07T20:32:28.6868084Z ) -> None: 2025-05-07T20:32:28.6868191Z torch.manual_seed(2025) 2025-05-07T20:32:28.6868272Z 2025-05-07T20:32:28.6868471Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6868571Z 2025-05-07T20:32:28.6868678Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6868846Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6868974Z x = x_sign * x_clamp 2025-05-07T20:32:28.6869070Z x0 = x[:, :D] 2025-05-07T20:32:28.6869171Z x1 = x[:, D:] 2025-05-07T20:32:28.6869256Z 2025-05-07T20:32:28.6869352Z if contiguous: 2025-05-07T20:32:28.6869469Z x0 = x0.contiguous() 2025-05-07T20:32:28.6869570Z x1 = x1.contiguous() 2025-05-07T20:32:28.6869652Z 2025-05-07T20:32:28.6869760Z if scale_ub is not None: 2025-05-07T20:32:28.6869881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6870047Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6870139Z ) 2025-05-07T20:32:28.6870227Z else: 2025-05-07T20:32:28.6870334Z scale_ub_tensor = None 2025-05-07T20:32:28.6870424Z 2025-05-07T20:32:28.6870576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6870679Z op = silu_mul_quant 2025-05-07T20:32:28.6870781Z if compiled: 2025-05-07T20:32:28.6870896Z op = torch.compile(op) 2025-05-07T20:32:28.6871023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6871107Z 2025-05-07T20:32:28.6871212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6871217Z 2025-05-07T20:32:28.6871335Z moe/activation_test.py:117: 2025-05-07T20:32:28.6871481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6871597Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6871726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6872147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6872258Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6872823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6872934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6873348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6873657Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6874139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6874255Z kernel = self.compile( 2025-05-07T20:32:28.6874690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6874904Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6875049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6875054Z 2025-05-07T20:32:28.6875286Z self = 2025-05-07T20:32:28.6876173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6876747Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676873880>} 2025-05-07T20:32:28.6877601Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6877909Z context = 2025-05-07T20:32:28.6877914Z 2025-05-07T20:32:28.6878112Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6878418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6878539Z module_map=module_map) 2025-05-07T20:32:28.6878754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6878871Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6878959Z E ^ 2025-05-07T20:32:28.6879368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6879373Z 2025-05-07T20:32:28.6879840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6879852Z 2025-05-07T20:32:28.6879975Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6880227Z self=, 2025-05-07T20:32:28.6880315Z T=4096, 2025-05-07T20:32:28.6880412Z D=5120, 2025-05-07T20:32:28.6880509Z scale_ub=1200.0, 2025-05-07T20:32:28.6880606Z contiguous=True, 2025-05-07T20:32:28.6880705Z compiled=True, 2025-05-07T20:32:28.6880788Z ) 2025-05-07T20:32:28.6881032Z self = 2025-05-07T20:32:28.6881233Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.6881238Z 2025-05-07T20:32:28.6881327Z @given( 2025-05-07T20:32:28.6881473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6881588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6881721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6881866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6881996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6882084Z ) 2025-05-07T20:32:28.6882370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6882478Z def test_silu_mul_quant( 2025-05-07T20:32:28.6882565Z self, 2025-05-07T20:32:28.6882658Z T: int, 2025-05-07T20:32:28.6882745Z D: int, 2025-05-07T20:32:28.6882862Z scale_ub: Optional[float], 2025-05-07T20:32:28.6882964Z contiguous: bool, 2025-05-07T20:32:28.6883063Z compiled: bool, 2025-05-07T20:32:28.6883157Z ) -> None: 2025-05-07T20:32:28.6883266Z torch.manual_seed(2025) 2025-05-07T20:32:28.6883349Z 2025-05-07T20:32:28.6883640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6883726Z 2025-05-07T20:32:28.6883833Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6883982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6884088Z x = x_sign * x_clamp 2025-05-07T20:32:28.6884179Z x0 = x[:, :D] 2025-05-07T20:32:28.6884277Z x1 = x[:, D:] 2025-05-07T20:32:28.6884358Z 2025-05-07T20:32:28.6884453Z if contiguous: 2025-05-07T20:32:28.6884563Z x0 = x0.contiguous() 2025-05-07T20:32:28.6884664Z x1 = x1.contiguous() 2025-05-07T20:32:28.6884750Z 2025-05-07T20:32:28.6884853Z if scale_ub is not None: 2025-05-07T20:32:28.6884975Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6885133Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6885222Z ) 2025-05-07T20:32:28.6885310Z else: 2025-05-07T20:32:28.6885431Z scale_ub_tensor = None 2025-05-07T20:32:28.6885514Z 2025-05-07T20:32:28.6885660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6885768Z op = silu_mul_quant 2025-05-07T20:32:28.6885952Z if compiled: 2025-05-07T20:32:28.6886065Z op = torch.compile(op) 2025-05-07T20:32:28.6886190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6886272Z 2025-05-07T20:32:28.6886378Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6886383Z 2025-05-07T20:32:28.6886495Z moe/activation_test.py:117: 2025-05-07T20:32:28.6886643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6886764Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6886878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6887296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6887411Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6887969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6888088Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6888538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6888804Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6889194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6889302Z kernel = self.compile( 2025-05-07T20:32:28.6889736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6889942Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6890090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6890095Z 2025-05-07T20:32:28.6890331Z self = 2025-05-07T20:32:28.6891205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6891780Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767f8940>} 2025-05-07T20:32:28.6892622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6892839Z context = 2025-05-07T20:32:28.6892844Z 2025-05-07T20:32:28.6893128Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6893430Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6893564Z module_map=module_map) 2025-05-07T20:32:28.6893747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6893859Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6893956Z E ^ 2025-05-07T20:32:28.6894358Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6894364Z 2025-05-07T20:32:28.6894829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6894841Z 2025-05-07T20:32:28.6894959Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6895212Z self=, 2025-05-07T20:32:28.6895310Z T=128, 2025-05-07T20:32:28.6895399Z D=5120, 2025-05-07T20:32:28.6895494Z scale_ub=1200.0, 2025-05-07T20:32:28.6895598Z contiguous=False, 2025-05-07T20:32:28.6895691Z compiled=True, 2025-05-07T20:32:28.6895932Z ) 2025-05-07T20:32:28.6896185Z self = 2025-05-07T20:32:28.6896378Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6896383Z 2025-05-07T20:32:28.6896469Z @given( 2025-05-07T20:32:28.6896609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6896722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6896858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6896992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6897122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6897209Z ) 2025-05-07T20:32:28.6897492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6897599Z def test_silu_mul_quant( 2025-05-07T20:32:28.6897693Z self, 2025-05-07T20:32:28.6897778Z T: int, 2025-05-07T20:32:28.6897866Z D: int, 2025-05-07T20:32:28.6897991Z scale_ub: Optional[float], 2025-05-07T20:32:28.6898093Z contiguous: bool, 2025-05-07T20:32:28.6898193Z compiled: bool, 2025-05-07T20:32:28.6898284Z ) -> None: 2025-05-07T20:32:28.6898415Z torch.manual_seed(2025) 2025-05-07T20:32:28.6898510Z 2025-05-07T20:32:28.6898721Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6898806Z 2025-05-07T20:32:28.6898916Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6899058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6899159Z x = x_sign * x_clamp 2025-05-07T20:32:28.6899256Z x0 = x[:, :D] 2025-05-07T20:32:28.6899346Z x1 = x[:, D:] 2025-05-07T20:32:28.6899430Z 2025-05-07T20:32:28.6899534Z if contiguous: 2025-05-07T20:32:28.6899638Z x0 = x0.contiguous() 2025-05-07T20:32:28.6899740Z x1 = x1.contiguous() 2025-05-07T20:32:28.6899831Z 2025-05-07T20:32:28.6899938Z if scale_ub is not None: 2025-05-07T20:32:28.6900064Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6900220Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6900308Z ) 2025-05-07T20:32:28.6900399Z else: 2025-05-07T20:32:28.6900505Z scale_ub_tensor = None 2025-05-07T20:32:28.6900590Z 2025-05-07T20:32:28.6900748Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6900851Z op = silu_mul_quant 2025-05-07T20:32:28.6900949Z if compiled: 2025-05-07T20:32:28.6901068Z op = torch.compile(op) 2025-05-07T20:32:28.6901189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6901273Z 2025-05-07T20:32:28.6901509Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6901514Z 2025-05-07T20:32:28.6901626Z moe/activation_test.py:117: 2025-05-07T20:32:28.6901779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6901900Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6902013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6902438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6902547Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6903107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6903225Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6903632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6903897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6904283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6904394Z kernel = self.compile( 2025-05-07T20:32:28.6904920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6905121Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6905270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6905275Z 2025-05-07T20:32:28.6905505Z self = 2025-05-07T20:32:28.6906378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6906955Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767f91b0>} 2025-05-07T20:32:28.6907796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6908029Z context = 2025-05-07T20:32:28.6908035Z 2025-05-07T20:32:28.6908227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6908533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6908662Z module_map=module_map) 2025-05-07T20:32:28.6908849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6908975Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6909066Z E ^ 2025-05-07T20:32:28.6909480Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6909485Z 2025-05-07T20:32:28.6909967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6909977Z 2025-05-07T20:32:28.6910100Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6910357Z self=, 2025-05-07T20:32:28.6910449Z T=16384, 2025-05-07T20:32:28.6910536Z D=7168, 2025-05-07T20:32:28.6910637Z scale_ub=1200.0, 2025-05-07T20:32:28.6910734Z contiguous=True, 2025-05-07T20:32:28.6910829Z compiled=True, 2025-05-07T20:32:28.6910916Z ) 2025-05-07T20:32:28.6911163Z self = 2025-05-07T20:32:28.6911363Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.6911455Z 2025-05-07T20:32:28.6911551Z @given( 2025-05-07T20:32:28.6911686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6911805Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6911940Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6912073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6912206Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6912292Z ) 2025-05-07T20:32:28.6912572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6912684Z def test_silu_mul_quant( 2025-05-07T20:32:28.6912770Z self, 2025-05-07T20:32:28.6912858Z T: int, 2025-05-07T20:32:28.6912950Z D: int, 2025-05-07T20:32:28.6913061Z scale_ub: Optional[float], 2025-05-07T20:32:28.6913162Z contiguous: bool, 2025-05-07T20:32:28.6913262Z compiled: bool, 2025-05-07T20:32:28.6913352Z ) -> None: 2025-05-07T20:32:28.6913471Z torch.manual_seed(2025) 2025-05-07T20:32:28.6913600Z 2025-05-07T20:32:28.6913794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6913884Z 2025-05-07T20:32:28.6914083Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6914226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6914332Z x = x_sign * x_clamp 2025-05-07T20:32:28.6914425Z x0 = x[:, :D] 2025-05-07T20:32:28.6914516Z x1 = x[:, D:] 2025-05-07T20:32:28.6914603Z 2025-05-07T20:32:28.6914699Z if contiguous: 2025-05-07T20:32:28.6914804Z x0 = x0.contiguous() 2025-05-07T20:32:28.6914915Z x1 = x1.contiguous() 2025-05-07T20:32:28.6914997Z 2025-05-07T20:32:28.6915100Z if scale_ub is not None: 2025-05-07T20:32:28.6915226Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6915381Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6915476Z ) 2025-05-07T20:32:28.6915570Z else: 2025-05-07T20:32:28.6915680Z scale_ub_tensor = None 2025-05-07T20:32:28.6915768Z 2025-05-07T20:32:28.6915915Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6916029Z op = silu_mul_quant 2025-05-07T20:32:28.6916133Z if compiled: 2025-05-07T20:32:28.6916247Z op = torch.compile(op) 2025-05-07T20:32:28.6916370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6916459Z 2025-05-07T20:32:28.6916563Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6916568Z 2025-05-07T20:32:28.6916686Z moe/activation_test.py:117: 2025-05-07T20:32:28.6916833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6916948Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6917066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6917488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6917593Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6918159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6918280Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6918744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6918999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6919383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6919495Z kernel = self.compile( 2025-05-07T20:32:28.6919926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6920125Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6920365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6920371Z 2025-05-07T20:32:28.6920604Z self = 2025-05-07T20:32:28.6921492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6922055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767f97e0>} 2025-05-07T20:32:28.6922906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6923126Z context = 2025-05-07T20:32:28.6923131Z 2025-05-07T20:32:28.6923319Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6923618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6924133Z module_map=module_map) 2025-05-07T20:32:28.6924347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6924468Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6924555Z E ^ 2025-05-07T20:32:28.6924961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6924966Z 2025-05-07T20:32:28.6925432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6925437Z 2025-05-07T20:32:28.6925555Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6925817Z self=, 2025-05-07T20:32:28.6925905Z T=16384, 2025-05-07T20:32:28.6925997Z D=5120, 2025-05-07T20:32:28.6926092Z scale_ub=1200.0, 2025-05-07T20:32:28.6926188Z contiguous=True, 2025-05-07T20:32:28.6926294Z compiled=False, 2025-05-07T20:32:28.6926377Z ) 2025-05-07T20:32:28.6926624Z self = 2025-05-07T20:32:28.6926831Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.6926836Z 2025-05-07T20:32:28.6926924Z @given( 2025-05-07T20:32:28.6927060Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6927181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6927312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6927454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6927584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6927668Z ) 2025-05-07T20:32:28.6927958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6928067Z def test_silu_mul_quant( 2025-05-07T20:32:28.6928155Z self, 2025-05-07T20:32:28.6928253Z T: int, 2025-05-07T20:32:28.6928341Z D: int, 2025-05-07T20:32:28.6928454Z scale_ub: Optional[float], 2025-05-07T20:32:28.6928563Z contiguous: bool, 2025-05-07T20:32:28.6928660Z compiled: bool, 2025-05-07T20:32:28.6928747Z ) -> None: 2025-05-07T20:32:28.6928862Z torch.manual_seed(2025) 2025-05-07T20:32:28.6928946Z 2025-05-07T20:32:28.6929147Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6929232Z 2025-05-07T20:32:28.6929341Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6929487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6929589Z x = x_sign * x_clamp 2025-05-07T20:32:28.6929682Z x0 = x[:, :D] 2025-05-07T20:32:28.6929930Z x1 = x[:, D:] 2025-05-07T20:32:28.6930017Z 2025-05-07T20:32:28.6930112Z if contiguous: 2025-05-07T20:32:28.6930223Z x0 = x0.contiguous() 2025-05-07T20:32:28.6930325Z x1 = x1.contiguous() 2025-05-07T20:32:28.6930412Z 2025-05-07T20:32:28.6930521Z if scale_ub is not None: 2025-05-07T20:32:28.6930642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6930798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6930891Z ) 2025-05-07T20:32:28.6930979Z else: 2025-05-07T20:32:28.6931093Z scale_ub_tensor = None 2025-05-07T20:32:28.6931175Z 2025-05-07T20:32:28.6931324Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6931435Z op = silu_mul_quant 2025-05-07T20:32:28.6931530Z if compiled: 2025-05-07T20:32:28.6931643Z op = torch.compile(op) 2025-05-07T20:32:28.6931777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6931864Z 2025-05-07T20:32:28.6931970Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6931975Z 2025-05-07T20:32:28.6932094Z moe/activation_test.py:117: 2025-05-07T20:32:28.6932240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6932524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6932642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6933210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6933330Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6933738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6933990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6934388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6934495Z kernel = self.compile( 2025-05-07T20:32:28.6934932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6935139Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6935284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6935288Z 2025-05-07T20:32:28.6935524Z self = 2025-05-07T20:32:28.6936397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6936977Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767fa950>} 2025-05-07T20:32:28.6937814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6938036Z context = 2025-05-07T20:32:28.6938046Z 2025-05-07T20:32:28.6938235Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6938536Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6938664Z module_map=module_map) 2025-05-07T20:32:28.6944024Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6944157Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6944253Z E ^ 2025-05-07T20:32:28.6944818Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6944825Z 2025-05-07T20:32:28.6945297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6945310Z 2025-05-07T20:32:28.6945436Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6945689Z self=, 2025-05-07T20:32:28.6945783Z T=1, 2025-05-07T20:32:28.6945870Z D=7168, 2025-05-07T20:32:28.6945964Z scale_ub=1200.0, 2025-05-07T20:32:28.6946068Z contiguous=False, 2025-05-07T20:32:28.6946163Z compiled=False, 2025-05-07T20:32:28.6946244Z ) 2025-05-07T20:32:28.6946492Z self = 2025-05-07T20:32:28.6946686Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.6946691Z 2025-05-07T20:32:28.6946780Z @given( 2025-05-07T20:32:28.6946918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6947035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6947169Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6947303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6947521Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6947613Z ) 2025-05-07T20:32:28.6947892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6948002Z def test_silu_mul_quant( 2025-05-07T20:32:28.6948091Z self, 2025-05-07T20:32:28.6948178Z T: int, 2025-05-07T20:32:28.6948266Z D: int, 2025-05-07T20:32:28.6948387Z scale_ub: Optional[float], 2025-05-07T20:32:28.6948488Z contiguous: bool, 2025-05-07T20:32:28.6948584Z compiled: bool, 2025-05-07T20:32:28.6948675Z ) -> None: 2025-05-07T20:32:28.6948782Z torch.manual_seed(2025) 2025-05-07T20:32:28.6948868Z 2025-05-07T20:32:28.6949064Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6949148Z 2025-05-07T20:32:28.6949255Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6949399Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6949508Z x = x_sign * x_clamp 2025-05-07T20:32:28.6949603Z x0 = x[:, :D] 2025-05-07T20:32:28.6949693Z x1 = x[:, D:] 2025-05-07T20:32:28.6949773Z 2025-05-07T20:32:28.6949871Z if contiguous: 2025-05-07T20:32:28.6949975Z x0 = x0.contiguous() 2025-05-07T20:32:28.6950077Z x1 = x1.contiguous() 2025-05-07T20:32:28.6950160Z 2025-05-07T20:32:28.6950262Z if scale_ub is not None: 2025-05-07T20:32:28.6950384Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6950538Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6950628Z ) 2025-05-07T20:32:28.6950723Z else: 2025-05-07T20:32:28.6950831Z scale_ub_tensor = None 2025-05-07T20:32:28.6950917Z 2025-05-07T20:32:28.6951070Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6951173Z op = silu_mul_quant 2025-05-07T20:32:28.6951269Z if compiled: 2025-05-07T20:32:28.6951393Z op = torch.compile(op) 2025-05-07T20:32:28.6951513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6951595Z 2025-05-07T20:32:28.6951702Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6951707Z 2025-05-07T20:32:28.6951816Z moe/activation_test.py:117: 2025-05-07T20:32:28.6951965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6952079Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6952196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6952767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6952880Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6953380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6953760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6954154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6954264Z kernel = self.compile( 2025-05-07T20:32:28.6954695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6954898Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6955041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6955046Z 2025-05-07T20:32:28.6955274Z self = 2025-05-07T20:32:28.6956157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6956726Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76767fbac0>} 2025-05-07T20:32:28.6957661Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6957881Z context = 2025-05-07T20:32:28.6957887Z 2025-05-07T20:32:28.6958075Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6958405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6958557Z module_map=module_map) 2025-05-07T20:32:28.6958748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6958860Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6958947Z E ^ 2025-05-07T20:32:28.6959356Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6959361Z 2025-05-07T20:32:28.6959825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6959830Z 2025-05-07T20:32:28.6959954Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6960206Z self=, 2025-05-07T20:32:28.6960293Z T=4096, 2025-05-07T20:32:28.6960385Z D=7168, 2025-05-07T20:32:28.6960478Z scale_ub=1200.0, 2025-05-07T20:32:28.6960576Z contiguous=False, 2025-05-07T20:32:28.6960673Z compiled=True, 2025-05-07T20:32:28.6960757Z ) 2025-05-07T20:32:28.6961005Z self = 2025-05-07T20:32:28.6961207Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6961218Z 2025-05-07T20:32:28.6961304Z @given( 2025-05-07T20:32:28.6961442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6961557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6961686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6961822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6961952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6962037Z ) 2025-05-07T20:32:28.6962319Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6962427Z def test_silu_mul_quant( 2025-05-07T20:32:28.6962515Z self, 2025-05-07T20:32:28.6962605Z T: int, 2025-05-07T20:32:28.6962692Z D: int, 2025-05-07T20:32:28.6962894Z scale_ub: Optional[float], 2025-05-07T20:32:28.6963003Z contiguous: bool, 2025-05-07T20:32:28.6963099Z compiled: bool, 2025-05-07T20:32:28.6963193Z ) -> None: 2025-05-07T20:32:28.6963300Z torch.manual_seed(2025) 2025-05-07T20:32:28.6963389Z 2025-05-07T20:32:28.6963585Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6963669Z 2025-05-07T20:32:28.6963775Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6963922Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6964023Z x = x_sign * x_clamp 2025-05-07T20:32:28.6964113Z x0 = x[:, :D] 2025-05-07T20:32:28.6964206Z x1 = x[:, D:] 2025-05-07T20:32:28.6964287Z 2025-05-07T20:32:28.6964381Z if contiguous: 2025-05-07T20:32:28.6964488Z x0 = x0.contiguous() 2025-05-07T20:32:28.6964587Z x1 = x1.contiguous() 2025-05-07T20:32:28.6964671Z 2025-05-07T20:32:28.6964779Z if scale_ub is not None: 2025-05-07T20:32:28.6964898Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6965058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6965142Z ) 2025-05-07T20:32:28.6965316Z else: 2025-05-07T20:32:28.6965427Z scale_ub_tensor = None 2025-05-07T20:32:28.6965512Z 2025-05-07T20:32:28.6965660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6965767Z op = silu_mul_quant 2025-05-07T20:32:28.6965862Z if compiled: 2025-05-07T20:32:28.6965974Z op = torch.compile(op) 2025-05-07T20:32:28.6966097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6966180Z 2025-05-07T20:32:28.6966285Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6966290Z 2025-05-07T20:32:28.6966399Z moe/activation_test.py:117: 2025-05-07T20:32:28.6966543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6966669Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6966782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6967198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6967313Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6967870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6967983Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6968387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6968642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6969032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6969137Z kernel = self.compile( 2025-05-07T20:32:28.6969575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6969778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6969928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6969932Z 2025-05-07T20:32:28.6970164Z self = 2025-05-07T20:32:28.6971037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6971603Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76762d4550>} 2025-05-07T20:32:28.6972533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6972751Z context = 2025-05-07T20:32:28.6972760Z 2025-05-07T20:32:28.6972952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6973248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6973372Z module_map=module_map) 2025-05-07T20:32:28.6973555Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6973667Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6973757Z E ^ 2025-05-07T20:32:28.6974159Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6974163Z 2025-05-07T20:32:28.6974632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6974637Z 2025-05-07T20:32:28.6974757Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6975009Z self=, 2025-05-07T20:32:28.6975215Z T=128, 2025-05-07T20:32:28.6975301Z D=7168, 2025-05-07T20:32:28.6975394Z scale_ub=1200.0, 2025-05-07T20:32:28.6975496Z contiguous=False, 2025-05-07T20:32:28.6975591Z compiled=True, 2025-05-07T20:32:28.6975674Z ) 2025-05-07T20:32:28.6975924Z self = 2025-05-07T20:32:28.6976118Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.6976123Z 2025-05-07T20:32:28.6976210Z @given( 2025-05-07T20:32:28.6976348Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6976460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6976602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6976735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6976866Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6976960Z ) 2025-05-07T20:32:28.6977246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6977352Z def test_silu_mul_quant( 2025-05-07T20:32:28.6977443Z self, 2025-05-07T20:32:28.6977529Z T: int, 2025-05-07T20:32:28.6977616Z D: int, 2025-05-07T20:32:28.6977730Z scale_ub: Optional[float], 2025-05-07T20:32:28.6977830Z contiguous: bool, 2025-05-07T20:32:28.6977926Z compiled: bool, 2025-05-07T20:32:28.6978020Z ) -> None: 2025-05-07T20:32:28.6978127Z torch.manual_seed(2025) 2025-05-07T20:32:28.6978213Z 2025-05-07T20:32:28.6978404Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6978488Z 2025-05-07T20:32:28.6978599Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6978740Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6978840Z x = x_sign * x_clamp 2025-05-07T20:32:28.6978935Z x0 = x[:, :D] 2025-05-07T20:32:28.6979031Z x1 = x[:, D:] 2025-05-07T20:32:28.6979114Z 2025-05-07T20:32:28.6979212Z if contiguous: 2025-05-07T20:32:28.6979315Z x0 = x0.contiguous() 2025-05-07T20:32:28.6979415Z x1 = x1.contiguous() 2025-05-07T20:32:28.6979501Z 2025-05-07T20:32:28.6979602Z if scale_ub is not None: 2025-05-07T20:32:28.6979725Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6979878Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6979964Z ) 2025-05-07T20:32:28.6980054Z else: 2025-05-07T20:32:28.6980161Z scale_ub_tensor = None 2025-05-07T20:32:28.6980243Z 2025-05-07T20:32:28.6980393Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6980585Z op = silu_mul_quant 2025-05-07T20:32:28.6980683Z if compiled: 2025-05-07T20:32:28.6980798Z op = torch.compile(op) 2025-05-07T20:32:28.6980916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6981003Z 2025-05-07T20:32:28.6981108Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6981113Z 2025-05-07T20:32:28.6981220Z moe/activation_test.py:117: 2025-05-07T20:32:28.6981369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6981482Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6981594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6982015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6982119Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6982680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6982801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6983204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6983546Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6983932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6984037Z kernel = self.compile( 2025-05-07T20:32:28.6984473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6984673Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6984815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6984823Z 2025-05-07T20:32:28.6985053Z self = 2025-05-07T20:32:28.6985931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6986506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76762d4f70>} 2025-05-07T20:32:28.6987346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6987566Z context = 2025-05-07T20:32:28.6987571Z 2025-05-07T20:32:28.6987758Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6988059Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6988182Z module_map=module_map) 2025-05-07T20:32:28.6988390Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6988522Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6988621Z E ^ 2025-05-07T20:32:28.6989019Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6989024Z 2025-05-07T20:32:28.6989493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6989498Z 2025-05-07T20:32:28.6989615Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6989867Z self=, 2025-05-07T20:32:28.6989956Z T=2048, 2025-05-07T20:32:28.6990042Z D=7168, 2025-05-07T20:32:28.6990138Z scale_ub=None, 2025-05-07T20:32:28.6990234Z contiguous=True, 2025-05-07T20:32:28.6990411Z compiled=True, 2025-05-07T20:32:28.6990500Z ) 2025-05-07T20:32:28.6990744Z self = 2025-05-07T20:32:28.6990938Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.6990948Z 2025-05-07T20:32:28.6991039Z @given( 2025-05-07T20:32:28.6991173Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6991288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6991425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6991557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6991688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6991772Z ) 2025-05-07T20:32:28.6992050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6992157Z def test_silu_mul_quant( 2025-05-07T20:32:28.6992244Z self, 2025-05-07T20:32:28.6992329Z T: int, 2025-05-07T20:32:28.6992424Z D: int, 2025-05-07T20:32:28.6992535Z scale_ub: Optional[float], 2025-05-07T20:32:28.6992636Z contiguous: bool, 2025-05-07T20:32:28.6992737Z compiled: bool, 2025-05-07T20:32:28.6992912Z ) -> None: 2025-05-07T20:32:28.6993020Z torch.manual_seed(2025) 2025-05-07T20:32:28.6993108Z 2025-05-07T20:32:28.6993299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6993386Z 2025-05-07T20:32:28.6993489Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6993690Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6993792Z x = x_sign * x_clamp 2025-05-07T20:32:28.6993884Z x0 = x[:, :D] 2025-05-07T20:32:28.6993973Z x1 = x[:, D:] 2025-05-07T20:32:28.6994059Z 2025-05-07T20:32:28.6994153Z if contiguous: 2025-05-07T20:32:28.6994256Z x0 = x0.contiguous() 2025-05-07T20:32:28.6994359Z x1 = x1.contiguous() 2025-05-07T20:32:28.6994446Z 2025-05-07T20:32:28.6994549Z if scale_ub is not None: 2025-05-07T20:32:28.6994672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6994825Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6994921Z ) 2025-05-07T20:32:28.6995006Z else: 2025-05-07T20:32:28.6995112Z scale_ub_tensor = None 2025-05-07T20:32:28.6995197Z 2025-05-07T20:32:28.6995343Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6995445Z op = silu_mul_quant 2025-05-07T20:32:28.6995546Z if compiled: 2025-05-07T20:32:28.6995661Z op = torch.compile(op) 2025-05-07T20:32:28.6995780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6995865Z 2025-05-07T20:32:28.6995968Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6995973Z 2025-05-07T20:32:28.6996083Z moe/activation_test.py:117: 2025-05-07T20:32:28.6996236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6996350Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6996465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6996882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.6996992Z return fn(*args, **kwargs) 2025-05-07T20:32:28.6997559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6997668Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6998074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6998330Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6998715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6998917Z kernel = self.compile( 2025-05-07T20:32:28.6999353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6999554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6999709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6999714Z 2025-05-07T20:32:28.6999944Z self = 2025-05-07T20:32:28.7000821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7001387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76762d5bd0>} 2025-05-07T20:32:28.7002232Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7002536Z context = 2025-05-07T20:32:28.7002541Z 2025-05-07T20:32:28.7002728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7003031Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7003151Z module_map=module_map) 2025-05-07T20:32:28.7003335Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7003454Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7003543Z E ^ 2025-05-07T20:32:28.7003951Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7003956Z 2025-05-07T20:32:28.7004425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7004430Z 2025-05-07T20:32:28.7004549Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7004815Z self=, 2025-05-07T20:32:28.7004902Z T=16384, 2025-05-07T20:32:28.7004990Z D=5120, 2025-05-07T20:32:28.7005085Z scale_ub=None, 2025-05-07T20:32:28.7005182Z contiguous=False, 2025-05-07T20:32:28.7005281Z compiled=False, 2025-05-07T20:32:28.7005364Z ) 2025-05-07T20:32:28.7005610Z self = 2025-05-07T20:32:28.7005812Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.7005817Z 2025-05-07T20:32:28.7005905Z @given( 2025-05-07T20:32:28.7006040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7006160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7006290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7006423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7006559Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7006651Z ) 2025-05-07T20:32:28.7006933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7007041Z def test_silu_mul_quant( 2025-05-07T20:32:28.7007132Z self, 2025-05-07T20:32:28.7007221Z T: int, 2025-05-07T20:32:28.7007310Z D: int, 2025-05-07T20:32:28.7007420Z scale_ub: Optional[float], 2025-05-07T20:32:28.7007522Z contiguous: bool, 2025-05-07T20:32:28.7007618Z compiled: bool, 2025-05-07T20:32:28.7007706Z ) -> None: 2025-05-07T20:32:28.7007816Z torch.manual_seed(2025) 2025-05-07T20:32:28.7007897Z 2025-05-07T20:32:28.7008087Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7008286Z 2025-05-07T20:32:28.7008392Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7008532Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7010580Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7010591Z 2025-05-07T20:32:28.7010728Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:28.7010733Z 2025-05-07T20:32:28.7010849Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7011105Z self=, 2025-05-07T20:32:28.7011196Z T=4096, 2025-05-07T20:32:28.7011283Z D=7168, 2025-05-07T20:32:28.7011377Z scale_ub=1200.0, 2025-05-07T20:32:28.7011476Z contiguous=True, 2025-05-07T20:32:28.7011656Z compiled=True, 2025-05-07T20:32:28.7011740Z ) 2025-05-07T20:32:28.7011990Z self = 2025-05-07T20:32:28.7012187Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.7012191Z 2025-05-07T20:32:28.7012288Z @given( 2025-05-07T20:32:28.7012424Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7012540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7012676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7012809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7012939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7013034Z ) 2025-05-07T20:32:28.7013319Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7013427Z def test_silu_mul_quant( 2025-05-07T20:32:28.7013524Z self, 2025-05-07T20:32:28.7013618Z T: int, 2025-05-07T20:32:28.7013710Z D: int, 2025-05-07T20:32:28.7013824Z scale_ub: Optional[float], 2025-05-07T20:32:28.7013926Z contiguous: bool, 2025-05-07T20:32:28.7014029Z compiled: bool, 2025-05-07T20:32:28.7014118Z ) -> None: 2025-05-07T20:32:28.7014225Z torch.manual_seed(2025) 2025-05-07T20:32:28.7014316Z 2025-05-07T20:32:28.7014510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7014594Z 2025-05-07T20:32:28.7014708Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7014852Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7016869Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7016884Z 2025-05-07T20:32:28.7017019Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:28.7017024Z 2025-05-07T20:32:28.7017140Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7017398Z self=, 2025-05-07T20:32:28.7017488Z T=16384, 2025-05-07T20:32:28.7017587Z D=7168, 2025-05-07T20:32:28.7017683Z scale_ub=None, 2025-05-07T20:32:28.7017784Z contiguous=False, 2025-05-07T20:32:28.7017888Z compiled=False, 2025-05-07T20:32:28.7018065Z ) 2025-05-07T20:32:28.7018314Z self = 2025-05-07T20:32:28.7018544Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.7018555Z 2025-05-07T20:32:28.7018648Z @given( 2025-05-07T20:32:28.7018788Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7018935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7019093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7019235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7019369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7019457Z ) 2025-05-07T20:32:28.7019748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7019857Z def test_silu_mul_quant( 2025-05-07T20:32:28.7019948Z self, 2025-05-07T20:32:28.7020044Z T: int, 2025-05-07T20:32:28.7020138Z D: int, 2025-05-07T20:32:28.7020252Z scale_ub: Optional[float], 2025-05-07T20:32:28.7020360Z contiguous: bool, 2025-05-07T20:32:28.7020462Z compiled: bool, 2025-05-07T20:32:28.7020554Z ) -> None: 2025-05-07T20:32:28.7020762Z torch.manual_seed(2025) 2025-05-07T20:32:28.7020845Z 2025-05-07T20:32:28.7021043Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7023084Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7023091Z 2025-05-07T20:32:28.7023233Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7023237Z 2025-05-07T20:32:28.7023354Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7023615Z self=, 2025-05-07T20:32:28.7023710Z T=2048, 2025-05-07T20:32:28.7024146Z D=7168, 2025-05-07T20:32:28.7024300Z scale_ub=1200.0, 2025-05-07T20:32:28.7024412Z contiguous=True, 2025-05-07T20:32:28.7024508Z compiled=True, 2025-05-07T20:32:28.7024593Z ) 2025-05-07T20:32:28.7024846Z self = 2025-05-07T20:32:28.7025040Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.7025045Z 2025-05-07T20:32:28.7025140Z @given( 2025-05-07T20:32:28.7025276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7025392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7025538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7025672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7025803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7025903Z ) 2025-05-07T20:32:28.7026187Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7026305Z def test_silu_mul_quant( 2025-05-07T20:32:28.7026394Z self, 2025-05-07T20:32:28.7026481Z T: int, 2025-05-07T20:32:28.7026578Z D: int, 2025-05-07T20:32:28.7026692Z scale_ub: Optional[float], 2025-05-07T20:32:28.7026795Z contiguous: bool, 2025-05-07T20:32:28.7026899Z compiled: bool, 2025-05-07T20:32:28.7026988Z ) -> None: 2025-05-07T20:32:28.7027096Z torch.manual_seed(2025) 2025-05-07T20:32:28.7027187Z 2025-05-07T20:32:28.7027377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7027463Z 2025-05-07T20:32:28.7027732Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7027879Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7029882Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7029893Z 2025-05-07T20:32:28.7030027Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:28.7030032Z 2025-05-07T20:32:28.7030155Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7030418Z self=, 2025-05-07T20:32:28.7030507Z T=2048, 2025-05-07T20:32:28.7030600Z D=7168, 2025-05-07T20:32:28.7030694Z scale_ub=None, 2025-05-07T20:32:28.7030791Z contiguous=True, 2025-05-07T20:32:28.7031028Z compiled=False, 2025-05-07T20:32:28.7031113Z ) 2025-05-07T20:32:28.7031359Z self = 2025-05-07T20:32:28.7031563Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.7031568Z 2025-05-07T20:32:28.7031657Z @given( 2025-05-07T20:32:28.7031795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7031908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7032040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7032176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7032306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7032391Z ) 2025-05-07T20:32:28.7032683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7032792Z def test_silu_mul_quant( 2025-05-07T20:32:28.7032880Z self, 2025-05-07T20:32:28.7032974Z T: int, 2025-05-07T20:32:28.7033068Z D: int, 2025-05-07T20:32:28.7033179Z scale_ub: Optional[float], 2025-05-07T20:32:28.7033287Z contiguous: bool, 2025-05-07T20:32:28.7033385Z compiled: bool, 2025-05-07T20:32:28.7033479Z ) -> None: 2025-05-07T20:32:28.7033642Z torch.manual_seed(2025) 2025-05-07T20:32:28.7033726Z 2025-05-07T20:32:28.7033926Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7034011Z 2025-05-07T20:32:28.7034115Z > x_sign = torch.sign(x) 2025-05-07T20:32:28.7036123Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7036135Z 2025-05-07T20:32:28.7036272Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:28.7036277Z 2025-05-07T20:32:28.7036400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7036654Z self=, 2025-05-07T20:32:28.7036743Z T=1, 2025-05-07T20:32:28.7036835Z D=7168, 2025-05-07T20:32:28.7036929Z scale_ub=1200.0, 2025-05-07T20:32:28.7037031Z contiguous=True, 2025-05-07T20:32:28.7037128Z compiled=False, 2025-05-07T20:32:28.7037213Z ) 2025-05-07T20:32:28.7037465Z self = 2025-05-07T20:32:28.7037745Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.7037750Z 2025-05-07T20:32:28.7037840Z @given( 2025-05-07T20:32:28.7037981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7038100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7038232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7038369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7038499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7038588Z ) 2025-05-07T20:32:28.7038868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7038975Z def test_silu_mul_quant( 2025-05-07T20:32:28.7039068Z self, 2025-05-07T20:32:28.7039155Z T: int, 2025-05-07T20:32:28.7039243Z D: int, 2025-05-07T20:32:28.7039361Z scale_ub: Optional[float], 2025-05-07T20:32:28.7039463Z contiguous: bool, 2025-05-07T20:32:28.7039567Z compiled: bool, 2025-05-07T20:32:28.7039659Z ) -> None: 2025-05-07T20:32:28.7039770Z torch.manual_seed(2025) 2025-05-07T20:32:28.7039855Z 2025-05-07T20:32:28.7040053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7040228Z 2025-05-07T20:32:28.7040338Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7040483Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7040587Z x = x_sign * x_clamp 2025-05-07T20:32:28.7040686Z x0 = x[:, :D] 2025-05-07T20:32:28.7040777Z x1 = x[:, D:] 2025-05-07T20:32:28.7040865Z 2025-05-07T20:32:28.7040966Z if contiguous: 2025-05-07T20:32:28.7041071Z x0 = x0.contiguous() 2025-05-07T20:32:28.7041173Z x1 = x1.contiguous() 2025-05-07T20:32:28.7041263Z 2025-05-07T20:32:28.7041366Z if scale_ub is not None: 2025-05-07T20:32:28.7041486Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7041653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7041741Z ) 2025-05-07T20:32:28.7041834Z else: 2025-05-07T20:32:28.7041941Z scale_ub_tensor = None 2025-05-07T20:32:28.7042028Z 2025-05-07T20:32:28.7042183Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7042287Z op = silu_mul_quant 2025-05-07T20:32:28.7042383Z if compiled: 2025-05-07T20:32:28.7042503Z op = torch.compile(op) 2025-05-07T20:32:28.7042624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7042708Z 2025-05-07T20:32:28.7042823Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7042828Z 2025-05-07T20:32:28.7042939Z moe/activation_test.py:117: 2025-05-07T20:32:28.7043090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7043211Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7043328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7043911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7044023Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7044442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7044703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7045096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7045214Z kernel = self.compile( 2025-05-07T20:32:28.7045654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7045856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7046210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7046216Z 2025-05-07T20:32:28.7046452Z self = 2025-05-07T20:32:28.7047335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7047923Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f76762d7b50>} 2025-05-07T20:32:28.7048773Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7048999Z context = 2025-05-07T20:32:28.7049004Z 2025-05-07T20:32:28.7049199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7049507Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7049716Z module_map=module_map) 2025-05-07T20:32:28.7049904Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7050025Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7050113Z E ^ 2025-05-07T20:32:28.7050519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7050524Z 2025-05-07T20:32:28.7051001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7051006Z 2025-05-07T20:32:28.7051125Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7051383Z self=, 2025-05-07T20:32:28.7051477Z T=128, 2025-05-07T20:32:28.7051565Z D=5120, 2025-05-07T20:32:28.7051666Z scale_ub=None, 2025-05-07T20:32:28.7051766Z contiguous=True, 2025-05-07T20:32:28.7051862Z compiled=False, 2025-05-07T20:32:28.7051957Z ) 2025-05-07T20:32:28.7052205Z self = 2025-05-07T20:32:28.7052407Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.7052412Z 2025-05-07T20:32:28.7052501Z @given( 2025-05-07T20:32:28.7052640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7052757Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7052890Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7053026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7053165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7053248Z ) 2025-05-07T20:32:28.7053533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7053646Z def test_silu_mul_quant( 2025-05-07T20:32:28.7053734Z self, 2025-05-07T20:32:28.7053827Z T: int, 2025-05-07T20:32:28.7053914Z D: int, 2025-05-07T20:32:28.7054036Z scale_ub: Optional[float], 2025-05-07T20:32:28.7054147Z contiguous: bool, 2025-05-07T20:32:28.7054246Z compiled: bool, 2025-05-07T20:32:28.7054337Z ) -> None: 2025-05-07T20:32:28.7054451Z torch.manual_seed(2025) 2025-05-07T20:32:28.7054535Z 2025-05-07T20:32:28.7054728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7054820Z 2025-05-07T20:32:28.7054925Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7055068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7055176Z x = x_sign * x_clamp 2025-05-07T20:32:28.7055267Z x0 = x[:, :D] 2025-05-07T20:32:28.7055358Z x1 = x[:, D:] 2025-05-07T20:32:28.7055446Z 2025-05-07T20:32:28.7055640Z if contiguous: 2025-05-07T20:32:28.7055754Z x0 = x0.contiguous() 2025-05-07T20:32:28.7055858Z x1 = x1.contiguous() 2025-05-07T20:32:28.7055941Z 2025-05-07T20:32:28.7056050Z if scale_ub is not None: 2025-05-07T20:32:28.7056176Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7056331Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7056424Z ) 2025-05-07T20:32:28.7056512Z else: 2025-05-07T20:32:28.7056621Z scale_ub_tensor = None 2025-05-07T20:32:28.7056710Z 2025-05-07T20:32:28.7056858Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7056961Z op = silu_mul_quant 2025-05-07T20:32:28.7057064Z if compiled: 2025-05-07T20:32:28.7057179Z op = torch.compile(op) 2025-05-07T20:32:28.7057307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7057392Z 2025-05-07T20:32:28.7057502Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7057507Z 2025-05-07T20:32:28.7057626Z moe/activation_test.py:117: 2025-05-07T20:32:28.7057774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7057980Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7058101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7058694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7058821Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7059226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7059481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7059875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7059982Z kernel = self.compile( 2025-05-07T20:32:28.7060421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7060628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7060776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7060781Z 2025-05-07T20:32:28.7061016Z self = 2025-05-07T20:32:28.7061891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7062455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676008670>} 2025-05-07T20:32:28.7063307Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7063522Z context = 2025-05-07T20:32:28.7063533Z 2025-05-07T20:32:28.7063727Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7064027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7064149Z module_map=module_map) 2025-05-07T20:32:28.7064335Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7064449Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7064544Z E ^ 2025-05-07T20:32:28.7064948Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7064952Z 2025-05-07T20:32:28.7065510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7065515Z 2025-05-07T20:32:28.7065641Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7065903Z self=, 2025-05-07T20:32:28.7065996Z T=128, 2025-05-07T20:32:28.7066084Z D=7168, 2025-05-07T20:32:28.7066181Z scale_ub=None, 2025-05-07T20:32:28.7066283Z contiguous=True, 2025-05-07T20:32:28.7066380Z compiled=False, 2025-05-07T20:32:28.7066464Z ) 2025-05-07T20:32:28.7066714Z self = 2025-05-07T20:32:28.7066907Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.7066912Z 2025-05-07T20:32:28.7067003Z @given( 2025-05-07T20:32:28.7067142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7067259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7067401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7067535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7067665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7067840Z ) 2025-05-07T20:32:28.7068119Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7068228Z def test_silu_mul_quant( 2025-05-07T20:32:28.7068320Z self, 2025-05-07T20:32:28.7068406Z T: int, 2025-05-07T20:32:28.7068493Z D: int, 2025-05-07T20:32:28.7068610Z scale_ub: Optional[float], 2025-05-07T20:32:28.7068712Z contiguous: bool, 2025-05-07T20:32:28.7068811Z compiled: bool, 2025-05-07T20:32:28.7068904Z ) -> None: 2025-05-07T20:32:28.7069012Z torch.manual_seed(2025) 2025-05-07T20:32:28.7069100Z 2025-05-07T20:32:28.7069290Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7069375Z 2025-05-07T20:32:28.7069489Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7069634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7069736Z x = x_sign * x_clamp 2025-05-07T20:32:28.7069834Z x0 = x[:, :D] 2025-05-07T20:32:28.7069932Z x1 = x[:, D:] 2025-05-07T20:32:28.7070015Z 2025-05-07T20:32:28.7070112Z if contiguous: 2025-05-07T20:32:28.7070218Z x0 = x0.contiguous() 2025-05-07T20:32:28.7070318Z x1 = x1.contiguous() 2025-05-07T20:32:28.7070406Z 2025-05-07T20:32:28.7070508Z if scale_ub is not None: 2025-05-07T20:32:28.7070627Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7070788Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7070874Z ) 2025-05-07T20:32:28.7070966Z else: 2025-05-07T20:32:28.7071074Z scale_ub_tensor = None 2025-05-07T20:32:28.7071157Z 2025-05-07T20:32:28.7071318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7071422Z op = silu_mul_quant 2025-05-07T20:32:28.7071520Z if compiled: 2025-05-07T20:32:28.7071636Z op = torch.compile(op) 2025-05-07T20:32:28.7071759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7071848Z 2025-05-07T20:32:28.7071958Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7071963Z 2025-05-07T20:32:28.7072072Z moe/activation_test.py:117: 2025-05-07T20:32:28.7072223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7072338Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7072452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7073025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7073136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7073685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7073947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7074333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7074450Z kernel = self.compile( 2025-05-07T20:32:28.7074885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7075085Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7075237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7075242Z 2025-05-07T20:32:28.7075473Z self = 2025-05-07T20:32:28.7076355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7076921Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676008ee0>} 2025-05-07T20:32:28.7077874Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7078097Z context = 2025-05-07T20:32:28.7078102Z 2025-05-07T20:32:28.7078290Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7078624Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7078772Z module_map=module_map) 2025-05-07T20:32:28.7078985Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7079104Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7079193Z E ^ 2025-05-07T20:32:28.7079596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7079612Z 2025-05-07T20:32:28.7080082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7080086Z 2025-05-07T20:32:28.7080207Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7080464Z self=, 2025-05-07T20:32:28.7080553Z T=2048, 2025-05-07T20:32:28.7080642Z D=7168, 2025-05-07T20:32:28.7080743Z scale_ub=1200.0, 2025-05-07T20:32:28.7080840Z contiguous=True, 2025-05-07T20:32:28.7080938Z compiled=False, 2025-05-07T20:32:28.7081030Z ) 2025-05-07T20:32:28.7086281Z self = 2025-05-07T20:32:28.7086505Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.7086511Z 2025-05-07T20:32:28.7086607Z @given( 2025-05-07T20:32:28.7086745Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7086867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7086998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7087131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7087264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7087350Z ) 2025-05-07T20:32:28.7087631Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7087744Z def test_silu_mul_quant( 2025-05-07T20:32:28.7087834Z self, 2025-05-07T20:32:28.7087922Z T: int, 2025-05-07T20:32:28.7088014Z D: int, 2025-05-07T20:32:28.7088128Z scale_ub: Optional[float], 2025-05-07T20:32:28.7088343Z contiguous: bool, 2025-05-07T20:32:28.7088443Z compiled: bool, 2025-05-07T20:32:28.7088533Z ) -> None: 2025-05-07T20:32:28.7088644Z torch.manual_seed(2025) 2025-05-07T20:32:28.7088727Z 2025-05-07T20:32:28.7088926Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7090954Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7090961Z 2025-05-07T20:32:28.7091098Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7091109Z 2025-05-07T20:32:28.7091234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7091489Z self=, 2025-05-07T20:32:28.7091576Z T=1, 2025-05-07T20:32:28.7091754Z D=5120, 2025-05-07T20:32:28.7091850Z scale_ub=1200.0, 2025-05-07T20:32:28.7091945Z contiguous=True, 2025-05-07T20:32:28.7092046Z compiled=False, 2025-05-07T20:32:28.7092129Z ) 2025-05-07T20:32:28.7092378Z self = 2025-05-07T20:32:28.7092568Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.7092573Z 2025-05-07T20:32:28.7092661Z @given( 2025-05-07T20:32:28.7092799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7092911Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7093042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7093185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7093317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7093403Z ) 2025-05-07T20:32:28.7093684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7093798Z def test_silu_mul_quant( 2025-05-07T20:32:28.7093890Z self, 2025-05-07T20:32:28.7093976Z T: int, 2025-05-07T20:32:28.7094062Z D: int, 2025-05-07T20:32:28.7094177Z scale_ub: Optional[float], 2025-05-07T20:32:28.7094280Z contiguous: bool, 2025-05-07T20:32:28.7094376Z compiled: bool, 2025-05-07T20:32:28.7094469Z ) -> None: 2025-05-07T20:32:28.7094577Z torch.manual_seed(2025) 2025-05-07T20:32:28.7094661Z 2025-05-07T20:32:28.7094856Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7094944Z 2025-05-07T20:32:28.7095051Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7095197Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7095305Z x = x_sign * x_clamp 2025-05-07T20:32:28.7095400Z x0 = x[:, :D] 2025-05-07T20:32:28.7095491Z x1 = x[:, D:] 2025-05-07T20:32:28.7095574Z 2025-05-07T20:32:28.7095677Z if contiguous: 2025-05-07T20:32:28.7095784Z x0 = x0.contiguous() 2025-05-07T20:32:28.7095884Z x1 = x1.contiguous() 2025-05-07T20:32:28.7095969Z 2025-05-07T20:32:28.7096071Z if scale_ub is not None: 2025-05-07T20:32:28.7096193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7096351Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7096437Z ) 2025-05-07T20:32:28.7096529Z else: 2025-05-07T20:32:28.7096635Z scale_ub_tensor = None 2025-05-07T20:32:28.7096717Z 2025-05-07T20:32:28.7096867Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7096973Z op = silu_mul_quant 2025-05-07T20:32:28.7097070Z if compiled: 2025-05-07T20:32:28.7097280Z op = torch.compile(op) 2025-05-07T20:32:28.7097404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7097485Z 2025-05-07T20:32:28.7097593Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7097602Z 2025-05-07T20:32:28.7097715Z moe/activation_test.py:117: 2025-05-07T20:32:28.7097866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7097981Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7098097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7098678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7098790Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7099200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7099464Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7099853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7099964Z kernel = self.compile( 2025-05-07T20:32:28.7100484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7100685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7100832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7100838Z 2025-05-07T20:32:28.7101069Z self = 2025-05-07T20:32:28.7101956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7102535Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7676009e10>} 2025-05-07T20:32:28.7103382Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7103610Z context = 2025-05-07T20:32:28.7103615Z 2025-05-07T20:32:28.7103804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7104106Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7104232Z module_map=module_map) 2025-05-07T20:32:28.7104415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7104531Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7104618Z E ^ 2025-05-07T20:32:28.7105030Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7105035Z 2025-05-07T20:32:28.7105502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7105512Z 2025-05-07T20:32:28.7105631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7105889Z self=, 2025-05-07T20:32:28.7105978Z T=2048, 2025-05-07T20:32:28.7106064Z D=5120, 2025-05-07T20:32:28.7106160Z scale_ub=None, 2025-05-07T20:32:28.7106258Z contiguous=True, 2025-05-07T20:32:28.7106359Z compiled=False, 2025-05-07T20:32:28.7106442Z ) 2025-05-07T20:32:28.7106688Z self = 2025-05-07T20:32:28.7106888Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.7106981Z 2025-05-07T20:32:28.7107069Z @given( 2025-05-07T20:32:28.7107204Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7107321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7107462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7107595Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7107728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7107813Z ) 2025-05-07T20:32:28.7108096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7108202Z def test_silu_mul_quant( 2025-05-07T20:32:28.7108289Z self, 2025-05-07T20:32:28.7108379Z T: int, 2025-05-07T20:32:28.7108465Z D: int, 2025-05-07T20:32:28.7108578Z scale_ub: Optional[float], 2025-05-07T20:32:28.7108687Z contiguous: bool, 2025-05-07T20:32:28.7108786Z compiled: bool, 2025-05-07T20:32:28.7108873Z ) -> None: 2025-05-07T20:32:28.7108992Z torch.manual_seed(2025) 2025-05-07T20:32:28.7109075Z 2025-05-07T20:32:28.7109265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7109354Z 2025-05-07T20:32:28.7109564Z > x_sign = torch.sign(x) 2025-05-07T20:32:28.7111592Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7111599Z 2025-05-07T20:32:28.7111736Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:28.7111741Z 2025-05-07T20:32:28.7111867Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7112122Z self=, 2025-05-07T20:32:28.7112212Z T=16384, 2025-05-07T20:32:28.7112304Z D=5120, 2025-05-07T20:32:28.7112404Z scale_ub=None, 2025-05-07T20:32:28.7112501Z contiguous=True, 2025-05-07T20:32:28.7112600Z compiled=False, 2025-05-07T20:32:28.7112682Z ) 2025-05-07T20:32:28.7112929Z self = 2025-05-07T20:32:28.7113133Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.7113138Z 2025-05-07T20:32:28.7113226Z @given( 2025-05-07T20:32:28.7113363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7113476Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7113679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7113816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7113953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7114038Z ) 2025-05-07T20:32:28.7114321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7114436Z def test_silu_mul_quant( 2025-05-07T20:32:28.7114523Z self, 2025-05-07T20:32:28.7114618Z T: int, 2025-05-07T20:32:28.7114705Z D: int, 2025-05-07T20:32:28.7114819Z scale_ub: Optional[float], 2025-05-07T20:32:28.7114921Z contiguous: bool, 2025-05-07T20:32:28.7115018Z compiled: bool, 2025-05-07T20:32:28.7115109Z ) -> None: 2025-05-07T20:32:28.7115216Z torch.manual_seed(2025) 2025-05-07T20:32:28.7115299Z 2025-05-07T20:32:28.7115495Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7117615Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7117626Z 2025-05-07T20:32:28.7117765Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7117770Z 2025-05-07T20:32:28.7117888Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7118141Z self=, 2025-05-07T20:32:28.7118234Z T=4096, 2025-05-07T20:32:28.7118320Z D=5120, 2025-05-07T20:32:28.7118416Z scale_ub=None, 2025-05-07T20:32:28.7118514Z contiguous=True, 2025-05-07T20:32:28.7118610Z compiled=False, 2025-05-07T20:32:28.7118702Z ) 2025-05-07T20:32:28.7118953Z self = 2025-05-07T20:32:28.7119148Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.7119153Z 2025-05-07T20:32:28.7119242Z @given( 2025-05-07T20:32:28.7119489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7119604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7119739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7119871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7120008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7120094Z ) 2025-05-07T20:32:28.7120375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7120485Z def test_silu_mul_quant( 2025-05-07T20:32:28.7120572Z self, 2025-05-07T20:32:28.7120658Z T: int, 2025-05-07T20:32:28.7120749Z D: int, 2025-05-07T20:32:28.7120859Z scale_ub: Optional[float], 2025-05-07T20:32:28.7120964Z contiguous: bool, 2025-05-07T20:32:28.7121063Z compiled: bool, 2025-05-07T20:32:28.7121151Z ) -> None: 2025-05-07T20:32:28.7121259Z torch.manual_seed(2025) 2025-05-07T20:32:28.7121353Z 2025-05-07T20:32:28.7121546Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7123563Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7123570Z 2025-05-07T20:32:28.7123706Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7123711Z 2025-05-07T20:32:28.7124191Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7124476Z self=, 2025-05-07T20:32:28.7124573Z T=2048, 2025-05-07T20:32:28.7124664Z D=5120, 2025-05-07T20:32:28.7124759Z scale_ub=None, 2025-05-07T20:32:28.7124860Z contiguous=False, 2025-05-07T20:32:28.7124961Z compiled=False, 2025-05-07T20:32:28.7125043Z ) 2025-05-07T20:32:28.7125289Z self = 2025-05-07T20:32:28.7125489Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.7125494Z 2025-05-07T20:32:28.7125582Z @given( 2025-05-07T20:32:28.7125720Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7125832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7125962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7126254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7126389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7126473Z ) 2025-05-07T20:32:28.7126758Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7126870Z def test_silu_mul_quant( 2025-05-07T20:32:28.7126956Z self, 2025-05-07T20:32:28.7127047Z T: int, 2025-05-07T20:32:28.7127133Z D: int, 2025-05-07T20:32:28.7127247Z scale_ub: Optional[float], 2025-05-07T20:32:28.7127348Z contiguous: bool, 2025-05-07T20:32:28.7127446Z compiled: bool, 2025-05-07T20:32:28.7127536Z ) -> None: 2025-05-07T20:32:28.7127643Z torch.manual_seed(2025) 2025-05-07T20:32:28.7127725Z 2025-05-07T20:32:28.7127918Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7129945Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7130079Z 2025-05-07T20:32:28.7130220Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7130225Z 2025-05-07T20:32:28.7130342Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7130594Z self=, 2025-05-07T20:32:28.7130685Z T=4096, 2025-05-07T20:32:28.7130772Z D=7168, 2025-05-07T20:32:28.7130869Z scale_ub=None, 2025-05-07T20:32:28.7130965Z contiguous=True, 2025-05-07T20:32:28.7131058Z compiled=True, 2025-05-07T20:32:28.7131150Z ) 2025-05-07T20:32:28.7131395Z self = 2025-05-07T20:32:28.7131587Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.7131598Z 2025-05-07T20:32:28.7131688Z @given( 2025-05-07T20:32:28.7131822Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7131935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7132068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7132199Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7132331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7132416Z ) 2025-05-07T20:32:28.7132697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7132808Z def test_silu_mul_quant( 2025-05-07T20:32:28.7132895Z self, 2025-05-07T20:32:28.7132984Z T: int, 2025-05-07T20:32:28.7133075Z D: int, 2025-05-07T20:32:28.7133191Z scale_ub: Optional[float], 2025-05-07T20:32:28.7133292Z contiguous: bool, 2025-05-07T20:32:28.7133393Z compiled: bool, 2025-05-07T20:32:28.7133480Z ) -> None: 2025-05-07T20:32:28.7133592Z torch.manual_seed(2025) 2025-05-07T20:32:28.7133678Z 2025-05-07T20:32:28.7133869Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7135896Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7135995Z 2025-05-07T20:32:28.7136131Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7136135Z 2025-05-07T20:32:28.7136256Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7136515Z self=, 2025-05-07T20:32:28.7136603Z T=2048, 2025-05-07T20:32:28.7136694Z D=5120, 2025-05-07T20:32:28.7136788Z scale_ub=1200.0, 2025-05-07T20:32:28.7136886Z contiguous=False, 2025-05-07T20:32:28.7136982Z compiled=False, 2025-05-07T20:32:28.7137065Z ) 2025-05-07T20:32:28.7137311Z self = 2025-05-07T20:32:28.7137515Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.7137519Z 2025-05-07T20:32:28.7137606Z @given( 2025-05-07T20:32:28.7137743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7137856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7137991Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7138130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7138260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7138434Z ) 2025-05-07T20:32:28.7138720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7138826Z def test_silu_mul_quant( 2025-05-07T20:32:28.7138911Z self, 2025-05-07T20:32:28.7139001Z T: int, 2025-05-07T20:32:28.7139087Z D: int, 2025-05-07T20:32:28.7139201Z scale_ub: Optional[float], 2025-05-07T20:32:28.7139301Z contiguous: bool, 2025-05-07T20:32:28.7139398Z compiled: bool, 2025-05-07T20:32:28.7139490Z ) -> None: 2025-05-07T20:32:28.7139597Z torch.manual_seed(2025) 2025-05-07T20:32:28.7139680Z 2025-05-07T20:32:28.7139874Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7141889Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7141901Z 2025-05-07T20:32:28.7142039Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7142044Z 2025-05-07T20:32:28.7142160Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7142411Z self=, 2025-05-07T20:32:28.7142501Z T=4096, 2025-05-07T20:32:28.7142587Z D=7168, 2025-05-07T20:32:28.7142686Z scale_ub=1200.0, 2025-05-07T20:32:28.7142785Z contiguous=True, 2025-05-07T20:32:28.7142880Z compiled=False, 2025-05-07T20:32:28.7142965Z ) 2025-05-07T20:32:28.7143214Z self = 2025-05-07T20:32:28.7143415Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.7143421Z 2025-05-07T20:32:28.7143511Z @given( 2025-05-07T20:32:28.7143645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7143758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7143892Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7144024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7144157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7144241Z ) 2025-05-07T20:32:28.7144524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7144631Z def test_silu_mul_quant( 2025-05-07T20:32:28.7144717Z self, 2025-05-07T20:32:28.7144894Z T: int, 2025-05-07T20:32:28.7144987Z D: int, 2025-05-07T20:32:28.7145098Z scale_ub: Optional[float], 2025-05-07T20:32:28.7145199Z contiguous: bool, 2025-05-07T20:32:28.7145306Z compiled: bool, 2025-05-07T20:32:28.7145393Z ) -> None: 2025-05-07T20:32:28.7145500Z torch.manual_seed(2025) 2025-05-07T20:32:28.7145588Z 2025-05-07T20:32:28.7145779Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7147801Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7147807Z 2025-05-07T20:32:28.7147941Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7147946Z 2025-05-07T20:32:28.7148158Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7148436Z self=, 2025-05-07T20:32:28.7148544Z T=16384, 2025-05-07T20:32:28.7148640Z D=7168, 2025-05-07T20:32:28.7148734Z scale_ub=None, 2025-05-07T20:32:28.7148832Z contiguous=False, 2025-05-07T20:32:28.7148929Z compiled=True, 2025-05-07T20:32:28.7149013Z ) 2025-05-07T20:32:28.7149258Z self = 2025-05-07T20:32:28.7149462Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.7149467Z 2025-05-07T20:32:28.7149555Z @given( 2025-05-07T20:32:28.7149692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7149813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7149943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7150079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7150214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7150300Z ) 2025-05-07T20:32:28.7150587Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7150693Z def test_silu_mul_quant( 2025-05-07T20:32:28.7150779Z self, 2025-05-07T20:32:28.7150870Z T: int, 2025-05-07T20:32:28.7150955Z D: int, 2025-05-07T20:32:28.7151070Z scale_ub: Optional[float], 2025-05-07T20:32:28.7151173Z contiguous: bool, 2025-05-07T20:32:28.7151269Z compiled: bool, 2025-05-07T20:32:28.7151360Z ) -> None: 2025-05-07T20:32:28.7151468Z torch.manual_seed(2025) 2025-05-07T20:32:28.7151551Z 2025-05-07T20:32:28.7151760Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7153835Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7153848Z 2025-05-07T20:32:28.7153984Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7153989Z 2025-05-07T20:32:28.7154105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7154355Z self=, 2025-05-07T20:32:28.7154446Z T=4096, 2025-05-07T20:32:28.7154532Z D=7168, 2025-05-07T20:32:28.7154739Z scale_ub=None, 2025-05-07T20:32:28.7154841Z contiguous=True, 2025-05-07T20:32:28.7154935Z compiled=False, 2025-05-07T20:32:28.7155022Z ) 2025-05-07T20:32:28.7155269Z self = 2025-05-07T20:32:28.7155468Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.7155473Z 2025-05-07T20:32:28.7155561Z @given( 2025-05-07T20:32:28.7155695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7155807Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7155940Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7156071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7156203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7156287Z ) 2025-05-07T20:32:28.7156566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7156680Z def test_silu_mul_quant( 2025-05-07T20:32:28.7156769Z self, 2025-05-07T20:32:28.7156858Z T: int, 2025-05-07T20:32:28.7156951Z D: int, 2025-05-07T20:32:28.7157064Z scale_ub: Optional[float], 2025-05-07T20:32:28.7157255Z contiguous: bool, 2025-05-07T20:32:28.7157361Z compiled: bool, 2025-05-07T20:32:28.7157451Z ) -> None: 2025-05-07T20:32:28.7157561Z torch.manual_seed(2025) 2025-05-07T20:32:28.7157647Z 2025-05-07T20:32:28.7157842Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7159875Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7159881Z 2025-05-07T20:32:28.7160017Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7160028Z 2025-05-07T20:32:28.7160152Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7160412Z self=, 2025-05-07T20:32:28.7160504Z T=16384, 2025-05-07T20:32:28.7160603Z D=7168, 2025-05-07T20:32:28.7160700Z scale_ub=None, 2025-05-07T20:32:28.7160800Z contiguous=True, 2025-05-07T20:32:28.7160901Z compiled=False, 2025-05-07T20:32:28.7160985Z ) 2025-05-07T20:32:28.7161233Z self = 2025-05-07T20:32:28.7161440Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.7161445Z 2025-05-07T20:32:28.7161535Z @given( 2025-05-07T20:32:28.7161678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7161792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7161923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7162069Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7162200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7162287Z ) 2025-05-07T20:32:28.7162574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7162681Z def test_silu_mul_quant( 2025-05-07T20:32:28.7162769Z self, 2025-05-07T20:32:28.7162863Z T: int, 2025-05-07T20:32:28.7162951Z D: int, 2025-05-07T20:32:28.7163069Z scale_ub: Optional[float], 2025-05-07T20:32:28.7163171Z contiguous: bool, 2025-05-07T20:32:28.7163270Z compiled: bool, 2025-05-07T20:32:28.7163365Z ) -> None: 2025-05-07T20:32:28.7163472Z torch.manual_seed(2025) 2025-05-07T20:32:28.7163556Z 2025-05-07T20:32:28.7163846Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7165872Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7165884Z 2025-05-07T20:32:28.7166026Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7166031Z 2025-05-07T20:32:28.7166150Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7166407Z self=, 2025-05-07T20:32:28.7166506Z T=16384, 2025-05-07T20:32:28.7166595Z D=7168, 2025-05-07T20:32:28.7166691Z scale_ub=1200.0, 2025-05-07T20:32:28.7166794Z contiguous=True, 2025-05-07T20:32:28.7166889Z compiled=False, 2025-05-07T20:32:28.7167064Z ) 2025-05-07T20:32:28.7167312Z self = 2025-05-07T20:32:28.7167514Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.7167518Z 2025-05-07T20:32:28.7167611Z @given( 2025-05-07T20:32:28.7167747Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7167860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7167998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7168131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7168267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7168353Z ) 2025-05-07T20:32:28.7168640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7168755Z def test_silu_mul_quant( 2025-05-07T20:32:28.7168845Z self, 2025-05-07T20:32:28.7168933Z T: int, 2025-05-07T20:32:28.7169034Z D: int, 2025-05-07T20:32:28.7169147Z scale_ub: Optional[float], 2025-05-07T20:32:28.7169250Z contiguous: bool, 2025-05-07T20:32:28.7169354Z compiled: bool, 2025-05-07T20:32:28.7169444Z ) -> None: 2025-05-07T20:32:28.7169554Z torch.manual_seed(2025) 2025-05-07T20:32:28.7169643Z 2025-05-07T20:32:28.7169837Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7171862Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7171874Z 2025-05-07T20:32:28.7172009Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7172014Z 2025-05-07T20:32:28.7172139Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7172395Z self=, 2025-05-07T20:32:28.7172485Z T=128, 2025-05-07T20:32:28.7172579Z D=5120, 2025-05-07T20:32:28.7172676Z scale_ub=1200.0, 2025-05-07T20:32:28.7172776Z contiguous=False, 2025-05-07T20:32:28.7172879Z compiled=False, 2025-05-07T20:32:28.7172967Z ) 2025-05-07T20:32:28.7173215Z self = 2025-05-07T20:32:28.7173416Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.7173512Z 2025-05-07T20:32:28.7173600Z @given( 2025-05-07T20:32:28.7173742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7173856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7173996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7174135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7174267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7174355Z ) 2025-05-07T20:32:28.7174639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7174748Z def test_silu_mul_quant( 2025-05-07T20:32:28.7174835Z self, 2025-05-07T20:32:28.7174928Z T: int, 2025-05-07T20:32:28.7175018Z D: int, 2025-05-07T20:32:28.7175135Z scale_ub: Optional[float], 2025-05-07T20:32:28.7175238Z contiguous: bool, 2025-05-07T20:32:28.7175336Z compiled: bool, 2025-05-07T20:32:28.7175430Z ) -> None: 2025-05-07T20:32:28.7175548Z torch.manual_seed(2025) 2025-05-07T20:32:28.7175631Z 2025-05-07T20:32:28.7175832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7175919Z 2025-05-07T20:32:28.7176110Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7176261Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7176363Z x = x_sign * x_clamp 2025-05-07T20:32:28.7176456Z x0 = x[:, :D] 2025-05-07T20:32:28.7176552Z x1 = x[:, D:] 2025-05-07T20:32:28.7176637Z 2025-05-07T20:32:28.7176733Z if contiguous: 2025-05-07T20:32:28.7176841Z x0 = x0.contiguous() 2025-05-07T20:32:28.7176945Z x1 = x1.contiguous() 2025-05-07T20:32:28.7177034Z 2025-05-07T20:32:28.7177137Z if scale_ub is not None: 2025-05-07T20:32:28.7177257Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7177420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7177512Z ) 2025-05-07T20:32:28.7177601Z else: 2025-05-07T20:32:28.7177714Z scale_ub_tensor = None 2025-05-07T20:32:28.7177799Z 2025-05-07T20:32:28.7177948Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7178067Z op = silu_mul_quant 2025-05-07T20:32:28.7178166Z if compiled: 2025-05-07T20:32:28.7178284Z op = torch.compile(op) 2025-05-07T20:32:28.7178415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7178506Z 2025-05-07T20:32:28.7178637Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7178642Z 2025-05-07T20:32:28.7178777Z moe/activation_test.py:117: 2025-05-07T20:32:28.7178926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7179048Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7179166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7179749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7179868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7180281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7180552Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7180943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7181052Z kernel = self.compile( 2025-05-07T20:32:28.7181499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7181703Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7181846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7181857Z 2025-05-07T20:32:28.7182176Z self = 2025-05-07T20:32:28.7183065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7183647Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7675f9dcf0>} 2025-05-07T20:32:28.7184493Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7184718Z context = 2025-05-07T20:32:28.7184724Z 2025-05-07T20:32:28.7184913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7185223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7185354Z module_map=module_map) 2025-05-07T20:32:28.7185540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7185745Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7185834Z E ^ 2025-05-07T20:32:28.7186241Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7186247Z 2025-05-07T20:32:28.7186721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7186726Z 2025-05-07T20:32:28.7186846Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7187100Z self=, 2025-05-07T20:32:28.7187194Z T=2048, 2025-05-07T20:32:28.7187279Z D=7168, 2025-05-07T20:32:28.7187387Z scale_ub=None, 2025-05-07T20:32:28.7187488Z contiguous=False, 2025-05-07T20:32:28.7187585Z compiled=False, 2025-05-07T20:32:28.7187676Z ) 2025-05-07T20:32:28.7187924Z self = 2025-05-07T20:32:28.7188131Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.7188135Z 2025-05-07T20:32:28.7188228Z @given( 2025-05-07T20:32:28.7188364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7188480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7188618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7188754Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7188890Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7188976Z ) 2025-05-07T20:32:28.7189258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7189373Z def test_silu_mul_quant( 2025-05-07T20:32:28.7189465Z self, 2025-05-07T20:32:28.7189553Z T: int, 2025-05-07T20:32:28.7189648Z D: int, 2025-05-07T20:32:28.7189762Z scale_ub: Optional[float], 2025-05-07T20:32:28.7189865Z contiguous: bool, 2025-05-07T20:32:28.7189973Z compiled: bool, 2025-05-07T20:32:28.7190064Z ) -> None: 2025-05-07T20:32:28.7190174Z torch.manual_seed(2025) 2025-05-07T20:32:28.7190267Z 2025-05-07T20:32:28.7190463Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7192593Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7192600Z 2025-05-07T20:32:28.7192739Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7192747Z 2025-05-07T20:32:28.7192872Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7193131Z self=, 2025-05-07T20:32:28.7193222Z T=128, 2025-05-07T20:32:28.7193317Z D=7168, 2025-05-07T20:32:28.7193414Z scale_ub=1200.0, 2025-05-07T20:32:28.7193560Z contiguous=True, 2025-05-07T20:32:28.7193662Z compiled=True, 2025-05-07T20:32:28.7193750Z ) 2025-05-07T20:32:28.7193999Z self = 2025-05-07T20:32:28.7194195Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.7194201Z 2025-05-07T20:32:28.7194291Z @given( 2025-05-07T20:32:28.7194438Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7194554Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7194687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7194827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7195048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7195135Z ) 2025-05-07T20:32:28.7195419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7195527Z def test_silu_mul_quant( 2025-05-07T20:32:28.7195614Z self, 2025-05-07T20:32:28.7195709Z T: int, 2025-05-07T20:32:28.7195797Z D: int, 2025-05-07T20:32:28.7195915Z scale_ub: Optional[float], 2025-05-07T20:32:28.7196019Z contiguous: bool, 2025-05-07T20:32:28.7196117Z compiled: bool, 2025-05-07T20:32:28.7196212Z ) -> None: 2025-05-07T20:32:28.7196322Z torch.manual_seed(2025) 2025-05-07T20:32:28.7196410Z 2025-05-07T20:32:28.7196615Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7196701Z 2025-05-07T20:32:28.7196807Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7196959Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7197068Z x = x_sign * x_clamp 2025-05-07T20:32:28.7197161Z x0 = x[:, :D] 2025-05-07T20:32:28.7197257Z x1 = x[:, D:] 2025-05-07T20:32:28.7197341Z 2025-05-07T20:32:28.7197444Z if contiguous: 2025-05-07T20:32:28.7197549Z x0 = x0.contiguous() 2025-05-07T20:32:28.7197651Z x1 = x1.contiguous() 2025-05-07T20:32:28.7197742Z 2025-05-07T20:32:28.7197847Z if scale_ub is not None: 2025-05-07T20:32:28.7197968Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7198128Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7198216Z ) 2025-05-07T20:32:28.7198304Z else: 2025-05-07T20:32:28.7198418Z scale_ub_tensor = None 2025-05-07T20:32:28.7198505Z 2025-05-07T20:32:28.7198654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7198766Z op = silu_mul_quant 2025-05-07T20:32:28.7198864Z if compiled: 2025-05-07T20:32:28.7198984Z op = torch.compile(op) 2025-05-07T20:32:28.7199111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7199195Z 2025-05-07T20:32:28.7199304Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7199309Z 2025-05-07T20:32:28.7199420Z moe/activation_test.py:117: 2025-05-07T20:32:28.7199565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7199688Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7199803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7200230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.7200346Z return fn(*args, **kwargs) 2025-05-07T20:32:28.7201084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7201204Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7201624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7201879Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7202275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7202385Z kernel = self.compile( 2025-05-07T20:32:28.7202821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7203033Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7203180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7203190Z 2025-05-07T20:32:28.7203434Z self = 2025-05-07T20:32:28.7204320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7204981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f7675f9f0a0>} 2025-05-07T20:32:28.7205827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7206046Z context = 2025-05-07T20:32:28.7206051Z 2025-05-07T20:32:28.7206253Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7206557Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7206686Z module_map=module_map) 2025-05-07T20:32:28.7206878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7206993Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7207089Z E ^ 2025-05-07T20:32:28.7207492Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7207497Z 2025-05-07T20:32:28.7207969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7207979Z 2025-05-07T20:32:28.7208100Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7208358Z self=, 2025-05-07T20:32:28.7208454Z T=128, 2025-05-07T20:32:28.7208549Z D=7168, 2025-05-07T20:32:28.7208646Z scale_ub=1200.0, 2025-05-07T20:32:28.7208750Z contiguous=True, 2025-05-07T20:32:28.7208848Z compiled=False, 2025-05-07T20:32:28.7208932Z ) 2025-05-07T20:32:28.7209197Z self = 2025-05-07T20:32:28.7209396Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.7209401Z 2025-05-07T20:32:28.7209497Z @given( 2025-05-07T20:32:28.7209635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7209750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7209887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7210024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7210155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7210245Z ) 2025-05-07T20:32:28.7210531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7210737Z def test_silu_mul_quant( 2025-05-07T20:32:28.7210833Z self, 2025-05-07T20:32:28.7210922Z T: int, 2025-05-07T20:32:28.7211010Z D: int, 2025-05-07T20:32:28.7211130Z scale_ub: Optional[float], 2025-05-07T20:32:28.7211236Z contiguous: bool, 2025-05-07T20:32:28.7211338Z compiled: bool, 2025-05-07T20:32:28.7211428Z ) -> None: 2025-05-07T20:32:28.7211537Z torch.manual_seed(2025) 2025-05-07T20:32:28.7211624Z 2025-05-07T20:32:28.7211816Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7211902Z 2025-05-07T20:32:28.7212014Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7212159Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7214179Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7214273Z 2025-05-07T20:32:28.7214412Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:28.7214416Z 2025-05-07T20:32:28.7214535Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7214793Z self=, 2025-05-07T20:32:28.7214882Z T=128, 2025-05-07T20:32:28.7214976Z D=5120, 2025-05-07T20:32:28.7215073Z scale_ub=1200.0, 2025-05-07T20:32:28.7215172Z contiguous=True, 2025-05-07T20:32:28.7215274Z compiled=True, 2025-05-07T20:32:28.7215360Z ) 2025-05-07T20:32:28.7215614Z self = 2025-05-07T20:32:28.7215810Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.7215816Z 2025-05-07T20:32:28.7215905Z @given( 2025-05-07T20:32:28.7216042Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7216171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7216304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7216444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7216575Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7216660Z ) 2025-05-07T20:32:28.7216946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7217054Z def test_silu_mul_quant( 2025-05-07T20:32:28.7217143Z self, 2025-05-07T20:32:28.7217236Z T: int, 2025-05-07T20:32:28.7217322Z D: int, 2025-05-07T20:32:28.7217434Z scale_ub: Optional[float], 2025-05-07T20:32:28.7217546Z contiguous: bool, 2025-05-07T20:32:28.7217646Z compiled: bool, 2025-05-07T20:32:28.7217736Z ) -> None: 2025-05-07T20:32:28.7217852Z torch.manual_seed(2025) 2025-05-07T20:32:28.7217935Z 2025-05-07T20:32:28.7218139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7218229Z 2025-05-07T20:32:28.7218335Z > x_sign = torch.sign(x) 2025-05-07T20:32:28.7220339Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7220433Z 2025-05-07T20:32:28.7220572Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:28.7220577Z 2025-05-07T20:32:28.7220704Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7220959Z self=, 2025-05-07T20:32:28.7221054Z T=128, 2025-05-07T20:32:28.7221147Z D=7168, 2025-05-07T20:32:28.7221245Z scale_ub=None, 2025-05-07T20:32:28.7221346Z contiguous=True, 2025-05-07T20:32:28.7221453Z compiled=True, 2025-05-07T20:32:28.7221539Z ) 2025-05-07T20:32:28.7221788Z self = 2025-05-07T20:32:28.7221984Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.7221989Z 2025-05-07T20:32:28.7222078Z @given( 2025-05-07T20:32:28.7222224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7222338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7222475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7222616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7222749Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7222837Z ) 2025-05-07T20:32:28.7223238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7223348Z def test_silu_mul_quant( 2025-05-07T20:32:28.7223441Z self, 2025-05-07T20:32:28.7223530Z T: int, 2025-05-07T20:32:28.7223618Z D: int, 2025-05-07T20:32:28.7223738Z scale_ub: Optional[float], 2025-05-07T20:32:28.7224157Z contiguous: bool, 2025-05-07T20:32:28.7224281Z compiled: bool, 2025-05-07T20:32:28.7224379Z ) -> None: 2025-05-07T20:32:28.7224487Z torch.manual_seed(2025) 2025-05-07T20:32:28.7224571Z 2025-05-07T20:32:28.7224772Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7226776Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7226789Z 2025-05-07T20:32:28.7226931Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7227086Z =============================== warnings summary =============================== 2025-05-07T20:32:28.7227444Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:28.7227790Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:28.7228132Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:28.7229132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:28.7229402Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:28.7229407Z 2025-05-07T20:32:28.7229617Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:32:28.7236355Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:32:28.7236607Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:32:28.7236613Z 2025-05-07T20:32:28.7236866Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:28.7237063Z ================== 1 failed, 1 passed, 13 warnings in 31.65s =================== 2025-05-07T20:32:30.5517671Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:30.6158072Z 2025-05-07T20:32:30.6158661Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:32:30.6159196Z 2025-05-07T20:32:30.6159201Z 2025-05-07T20:32:30.6181850Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:32.7903301Z ============================= test session starts ============================== 2025-05-07T20:32:32.7904374Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:32.7905725Z cachedir: .pytest_cache 2025-05-07T20:32:32.7906716Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:32.7908029Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:32.7908764Z plugins: hypothesis-6.131.14 2025-05-07T20:32:34.4261417Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:34.6081929Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:34.6082530Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:34.6082856Z 2025-05-07T20:32:36.9115778Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.9117022Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:36.9118513Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.9120099Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.9121654Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.9123163Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.9124875Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.9126369Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.9127914Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.9129636Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:36.9131003Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.9132390Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:36.9133525Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:36.9134654Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:36.9136000Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.9137412Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.9138795Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:36.9139937Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:36.9141241Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.9142745Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.9143912Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.9144924Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.9145734Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:36.9146859Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.9289972Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.9291135Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:36.9292605Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.9294157Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.9295673Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.9297371Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.9298804Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.9300319Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.9301924Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.9303295Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:36.9304633Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.9306093Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:36.9307233Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:36.9308348Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:36.9309692Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.9311146Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.9312382Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:36.9313623Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:36.9314909Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.9316402Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.9317574Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.9318587Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.9319406Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:36.9320521Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5341552Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5342978Z self=, 2025-05-07T20:32:37.5343465Z T=1, 2025-05-07T20:32:37.5343686Z D=5120, 2025-05-07T20:32:37.5343904Z scale_ub=None, 2025-05-07T20:32:37.5344156Z contiguous=True, 2025-05-07T20:32:37.5344430Z compiled=True, 2025-05-07T20:32:37.5344664Z ) 2025-05-07T20:32:37.5345036Z self = 2025-05-07T20:32:37.5345602Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.5345904Z 2025-05-07T20:32:37.5346002Z @given( 2025-05-07T20:32:37.5346268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5346634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5346989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5347368Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5347751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5348083Z ) 2025-05-07T20:32:37.5348492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5349010Z def test_silu_mul_quant( 2025-05-07T20:32:37.5349294Z self, 2025-05-07T20:32:37.5349517Z T: int, 2025-05-07T20:32:37.5349937Z D: int, 2025-05-07T20:32:37.5350195Z scale_ub: Optional[float], 2025-05-07T20:32:37.5350513Z contiguous: bool, 2025-05-07T20:32:37.5350831Z compiled: bool, 2025-05-07T20:32:37.5351155Z ) -> None: 2025-05-07T20:32:37.5351467Z torch.manual_seed(2025) 2025-05-07T20:32:37.5351809Z 2025-05-07T20:32:37.5352203Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5352687Z 2025-05-07T20:32:37.5352907Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5353245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5353682Z x = x_sign * x_clamp 2025-05-07T20:32:37.5353956Z x0 = x[:, :D] 2025-05-07T20:32:37.5354211Z x1 = x[:, D:] 2025-05-07T20:32:37.5354461Z 2025-05-07T20:32:37.5354671Z if contiguous: 2025-05-07T20:32:37.5354942Z x0 = x0.contiguous() 2025-05-07T20:32:37.5355244Z x1 = x1.contiguous() 2025-05-07T20:32:37.5355525Z 2025-05-07T20:32:37.5355749Z if scale_ub is not None: 2025-05-07T20:32:37.5356070Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5356464Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5356817Z ) 2025-05-07T20:32:37.5357043Z else: 2025-05-07T20:32:37.5357289Z scale_ub_tensor = None 2025-05-07T20:32:37.5357573Z 2025-05-07T20:32:37.5357844Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5358208Z op = silu_mul_quant 2025-05-07T20:32:37.5358492Z if compiled: 2025-05-07T20:32:37.5358783Z op = torch.compile(op) 2025-05-07T20:32:37.5359129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5359444Z 2025-05-07T20:32:37.5359671Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.5360003Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.5360333Z 2025-05-07T20:32:37.5360611Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5361003Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.5361337Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.5361700Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.5362117Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.5362479Z 2025-05-07T20:32:37.5362709Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.5362941Z 2025-05-07T20:32:37.5363058Z moe/activation_test.py:126: 2025-05-07T20:32:37.5363402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5363787Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.5364264Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.5365176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.5366044Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.5366666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5367451Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5368239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.5369060Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.5369919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.5370778Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.5371615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.5372340Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.5373318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.5373916Z fn() 2025-05-07T20:32:37.5374494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.5375177Z self.fn.run( 2025-05-07T20:32:37.5375714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5376324Z kernel = self.compile( 2025-05-07T20:32:37.5376937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5377696Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5378156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5378419Z 2025-05-07T20:32:37.5378665Z self = 2025-05-07T20:32:37.5379900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5381555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff1bc73400>} 2025-05-07T20:32:37.5383098Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5384271Z context = 2025-05-07T20:32:37.5384601Z 2025-05-07T20:32:37.5384793Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5385401Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5385941Z module_map=module_map) 2025-05-07T20:32:37.5386360Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5386771Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.5387078Z E ^ 2025-05-07T20:32:37.5387610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5388124Z 2025-05-07T20:32:37.5388597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5389188Z 2025-05-07T20:32:37.5389402Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5389879Z self=, 2025-05-07T20:32:37.5390342Z T=2048, 2025-05-07T20:32:37.5390551Z D=5120, 2025-05-07T20:32:37.5390780Z scale_ub=1200.0, 2025-05-07T20:32:37.5391040Z contiguous=True, 2025-05-07T20:32:37.5391294Z compiled=False, 2025-05-07T20:32:37.5391535Z ) 2025-05-07T20:32:38.5353690Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.5354933Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:38.5356466Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.5358080Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.5360036Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.5361594Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.5363053Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.5364600Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.5366192Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.5367597Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:38.5368966Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.5370315Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:38.5371528Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:38.5372678Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:38.5374041Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.5375474Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.5376939Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:38.5378110Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:38.5379638Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.5381158Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.5382341Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.5383364Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.5384205Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:38.5385351Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.7570338Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.7571592Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:38.7573011Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.7574534Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.7576003Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.7577480Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.7578867Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.7580326Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.7581833Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.7583164Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:38.7584458Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.7585744Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:38.7587012Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:38.7588100Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:38.7589408Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.7590771Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.7591962Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:38.7593071Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:38.7594408Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.7596311Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.7597685Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.7598846Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.7599796Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:38.7601121Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.5728889Z self = 2025-05-07T20:32:39.5729483Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:39.5729781Z 2025-05-07T20:32:39.5729868Z @given( 2025-05-07T20:32:39.5730217Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.5730576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.5730907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.5731254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.5731607Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.5731915Z ) 2025-05-07T20:32:39.5732295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.5732770Z def test_silu_mul_quant( 2025-05-07T20:32:39.5733033Z self, 2025-05-07T20:32:39.5733236Z T: int, 2025-05-07T20:32:39.5733457Z D: int, 2025-05-07T20:32:39.5733692Z scale_ub: Optional[float], 2025-05-07T20:32:39.5733976Z contiguous: bool, 2025-05-07T20:32:39.5734231Z compiled: bool, 2025-05-07T20:32:39.5734473Z ) -> None: 2025-05-07T20:32:39.5734698Z torch.manual_seed(2025) 2025-05-07T20:32:39.5734956Z 2025-05-07T20:32:39.5735253Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.5735617Z 2025-05-07T20:32:39.5735820Z x_sign = torch.sign(x) 2025-05-07T20:32:39.5736132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.5736463Z x = x_sign * x_clamp 2025-05-07T20:32:39.5736714Z x0 = x[:, :D] 2025-05-07T20:32:39.5737203Z x1 = x[:, D:] 2025-05-07T20:32:39.5737433Z 2025-05-07T20:32:39.5737628Z if contiguous: 2025-05-07T20:32:39.5737876Z x0 = x0.contiguous() 2025-05-07T20:32:39.5738151Z x1 = x1.contiguous() 2025-05-07T20:32:39.5738406Z 2025-05-07T20:32:39.5738612Z if scale_ub is not None: 2025-05-07T20:32:39.5738907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.5739261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.5739591Z ) 2025-05-07T20:32:39.5739800Z else: 2025-05-07T20:32:39.5740022Z scale_ub_tensor = None 2025-05-07T20:32:39.5740291Z 2025-05-07T20:32:39.5740542Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.5740882Z op = silu_mul_quant 2025-05-07T20:32:39.5741147Z if compiled: 2025-05-07T20:32:39.5741420Z op = torch.compile(op) 2025-05-07T20:32:39.5741748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.5742047Z 2025-05-07T20:32:39.5742258Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.5742434Z 2025-05-07T20:32:39.5742548Z moe/activation_test.py:117: 2025-05-07T20:32:39.5742867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5743376Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.5743685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.5744421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.5745160Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.5745737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.5746470Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.5747173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.5747749Z kernel = self.compile( 2025-05-07T20:32:39.5748335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.5749044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.5749461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5749712Z 2025-05-07T20:32:39.5749934Z self = 2025-05-07T20:32:39.5751081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.5752549Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff1b452ef0>} 2025-05-07T20:32:39.5754067Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.5755162Z context = 2025-05-07T20:32:39.5755472Z 2025-05-07T20:32:39.5755649Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.5756201Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.5756692Z module_map=module_map) 2025-05-07T20:32:39.5757079Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.5757455Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.5757724Z E ^ 2025-05-07T20:32:39.5758220Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.5758786Z 2025-05-07T20:32:39.5759228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.5759770Z 2025-05-07T20:32:39.5759894Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.5760334Z self=, 2025-05-07T20:32:39.5760760Z T=2048, 2025-05-07T20:32:39.5760964Z D=5120, 2025-05-07T20:32:39.5761166Z scale_ub=1200.0, 2025-05-07T20:32:39.5761405Z contiguous=True, 2025-05-07T20:32:39.5761645Z compiled=True, 2025-05-07T20:32:39.5761867Z ) 2025-05-07T20:32:39.5762201Z self = 2025-05-07T20:32:39.5762727Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:39.5763013Z 2025-05-07T20:32:39.5763107Z @given( 2025-05-07T20:32:39.5763349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.5763691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.5764021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.5764366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.5764805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.5765112Z ) 2025-05-07T20:32:39.5765491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.5765960Z def test_silu_mul_quant( 2025-05-07T20:32:39.5766222Z self, 2025-05-07T20:32:39.5766462Z T: int, 2025-05-07T20:32:39.5766676Z D: int, 2025-05-07T20:32:39.5766910Z scale_ub: Optional[float], 2025-05-07T20:32:39.5767211Z contiguous: bool, 2025-05-07T20:32:39.5767478Z compiled: bool, 2025-05-07T20:32:39.5767716Z ) -> None: 2025-05-07T20:32:39.5767950Z torch.manual_seed(2025) 2025-05-07T20:32:39.5768208Z 2025-05-07T20:32:39.5768522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.5768886Z 2025-05-07T20:32:39.5769088Z x_sign = torch.sign(x) 2025-05-07T20:32:39.5769400Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.5769733Z x = x_sign * x_clamp 2025-05-07T20:32:39.5769991Z x0 = x[:, :D] 2025-05-07T20:32:39.5770231Z x1 = x[:, D:] 2025-05-07T20:32:39.5770455Z 2025-05-07T20:32:39.5770649Z if contiguous: 2025-05-07T20:32:39.5770898Z x0 = x0.contiguous() 2025-05-07T20:32:39.5771181Z x1 = x1.contiguous() 2025-05-07T20:32:39.5771439Z 2025-05-07T20:32:39.5778599Z if scale_ub is not None: 2025-05-07T20:32:39.5778966Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.5779443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.5779783Z ) 2025-05-07T20:32:39.5780002Z else: 2025-05-07T20:32:39.5780232Z scale_ub_tensor = None 2025-05-07T20:32:39.5780512Z 2025-05-07T20:32:39.5780780Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.5781122Z op = silu_mul_quant 2025-05-07T20:32:39.5781403Z if compiled: 2025-05-07T20:32:39.5781683Z op = torch.compile(op) 2025-05-07T20:32:39.5782013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.5782305Z 2025-05-07T20:32:39.5782522Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.5782834Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.5783148Z 2025-05-07T20:32:39.5783407Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.5783767Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.5784080Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.5784418Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.5784806Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.5785135Z 2025-05-07T20:32:39.5785478Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.5785698Z 2025-05-07T20:32:39.5785809Z moe/activation_test.py:126: 2025-05-07T20:32:39.5786134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5786494Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.5786847Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.5787690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.5788484Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.5789066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.5789853Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.5790828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.5791602Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.5792404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:39.5793313Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.5794190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.5794866Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.5795510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.5796064Z fn() 2025-05-07T20:32:39.5796601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.5797223Z self.fn.run( 2025-05-07T20:32:39.5797731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.5798299Z kernel = self.compile( 2025-05-07T20:32:39.5798875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.5799577Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.5799998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5800247Z 2025-05-07T20:32:39.5800469Z self = 2025-05-07T20:32:39.5801698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.5803151Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09f211b0>} 2025-05-07T20:32:39.5804565Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.5805646Z context = 2025-05-07T20:32:39.5805955Z 2025-05-07T20:32:39.5806132Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.5806687Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.5807188Z module_map=module_map) 2025-05-07T20:32:39.5807577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.5807959Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.5808249Z E ^ 2025-05-07T20:32:39.5808838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.5809320Z 2025-05-07T20:32:39.5809757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.5810303Z 2025-05-07T20:32:39.5810416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.5810864Z self=, 2025-05-07T20:32:39.5811285Z T=16384, 2025-05-07T20:32:39.5811517Z D=7168, 2025-05-07T20:32:39.5811772Z scale_ub=1200.0, 2025-05-07T20:32:39.5812014Z contiguous=False, 2025-05-07T20:32:39.5812258Z compiled=False, 2025-05-07T20:32:39.5812481Z ) 2025-05-07T20:32:40.1348516Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:40.1349655Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:40.1351064Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:40.1352742Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:40.1354238Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:40.1355676Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1357029Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:40.1358465Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1359940Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:40.1361229Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:40.1362497Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:40.1363754Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:40.1364850Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:40.1365909Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:40.1367168Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:40.1368624Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:40.1369789Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:40.1370876Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:40.1372100Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:40.1373498Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:40.1374608Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1375557Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1376404Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:40.1377462Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.2907819Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:40.2908963Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:40.2910358Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:40.2911849Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:40.2913285Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:40.2914785Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.2916290Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:40.2917729Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.2919207Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:40.2920507Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:40.2921946Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:40.2923211Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:40.2924716Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:40.2925786Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:40.2927343Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:40.2928681Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:40.2929846Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:40.2931123Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:40.2932363Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:40.2933773Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:40.2934873Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.2935821Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.2936604Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:40.2937666Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3299247Z self = 2025-05-07T20:32:41.3299797Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.3300106Z 2025-05-07T20:32:41.3300194Z @given( 2025-05-07T20:32:41.3300460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3300804Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3301163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3301527Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3301894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3302353Z ) 2025-05-07T20:32:41.3302787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3303261Z def test_silu_mul_quant( 2025-05-07T20:32:41.3303521Z self, 2025-05-07T20:32:41.3303732Z T: int, 2025-05-07T20:32:41.3303944Z D: int, 2025-05-07T20:32:41.3304178Z scale_ub: Optional[float], 2025-05-07T20:32:41.3304462Z contiguous: bool, 2025-05-07T20:32:41.3304720Z compiled: bool, 2025-05-07T20:32:41.3304964Z ) -> None: 2025-05-07T20:32:41.3305190Z torch.manual_seed(2025) 2025-05-07T20:32:41.3305448Z 2025-05-07T20:32:41.3305927Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3306289Z 2025-05-07T20:32:41.3306500Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3306813Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3307140Z x = x_sign * x_clamp 2025-05-07T20:32:41.3307396Z x0 = x[:, :D] 2025-05-07T20:32:41.3307629Z x1 = x[:, D:] 2025-05-07T20:32:41.3307845Z 2025-05-07T20:32:41.3308044Z if contiguous: 2025-05-07T20:32:41.3308291Z x0 = x0.contiguous() 2025-05-07T20:32:41.3308558Z x1 = x1.contiguous() 2025-05-07T20:32:41.3308815Z 2025-05-07T20:32:41.3309020Z if scale_ub is not None: 2025-05-07T20:32:41.3309311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3309661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3309986Z ) 2025-05-07T20:32:41.3310192Z else: 2025-05-07T20:32:41.3310410Z scale_ub_tensor = None 2025-05-07T20:32:41.3310676Z 2025-05-07T20:32:41.3310928Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3311253Z op = silu_mul_quant 2025-05-07T20:32:41.3311520Z if compiled: 2025-05-07T20:32:41.3311786Z op = torch.compile(op) 2025-05-07T20:32:41.3312306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3312600Z 2025-05-07T20:32:41.3312816Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.3312989Z 2025-05-07T20:32:41.3313096Z moe/activation_test.py:117: 2025-05-07T20:32:41.3313411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3313834Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.3314138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3314858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.3315582Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.3316151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3316861Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3317562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3318123Z kernel = self.compile( 2025-05-07T20:32:41.3318694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3319376Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3319807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3320045Z 2025-05-07T20:32:41.3320268Z self = 2025-05-07T20:32:41.3321400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3322844Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09f20af0>} 2025-05-07T20:32:41.3324534Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3325604Z context = 2025-05-07T20:32:41.3325903Z 2025-05-07T20:32:41.3326077Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3326619Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3327242Z module_map=module_map) 2025-05-07T20:32:41.3327630Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3327991Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.3328263Z E ^ 2025-05-07T20:32:41.3328754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3329225Z 2025-05-07T20:32:41.3329656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3330202Z 2025-05-07T20:32:41.3330312Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3330748Z self=, 2025-05-07T20:32:41.3331172Z T=1, 2025-05-07T20:32:41.3331365Z D=7168, 2025-05-07T20:32:41.3331571Z scale_ub=None, 2025-05-07T20:32:41.3331799Z contiguous=True, 2025-05-07T20:32:41.3332038Z compiled=True, 2025-05-07T20:32:41.3332296Z ) 2025-05-07T20:32:41.3332642Z self = 2025-05-07T20:32:41.3333142Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.3333416Z 2025-05-07T20:32:41.3333623Z @given( 2025-05-07T20:32:41.3333869Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3334200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3334519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3334867Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3335212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3335514Z ) 2025-05-07T20:32:41.3335885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3336348Z def test_silu_mul_quant( 2025-05-07T20:32:41.3336601Z self, 2025-05-07T20:32:41.3336810Z T: int, 2025-05-07T20:32:41.3337023Z D: int, 2025-05-07T20:32:41.3337256Z scale_ub: Optional[float], 2025-05-07T20:32:41.3337543Z contiguous: bool, 2025-05-07T20:32:41.3337797Z compiled: bool, 2025-05-07T20:32:41.3338028Z ) -> None: 2025-05-07T20:32:41.3338259Z torch.manual_seed(2025) 2025-05-07T20:32:41.3338522Z 2025-05-07T20:32:41.3338802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3339165Z 2025-05-07T20:32:41.3339379Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3339689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3340007Z x = x_sign * x_clamp 2025-05-07T20:32:41.3340262Z x0 = x[:, :D] 2025-05-07T20:32:41.3340496Z x1 = x[:, D:] 2025-05-07T20:32:41.3340713Z 2025-05-07T20:32:41.3340919Z if contiguous: 2025-05-07T20:32:41.3341166Z x0 = x0.contiguous() 2025-05-07T20:32:41.3341435Z x1 = x1.contiguous() 2025-05-07T20:32:41.3341693Z 2025-05-07T20:32:41.3341901Z if scale_ub is not None: 2025-05-07T20:32:41.3342192Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3342547Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3342875Z ) 2025-05-07T20:32:41.3343077Z else: 2025-05-07T20:32:41.3343309Z scale_ub_tensor = None 2025-05-07T20:32:41.3343578Z 2025-05-07T20:32:41.3343817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3344151Z op = silu_mul_quant 2025-05-07T20:32:41.3344420Z if compiled: 2025-05-07T20:32:41.3344686Z op = torch.compile(op) 2025-05-07T20:32:41.3344998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3345288Z 2025-05-07T20:32:41.3345495Z y_fp8, y_scale = fn() 2025-05-07T20:32:41.3345815Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:41.3346120Z 2025-05-07T20:32:41.3346373Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3346806Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:41.3347120Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:41.3347450Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:41.3347830Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.3348160Z 2025-05-07T20:32:41.3348378Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:41.3348583Z 2025-05-07T20:32:41.3348695Z moe/activation_test.py:126: 2025-05-07T20:32:41.3349004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3349357Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:41.3349703Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.3350536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:41.3351318Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:41.3351895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3352610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3353410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:41.3354241Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.3355030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:41.3355813Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.3356567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:41.3357240Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:41.3357875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:41.3358417Z fn() 2025-05-07T20:32:41.3358947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:41.3359566Z self.fn.run( 2025-05-07T20:32:41.3360059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3360611Z kernel = self.compile( 2025-05-07T20:32:41.3361183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3361870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3362288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3362530Z 2025-05-07T20:32:41.3362748Z self = 2025-05-07T20:32:41.3363880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3365315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09ea5ab0>} 2025-05-07T20:32:41.3366722Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3367796Z context = 2025-05-07T20:32:41.3368099Z 2025-05-07T20:32:41.3368276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3368995Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3369494Z module_map=module_map) 2025-05-07T20:32:41.3369875Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3370255Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:41.3370538Z E ^ 2025-05-07T20:32:41.3371021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3371500Z 2025-05-07T20:32:41.3371935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3372529Z 2025-05-07T20:32:41.3372639Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3373075Z self=, 2025-05-07T20:32:41.3373490Z T=4096, 2025-05-07T20:32:41.3373690Z D=5120, 2025-05-07T20:32:41.3373901Z scale_ub=None, 2025-05-07T20:32:41.3374127Z contiguous=False, 2025-05-07T20:32:41.3374374Z compiled=False, 2025-05-07T20:32:41.3374592Z ) 2025-05-07T20:32:41.9428501Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:41.9429820Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:41.9431232Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:41.9432775Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:41.9434277Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:41.9435733Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.9437099Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:41.9438532Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.9440013Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:41.9441305Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:41.9442579Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:41.9443840Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:41.9444923Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:41.9446110Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:41.9447379Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:41.9448722Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:41.9449887Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:41.9450974Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:41.9452202Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:41.9453616Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:41.9454802Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.9455757Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.9456536Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:41.9457601Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5693834Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:42.5695089Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:42.5696610Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:42.5698214Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:42.5699779Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:42.5701347Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5702826Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.5704373Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5706148Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:42.5707544Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:42.5708924Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:42.5710286Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:42.5711454Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:42.5712603Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:42.5714055Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:42.5715633Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:42.5716896Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:42.5718070Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:42.5719398Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:42.5720923Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:42.5722126Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5723161Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5724185Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:42.5725338Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8314670Z self = 2025-05-07T20:32:43.8315366Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.8315809Z 2025-05-07T20:32:43.8315961Z @given( 2025-05-07T20:32:43.8316335Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8316795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8317149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8317546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8317932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8324767Z ) 2025-05-07T20:32:43.8325187Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8325705Z def test_silu_mul_quant( 2025-05-07T20:32:43.8325992Z self, 2025-05-07T20:32:43.8326215Z T: int, 2025-05-07T20:32:43.8326450Z D: int, 2025-05-07T20:32:43.8326902Z scale_ub: Optional[float], 2025-05-07T20:32:43.8327217Z contiguous: bool, 2025-05-07T20:32:43.8327500Z compiled: bool, 2025-05-07T20:32:43.8327766Z ) -> None: 2025-05-07T20:32:43.8328014Z torch.manual_seed(2025) 2025-05-07T20:32:43.8328305Z 2025-05-07T20:32:43.8328624Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8329011Z 2025-05-07T20:32:43.8329239Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8329576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8329939Z x = x_sign * x_clamp 2025-05-07T20:32:43.8330212Z x0 = x[:, :D] 2025-05-07T20:32:43.8330470Z x1 = x[:, D:] 2025-05-07T20:32:43.8330711Z 2025-05-07T20:32:43.8330921Z if contiguous: 2025-05-07T20:32:43.8331189Z x0 = x0.contiguous() 2025-05-07T20:32:43.8331492Z x1 = x1.contiguous() 2025-05-07T20:32:43.8331770Z 2025-05-07T20:32:43.8331997Z if scale_ub is not None: 2025-05-07T20:32:43.8332327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8332710Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8333073Z ) 2025-05-07T20:32:43.8333327Z else: 2025-05-07T20:32:43.8333699Z scale_ub_tensor = None 2025-05-07T20:32:43.8333993Z 2025-05-07T20:32:43.8334263Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8334617Z op = silu_mul_quant 2025-05-07T20:32:43.8334904Z if compiled: 2025-05-07T20:32:43.8335191Z op = torch.compile(op) 2025-05-07T20:32:43.8335527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8335844Z 2025-05-07T20:32:43.8336068Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.8336260Z 2025-05-07T20:32:43.8336379Z moe/activation_test.py:117: 2025-05-07T20:32:43.8336715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8337105Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.8337432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8338216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.8339009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.8339619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.8340393Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.8341144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.8341748Z kernel = self.compile( 2025-05-07T20:32:43.8342358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.8343140Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.8343643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8343908Z 2025-05-07T20:32:43.8344153Z self = 2025-05-07T20:32:43.8345389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.8346956Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09ea6d40>} 2025-05-07T20:32:43.8348488Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.8349753Z context = 2025-05-07T20:32:43.8350083Z 2025-05-07T20:32:43.8350273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.8350869Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.8351411Z module_map=module_map) 2025-05-07T20:32:43.8351828Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.8352227Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.8352525Z E ^ 2025-05-07T20:32:43.8353055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8353667Z 2025-05-07T20:32:43.8354146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.8354728Z 2025-05-07T20:32:43.8354847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.8355325Z self=, 2025-05-07T20:32:43.8355788Z T=4096, 2025-05-07T20:32:43.8356008Z D=7168, 2025-05-07T20:32:43.8356232Z scale_ub=None, 2025-05-07T20:32:43.8356483Z contiguous=False, 2025-05-07T20:32:43.8356836Z compiled=False, 2025-05-07T20:32:43.8357071Z ) 2025-05-07T20:32:43.8357436Z self = 2025-05-07T20:32:43.8357996Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.8358312Z 2025-05-07T20:32:43.8358401Z @given( 2025-05-07T20:32:43.8358666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8359025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8359370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8359751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8360129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8360453Z ) 2025-05-07T20:32:43.8360863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8361369Z def test_silu_mul_quant( 2025-05-07T20:32:43.8361645Z self, 2025-05-07T20:32:43.8361876Z T: int, 2025-05-07T20:32:43.8362102Z D: int, 2025-05-07T20:32:43.8362350Z scale_ub: Optional[float], 2025-05-07T20:32:43.8362670Z contiguous: bool, 2025-05-07T20:32:43.8362948Z compiled: bool, 2025-05-07T20:32:43.8363205Z ) -> None: 2025-05-07T20:32:43.8363446Z torch.manual_seed(2025) 2025-05-07T20:32:43.8363723Z 2025-05-07T20:32:43.8364035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8364420Z 2025-05-07T20:32:43.8364667Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8365002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8365353Z x = x_sign * x_clamp 2025-05-07T20:32:43.8365629Z x0 = x[:, :D] 2025-05-07T20:32:43.8365881Z x1 = x[:, D:] 2025-05-07T20:32:43.8366122Z 2025-05-07T20:32:43.8366334Z if contiguous: 2025-05-07T20:32:43.8366605Z x0 = x0.contiguous() 2025-05-07T20:32:43.8366902Z x1 = x1.contiguous() 2025-05-07T20:32:43.8367180Z 2025-05-07T20:32:43.8367403Z if scale_ub is not None: 2025-05-07T20:32:43.8367716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8368100Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8368455Z ) 2025-05-07T20:32:43.8368680Z else: 2025-05-07T20:32:43.8368917Z scale_ub_tensor = None 2025-05-07T20:32:43.8369206Z 2025-05-07T20:32:43.8369479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8369838Z op = silu_mul_quant 2025-05-07T20:32:43.8370129Z if compiled: 2025-05-07T20:32:43.8370415Z op = torch.compile(op) 2025-05-07T20:32:43.8370747Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8371163Z 2025-05-07T20:32:43.8371396Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.8371588Z 2025-05-07T20:32:43.8371710Z moe/activation_test.py:117: 2025-05-07T20:32:43.8372042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8372430Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.8372761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8373545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.8374334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.8374945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.8375724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.8376480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.8377088Z kernel = self.compile( 2025-05-07T20:32:43.8377706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.8378538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.8378994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8379261Z 2025-05-07T20:32:43.8379495Z self = 2025-05-07T20:32:43.8380726Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.8382284Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09ea7c70>} 2025-05-07T20:32:43.8383861Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.8385037Z context = 2025-05-07T20:32:43.8385366Z 2025-05-07T20:32:43.8385564Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.8386167Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.8386697Z module_map=module_map) 2025-05-07T20:32:43.8387114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.8387516Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.8387812Z E ^ 2025-05-07T20:32:43.8388344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8388857Z 2025-05-07T20:32:43.8389333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.8389914Z 2025-05-07T20:32:43.8390044Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.8390512Z self=, 2025-05-07T20:32:43.8390967Z T=128, 2025-05-07T20:32:43.8391183Z D=7168, 2025-05-07T20:32:43.8391400Z scale_ub=None, 2025-05-07T20:32:43.8391651Z contiguous=False, 2025-05-07T20:32:43.8391913Z compiled=True, 2025-05-07T20:32:43.8392139Z ) 2025-05-07T20:32:43.9072761Z self = 2025-05-07T20:32:43.9073577Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.9073886Z 2025-05-07T20:32:43.9073974Z @given( 2025-05-07T20:32:43.9074233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9074759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9075111Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9075480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9075852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9076173Z ) 2025-05-07T20:32:43.9076568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9077055Z def test_silu_mul_quant( 2025-05-07T20:32:43.9077326Z self, 2025-05-07T20:32:43.9077544Z T: int, 2025-05-07T20:32:43.9077763Z D: int, 2025-05-07T20:32:43.9078010Z scale_ub: Optional[float], 2025-05-07T20:32:43.9078313Z contiguous: bool, 2025-05-07T20:32:43.9078581Z compiled: bool, 2025-05-07T20:32:43.9078827Z ) -> None: 2025-05-07T20:32:43.9079069Z torch.manual_seed(2025) 2025-05-07T20:32:43.9079337Z 2025-05-07T20:32:43.9079646Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9080023Z 2025-05-07T20:32:43.9080242Z x_sign = torch.sign(x) 2025-05-07T20:32:43.9080561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.9081068Z x = x_sign * x_clamp 2025-05-07T20:32:43.9081339Z x0 = x[:, :D] 2025-05-07T20:32:43.9081576Z x1 = x[:, D:] 2025-05-07T20:32:43.9081810Z 2025-05-07T20:32:43.9082021Z if contiguous: 2025-05-07T20:32:43.9082277Z x0 = x0.contiguous() 2025-05-07T20:32:43.9082561Z x1 = x1.contiguous() 2025-05-07T20:32:43.9082827Z 2025-05-07T20:32:43.9083036Z if scale_ub is not None: 2025-05-07T20:32:43.9083341Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.9083715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.9084063Z ) 2025-05-07T20:32:43.9084276Z else: 2025-05-07T20:32:43.9084513Z scale_ub_tensor = None 2025-05-07T20:32:43.9084793Z 2025-05-07T20:32:43.9085064Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.9085417Z op = silu_mul_quant 2025-05-07T20:32:43.9085701Z if compiled: 2025-05-07T20:32:43.9085991Z op = torch.compile(op) 2025-05-07T20:32:43.9086325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.9086635Z 2025-05-07T20:32:43.9086851Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.9087176Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.9087504Z 2025-05-07T20:32:43.9087768Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.9088147Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.9088478Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.9088822Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.9089224Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.9089578Z 2025-05-07T20:32:43.9089811Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.9090031Z 2025-05-07T20:32:43.9090142Z moe/activation_test.py:126: 2025-05-07T20:32:43.9090476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.9090858Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.9091224Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.9092098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.9092961Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.9093596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.9094348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.9095204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.9096012Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.9096845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.9097687Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.9098495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.9099208Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.9099871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.9100448Z fn() 2025-05-07T20:32:43.9101012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.9101657Z self.fn.run( 2025-05-07T20:32:43.9102178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.9102767Z kernel = self.compile( 2025-05-07T20:32:43.9103370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.9104180Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.9104623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.9104883Z 2025-05-07T20:32:43.9105115Z self = 2025-05-07T20:32:43.9106313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.9107832Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09ea7ac0>} 2025-05-07T20:32:43.9109316Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.9110459Z context = 2025-05-07T20:32:43.9110781Z 2025-05-07T20:32:43.9110978Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.9111560Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.9112080Z module_map=module_map) 2025-05-07T20:32:43.9112497Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.9112897Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.9113190Z E ^ 2025-05-07T20:32:43.9113765Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.9114262Z 2025-05-07T20:32:43.9114726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.9115297Z 2025-05-07T20:32:43.9115420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9115877Z self=, 2025-05-07T20:32:43.9116323Z T=128, 2025-05-07T20:32:43.9116539Z D=7168, 2025-05-07T20:32:43.9116756Z scale_ub=None, 2025-05-07T20:32:43.9117002Z contiguous=False, 2025-05-07T20:32:43.9117262Z compiled=False, 2025-05-07T20:32:43.9117489Z ) 2025-05-07T20:32:44.2970707Z self = 2025-05-07T20:32:44.2971363Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.2971803Z 2025-05-07T20:32:44.2972135Z @given( 2025-05-07T20:32:44.2972511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2973007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2973462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2973966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2974368Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2974698Z ) 2025-05-07T20:32:44.2975095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2975602Z def test_silu_mul_quant( 2025-05-07T20:32:44.2975881Z self, 2025-05-07T20:32:44.2976101Z T: int, 2025-05-07T20:32:44.2976362Z D: int, 2025-05-07T20:32:44.2976613Z scale_ub: Optional[float], 2025-05-07T20:32:44.2976924Z contiguous: bool, 2025-05-07T20:32:44.2977203Z compiled: bool, 2025-05-07T20:32:44.2977461Z ) -> None: 2025-05-07T20:32:44.2977717Z torch.manual_seed(2025) 2025-05-07T20:32:44.2977992Z 2025-05-07T20:32:44.2978296Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2978683Z 2025-05-07T20:32:44.2978909Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2979390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2979742Z x = x_sign * x_clamp 2025-05-07T20:32:44.2980018Z x0 = x[:, :D] 2025-05-07T20:32:44.2980262Z x1 = x[:, D:] 2025-05-07T20:32:44.2980499Z 2025-05-07T20:32:44.2980717Z if contiguous: 2025-05-07T20:32:44.2980979Z x0 = x0.contiguous() 2025-05-07T20:32:44.2981274Z x1 = x1.contiguous() 2025-05-07T20:32:44.2981554Z 2025-05-07T20:32:44.2981773Z if scale_ub is not None: 2025-05-07T20:32:44.2982087Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2982471Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2982823Z ) 2025-05-07T20:32:44.2983047Z else: 2025-05-07T20:32:44.2983295Z scale_ub_tensor = None 2025-05-07T20:32:44.2983587Z 2025-05-07T20:32:44.2983850Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2984218Z op = silu_mul_quant 2025-05-07T20:32:44.2984509Z if compiled: 2025-05-07T20:32:44.2984790Z op = torch.compile(op) 2025-05-07T20:32:44.2985130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2985449Z 2025-05-07T20:32:44.2985672Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2985868Z 2025-05-07T20:32:44.2985982Z moe/activation_test.py:117: 2025-05-07T20:32:44.2986322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2986692Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2987021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2987809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2988589Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2989191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2989973Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2990723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2991597Z kernel = self.compile( 2025-05-07T20:32:44.2992368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2993115Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2993617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2993876Z 2025-05-07T20:32:44.2994110Z self = 2025-05-07T20:32:44.2995779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2997325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff1a0cbb50>} 2025-05-07T20:32:44.2998833Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2999981Z context = 2025-05-07T20:32:44.3000304Z 2025-05-07T20:32:44.3000494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3001086Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3001617Z module_map=module_map) 2025-05-07T20:32:44.3002031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3002520Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3002820Z E ^ 2025-05-07T20:32:44.3003348Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3003853Z 2025-05-07T20:32:44.3004320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3004899Z 2025-05-07T20:32:44.3005018Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3005487Z self=, 2025-05-07T20:32:44.3005944Z T=4096, 2025-05-07T20:32:44.3006158Z D=5120, 2025-05-07T20:32:44.3006387Z scale_ub=1200.0, 2025-05-07T20:32:44.3006650Z contiguous=True, 2025-05-07T20:32:44.3006903Z compiled=False, 2025-05-07T20:32:44.3007147Z ) 2025-05-07T20:32:44.3007514Z self = 2025-05-07T20:32:44.3008075Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3008394Z 2025-05-07T20:32:44.3008483Z @given( 2025-05-07T20:32:44.3008754Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3009107Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3009461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3009842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3010219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3010541Z ) 2025-05-07T20:32:44.3010945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3011448Z def test_silu_mul_quant( 2025-05-07T20:32:44.3011723Z self, 2025-05-07T20:32:44.3011962Z T: int, 2025-05-07T20:32:44.3012196Z D: int, 2025-05-07T20:32:44.3012447Z scale_ub: Optional[float], 2025-05-07T20:32:44.3012823Z contiguous: bool, 2025-05-07T20:32:44.3013179Z compiled: bool, 2025-05-07T20:32:44.3013498Z ) -> None: 2025-05-07T20:32:44.3013811Z torch.manual_seed(2025) 2025-05-07T20:32:44.3014160Z 2025-05-07T20:32:44.3014548Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3015035Z 2025-05-07T20:32:44.3015316Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3015646Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3016000Z x = x_sign * x_clamp 2025-05-07T20:32:44.3016279Z x0 = x[:, :D] 2025-05-07T20:32:44.3016532Z x1 = x[:, D:] 2025-05-07T20:32:44.3016770Z 2025-05-07T20:32:44.3016988Z if contiguous: 2025-05-07T20:32:44.3017266Z x0 = x0.contiguous() 2025-05-07T20:32:44.3017683Z x1 = x1.contiguous() 2025-05-07T20:32:44.3017964Z 2025-05-07T20:32:44.3018189Z if scale_ub is not None: 2025-05-07T20:32:44.3018500Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3018887Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3019249Z ) 2025-05-07T20:32:44.3025990Z else: 2025-05-07T20:32:44.3026275Z scale_ub_tensor = None 2025-05-07T20:32:44.3026576Z 2025-05-07T20:32:44.3026853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3027214Z op = silu_mul_quant 2025-05-07T20:32:44.3027511Z if compiled: 2025-05-07T20:32:44.3027798Z op = torch.compile(op) 2025-05-07T20:32:44.3028136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3028450Z 2025-05-07T20:32:44.3028679Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3028869Z 2025-05-07T20:32:44.3028985Z moe/activation_test.py:117: 2025-05-07T20:32:44.3029334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3029719Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3030046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3031008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3031792Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3032401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3033167Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3033988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3034596Z kernel = self.compile( 2025-05-07T20:32:44.3035212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3035943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3036393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3036670Z 2025-05-07T20:32:44.3036907Z self = 2025-05-07T20:32:44.3038119Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3039657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0994a0e0>} 2025-05-07T20:32:44.3041173Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3042329Z context = 2025-05-07T20:32:44.3042657Z 2025-05-07T20:32:44.3042852Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3043525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3044066Z module_map=module_map) 2025-05-07T20:32:44.3044490Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3044896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3045188Z E ^ 2025-05-07T20:32:44.3045720Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3046232Z 2025-05-07T20:32:44.3046706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3047286Z 2025-05-07T20:32:44.3047548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3048021Z self=, 2025-05-07T20:32:44.3048482Z T=1, 2025-05-07T20:32:44.3048701Z D=5120, 2025-05-07T20:32:44.3048919Z scale_ub=None, 2025-05-07T20:32:44.3049169Z contiguous=True, 2025-05-07T20:32:44.3049432Z compiled=True, 2025-05-07T20:32:44.3049664Z ) 2025-05-07T20:32:44.8011030Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:44.8012629Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:44.8014176Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:44.8015798Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:44.8017555Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:44.8019133Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8020626Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.8022207Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8024061Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:44.8025465Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:44.8026840Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:44.8028207Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:44.8029378Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:44.8030529Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:44.8031907Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:44.8033355Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:44.8034850Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:44.8036033Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:44.8037366Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:44.8038896Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:44.8040091Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8041123Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8041962Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:44.8043113Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9778874Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:44.9780326Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:44.9781831Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:44.9783420Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:44.9784972Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:44.9786520Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9787977Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.9789518Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9791103Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:44.9792495Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:44.9793928Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:44.9795286Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:44.9796630Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:44.9797784Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:44.9799151Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:44.9800590Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:44.9801851Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:44.9803023Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:44.9804342Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:44.9806014Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:44.9807213Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9808238Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9809081Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:44.9810225Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.4532194Z self = 2025-05-07T20:32:45.4532871Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:45.4533338Z 2025-05-07T20:32:45.4533467Z @given( 2025-05-07T20:32:45.4533833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.4534311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.4534795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.4535291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.4535774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.4536119Z ) 2025-05-07T20:32:45.4536525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.4537025Z def test_silu_mul_quant( 2025-05-07T20:32:45.4537298Z self, 2025-05-07T20:32:45.4537531Z T: int, 2025-05-07T20:32:45.4537758Z D: int, 2025-05-07T20:32:45.4538006Z scale_ub: Optional[float], 2025-05-07T20:32:45.4538320Z contiguous: bool, 2025-05-07T20:32:45.4538595Z compiled: bool, 2025-05-07T20:32:45.4538847Z ) -> None: 2025-05-07T20:32:45.4539096Z torch.manual_seed(2025) 2025-05-07T20:32:45.4539372Z 2025-05-07T20:32:45.4539676Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.4540064Z 2025-05-07T20:32:45.4540285Z x_sign = torch.sign(x) 2025-05-07T20:32:45.4540617Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.4540963Z x = x_sign * x_clamp 2025-05-07T20:32:45.4541238Z x0 = x[:, :D] 2025-05-07T20:32:45.4541670Z x1 = x[:, D:] 2025-05-07T20:32:45.4541910Z 2025-05-07T20:32:45.4542126Z if contiguous: 2025-05-07T20:32:45.4542395Z x0 = x0.contiguous() 2025-05-07T20:32:45.4542689Z x1 = x1.contiguous() 2025-05-07T20:32:45.4542968Z 2025-05-07T20:32:45.4543196Z if scale_ub is not None: 2025-05-07T20:32:45.4543530Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.4543938Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.4544293Z ) 2025-05-07T20:32:45.4544510Z else: 2025-05-07T20:32:45.4544757Z scale_ub_tensor = None 2025-05-07T20:32:45.4545047Z 2025-05-07T20:32:45.4545311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.4545671Z op = silu_mul_quant 2025-05-07T20:32:45.4545958Z if compiled: 2025-05-07T20:32:45.4546244Z op = torch.compile(op) 2025-05-07T20:32:45.4546585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.4546912Z 2025-05-07T20:32:45.4547139Z y_fp8, y_scale = fn() 2025-05-07T20:32:45.4547459Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:45.4547786Z 2025-05-07T20:32:45.4548202Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.4548572Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:45.4548907Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:45.4549267Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:45.4549665Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.4550017Z 2025-05-07T20:32:45.4550249Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:45.4550470Z 2025-05-07T20:32:45.4550591Z moe/activation_test.py:126: 2025-05-07T20:32:45.4550928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.4551312Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:45.4551693Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.4552574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:45.4553450Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:45.4554206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.4554970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.4555735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:45.4556545Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.4557390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:45.4558236Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.4559045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:45.4559769Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:45.4560446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:45.4561024Z fn() 2025-05-07T20:32:45.4561593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:45.4562244Z self.fn.run( 2025-05-07T20:32:45.4562771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.4563381Z kernel = self.compile( 2025-05-07T20:32:45.4564051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.4564880Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.4565332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.4565603Z 2025-05-07T20:32:45.4565838Z self = 2025-05-07T20:32:45.4567054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.4568594Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09f20ee0>} 2025-05-07T20:32:45.4570109Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.4571259Z context = 2025-05-07T20:32:45.4571591Z 2025-05-07T20:32:45.4571783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.4572533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.4573075Z module_map=module_map) 2025-05-07T20:32:45.4573497Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.4573907Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:45.4574207Z E ^ 2025-05-07T20:32:45.4574737Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.4575254Z 2025-05-07T20:32:45.4575732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.4576310Z 2025-05-07T20:32:45.4576444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.4576914Z self=, 2025-05-07T20:32:45.4577375Z T=2048, 2025-05-07T20:32:45.4577609Z D=5120, 2025-05-07T20:32:45.4577831Z scale_ub=None, 2025-05-07T20:32:45.4578083Z contiguous=True, 2025-05-07T20:32:45.4578343Z compiled=True, 2025-05-07T20:32:45.4578574Z ) 2025-05-07T20:32:45.9083737Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:45.9085455Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:45.9087004Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:45.9088631Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:45.9090204Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:45.9091782Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.9093269Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:45.9095015Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.9096630Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:45.9098041Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:45.9099419Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:45.9100794Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:45.9101966Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:45.9103241Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:45.9104661Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:45.9106104Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:45.9107376Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:45.9108556Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:45.9109892Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:45.9111414Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:45.9112609Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.9113755Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.9114624Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:45.9115774Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.0820704Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:46.0822156Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:46.0824091Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:46.0825678Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:46.0827238Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:46.0828765Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.0830215Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:46.0831742Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.0833311Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:46.0834896Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:46.0836249Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:46.0837583Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:46.0838738Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:46.0839878Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:46.0841230Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:46.0842641Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:46.0843940Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:46.0845097Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:46.0846408Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:46.0847912Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:46.0849081Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.0850095Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.0851029Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:46.0852165Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.5581386Z self = 2025-05-07T20:32:46.5582062Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:46.5582498Z 2025-05-07T20:32:46.5582658Z @given( 2025-05-07T20:32:46.5583046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.5583552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.5584092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.5584584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.5584966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.5585295Z ) 2025-05-07T20:32:46.5585713Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.5586246Z def test_silu_mul_quant( 2025-05-07T20:32:46.5586523Z self, 2025-05-07T20:32:46.5586954Z T: int, 2025-05-07T20:32:46.5587185Z D: int, 2025-05-07T20:32:46.5587432Z scale_ub: Optional[float], 2025-05-07T20:32:46.5587747Z contiguous: bool, 2025-05-07T20:32:46.5588027Z compiled: bool, 2025-05-07T20:32:46.5588282Z ) -> None: 2025-05-07T20:32:46.5588528Z torch.manual_seed(2025) 2025-05-07T20:32:46.5588806Z 2025-05-07T20:32:46.5589114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.5589502Z 2025-05-07T20:32:46.5589725Z x_sign = torch.sign(x) 2025-05-07T20:32:46.5590058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.5590405Z x = x_sign * x_clamp 2025-05-07T20:32:46.5590892Z x0 = x[:, :D] 2025-05-07T20:32:46.5591146Z x1 = x[:, D:] 2025-05-07T20:32:46.5591383Z 2025-05-07T20:32:46.5591603Z if contiguous: 2025-05-07T20:32:46.5591869Z x0 = x0.contiguous() 2025-05-07T20:32:46.5592159Z x1 = x1.contiguous() 2025-05-07T20:32:46.5592447Z 2025-05-07T20:32:46.5592672Z if scale_ub is not None: 2025-05-07T20:32:46.5592979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.5593360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.5593787Z ) 2025-05-07T20:32:46.5594003Z else: 2025-05-07T20:32:46.5594247Z scale_ub_tensor = None 2025-05-07T20:32:46.5594540Z 2025-05-07T20:32:46.5594801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.5595165Z op = silu_mul_quant 2025-05-07T20:32:46.5595458Z if compiled: 2025-05-07T20:32:46.5595744Z op = torch.compile(op) 2025-05-07T20:32:46.5596075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.5596393Z 2025-05-07T20:32:46.5596617Z y_fp8, y_scale = fn() 2025-05-07T20:32:46.5596936Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:46.5597268Z 2025-05-07T20:32:46.5597549Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.5597924Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:46.5598260Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:46.5598620Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:46.5599021Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.5599375Z 2025-05-07T20:32:46.5599612Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:46.5599832Z 2025-05-07T20:32:46.5599956Z moe/activation_test.py:126: 2025-05-07T20:32:46.5600290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.5600672Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:46.5601181Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.5602068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:46.5602914Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:46.5603534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.5604355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.5605130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:46.5605944Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:46.5606791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:46.5607640Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:46.5608453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:46.5609263Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:46.5609939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:46.5610516Z fn() 2025-05-07T20:32:46.5611089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:46.5611748Z self.fn.run( 2025-05-07T20:32:46.5612276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.5612867Z kernel = self.compile( 2025-05-07T20:32:46.5613474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.5614215Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.5614658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.5614927Z 2025-05-07T20:32:46.5615161Z self = 2025-05-07T20:32:46.5616376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.5617919Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0949e7a0>} 2025-05-07T20:32:46.5619426Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.5620569Z context = 2025-05-07T20:32:46.5620898Z 2025-05-07T20:32:46.5621086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.5621679Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.5622211Z module_map=module_map) 2025-05-07T20:32:46.5622626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.5623028Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:46.5623329Z E ^ 2025-05-07T20:32:46.5624140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.5624656Z 2025-05-07T20:32:46.5625122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.5625702Z 2025-05-07T20:32:46.5625957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.5626431Z self=, 2025-05-07T20:32:46.5626878Z T=128, 2025-05-07T20:32:46.5627100Z D=5120, 2025-05-07T20:32:46.5627327Z scale_ub=None, 2025-05-07T20:32:46.5627573Z contiguous=True, 2025-05-07T20:32:46.5627831Z compiled=True, 2025-05-07T20:32:46.5628067Z ) 2025-05-07T20:32:47.0549945Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:47.0551995Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:47.0554311Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:47.0555882Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:47.0557585Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:47.0559100Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.0560540Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:47.0562051Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.0563613Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:47.0564981Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:47.0566312Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:47.0567646Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:47.0568783Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:47.0569904Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:47.0577369Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:47.0578796Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:47.0580204Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:47.0581360Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:47.0582661Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:47.0584211Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:47.0585377Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.0586381Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.0587203Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:47.0588319Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.2307678Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:47.2309748Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:47.2312344Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:47.2314634Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:47.2316175Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:47.2317699Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.2319145Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:47.2320674Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.2322245Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:47.2323629Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:47.2325242Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:47.2326586Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:47.2327917Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:47.2329057Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:47.2330412Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:47.2331823Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:47.2333072Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:47.2334226Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:47.2335532Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:47.2337158Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:47.2338327Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.2339346Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.2340177Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:47.2341311Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.9945529Z self = 2025-05-07T20:32:47.9947094Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.9947862Z 2025-05-07T20:32:47.9948043Z @given( 2025-05-07T20:32:47.9948554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.9949240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.9949896Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.9950588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.9951290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.9951904Z ) 2025-05-07T20:32:47.9952663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.9953714Z def test_silu_mul_quant( 2025-05-07T20:32:47.9954238Z self, 2025-05-07T20:32:47.9954495Z T: int, 2025-05-07T20:32:47.9954735Z D: int, 2025-05-07T20:32:47.9954977Z scale_ub: Optional[float], 2025-05-07T20:32:47.9955275Z contiguous: bool, 2025-05-07T20:32:47.9955533Z compiled: bool, 2025-05-07T20:32:47.9955781Z ) -> None: 2025-05-07T20:32:47.9956022Z torch.manual_seed(2025) 2025-05-07T20:32:47.9956287Z 2025-05-07T20:32:47.9956590Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.9956956Z 2025-05-07T20:32:47.9957161Z x_sign = torch.sign(x) 2025-05-07T20:32:47.9957479Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.9957813Z x = x_sign * x_clamp 2025-05-07T20:32:47.9958069Z x0 = x[:, :D] 2025-05-07T20:32:47.9958481Z x1 = x[:, D:] 2025-05-07T20:32:47.9958716Z 2025-05-07T20:32:47.9958913Z if contiguous: 2025-05-07T20:32:47.9959168Z x0 = x0.contiguous() 2025-05-07T20:32:47.9959448Z x1 = x1.contiguous() 2025-05-07T20:32:47.9959707Z 2025-05-07T20:32:47.9959910Z if scale_ub is not None: 2025-05-07T20:32:47.9960207Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.9960567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.9960891Z ) 2025-05-07T20:32:47.9961101Z else: 2025-05-07T20:32:47.9961329Z scale_ub_tensor = None 2025-05-07T20:32:47.9961593Z 2025-05-07T20:32:47.9961841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.9962176Z op = silu_mul_quant 2025-05-07T20:32:47.9962440Z if compiled: 2025-05-07T20:32:47.9962708Z op = torch.compile(op) 2025-05-07T20:32:47.9963024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.9963318Z 2025-05-07T20:32:47.9963526Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.9963833Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.9964143Z 2025-05-07T20:32:47.9964522Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.9964878Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.9965188Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.9965518Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.9965897Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.9966230Z 2025-05-07T20:32:47.9966443Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.9966653Z 2025-05-07T20:32:47.9966760Z moe/activation_test.py:126: 2025-05-07T20:32:47.9967078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.9967426Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.9967784Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.9968619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.9969421Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.9969997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.9970720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.9971453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.9972224Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.9973015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:47.9973809Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.9974629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.9975316Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.9975951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.9976505Z fn() 2025-05-07T20:32:47.9977046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.9977656Z self.fn.run( 2025-05-07T20:32:47.9978151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.9978712Z kernel = self.compile( 2025-05-07T20:32:47.9979284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.9980050Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.9980471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.9980716Z 2025-05-07T20:32:47.9980943Z self = 2025-05-07T20:32:47.9982068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.9983497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0949f640>} 2025-05-07T20:32:47.9984955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.9986024Z context = 2025-05-07T20:32:47.9986327Z 2025-05-07T20:32:47.9986511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.9987163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.9987655Z module_map=module_map) 2025-05-07T20:32:47.9988047Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.9988417Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.9988710Z E ^ 2025-05-07T20:32:47.9989197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.9989669Z 2025-05-07T20:32:47.9990108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.9990644Z 2025-05-07T20:32:47.9990761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.9991199Z self=, 2025-05-07T20:32:47.9991620Z T=4096, 2025-05-07T20:32:47.9991828Z D=5120, 2025-05-07T20:32:47.9992029Z scale_ub=None, 2025-05-07T20:32:47.9992259Z contiguous=True, 2025-05-07T20:32:47.9992496Z compiled=True, 2025-05-07T20:32:47.9992708Z ) 2025-05-07T20:32:48.4684866Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:48.4686262Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:48.4687700Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:48.4689223Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:48.4690692Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:48.4692157Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.4693543Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:48.4695172Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.4696691Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:48.4698007Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:48.4699281Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:48.4700561Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:48.4701652Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:48.4702857Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:48.4704138Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:48.4705491Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:48.4706681Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:48.4707786Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:48.4709042Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:48.4710462Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:48.4711574Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.4712530Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.4713313Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:48.4714506Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.6324855Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:48.6326272Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:48.6327851Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:48.6329337Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:48.6330782Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:48.6332226Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.6333579Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:48.6335016Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.6336490Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:48.6337907Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:48.6339177Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:48.6340438Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:48.6341524Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:48.6342599Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:48.6343878Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:48.6345221Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:48.6346383Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:48.6347480Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:48.6348712Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:48.6350129Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:48.6351233Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.6352187Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.6353039Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:48.6354193Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.2122886Z self = 2025-05-07T20:32:49.2123708Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.2124468Z 2025-05-07T20:32:49.2124613Z @given( 2025-05-07T20:32:49.2124967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.2125295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.2125629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.2125984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.2126334Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.2126635Z ) 2025-05-07T20:32:49.2127022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.2127492Z def test_silu_mul_quant( 2025-05-07T20:32:49.2127749Z self, 2025-05-07T20:32:49.2128160Z T: int, 2025-05-07T20:32:49.2128374Z D: int, 2025-05-07T20:32:49.2128606Z scale_ub: Optional[float], 2025-05-07T20:32:49.2128904Z contiguous: bool, 2025-05-07T20:32:49.2129175Z compiled: bool, 2025-05-07T20:32:49.2129424Z ) -> None: 2025-05-07T20:32:49.2129661Z torch.manual_seed(2025) 2025-05-07T20:32:49.2129970Z 2025-05-07T20:32:49.2130356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.2130725Z 2025-05-07T20:32:49.2130938Z x_sign = torch.sign(x) 2025-05-07T20:32:49.2131255Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.2131582Z x = x_sign * x_clamp 2025-05-07T20:32:49.2131847Z x0 = x[:, :D] 2025-05-07T20:32:49.2132090Z x1 = x[:, D:] 2025-05-07T20:32:49.2132308Z 2025-05-07T20:32:49.2132513Z if contiguous: 2025-05-07T20:32:49.2132764Z x0 = x0.contiguous() 2025-05-07T20:32:49.2133037Z x1 = x1.contiguous() 2025-05-07T20:32:49.2133311Z 2025-05-07T20:32:49.2133520Z if scale_ub is not None: 2025-05-07T20:32:49.2133805Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.2134164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.2134495Z ) 2025-05-07T20:32:49.2134696Z else: 2025-05-07T20:32:49.2134922Z scale_ub_tensor = None 2025-05-07T20:32:49.2135190Z 2025-05-07T20:32:49.2135435Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.2135768Z op = silu_mul_quant 2025-05-07T20:32:49.2136035Z if compiled: 2025-05-07T20:32:49.2136293Z op = torch.compile(op) 2025-05-07T20:32:49.2136608Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.2136907Z 2025-05-07T20:32:49.2137115Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.2137413Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.2137727Z 2025-05-07T20:32:49.2137986Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.2138334Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.2138647Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.2138982Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.2139356Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.2139688Z 2025-05-07T20:32:49.2139908Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.2140114Z 2025-05-07T20:32:49.2140229Z moe/activation_test.py:126: 2025-05-07T20:32:49.2140547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.2141038Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.2141569Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.2142392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.2143183Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.2143757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.2144470Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.2145187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.2145942Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.2146733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:49.2147517Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.2148281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.2149035Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.2149663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.2150200Z fn() 2025-05-07T20:32:49.2150734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.2151345Z self.fn.run( 2025-05-07T20:32:49.2151842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.2152395Z kernel = self.compile( 2025-05-07T20:32:49.2152963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.2153736Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.2154150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.2154405Z 2025-05-07T20:32:49.2154622Z self = 2025-05-07T20:32:49.2155750Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.2157178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0888f5b0>} 2025-05-07T20:32:49.2158569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.2159630Z context = 2025-05-07T20:32:49.2159934Z 2025-05-07T20:32:49.2160108Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.2160665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.2161156Z module_map=module_map) 2025-05-07T20:32:49.2161533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.2161921Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.2162204Z E ^ 2025-05-07T20:32:49.2162683Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.2163158Z 2025-05-07T20:32:49.2163587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.2164127Z 2025-05-07T20:32:49.2164344Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.2164838Z self=, 2025-05-07T20:32:49.2165261Z T=16384, 2025-05-07T20:32:49.2165477Z D=5120, 2025-05-07T20:32:49.2165688Z scale_ub=None, 2025-05-07T20:32:49.2165914Z contiguous=True, 2025-05-07T20:32:49.2166154Z compiled=True, 2025-05-07T20:32:49.2166373Z ) 2025-05-07T20:32:49.2558882Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:49.2560499Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:49.2561894Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:49.2562922Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:49.2564238Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:49.3601865Z self = 2025-05-07T20:32:49.3602626Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.3603048Z 2025-05-07T20:32:49.3603167Z @given( 2025-05-07T20:32:49.3603545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.3603912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.3604255Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.3604957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.3605666Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.3606269Z ) 2025-05-07T20:32:49.3607000Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.3607942Z def test_silu_mul_quant( 2025-05-07T20:32:49.3608454Z self, 2025-05-07T20:32:49.3608859Z T: int, 2025-05-07T20:32:49.3609275Z D: int, 2025-05-07T20:32:49.3609734Z scale_ub: Optional[float], 2025-05-07T20:32:49.3610301Z contiguous: bool, 2025-05-07T20:32:49.3610808Z compiled: bool, 2025-05-07T20:32:49.3611283Z ) -> None: 2025-05-07T20:32:49.3611727Z torch.manual_seed(2025) 2025-05-07T20:32:49.3612232Z 2025-05-07T20:32:49.3612804Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.3613515Z 2025-05-07T20:32:49.3613921Z x_sign = torch.sign(x) 2025-05-07T20:32:49.3614429Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.3614769Z x = x_sign * x_clamp 2025-05-07T20:32:49.3615024Z x0 = x[:, :D] 2025-05-07T20:32:49.3615260Z x1 = x[:, D:] 2025-05-07T20:32:49.3615489Z 2025-05-07T20:32:49.3615684Z if contiguous: 2025-05-07T20:32:49.3615939Z x0 = x0.contiguous() 2025-05-07T20:32:49.3616220Z x1 = x1.contiguous() 2025-05-07T20:32:49.3616476Z 2025-05-07T20:32:49.3616685Z if scale_ub is not None: 2025-05-07T20:32:49.3616984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.3617338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.3617674Z ) 2025-05-07T20:32:49.3617885Z else: 2025-05-07T20:32:49.3618108Z scale_ub_tensor = None 2025-05-07T20:32:49.3618386Z 2025-05-07T20:32:49.3618639Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.3618970Z op = silu_mul_quant 2025-05-07T20:32:49.3619243Z if compiled: 2025-05-07T20:32:49.3619697Z op = torch.compile(op) 2025-05-07T20:32:49.3620018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3620316Z 2025-05-07T20:32:49.3620532Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.3620846Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.3621150Z 2025-05-07T20:32:49.3621411Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.3621772Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.3622081Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.3622415Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.3622796Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.3623120Z 2025-05-07T20:32:49.3623339Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.3623545Z 2025-05-07T20:32:49.3623659Z moe/activation_test.py:126: 2025-05-07T20:32:49.3624249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3624652Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.3625000Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.3625831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.3626757Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.3627334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.3628054Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.3628780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.3629536Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.3630336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:49.3631127Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.3631893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.3632574Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.3638905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.3639474Z fn() 2025-05-07T20:32:49.3640010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.3640627Z self.fn.run( 2025-05-07T20:32:49.3641125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.3641727Z kernel = self.compile( 2025-05-07T20:32:49.3642561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.3643347Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.3643784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3644028Z 2025-05-07T20:32:49.3644249Z self = 2025-05-07T20:32:49.3645390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.3646841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0949ec20>} 2025-05-07T20:32:49.3648476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.3649550Z context = 2025-05-07T20:32:49.3649867Z 2025-05-07T20:32:49.3650043Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.3650596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.3651099Z module_map=module_map) 2025-05-07T20:32:49.3651482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.3651861Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.3652145Z E ^ 2025-05-07T20:32:49.3652719Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.3653347Z 2025-05-07T20:32:49.3653794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.3654345Z 2025-05-07T20:32:49.3654478Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.3654946Z self=, 2025-05-07T20:32:49.3655467Z T=1, 2025-05-07T20:32:49.3655667Z D=5120, 2025-05-07T20:32:49.3655884Z scale_ub=1200.0, 2025-05-07T20:32:49.3656121Z contiguous=True, 2025-05-07T20:32:49.3656364Z compiled=True, 2025-05-07T20:32:49.3656586Z ) 2025-05-07T20:32:49.5100793Z self = 2025-05-07T20:32:49.5101589Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:49.5101975Z 2025-05-07T20:32:49.5102096Z @given( 2025-05-07T20:32:49.5102393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.5102732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.5103067Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.5103425Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.5103775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.5104082Z ) 2025-05-07T20:32:49.5104471Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.5104948Z def test_silu_mul_quant( 2025-05-07T20:32:49.5105207Z self, 2025-05-07T20:32:49.5105425Z T: int, 2025-05-07T20:32:49.5105637Z D: int, 2025-05-07T20:32:49.5105872Z scale_ub: Optional[float], 2025-05-07T20:32:49.5106166Z contiguous: bool, 2025-05-07T20:32:49.5106426Z compiled: bool, 2025-05-07T20:32:49.5106674Z ) -> None: 2025-05-07T20:32:49.5106909Z torch.manual_seed(2025) 2025-05-07T20:32:49.5107167Z 2025-05-07T20:32:49.5107454Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.5107818Z 2025-05-07T20:32:49.5108036Z x_sign = torch.sign(x) 2025-05-07T20:32:49.5108340Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.5108672Z x = x_sign * x_clamp 2025-05-07T20:32:49.5108931Z x0 = x[:, :D] 2025-05-07T20:32:49.5109165Z x1 = x[:, D:] 2025-05-07T20:32:49.5109395Z 2025-05-07T20:32:49.5109598Z if contiguous: 2025-05-07T20:32:49.5109844Z x0 = x0.contiguous() 2025-05-07T20:32:49.5110123Z x1 = x1.contiguous() 2025-05-07T20:32:49.5110381Z 2025-05-07T20:32:49.5110586Z if scale_ub is not None: 2025-05-07T20:32:49.5110880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.5111239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.5111561Z ) 2025-05-07T20:32:49.5111772Z else: 2025-05-07T20:32:49.5112001Z scale_ub_tensor = None 2025-05-07T20:32:49.5112271Z 2025-05-07T20:32:49.5112515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.5113026Z op = silu_mul_quant 2025-05-07T20:32:49.5113300Z if compiled: 2025-05-07T20:32:49.5113656Z op = torch.compile(op) 2025-05-07T20:32:49.5113977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.5114274Z 2025-05-07T20:32:49.5114485Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.5114663Z 2025-05-07T20:32:49.5114770Z moe/activation_test.py:117: 2025-05-07T20:32:49.5115087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.5115435Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.5115737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.5116330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.5116920Z return fn(*args, **kwargs) 2025-05-07T20:32:49.5117623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.5118356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.5118929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.5119774Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.5120474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.5121045Z kernel = self.compile( 2025-05-07T20:32:49.5121619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.5122309Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.5122734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.5122973Z 2025-05-07T20:32:49.5123199Z self = 2025-05-07T20:32:49.5124587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.5126039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08d2ac20>} 2025-05-07T20:32:49.5127448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.5128521Z context = 2025-05-07T20:32:49.5128826Z 2025-05-07T20:32:49.5129010Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.5129561Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.5130063Z module_map=module_map) 2025-05-07T20:32:49.5130456Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.5130834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.5131128Z E ^ 2025-05-07T20:32:49.5131624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.5132099Z 2025-05-07T20:32:49.5132539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.5133078Z 2025-05-07T20:32:49.5133198Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.5133633Z self=, 2025-05-07T20:32:49.5134062Z T=1, 2025-05-07T20:32:49.5134268Z D=5120, 2025-05-07T20:32:49.5134482Z scale_ub=None, 2025-05-07T20:32:49.5134718Z contiguous=False, 2025-05-07T20:32:49.5135100Z compiled=True, 2025-05-07T20:32:49.5135319Z ) 2025-05-07T20:32:49.5827630Z self = 2025-05-07T20:32:49.5828389Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:49.5828757Z 2025-05-07T20:32:49.5828857Z @given( 2025-05-07T20:32:49.5829115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.5829468Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.5829808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.5830171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.5830537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.5830858Z ) 2025-05-07T20:32:49.5831243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.5831729Z def test_silu_mul_quant( 2025-05-07T20:32:49.5831998Z self, 2025-05-07T20:32:49.5832221Z T: int, 2025-05-07T20:32:49.5832446Z D: int, 2025-05-07T20:32:49.5832695Z scale_ub: Optional[float], 2025-05-07T20:32:49.5832995Z contiguous: bool, 2025-05-07T20:32:49.5833256Z compiled: bool, 2025-05-07T20:32:49.5833796Z ) -> None: 2025-05-07T20:32:49.5834037Z torch.manual_seed(2025) 2025-05-07T20:32:49.5834306Z 2025-05-07T20:32:49.5834610Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.5835110Z 2025-05-07T20:32:49.5835333Z x_sign = torch.sign(x) 2025-05-07T20:32:49.5835654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.5835994Z x = x_sign * x_clamp 2025-05-07T20:32:49.5836255Z x0 = x[:, :D] 2025-05-07T20:32:49.5836495Z x1 = x[:, D:] 2025-05-07T20:32:49.5836727Z 2025-05-07T20:32:49.5836929Z if contiguous: 2025-05-07T20:32:49.5837191Z x0 = x0.contiguous() 2025-05-07T20:32:49.5837476Z x1 = x1.contiguous() 2025-05-07T20:32:49.5837745Z 2025-05-07T20:32:49.5837961Z if scale_ub is not None: 2025-05-07T20:32:49.5838268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.5838634Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.5838975Z ) 2025-05-07T20:32:49.5839193Z else: 2025-05-07T20:32:49.5839427Z scale_ub_tensor = None 2025-05-07T20:32:49.5839701Z 2025-05-07T20:32:49.5839962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.5840307Z op = silu_mul_quant 2025-05-07T20:32:49.5840578Z if compiled: 2025-05-07T20:32:49.5840853Z op = torch.compile(op) 2025-05-07T20:32:49.5841183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.5841490Z 2025-05-07T20:32:49.5841709Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.5842023Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.5842335Z 2025-05-07T20:32:49.5842603Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.5842971Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.5843291Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.5843633Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.5844028Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.5844366Z 2025-05-07T20:32:49.5844585Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.5844805Z 2025-05-07T20:32:49.5844915Z moe/activation_test.py:126: 2025-05-07T20:32:49.5845247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.5845610Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.5845971Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.5846826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.5847799Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.5848392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.5849133Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.5849882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.5850663Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.5851475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:49.5852287Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.5853080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.5853780Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.5854423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.5855129Z fn() 2025-05-07T20:32:49.5855679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.5856309Z self.fn.run( 2025-05-07T20:32:49.5856816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.5857400Z kernel = self.compile( 2025-05-07T20:32:49.5857992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.5858697Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.5859143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.5859392Z 2025-05-07T20:32:49.5859634Z self = 2025-05-07T20:32:49.5860805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.5862292Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0837f370>} 2025-05-07T20:32:49.5863746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.5864857Z context = 2025-05-07T20:32:49.5865170Z 2025-05-07T20:32:49.5865363Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.5865930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.5866446Z module_map=module_map) 2025-05-07T20:32:49.5866855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.5867250Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.5867540Z E ^ 2025-05-07T20:32:49.5868048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.5868536Z 2025-05-07T20:32:49.5868991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.5869544Z 2025-05-07T20:32:49.5869666Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.5870113Z self=, 2025-05-07T20:32:49.5870554Z T=1, 2025-05-07T20:32:49.5870764Z D=5120, 2025-05-07T20:32:49.5871062Z scale_ub=None, 2025-05-07T20:32:49.5871305Z contiguous=True, 2025-05-07T20:32:49.5871555Z compiled=False, 2025-05-07T20:32:49.5871780Z ) 2025-05-07T20:32:49.9339467Z self = 2025-05-07T20:32:49.9340289Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:49.9340687Z 2025-05-07T20:32:49.9340811Z @given( 2025-05-07T20:32:49.9341142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.9341475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.9341812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.9342172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.9342531Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.9342845Z ) 2025-05-07T20:32:49.9343228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.9343722Z def test_silu_mul_quant( 2025-05-07T20:32:49.9343985Z self, 2025-05-07T20:32:49.9344208Z T: int, 2025-05-07T20:32:49.9344442Z D: int, 2025-05-07T20:32:49.9344718Z scale_ub: Optional[float], 2025-05-07T20:32:49.9345206Z contiguous: bool, 2025-05-07T20:32:49.9345472Z compiled: bool, 2025-05-07T20:32:49.9345712Z ) -> None: 2025-05-07T20:32:49.9345948Z torch.manual_seed(2025) 2025-05-07T20:32:49.9346211Z 2025-05-07T20:32:49.9346503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.9346869Z 2025-05-07T20:32:49.9347082Z x_sign = torch.sign(x) 2025-05-07T20:32:49.9347388Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.9347721Z x = x_sign * x_clamp 2025-05-07T20:32:49.9347988Z x0 = x[:, :D] 2025-05-07T20:32:49.9348218Z x1 = x[:, D:] 2025-05-07T20:32:49.9348445Z 2025-05-07T20:32:49.9348653Z if contiguous: 2025-05-07T20:32:49.9348910Z x0 = x0.contiguous() 2025-05-07T20:32:49.9349192Z x1 = x1.contiguous() 2025-05-07T20:32:49.9349457Z 2025-05-07T20:32:49.9349675Z if scale_ub is not None: 2025-05-07T20:32:49.9349970Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.9350339Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.9350675Z ) 2025-05-07T20:32:49.9350880Z else: 2025-05-07T20:32:49.9351108Z scale_ub_tensor = None 2025-05-07T20:32:49.9351378Z 2025-05-07T20:32:49.9351621Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.9351961Z op = silu_mul_quant 2025-05-07T20:32:49.9352232Z if compiled: 2025-05-07T20:32:49.9352494Z op = torch.compile(op) 2025-05-07T20:32:49.9352817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9353113Z 2025-05-07T20:32:49.9353316Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.9353591Z 2025-05-07T20:32:49.9353704Z moe/activation_test.py:117: 2025-05-07T20:32:49.9354021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9354375Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.9354679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9355419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.9356153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.9356720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.9357445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.9358153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.9358721Z kernel = self.compile( 2025-05-07T20:32:49.9359416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.9360114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.9360535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9360781Z 2025-05-07T20:32:49.9361006Z self = 2025-05-07T20:32:49.9362141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.9363597Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0837feb0>} 2025-05-07T20:32:49.9365023Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.9366110Z context = 2025-05-07T20:32:49.9366497Z 2025-05-07T20:32:49.9366674Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.9367227Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.9367724Z module_map=module_map) 2025-05-07T20:32:49.9368110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.9368479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.9368756Z E ^ 2025-05-07T20:32:49.9369245Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.9369719Z 2025-05-07T20:32:49.9370164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.9370708Z 2025-05-07T20:32:49.9370820Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.9371260Z self=, 2025-05-07T20:32:49.9371696Z T=128, 2025-05-07T20:32:49.9371894Z D=5120, 2025-05-07T20:32:49.9372105Z scale_ub=None, 2025-05-07T20:32:49.9372339Z contiguous=False, 2025-05-07T20:32:49.9372581Z compiled=True, 2025-05-07T20:32:49.9372803Z ) 2025-05-07T20:32:49.9373145Z self = 2025-05-07T20:32:49.9373660Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:49.9373950Z 2025-05-07T20:32:49.9374034Z @given( 2025-05-07T20:32:49.9374286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.9374638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.9374988Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.9375345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.9375694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.9375994Z ) 2025-05-07T20:32:49.9376379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.9376852Z def test_silu_mul_quant( 2025-05-07T20:32:49.9377108Z self, 2025-05-07T20:32:49.9377318Z T: int, 2025-05-07T20:32:49.9377529Z D: int, 2025-05-07T20:32:49.9377760Z scale_ub: Optional[float], 2025-05-07T20:32:49.9378047Z contiguous: bool, 2025-05-07T20:32:49.9378307Z compiled: bool, 2025-05-07T20:32:49.9378544Z ) -> None: 2025-05-07T20:32:49.9378781Z torch.manual_seed(2025) 2025-05-07T20:32:49.9379040Z 2025-05-07T20:32:49.9379330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.9379686Z 2025-05-07T20:32:49.9379894Z x_sign = torch.sign(x) 2025-05-07T20:32:49.9380292Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.9380616Z x = x_sign * x_clamp 2025-05-07T20:32:49.9380872Z x0 = x[:, :D] 2025-05-07T20:32:49.9381106Z x1 = x[:, D:] 2025-05-07T20:32:49.9381329Z 2025-05-07T20:32:49.9381531Z if contiguous: 2025-05-07T20:32:49.9381780Z x0 = x0.contiguous() 2025-05-07T20:32:49.9382053Z x1 = x1.contiguous() 2025-05-07T20:32:49.9382310Z 2025-05-07T20:32:49.9382518Z if scale_ub is not None: 2025-05-07T20:32:49.9382809Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.9383167Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.9383499Z ) 2025-05-07T20:32:49.9383703Z else: 2025-05-07T20:32:49.9383930Z scale_ub_tensor = None 2025-05-07T20:32:49.9384203Z 2025-05-07T20:32:49.9384450Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.9384790Z op = silu_mul_quant 2025-05-07T20:32:49.9385067Z if compiled: 2025-05-07T20:32:49.9385336Z op = torch.compile(op) 2025-05-07T20:32:49.9385649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9386031Z 2025-05-07T20:32:49.9386242Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.9386420Z 2025-05-07T20:32:49.9386526Z moe/activation_test.py:117: 2025-05-07T20:32:49.9386844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9387204Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.9387503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9388100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.9388698Z return fn(*args, **kwargs) 2025-05-07T20:32:49.9389398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.9390132Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.9390704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.9391428Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.9392134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.9392702Z kernel = self.compile( 2025-05-07T20:32:49.9393277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.9394027Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.9394444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9394698Z 2025-05-07T20:32:49.9394923Z self = 2025-05-07T20:32:49.9396071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.9397530Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08fce8c0>} 2025-05-07T20:32:49.9398960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.9400045Z context = 2025-05-07T20:32:49.9400358Z 2025-05-07T20:32:49.9400536Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.9401096Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.9401702Z module_map=module_map) 2025-05-07T20:32:49.9402095Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.9402475Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.9402758Z E ^ 2025-05-07T20:32:49.9403247Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.9403728Z 2025-05-07T20:32:49.9404168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.9404738Z 2025-05-07T20:32:49.9404876Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.9410839Z self=, 2025-05-07T20:32:49.9411299Z T=128, 2025-05-07T20:32:49.9411511Z D=7168, 2025-05-07T20:32:49.9411713Z scale_ub=1200.0, 2025-05-07T20:32:49.9411958Z contiguous=False, 2025-05-07T20:32:49.9412208Z compiled=False, 2025-05-07T20:32:49.9412433Z ) 2025-05-07T20:32:50.0677535Z self = 2025-05-07T20:32:50.0678380Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.0679017Z 2025-05-07T20:32:50.0679139Z @given( 2025-05-07T20:32:50.0679487Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.0679903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.0680233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.0680588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.0680936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.0681243Z ) 2025-05-07T20:32:50.0681618Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.0682093Z def test_silu_mul_quant( 2025-05-07T20:32:50.0682346Z self, 2025-05-07T20:32:50.0682560Z T: int, 2025-05-07T20:32:50.0682782Z D: int, 2025-05-07T20:32:50.0683015Z scale_ub: Optional[float], 2025-05-07T20:32:50.0683311Z contiguous: bool, 2025-05-07T20:32:50.0683576Z compiled: bool, 2025-05-07T20:32:50.0683826Z ) -> None: 2025-05-07T20:32:50.0684063Z torch.manual_seed(2025) 2025-05-07T20:32:50.0684326Z 2025-05-07T20:32:50.0684639Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.0685030Z 2025-05-07T20:32:50.0685242Z x_sign = torch.sign(x) 2025-05-07T20:32:50.0685548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.0685876Z x = x_sign * x_clamp 2025-05-07T20:32:50.0686136Z x0 = x[:, :D] 2025-05-07T20:32:50.0686367Z x1 = x[:, D:] 2025-05-07T20:32:50.0686595Z 2025-05-07T20:32:50.0686798Z if contiguous: 2025-05-07T20:32:50.0687042Z x0 = x0.contiguous() 2025-05-07T20:32:50.0687320Z x1 = x1.contiguous() 2025-05-07T20:32:50.0687581Z 2025-05-07T20:32:50.0687795Z if scale_ub is not None: 2025-05-07T20:32:50.0688085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.0688447Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.0688787Z ) 2025-05-07T20:32:50.0688989Z else: 2025-05-07T20:32:50.0689219Z scale_ub_tensor = None 2025-05-07T20:32:50.0689483Z 2025-05-07T20:32:50.0689732Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.0690065Z op = silu_mul_quant 2025-05-07T20:32:50.0690328Z if compiled: 2025-05-07T20:32:50.0690597Z op = torch.compile(op) 2025-05-07T20:32:50.0690914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.0691206Z 2025-05-07T20:32:50.0691421Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.0691595Z 2025-05-07T20:32:50.0691711Z moe/activation_test.py:117: 2025-05-07T20:32:50.0692155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0692511Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.0692818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.0693552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.0694283Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.0694855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.0695575Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.0696278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.0696838Z kernel = self.compile( 2025-05-07T20:32:50.0697418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.0698120Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.0698533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0698780Z 2025-05-07T20:32:50.0699001Z self = 2025-05-07T20:32:50.0700222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.0701680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08d2aef0>} 2025-05-07T20:32:50.0703094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.0704161Z context = 2025-05-07T20:32:50.0704466Z 2025-05-07T20:32:50.0704641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.0705215Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.0705702Z module_map=module_map) 2025-05-07T20:32:50.0706082Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.0706453Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.0706725Z E ^ 2025-05-07T20:32:50.0707208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.0707674Z 2025-05-07T20:32:50.0708105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.0708641Z 2025-05-07T20:32:50.0708756Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.0709190Z self=, 2025-05-07T20:32:50.0709610Z T=128, 2025-05-07T20:32:50.0709805Z D=5120, 2025-05-07T20:32:50.0710017Z scale_ub=None, 2025-05-07T20:32:50.0710445Z contiguous=False, 2025-05-07T20:32:50.0710683Z compiled=False, 2025-05-07T20:32:50.0710902Z ) 2025-05-07T20:32:50.0711235Z self = 2025-05-07T20:32:50.0711744Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:50.0712030Z 2025-05-07T20:32:50.0712112Z @given( 2025-05-07T20:32:50.0712355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.0712681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.0713002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.0713351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.0713836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.0714136Z ) 2025-05-07T20:32:50.0714505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.0715022Z def test_silu_mul_quant( 2025-05-07T20:32:50.0715278Z self, 2025-05-07T20:32:50.0715488Z T: int, 2025-05-07T20:32:50.0715698Z D: int, 2025-05-07T20:32:50.0715926Z scale_ub: Optional[float], 2025-05-07T20:32:50.0716218Z contiguous: bool, 2025-05-07T20:32:50.0716477Z compiled: bool, 2025-05-07T20:32:50.0716709Z ) -> None: 2025-05-07T20:32:50.0716938Z torch.manual_seed(2025) 2025-05-07T20:32:50.0717193Z 2025-05-07T20:32:50.0717474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.0720667Z 2025-05-07T20:32:50.0720877Z x_sign = torch.sign(x) 2025-05-07T20:32:50.0721213Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.0721693Z x = x_sign * x_clamp 2025-05-07T20:32:50.0721966Z x0 = x[:, :D] 2025-05-07T20:32:50.0722195Z x1 = x[:, D:] 2025-05-07T20:32:50.0722418Z 2025-05-07T20:32:50.0722615Z if contiguous: 2025-05-07T20:32:50.0722855Z x0 = x0.contiguous() 2025-05-07T20:32:50.0723202Z x1 = x1.contiguous() 2025-05-07T20:32:50.0723456Z 2025-05-07T20:32:50.0723654Z if scale_ub is not None: 2025-05-07T20:32:50.0724229Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.0724585Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.0724903Z ) 2025-05-07T20:32:50.0725111Z else: 2025-05-07T20:32:50.0725334Z scale_ub_tensor = None 2025-05-07T20:32:50.0725601Z 2025-05-07T20:32:50.0725843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.0726203Z op = silu_mul_quant 2025-05-07T20:32:50.0726463Z if compiled: 2025-05-07T20:32:50.0726728Z op = torch.compile(op) 2025-05-07T20:32:50.0727048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.0727335Z 2025-05-07T20:32:50.0727541Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.0727715Z 2025-05-07T20:32:50.0727826Z moe/activation_test.py:117: 2025-05-07T20:32:50.0728139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0728486Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.0728785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.0729506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.0730219Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.0730781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.0731501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.0732192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.0732750Z kernel = self.compile( 2025-05-07T20:32:50.0733317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.0734007Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.0734416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0734655Z 2025-05-07T20:32:50.0734871Z self = 2025-05-07T20:32:50.0735995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.0737559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08fccb80>} 2025-05-07T20:32:50.0738955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.0740030Z context = 2025-05-07T20:32:50.0740339Z 2025-05-07T20:32:50.0740515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.0741062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.0741551Z module_map=module_map) 2025-05-07T20:32:50.0741934Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.0742415Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.0742689Z E ^ 2025-05-07T20:32:50.0743177Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.0743654Z 2025-05-07T20:32:50.0744087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.0744709Z 2025-05-07T20:32:50.0744826Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.0745254Z self=, 2025-05-07T20:32:50.0745672Z T=128, 2025-05-07T20:32:50.0745870Z D=5120, 2025-05-07T20:32:50.0746077Z scale_ub=1200.0, 2025-05-07T20:32:50.0746308Z contiguous=True, 2025-05-07T20:32:50.0746548Z compiled=False, 2025-05-07T20:32:50.0746765Z ) 2025-05-07T20:32:50.2691157Z self = 2025-05-07T20:32:50.2692738Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.2693465Z 2025-05-07T20:32:50.2693628Z @given( 2025-05-07T20:32:50.2694122Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.2694698Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.2695019Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.2695372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.2695719Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.2696017Z ) 2025-05-07T20:32:50.2696389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.2696854Z def test_silu_mul_quant( 2025-05-07T20:32:50.2697112Z self, 2025-05-07T20:32:50.2697313Z T: int, 2025-05-07T20:32:50.2697521Z D: int, 2025-05-07T20:32:50.2697752Z scale_ub: Optional[float], 2025-05-07T20:32:50.2698034Z contiguous: bool, 2025-05-07T20:32:50.2698291Z compiled: bool, 2025-05-07T20:32:50.2698535Z ) -> None: 2025-05-07T20:32:50.2698759Z torch.manual_seed(2025) 2025-05-07T20:32:50.2699016Z 2025-05-07T20:32:50.2699316Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.2699678Z 2025-05-07T20:32:50.2699881Z x_sign = torch.sign(x) 2025-05-07T20:32:50.2700194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.2700523Z x = x_sign * x_clamp 2025-05-07T20:32:50.2700774Z x0 = x[:, :D] 2025-05-07T20:32:50.2701009Z x1 = x[:, D:] 2025-05-07T20:32:50.2701231Z 2025-05-07T20:32:50.2701427Z if contiguous: 2025-05-07T20:32:50.2701680Z x0 = x0.contiguous() 2025-05-07T20:32:50.2701958Z x1 = x1.contiguous() 2025-05-07T20:32:50.2702211Z 2025-05-07T20:32:50.2702424Z if scale_ub is not None: 2025-05-07T20:32:50.2702716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.2703068Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.2703396Z ) 2025-05-07T20:32:50.2703606Z else: 2025-05-07T20:32:50.2704025Z scale_ub_tensor = None 2025-05-07T20:32:50.2704300Z 2025-05-07T20:32:50.2704554Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.2704929Z op = silu_mul_quant 2025-05-07T20:32:50.2705198Z if compiled: 2025-05-07T20:32:50.2705465Z op = torch.compile(op) 2025-05-07T20:32:50.2705785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2706071Z 2025-05-07T20:32:50.2706281Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.2706454Z 2025-05-07T20:32:50.2706569Z moe/activation_test.py:117: 2025-05-07T20:32:50.2706877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2707228Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.2707527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2708336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.2709069Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.2709634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.2710355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.2711109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.2711668Z kernel = self.compile( 2025-05-07T20:32:50.2712238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.2712930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.2713340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2713681Z 2025-05-07T20:32:50.2713900Z self = 2025-05-07T20:32:50.2715029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.2716465Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08fcff40>} 2025-05-07T20:32:50.2717859Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.2718930Z context = 2025-05-07T20:32:50.2719240Z 2025-05-07T20:32:50.2719420Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.2719972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.2720461Z module_map=module_map) 2025-05-07T20:32:50.2720848Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.2721222Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.2721497Z E ^ 2025-05-07T20:32:50.2721987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.2722461Z 2025-05-07T20:32:50.2722891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.2723422Z 2025-05-07T20:32:50.2723539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.2724233Z self=, 2025-05-07T20:32:50.2724662Z T=1, 2025-05-07T20:32:50.2724862Z D=7168, 2025-05-07T20:32:50.2725067Z scale_ub=1200.0, 2025-05-07T20:32:50.2725309Z contiguous=True, 2025-05-07T20:32:50.2725680Z compiled=True, 2025-05-07T20:32:50.2725898Z ) 2025-05-07T20:32:50.2726237Z self = 2025-05-07T20:32:50.2726747Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:50.2727024Z 2025-05-07T20:32:50.2727116Z @given( 2025-05-07T20:32:50.2727357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.2727691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.2728018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.2728362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.2728712Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.2729015Z ) 2025-05-07T20:32:50.2729381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.2729925Z def test_silu_mul_quant( 2025-05-07T20:32:50.2730182Z self, 2025-05-07T20:32:50.2730393Z T: int, 2025-05-07T20:32:50.2730610Z D: int, 2025-05-07T20:32:50.2730844Z scale_ub: Optional[float], 2025-05-07T20:32:50.2731143Z contiguous: bool, 2025-05-07T20:32:50.2731399Z compiled: bool, 2025-05-07T20:32:50.2731703Z ) -> None: 2025-05-07T20:32:50.2731934Z torch.manual_seed(2025) 2025-05-07T20:32:50.2732184Z 2025-05-07T20:32:50.2732473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.2732833Z 2025-05-07T20:32:50.2733035Z x_sign = torch.sign(x) 2025-05-07T20:32:50.2733344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.2733671Z x = x_sign * x_clamp 2025-05-07T20:32:50.2733923Z x0 = x[:, :D] 2025-05-07T20:32:50.2734156Z x1 = x[:, D:] 2025-05-07T20:32:50.2734383Z 2025-05-07T20:32:50.2734587Z if contiguous: 2025-05-07T20:32:50.2734836Z x0 = x0.contiguous() 2025-05-07T20:32:50.2735112Z x1 = x1.contiguous() 2025-05-07T20:32:50.2735368Z 2025-05-07T20:32:50.2735579Z if scale_ub is not None: 2025-05-07T20:32:50.2735873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.2736232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.2736558Z ) 2025-05-07T20:32:50.2736770Z else: 2025-05-07T20:32:50.2736999Z scale_ub_tensor = None 2025-05-07T20:32:50.2737263Z 2025-05-07T20:32:50.2737517Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.2737853Z op = silu_mul_quant 2025-05-07T20:32:50.2738115Z if compiled: 2025-05-07T20:32:50.2738387Z op = torch.compile(op) 2025-05-07T20:32:50.2738708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2738995Z 2025-05-07T20:32:50.2739210Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.2739384Z 2025-05-07T20:32:50.2739498Z moe/activation_test.py:117: 2025-05-07T20:32:50.2739818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2740165Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.2740473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2741061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.2741644Z return fn(*args, **kwargs) 2025-05-07T20:32:50.2742338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.2743059Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.2743624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.2744340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.2745036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.2745681Z kernel = self.compile( 2025-05-07T20:32:50.2746248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.2746939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.2747359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2747597Z 2025-05-07T20:32:50.2747819Z self = 2025-05-07T20:32:50.2748939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.2750370Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08fcf640>} 2025-05-07T20:32:50.2751825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.2752894Z context = 2025-05-07T20:32:50.2753237Z 2025-05-07T20:32:50.2753418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.2754012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.2754504Z module_map=module_map) 2025-05-07T20:32:50.2754888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.2755255Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.2755535Z E ^ 2025-05-07T20:32:50.2756024Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.2756494Z 2025-05-07T20:32:50.2756937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.2757468Z 2025-05-07T20:32:50.2757580Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.2758019Z self=, 2025-05-07T20:32:50.2758448Z T=1, 2025-05-07T20:32:50.2758642Z D=7168, 2025-05-07T20:32:50.2758852Z scale_ub=1200.0, 2025-05-07T20:32:50.2759095Z contiguous=False, 2025-05-07T20:32:50.2759333Z compiled=True, 2025-05-07T20:32:50.2759555Z ) 2025-05-07T20:32:50.4225454Z self = 2025-05-07T20:32:50.4226366Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.4226869Z 2025-05-07T20:32:50.4227022Z @given( 2025-05-07T20:32:50.4227427Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4227983Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4228552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4229139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4229734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4230260Z ) 2025-05-07T20:32:50.4230908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4231635Z def test_silu_mul_quant( 2025-05-07T20:32:50.4232038Z self, 2025-05-07T20:32:50.4232359Z T: int, 2025-05-07T20:32:50.4232690Z D: int, 2025-05-07T20:32:50.4233068Z scale_ub: Optional[float], 2025-05-07T20:32:50.4233652Z contiguous: bool, 2025-05-07T20:32:50.4246651Z compiled: bool, 2025-05-07T20:32:50.4247096Z ) -> None: 2025-05-07T20:32:50.4247503Z torch.manual_seed(2025) 2025-05-07T20:32:50.4247965Z 2025-05-07T20:32:50.4248488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4249104Z 2025-05-07T20:32:50.4250846Z x_sign = torch.sign(x) 2025-05-07T20:32:50.4251394Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.4251970Z x = x_sign * x_clamp 2025-05-07T20:32:50.4252431Z x0 = x[:, :D] 2025-05-07T20:32:50.4252834Z x1 = x[:, D:] 2025-05-07T20:32:50.4253216Z 2025-05-07T20:32:50.4253565Z if contiguous: 2025-05-07T20:32:50.4254031Z x0 = x0.contiguous() 2025-05-07T20:32:50.4254510Z x1 = x1.contiguous() 2025-05-07T20:32:50.4254949Z 2025-05-07T20:32:50.4255302Z if scale_ub is not None: 2025-05-07T20:32:50.4255824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.4256447Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.4257019Z ) 2025-05-07T20:32:50.4257538Z else: 2025-05-07T20:32:50.4257919Z scale_ub_tensor = None 2025-05-07T20:32:50.4258398Z 2025-05-07T20:32:50.4258829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.4259396Z op = silu_mul_quant 2025-05-07T20:32:50.4259843Z if compiled: 2025-05-07T20:32:50.4260283Z op = torch.compile(op) 2025-05-07T20:32:50.4260933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4261398Z 2025-05-07T20:32:50.4261741Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.4262001Z 2025-05-07T20:32:50.4262186Z moe/activation_test.py:117: 2025-05-07T20:32:50.4262722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4263303Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.4263825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4264826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.4265859Z return fn(*args, **kwargs) 2025-05-07T20:32:50.4267077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.4268371Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.4269342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.4270638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.4271864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.4272857Z kernel = self.compile( 2025-05-07T20:32:50.4273905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.4275048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.4275825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4276271Z 2025-05-07T20:32:50.4276651Z self = 2025-05-07T20:32:50.4278667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.4281345Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f5b5b0>} 2025-05-07T20:32:50.4283894Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.4285852Z context = 2025-05-07T20:32:50.4286412Z 2025-05-07T20:32:50.4286714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.4287820Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.4288693Z module_map=module_map) 2025-05-07T20:32:50.4289344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.4289993Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.4290471Z E ^ 2025-05-07T20:32:50.4291327Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.4292174Z 2025-05-07T20:32:50.4292944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.4293921Z 2025-05-07T20:32:50.4294109Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.4294865Z self=, 2025-05-07T20:32:50.4295702Z T=1, 2025-05-07T20:32:50.4296041Z D=7168, 2025-05-07T20:32:50.4296389Z scale_ub=None, 2025-05-07T20:32:50.4296773Z contiguous=False, 2025-05-07T20:32:50.4297183Z compiled=True, 2025-05-07T20:32:50.4297552Z ) 2025-05-07T20:32:50.5331286Z self = 2025-05-07T20:32:50.5332344Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.5332655Z 2025-05-07T20:32:50.5332756Z @given( 2025-05-07T20:32:50.5333028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.5333395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.5333753Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.5334132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.5334517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.5334850Z ) 2025-05-07T20:32:50.5335265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.5335770Z def test_silu_mul_quant( 2025-05-07T20:32:50.5336056Z self, 2025-05-07T20:32:50.5336297Z T: int, 2025-05-07T20:32:50.5336526Z D: int, 2025-05-07T20:32:50.5336787Z scale_ub: Optional[float], 2025-05-07T20:32:50.5337104Z contiguous: bool, 2025-05-07T20:32:50.5337386Z compiled: bool, 2025-05-07T20:32:50.5337657Z ) -> None: 2025-05-07T20:32:50.5337912Z torch.manual_seed(2025) 2025-05-07T20:32:50.5338191Z 2025-05-07T20:32:50.5338508Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.5338903Z 2025-05-07T20:32:50.5339124Z x_sign = torch.sign(x) 2025-05-07T20:32:50.5339458Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.5339816Z x = x_sign * x_clamp 2025-05-07T20:32:50.5340090Z x0 = x[:, :D] 2025-05-07T20:32:50.5340347Z x1 = x[:, D:] 2025-05-07T20:32:50.5340590Z 2025-05-07T20:32:50.5340808Z if contiguous: 2025-05-07T20:32:50.5341073Z x0 = x0.contiguous() 2025-05-07T20:32:50.5341381Z x1 = x1.contiguous() 2025-05-07T20:32:50.5341661Z 2025-05-07T20:32:50.5341881Z if scale_ub is not None: 2025-05-07T20:32:50.5342202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.5342594Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.5342947Z ) 2025-05-07T20:32:50.5343177Z else: 2025-05-07T20:32:50.5343424Z scale_ub_tensor = None 2025-05-07T20:32:50.5343708Z 2025-05-07T20:32:50.5343979Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5344342Z op = silu_mul_quant 2025-05-07T20:32:50.5344627Z if compiled: 2025-05-07T20:32:50.5344917Z op = torch.compile(op) 2025-05-07T20:32:50.5345261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5345576Z 2025-05-07T20:32:50.5345803Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.5346135Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.5346474Z 2025-05-07T20:32:50.5346929Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5347319Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.5347666Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.5348027Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.5348443Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.5348803Z 2025-05-07T20:32:50.5349036Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.5349266Z 2025-05-07T20:32:50.5349383Z moe/activation_test.py:126: 2025-05-07T20:32:50.5349732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5350120Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.5350596Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.5351501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.5352355Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.5352975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.5353875Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.5354660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.5355484Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.5356333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:50.5357181Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.5358014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.5358745Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.5359423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.5360016Z fn() 2025-05-07T20:32:50.5360595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.5361250Z self.fn.run( 2025-05-07T20:32:50.5361784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.5362388Z kernel = self.compile( 2025-05-07T20:32:50.5363005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.5363743Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.5364197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5364465Z 2025-05-07T20:32:50.5364711Z self = 2025-05-07T20:32:50.5366017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.5367623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f59bd0>} 2025-05-07T20:32:50.5369148Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.5370317Z context = 2025-05-07T20:32:50.5370647Z 2025-05-07T20:32:50.5370939Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.5371541Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.5372076Z module_map=module_map) 2025-05-07T20:32:50.5372503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.5372922Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.5373227Z E ^ 2025-05-07T20:32:50.5373762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.5374280Z 2025-05-07T20:32:50.5374753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.5375333Z 2025-05-07T20:32:50.5375515Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.5375988Z self=, 2025-05-07T20:32:50.5376447Z T=1, 2025-05-07T20:32:50.5376672Z D=5120, 2025-05-07T20:32:50.5376897Z scale_ub=1200.0, 2025-05-07T20:32:50.5377163Z contiguous=False, 2025-05-07T20:32:50.5377429Z compiled=True, 2025-05-07T20:32:50.5377664Z ) 2025-05-07T20:32:50.8982506Z self = 2025-05-07T20:32:50.8983923Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.8984685Z 2025-05-07T20:32:50.8984914Z @given( 2025-05-07T20:32:50.8985400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8985797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8986165Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8986558Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8986952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8987293Z ) 2025-05-07T20:32:50.8987719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8988233Z def test_silu_mul_quant( 2025-05-07T20:32:50.8988525Z self, 2025-05-07T20:32:50.8988763Z T: int, 2025-05-07T20:32:50.8988995Z D: int, 2025-05-07T20:32:50.8989264Z scale_ub: Optional[float], 2025-05-07T20:32:50.8989590Z contiguous: bool, 2025-05-07T20:32:50.8989874Z compiled: bool, 2025-05-07T20:32:50.8990147Z ) -> None: 2025-05-07T20:32:50.8990410Z torch.manual_seed(2025) 2025-05-07T20:32:50.8990697Z 2025-05-07T20:32:50.8991016Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8991417Z 2025-05-07T20:32:50.8991652Z x_sign = torch.sign(x) 2025-05-07T20:32:50.8991988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.8992356Z x = x_sign * x_clamp 2025-05-07T20:32:50.8992644Z x0 = x[:, :D] 2025-05-07T20:32:50.8992900Z x1 = x[:, D:] 2025-05-07T20:32:50.8993153Z 2025-05-07T20:32:50.8993382Z if contiguous: 2025-05-07T20:32:50.8993723Z x0 = x0.contiguous() 2025-05-07T20:32:50.8994031Z x1 = x1.contiguous() 2025-05-07T20:32:50.8994317Z 2025-05-07T20:32:50.8994543Z if scale_ub is not None: 2025-05-07T20:32:50.8994872Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.8995302Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.8995674Z ) 2025-05-07T20:32:50.8995905Z else: 2025-05-07T20:32:50.8996159Z scale_ub_tensor = None 2025-05-07T20:32:50.8996470Z 2025-05-07T20:32:50.8996742Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.8997118Z op = silu_mul_quant 2025-05-07T20:32:50.8997420Z if compiled: 2025-05-07T20:32:50.8997718Z op = torch.compile(op) 2025-05-07T20:32:50.8998069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8998397Z 2025-05-07T20:32:50.8998624Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.8999051Z 2025-05-07T20:32:50.8999176Z moe/activation_test.py:117: 2025-05-07T20:32:50.8999530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8999917Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.9000260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9000923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.9001584Z return fn(*args, **kwargs) 2025-05-07T20:32:50.9002349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.9003151Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.9003778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9004663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9005443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9006063Z kernel = self.compile( 2025-05-07T20:32:50.9006702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9007546Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9008013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9008282Z 2025-05-07T20:32:50.9008533Z self = 2025-05-07T20:32:50.9009794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9011398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f583a0>} 2025-05-07T20:32:50.9012949Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9014139Z context = 2025-05-07T20:32:50.9014474Z 2025-05-07T20:32:50.9014675Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9015328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9015873Z module_map=module_map) 2025-05-07T20:32:50.9016306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9016721Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.9017028Z E ^ 2025-05-07T20:32:50.9017576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9018092Z 2025-05-07T20:32:50.9018573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.9019163Z 2025-05-07T20:32:50.9019294Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9019769Z self=, 2025-05-07T20:32:50.9020235Z T=1, 2025-05-07T20:32:50.9020458Z D=5120, 2025-05-07T20:32:50.9020687Z scale_ub=1200.0, 2025-05-07T20:32:50.9020958Z contiguous=False, 2025-05-07T20:32:50.9021230Z compiled=False, 2025-05-07T20:32:50.9021477Z ) 2025-05-07T20:32:50.9021852Z self = 2025-05-07T20:32:50.9022426Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.9022734Z 2025-05-07T20:32:50.9022921Z @given( 2025-05-07T20:32:50.9023199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9023566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9024250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9024643Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9025038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9025421Z ) 2025-05-07T20:32:50.9025829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9026342Z def test_silu_mul_quant( 2025-05-07T20:32:50.9026634Z self, 2025-05-07T20:32:50.9026866Z T: int, 2025-05-07T20:32:50.9027104Z D: int, 2025-05-07T20:32:50.9027369Z scale_ub: Optional[float], 2025-05-07T20:32:50.9027773Z contiguous: bool, 2025-05-07T20:32:50.9028062Z compiled: bool, 2025-05-07T20:32:50.9028331Z ) -> None: 2025-05-07T20:32:50.9028587Z torch.manual_seed(2025) 2025-05-07T20:32:50.9028885Z 2025-05-07T20:32:50.9029210Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9029609Z 2025-05-07T20:32:50.9029844Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9030264Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9030629Z x = x_sign * x_clamp 2025-05-07T20:32:50.9030913Z x0 = x[:, :D] 2025-05-07T20:32:50.9031176Z x1 = x[:, D:] 2025-05-07T20:32:50.9031430Z 2025-05-07T20:32:50.9031648Z if contiguous: 2025-05-07T20:32:50.9031931Z x0 = x0.contiguous() 2025-05-07T20:32:50.9032242Z x1 = x1.contiguous() 2025-05-07T20:32:50.9032523Z 2025-05-07T20:32:50.9032756Z if scale_ub is not None: 2025-05-07T20:32:50.9033084Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9033472Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9033934Z ) 2025-05-07T20:32:50.9034173Z else: 2025-05-07T20:32:50.9034422Z scale_ub_tensor = None 2025-05-07T20:32:50.9034726Z 2025-05-07T20:32:50.9035003Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9035366Z op = silu_mul_quant 2025-05-07T20:32:50.9035668Z if compiled: 2025-05-07T20:32:50.9035964Z op = torch.compile(op) 2025-05-07T20:32:50.9036318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9036637Z 2025-05-07T20:32:50.9036872Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.9037066Z 2025-05-07T20:32:50.9037195Z moe/activation_test.py:117: 2025-05-07T20:32:50.9037537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9037929Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.9038266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9039064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.9039862Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.9040489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9041283Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9042045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9042663Z kernel = self.compile( 2025-05-07T20:32:50.9043295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9044057Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9044516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9044792Z 2025-05-07T20:32:50.9045034Z self = 2025-05-07T20:32:50.9046406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9047997Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f58ee0>} 2025-05-07T20:32:50.9049531Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9050708Z context = 2025-05-07T20:32:50.9051099Z 2025-05-07T20:32:50.9051296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9051913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9052454Z module_map=module_map) 2025-05-07T20:32:50.9052886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9053350Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.9053659Z E ^ 2025-05-07T20:32:50.9054194Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9054719Z 2025-05-07T20:32:50.9055245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.9055829Z 2025-05-07T20:32:50.9055959Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9056442Z self=, 2025-05-07T20:32:50.9056916Z T=16384, 2025-05-07T20:32:50.9057153Z D=5120, 2025-05-07T20:32:50.9057392Z scale_ub=1200.0, 2025-05-07T20:32:50.9057663Z contiguous=False, 2025-05-07T20:32:50.9057932Z compiled=True, 2025-05-07T20:32:50.9058176Z ) 2025-05-07T20:32:51.0145184Z self = 2025-05-07T20:32:51.0146753Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.0147498Z 2025-05-07T20:32:51.0147712Z @given( 2025-05-07T20:32:51.0148345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.0149173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.0149767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.0150417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.0151058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.0151607Z ) 2025-05-07T20:32:51.0152302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.0153156Z def test_silu_mul_quant( 2025-05-07T20:32:51.0153739Z self, 2025-05-07T20:32:51.0154131Z T: int, 2025-05-07T20:32:51.0154523Z D: int, 2025-05-07T20:32:51.0154954Z scale_ub: Optional[float], 2025-05-07T20:32:51.0155450Z contiguous: bool, 2025-05-07T20:32:51.0155756Z compiled: bool, 2025-05-07T20:32:51.0156029Z ) -> None: 2025-05-07T20:32:51.0156283Z torch.manual_seed(2025) 2025-05-07T20:32:51.0156568Z 2025-05-07T20:32:51.0156890Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.0157280Z 2025-05-07T20:32:51.0157518Z x_sign = torch.sign(x) 2025-05-07T20:32:51.0157859Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.0167019Z x = x_sign * x_clamp 2025-05-07T20:32:51.0167327Z x0 = x[:, :D] 2025-05-07T20:32:51.0167587Z x1 = x[:, D:] 2025-05-07T20:32:51.0167849Z 2025-05-07T20:32:51.0168077Z if contiguous: 2025-05-07T20:32:51.0168351Z x0 = x0.contiguous() 2025-05-07T20:32:51.0168860Z x1 = x1.contiguous() 2025-05-07T20:32:51.0169154Z 2025-05-07T20:32:51.0169383Z if scale_ub is not None: 2025-05-07T20:32:51.0169708Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.0170116Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.0170480Z ) 2025-05-07T20:32:51.0170719Z else: 2025-05-07T20:32:51.0170978Z scale_ub_tensor = None 2025-05-07T20:32:51.0171271Z 2025-05-07T20:32:51.0171553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.0171931Z op = silu_mul_quant 2025-05-07T20:32:51.0172226Z if compiled: 2025-05-07T20:32:51.0172524Z op = torch.compile(op) 2025-05-07T20:32:51.0172879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0173278Z 2025-05-07T20:32:51.0173515Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.0173716Z 2025-05-07T20:32:51.0173836Z moe/activation_test.py:117: 2025-05-07T20:32:51.0174195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0174582Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.0174920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0175645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.0176291Z return fn(*args, **kwargs) 2025-05-07T20:32:51.0177051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.0177846Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.0178471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.0179250Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.0180014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.0180638Z kernel = self.compile( 2025-05-07T20:32:51.0181261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.0182020Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.0182474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0182746Z 2025-05-07T20:32:51.0182987Z self = 2025-05-07T20:32:51.0184218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.0185841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f5a9e0>} 2025-05-07T20:32:51.0187372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.0188546Z context = 2025-05-07T20:32:51.0188883Z 2025-05-07T20:32:51.0189078Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.0189684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.0190220Z module_map=module_map) 2025-05-07T20:32:51.0190652Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.0191067Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.0191380Z E ^ 2025-05-07T20:32:51.0191912Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.0192529Z 2025-05-07T20:32:51.0193006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.0193671Z 2025-05-07T20:32:51.0193804Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.0194284Z self=, 2025-05-07T20:32:51.0194741Z T=2048, 2025-05-07T20:32:51.0194967Z D=7168, 2025-05-07T20:32:51.0195220Z scale_ub=1200.0, 2025-05-07T20:32:51.0195510Z contiguous=False, 2025-05-07T20:32:51.0195776Z compiled=True, 2025-05-07T20:32:51.0196018Z ) 2025-05-07T20:32:51.0196382Z self = 2025-05-07T20:32:51.0196953Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.0197328Z 2025-05-07T20:32:51.0197421Z @given( 2025-05-07T20:32:51.0197681Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.0198043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.0198397Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.0198777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.0199199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.0199527Z ) 2025-05-07T20:32:51.0199930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.0200433Z def test_silu_mul_quant( 2025-05-07T20:32:51.0200724Z self, 2025-05-07T20:32:51.0200961Z T: int, 2025-05-07T20:32:51.0201193Z D: int, 2025-05-07T20:32:51.0201453Z scale_ub: Optional[float], 2025-05-07T20:32:51.0201773Z contiguous: bool, 2025-05-07T20:32:51.0202050Z compiled: bool, 2025-05-07T20:32:51.0202318Z ) -> None: 2025-05-07T20:32:51.0202577Z torch.manual_seed(2025) 2025-05-07T20:32:51.0202855Z 2025-05-07T20:32:51.0203180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.0203581Z 2025-05-07T20:32:51.0203806Z x_sign = torch.sign(x) 2025-05-07T20:32:51.0204151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.0204513Z x = x_sign * x_clamp 2025-05-07T20:32:51.0204800Z x0 = x[:, :D] 2025-05-07T20:32:51.0205051Z x1 = x[:, D:] 2025-05-07T20:32:51.0205297Z 2025-05-07T20:32:51.0205528Z if contiguous: 2025-05-07T20:32:51.0205800Z x0 = x0.contiguous() 2025-05-07T20:32:51.0206107Z x1 = x1.contiguous() 2025-05-07T20:32:51.0206394Z 2025-05-07T20:32:51.0206621Z if scale_ub is not None: 2025-05-07T20:32:51.0206946Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.0207338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.0207699Z ) 2025-05-07T20:32:51.0207934Z else: 2025-05-07T20:32:51.0208192Z scale_ub_tensor = None 2025-05-07T20:32:51.0208485Z 2025-05-07T20:32:51.0208764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.0209134Z op = silu_mul_quant 2025-05-07T20:32:51.0209424Z if compiled: 2025-05-07T20:32:51.0209717Z op = torch.compile(op) 2025-05-07T20:32:51.0210084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0210412Z 2025-05-07T20:32:51.0210636Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.0210839Z 2025-05-07T20:32:51.0210955Z moe/activation_test.py:117: 2025-05-07T20:32:51.0211302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0211681Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.0212011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0212659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.0213309Z return fn(*args, **kwargs) 2025-05-07T20:32:51.0214161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.0214954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.0215571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.0216355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.0217107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.0217817Z kernel = self.compile( 2025-05-07T20:32:51.0218709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.0219501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.0220040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0220313Z 2025-05-07T20:32:51.0220558Z self = 2025-05-07T20:32:51.0221783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.0224146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f5bb50>} 2025-05-07T20:32:51.0225679Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.0226850Z context = 2025-05-07T20:32:51.0227180Z 2025-05-07T20:32:51.0227383Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.0227992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.0228528Z module_map=module_map) 2025-05-07T20:32:51.0228957Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.0229369Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.0229672Z E ^ 2025-05-07T20:32:51.0230211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.0230725Z 2025-05-07T20:32:51.0231204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.0231785Z 2025-05-07T20:32:51.1597206Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.1597900Z self=, 2025-05-07T20:32:51.1598566Z T=1, 2025-05-07T20:32:51.1598870Z D=5120, 2025-05-07T20:32:51.1599186Z scale_ub=None, 2025-05-07T20:32:51.1599462Z contiguous=False, 2025-05-07T20:32:51.1599734Z compiled=False, 2025-05-07T20:32:51.1599971Z ) 2025-05-07T20:32:51.1600341Z self = 2025-05-07T20:32:51.1600902Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.1601241Z 2025-05-07T20:32:51.1601335Z @given( 2025-05-07T20:32:51.1601613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.1601978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.1602335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.1602718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.1603103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.1603433Z ) 2025-05-07T20:32:51.1603839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.1604533Z def test_silu_mul_quant( 2025-05-07T20:32:51.1604817Z self, 2025-05-07T20:32:51.1605049Z T: int, 2025-05-07T20:32:51.1605283Z D: int, 2025-05-07T20:32:51.1605534Z scale_ub: Optional[float], 2025-05-07T20:32:51.1605853Z contiguous: bool, 2025-05-07T20:32:51.1606134Z compiled: bool, 2025-05-07T20:32:51.1606395Z ) -> None: 2025-05-07T20:32:51.1606653Z torch.manual_seed(2025) 2025-05-07T20:32:51.1606935Z 2025-05-07T20:32:51.1607253Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.1607643Z 2025-05-07T20:32:51.1607873Z x_sign = torch.sign(x) 2025-05-07T20:32:51.1608211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.1608563Z x = x_sign * x_clamp 2025-05-07T20:32:51.1608950Z x0 = x[:, :D] 2025-05-07T20:32:51.1609198Z x1 = x[:, D:] 2025-05-07T20:32:51.1609433Z 2025-05-07T20:32:51.1609648Z if contiguous: 2025-05-07T20:32:51.1609924Z x0 = x0.contiguous() 2025-05-07T20:32:51.1610219Z x1 = x1.contiguous() 2025-05-07T20:32:51.1610501Z 2025-05-07T20:32:51.1610730Z if scale_ub is not None: 2025-05-07T20:32:51.1611050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.1611510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.1611867Z ) 2025-05-07T20:32:51.1612093Z else: 2025-05-07T20:32:51.1612348Z scale_ub_tensor = None 2025-05-07T20:32:51.1612645Z 2025-05-07T20:32:51.1612914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.1613277Z op = silu_mul_quant 2025-05-07T20:32:51.1613571Z if compiled: 2025-05-07T20:32:51.1613863Z op = torch.compile(op) 2025-05-07T20:32:51.1614205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1614525Z 2025-05-07T20:32:51.1614757Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.1614947Z 2025-05-07T20:32:51.1615070Z moe/activation_test.py:117: 2025-05-07T20:32:51.1615414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1615799Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.1616124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1616900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.1617672Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.1618278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.1619037Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.1619775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.1620375Z kernel = self.compile( 2025-05-07T20:32:51.1620983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.1621718Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.1622170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1622435Z 2025-05-07T20:32:51.1622682Z self = 2025-05-07T20:32:51.1624200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.1625786Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a085e0>} 2025-05-07T20:32:51.1627422Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.1628558Z context = 2025-05-07T20:32:51.1628884Z 2025-05-07T20:32:51.1629079Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.1629657Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.1630184Z module_map=module_map) 2025-05-07T20:32:51.1630601Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.1630995Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.1631292Z E ^ 2025-05-07T20:32:51.1631815Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.1632386Z 2025-05-07T20:32:51.1632858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.1633423Z 2025-05-07T20:32:51.1633643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.1634212Z self=, 2025-05-07T20:32:51.1634757Z T=4096, 2025-05-07T20:32:51.1634974Z D=7168, 2025-05-07T20:32:51.1635212Z scale_ub=1200.0, 2025-05-07T20:32:51.1635477Z contiguous=False, 2025-05-07T20:32:51.1635733Z compiled=False, 2025-05-07T20:32:51.1635976Z ) 2025-05-07T20:32:51.1636341Z self = 2025-05-07T20:32:51.1636902Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.1637212Z 2025-05-07T20:32:51.1637307Z @given( 2025-05-07T20:32:51.1637582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.1637944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.1638294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.1638681Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.1639060Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.1639383Z ) 2025-05-07T20:32:51.1639785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.1640291Z def test_silu_mul_quant( 2025-05-07T20:32:51.1640573Z self, 2025-05-07T20:32:51.1640798Z T: int, 2025-05-07T20:32:51.1641031Z D: int, 2025-05-07T20:32:51.1641290Z scale_ub: Optional[float], 2025-05-07T20:32:51.1641600Z contiguous: bool, 2025-05-07T20:32:51.1641881Z compiled: bool, 2025-05-07T20:32:51.1642148Z ) -> None: 2025-05-07T20:32:51.1642398Z torch.manual_seed(2025) 2025-05-07T20:32:51.1642692Z 2025-05-07T20:32:51.1643012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.1643398Z 2025-05-07T20:32:51.1643627Z x_sign = torch.sign(x) 2025-05-07T20:32:51.1643965Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.1644312Z x = x_sign * x_clamp 2025-05-07T20:32:51.1644701Z x0 = x[:, :D] 2025-05-07T20:32:51.1645074Z x1 = x[:, D:] 2025-05-07T20:32:51.1645396Z 2025-05-07T20:32:51.1645617Z if contiguous: 2025-05-07T20:32:51.1645888Z x0 = x0.contiguous() 2025-05-07T20:32:51.1646185Z x1 = x1.contiguous() 2025-05-07T20:32:51.1646463Z 2025-05-07T20:32:51.1646689Z if scale_ub is not None: 2025-05-07T20:32:51.1647006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.1647384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.1647741Z ) 2025-05-07T20:32:51.1647973Z else: 2025-05-07T20:32:51.1648212Z scale_ub_tensor = None 2025-05-07T20:32:51.1648508Z 2025-05-07T20:32:51.1648777Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.1649129Z op = silu_mul_quant 2025-05-07T20:32:51.1649531Z if compiled: 2025-05-07T20:32:51.1649824Z op = torch.compile(op) 2025-05-07T20:32:51.1650159Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1650478Z 2025-05-07T20:32:51.1650709Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.1650897Z 2025-05-07T20:32:51.1651010Z moe/activation_test.py:117: 2025-05-07T20:32:51.1651350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1651726Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.1652052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1652813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.1653648Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.1654249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.1655004Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.1655742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.1656382Z kernel = self.compile( 2025-05-07T20:32:51.1656989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.1657714Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.1658161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1658414Z 2025-05-07T20:32:51.1658653Z self = 2025-05-07T20:32:51.1659845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.1661344Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a08ca0>} 2025-05-07T20:32:51.1662830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.1663959Z context = 2025-05-07T20:32:51.1664279Z 2025-05-07T20:32:51.1664472Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.1665052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.1665581Z module_map=module_map) 2025-05-07T20:32:51.1665993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.1666392Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.1666681Z E ^ 2025-05-07T20:32:51.1667196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.1667695Z 2025-05-07T20:32:51.1668156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.1668715Z 2025-05-07T20:32:51.1668841Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.1669297Z self=, 2025-05-07T20:32:51.1669746Z T=16384, 2025-05-07T20:32:51.1669971Z D=7168, 2025-05-07T20:32:51.1670188Z scale_ub=None, 2025-05-07T20:32:51.1670433Z contiguous=True, 2025-05-07T20:32:51.1670693Z compiled=True, 2025-05-07T20:32:51.1670925Z ) 2025-05-07T20:32:51.3785881Z self = 2025-05-07T20:32:51.3786825Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3787347Z 2025-05-07T20:32:51.3787485Z @given( 2025-05-07T20:32:51.3787864Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3788352Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3788727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3789127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3789518Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3789862Z ) 2025-05-07T20:32:51.3790279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3790802Z def test_silu_mul_quant( 2025-05-07T20:32:51.3791088Z self, 2025-05-07T20:32:51.3791329Z T: int, 2025-05-07T20:32:51.3791673Z D: int, 2025-05-07T20:32:51.3791935Z scale_ub: Optional[float], 2025-05-07T20:32:51.3792263Z contiguous: bool, 2025-05-07T20:32:51.3792553Z compiled: bool, 2025-05-07T20:32:51.3792834Z ) -> None: 2025-05-07T20:32:51.3793096Z torch.manual_seed(2025) 2025-05-07T20:32:51.3793392Z 2025-05-07T20:32:51.3793805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3794287Z 2025-05-07T20:32:51.3794520Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3794862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3795230Z x = x_sign * x_clamp 2025-05-07T20:32:51.3795520Z x0 = x[:, :D] 2025-05-07T20:32:51.3795777Z x1 = x[:, D:] 2025-05-07T20:32:51.3796028Z 2025-05-07T20:32:51.3796258Z if contiguous: 2025-05-07T20:32:51.3796533Z x0 = x0.contiguous() 2025-05-07T20:32:51.3796845Z x1 = x1.contiguous() 2025-05-07T20:32:51.3797135Z 2025-05-07T20:32:51.3797377Z if scale_ub is not None: 2025-05-07T20:32:51.3797699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3798101Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3798493Z ) 2025-05-07T20:32:51.3798816Z else: 2025-05-07T20:32:51.3799116Z scale_ub_tensor = None 2025-05-07T20:32:51.3799418Z 2025-05-07T20:32:51.3799693Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3800074Z op = silu_mul_quant 2025-05-07T20:32:51.3800375Z if compiled: 2025-05-07T20:32:51.3800667Z op = torch.compile(op) 2025-05-07T20:32:51.3801020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3801350Z 2025-05-07T20:32:51.3801580Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3801783Z 2025-05-07T20:32:51.3801903Z moe/activation_test.py:117: 2025-05-07T20:32:51.3802255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3802650Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3802985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3803647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.3804310Z return fn(*args, **kwargs) 2025-05-07T20:32:51.3805082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3805890Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3806510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3807307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3808078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3808698Z kernel = self.compile( 2025-05-07T20:32:51.3809327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3810222Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3810695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3810961Z 2025-05-07T20:32:51.3811204Z self = 2025-05-07T20:32:51.3812453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3820973Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a09b40>} 2025-05-07T20:32:51.3822535Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3824169Z context = 2025-05-07T20:32:51.3824521Z 2025-05-07T20:32:51.3824721Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3825451Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3826016Z module_map=module_map) 2025-05-07T20:32:51.3826446Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3826858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3827160Z E ^ 2025-05-07T20:32:51.3827698Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3828215Z 2025-05-07T20:32:51.3828690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3829277Z 2025-05-07T20:32:51.3829406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3829886Z self=, 2025-05-07T20:32:51.3830350Z T=4096, 2025-05-07T20:32:51.3830579Z D=5120, 2025-05-07T20:32:51.3830806Z scale_ub=None, 2025-05-07T20:32:51.3831070Z contiguous=False, 2025-05-07T20:32:51.3831336Z compiled=True, 2025-05-07T20:32:51.3831576Z ) 2025-05-07T20:32:51.3831941Z self = 2025-05-07T20:32:51.3832510Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.3832821Z 2025-05-07T20:32:51.3832918Z @given( 2025-05-07T20:32:51.3833189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3833624Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3833991Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3834371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3834812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3835230Z ) 2025-05-07T20:32:51.3835745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3836375Z def test_silu_mul_quant( 2025-05-07T20:32:51.3836736Z self, 2025-05-07T20:32:51.3837019Z T: int, 2025-05-07T20:32:51.3837284Z D: int, 2025-05-07T20:32:51.3837543Z scale_ub: Optional[float], 2025-05-07T20:32:51.3837863Z contiguous: bool, 2025-05-07T20:32:51.3838142Z compiled: bool, 2025-05-07T20:32:51.3838406Z ) -> None: 2025-05-07T20:32:51.3838666Z torch.manual_seed(2025) 2025-05-07T20:32:51.3838944Z 2025-05-07T20:32:51.3839267Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3839669Z 2025-05-07T20:32:51.3839896Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3840237Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3840598Z x = x_sign * x_clamp 2025-05-07T20:32:51.3841019Z x0 = x[:, :D] 2025-05-07T20:32:51.3841283Z x1 = x[:, D:] 2025-05-07T20:32:51.3841535Z 2025-05-07T20:32:51.3841750Z if contiguous: 2025-05-07T20:32:51.3842028Z x0 = x0.contiguous() 2025-05-07T20:32:51.3842337Z x1 = x1.contiguous() 2025-05-07T20:32:51.3842617Z 2025-05-07T20:32:51.3842847Z if scale_ub is not None: 2025-05-07T20:32:51.3843167Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3843555Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3843915Z ) 2025-05-07T20:32:51.3844147Z else: 2025-05-07T20:32:51.3844393Z scale_ub_tensor = None 2025-05-07T20:32:51.3844682Z 2025-05-07T20:32:51.3844955Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3845423Z op = silu_mul_quant 2025-05-07T20:32:51.3845732Z if compiled: 2025-05-07T20:32:51.3846022Z op = torch.compile(op) 2025-05-07T20:32:51.3846374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3846687Z 2025-05-07T20:32:51.3846914Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3847105Z 2025-05-07T20:32:51.3847227Z moe/activation_test.py:117: 2025-05-07T20:32:51.3847621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3848010Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3848341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3848988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.3849628Z return fn(*args, **kwargs) 2025-05-07T20:32:51.3850385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3851176Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3851795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3852575Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3853333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3853950Z kernel = self.compile( 2025-05-07T20:32:51.3854571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3855363Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3855831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3856094Z 2025-05-07T20:32:51.3856341Z self = 2025-05-07T20:32:51.3857574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3859144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a09240>} 2025-05-07T20:32:51.3860683Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3861854Z context = 2025-05-07T20:32:51.3862189Z 2025-05-07T20:32:51.3862382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3862982Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3863526Z module_map=module_map) 2025-05-07T20:32:51.3864042Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3864452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3864757Z E ^ 2025-05-07T20:32:51.3865345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3865863Z 2025-05-07T20:32:51.3866337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3866924Z 2025-05-07T20:32:51.7479112Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7480348Z self=, 2025-05-07T20:32:51.7481479Z T=4096, 2025-05-07T20:32:51.7482009Z D=5120, 2025-05-07T20:32:51.7482487Z scale_ub=1200.0, 2025-05-07T20:32:51.7483165Z contiguous=False, 2025-05-07T20:32:51.7483618Z compiled=False, 2025-05-07T20:32:51.7484027Z ) 2025-05-07T20:32:51.7484670Z self = 2025-05-07T20:32:51.7485542Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.7485884Z 2025-05-07T20:32:51.7485977Z @given( 2025-05-07T20:32:51.7486247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7486689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7487040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7487424Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7487811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7488146Z ) 2025-05-07T20:32:51.7488549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7489067Z def test_silu_mul_quant( 2025-05-07T20:32:51.7489351Z self, 2025-05-07T20:32:51.7489579Z T: int, 2025-05-07T20:32:51.7489812Z D: int, 2025-05-07T20:32:51.7490080Z scale_ub: Optional[float], 2025-05-07T20:32:51.7490394Z contiguous: bool, 2025-05-07T20:32:51.7490684Z compiled: bool, 2025-05-07T20:32:51.7490948Z ) -> None: 2025-05-07T20:32:51.7491202Z torch.manual_seed(2025) 2025-05-07T20:32:51.7491487Z 2025-05-07T20:32:51.7491804Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7492198Z 2025-05-07T20:32:51.7492426Z x_sign = torch.sign(x) 2025-05-07T20:32:51.7492793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.7493154Z x = x_sign * x_clamp 2025-05-07T20:32:51.7493435Z x0 = x[:, :D] 2025-05-07T20:32:51.7493686Z x1 = x[:, D:] 2025-05-07T20:32:51.7493932Z 2025-05-07T20:32:51.7494157Z if contiguous: 2025-05-07T20:32:51.7494425Z x0 = x0.contiguous() 2025-05-07T20:32:51.7494728Z x1 = x1.contiguous() 2025-05-07T20:32:51.7495018Z 2025-05-07T20:32:51.7495273Z if scale_ub is not None: 2025-05-07T20:32:51.7495627Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.7496020Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.7496379Z ) 2025-05-07T20:32:51.7496609Z else: 2025-05-07T20:32:51.7496858Z scale_ub_tensor = None 2025-05-07T20:32:51.7497155Z 2025-05-07T20:32:51.7497421Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7497786Z op = silu_mul_quant 2025-05-07T20:32:51.7498188Z if compiled: 2025-05-07T20:32:51.7498474Z op = torch.compile(op) 2025-05-07T20:32:51.7498818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7499139Z 2025-05-07T20:32:51.7499360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.7499556Z 2025-05-07T20:32:51.7499671Z moe/activation_test.py:117: 2025-05-07T20:32:51.7500024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7500404Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.7500879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7501672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.7502453Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.7503067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7503845Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.7504597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.7505205Z kernel = self.compile( 2025-05-07T20:32:51.7505820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.7506620Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7507083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7507343Z 2025-05-07T20:32:51.7507579Z self = 2025-05-07T20:32:51.7508804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.7510410Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a0acb0>} 2025-05-07T20:32:51.7511932Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.7513092Z context = 2025-05-07T20:32:51.7513419Z 2025-05-07T20:32:51.7513699Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.7514298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7514840Z module_map=module_map) 2025-05-07T20:32:51.7515256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7515695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.7515990Z E ^ 2025-05-07T20:32:51.7516516Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7517034Z 2025-05-07T20:32:51.7517503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7518088Z 2025-05-07T20:32:51.7518205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7518677Z self=, 2025-05-07T20:32:51.7519128Z T=4096, 2025-05-07T20:32:51.7519346Z D=5120, 2025-05-07T20:32:51.7519567Z scale_ub=1200.0, 2025-05-07T20:32:51.7519823Z contiguous=False, 2025-05-07T20:32:51.7520074Z compiled=True, 2025-05-07T20:32:51.7520310Z ) 2025-05-07T20:32:51.7520669Z self = 2025-05-07T20:32:51.7521230Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.7521544Z 2025-05-07T20:32:51.7521631Z @given( 2025-05-07T20:32:51.7521890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7522242Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7522596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7522973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7523348Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7523676Z ) 2025-05-07T20:32:51.7524609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7525117Z def test_silu_mul_quant( 2025-05-07T20:32:51.7525397Z self, 2025-05-07T20:32:51.7525623Z T: int, 2025-05-07T20:32:51.7525852Z D: int, 2025-05-07T20:32:51.7526104Z scale_ub: Optional[float], 2025-05-07T20:32:51.7526415Z contiguous: bool, 2025-05-07T20:32:51.7526691Z compiled: bool, 2025-05-07T20:32:51.7526943Z ) -> None: 2025-05-07T20:32:51.7527192Z torch.manual_seed(2025) 2025-05-07T20:32:51.7527469Z 2025-05-07T20:32:51.7527776Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7528165Z 2025-05-07T20:32:51.7528390Z x_sign = torch.sign(x) 2025-05-07T20:32:51.7528716Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.7529143Z x = x_sign * x_clamp 2025-05-07T20:32:51.7529424Z x0 = x[:, :D] 2025-05-07T20:32:51.7529669Z x1 = x[:, D:] 2025-05-07T20:32:51.7529912Z 2025-05-07T20:32:51.7530135Z if contiguous: 2025-05-07T20:32:51.7530404Z x0 = x0.contiguous() 2025-05-07T20:32:51.7530697Z x1 = x1.contiguous() 2025-05-07T20:32:51.7530974Z 2025-05-07T20:32:51.7531272Z if scale_ub is not None: 2025-05-07T20:32:51.7531581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.7531964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.7532320Z ) 2025-05-07T20:32:51.7532537Z else: 2025-05-07T20:32:51.7532781Z scale_ub_tensor = None 2025-05-07T20:32:51.7533072Z 2025-05-07T20:32:51.7533333Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7533693Z op = silu_mul_quant 2025-05-07T20:32:51.7533989Z if compiled: 2025-05-07T20:32:51.7534271Z op = torch.compile(op) 2025-05-07T20:32:51.7534614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7534933Z 2025-05-07T20:32:51.7535158Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.7535355Z 2025-05-07T20:32:51.7535469Z moe/activation_test.py:117: 2025-05-07T20:32:51.7535810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7536192Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.7536511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7537145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.7537780Z return fn(*args, **kwargs) 2025-05-07T20:32:51.7538517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.7539293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.7539903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7540676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.7541417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.7542019Z kernel = self.compile( 2025-05-07T20:32:51.7542631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.7543369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7543813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7544080Z 2025-05-07T20:32:51.7544314Z self = 2025-05-07T20:32:51.7545525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.7547171Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a0ab90>} 2025-05-07T20:32:51.7548675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.7549832Z context = 2025-05-07T20:32:51.7550164Z 2025-05-07T20:32:51.7550353Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.7550945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7551471Z module_map=module_map) 2025-05-07T20:32:51.7551934Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7552336Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.7552627Z E ^ 2025-05-07T20:32:51.7553157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7553756Z 2025-05-07T20:32:51.7554223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7554847Z 2025-05-07T20:32:51.8943817Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8944615Z self=, 2025-05-07T20:32:51.8945303Z T=2048, 2025-05-07T20:32:51.8945847Z D=7168, 2025-05-07T20:32:51.8946392Z scale_ub=1200.0, 2025-05-07T20:32:51.8946948Z contiguous=False, 2025-05-07T20:32:51.8947386Z compiled=False, 2025-05-07T20:32:51.8947783Z ) 2025-05-07T20:32:51.8948396Z self = 2025-05-07T20:32:51.8949349Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.8949883Z 2025-05-07T20:32:51.8950044Z @given( 2025-05-07T20:32:51.8950481Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.8951085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.8951684Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.8952321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.8952942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.8953492Z ) 2025-05-07T20:32:51.8954290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.8955122Z def test_silu_mul_quant( 2025-05-07T20:32:51.8955589Z self, 2025-05-07T20:32:51.8955860Z T: int, 2025-05-07T20:32:51.8956087Z D: int, 2025-05-07T20:32:51.8956347Z scale_ub: Optional[float], 2025-05-07T20:32:51.8956668Z contiguous: bool, 2025-05-07T20:32:51.8956942Z compiled: bool, 2025-05-07T20:32:51.8957206Z ) -> None: 2025-05-07T20:32:51.8957467Z torch.manual_seed(2025) 2025-05-07T20:32:51.8957743Z 2025-05-07T20:32:51.8958061Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.8958454Z 2025-05-07T20:32:51.8958685Z x_sign = torch.sign(x) 2025-05-07T20:32:51.8959016Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.8959372Z x = x_sign * x_clamp 2025-05-07T20:32:51.8959650Z x0 = x[:, :D] 2025-05-07T20:32:51.8959897Z x1 = x[:, D:] 2025-05-07T20:32:51.8960140Z 2025-05-07T20:32:51.8960359Z if contiguous: 2025-05-07T20:32:51.8960626Z x0 = x0.contiguous() 2025-05-07T20:32:51.8960927Z x1 = x1.contiguous() 2025-05-07T20:32:51.8961207Z 2025-05-07T20:32:51.8961429Z if scale_ub is not None: 2025-05-07T20:32:51.8961751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.8962137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.8962487Z ) 2025-05-07T20:32:51.8962907Z else: 2025-05-07T20:32:51.8963161Z scale_ub_tensor = None 2025-05-07T20:32:51.8963449Z 2025-05-07T20:32:51.8963720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.8964083Z op = silu_mul_quant 2025-05-07T20:32:51.8964373Z if compiled: 2025-05-07T20:32:51.8964659Z op = torch.compile(op) 2025-05-07T20:32:51.8965000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8965342Z 2025-05-07T20:32:51.8965589Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.8965816Z 2025-05-07T20:32:51.8965934Z moe/activation_test.py:117: 2025-05-07T20:32:51.8966276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8966652Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.8967055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8967852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.8968643Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.8969254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.8970119Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.8970879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.8971486Z kernel = self.compile( 2025-05-07T20:32:51.8972113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.8972867Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.8973325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8973589Z 2025-05-07T20:32:51.8973831Z self = 2025-05-07T20:32:51.8975067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.8976651Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf75285e0>} 2025-05-07T20:32:51.8978187Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.8979350Z context = 2025-05-07T20:32:51.8979690Z 2025-05-07T20:32:51.8979884Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.8980491Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.8981032Z module_map=module_map) 2025-05-07T20:32:51.8981449Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.8981858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.8982161Z E ^ 2025-05-07T20:32:51.8982684Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.8983201Z 2025-05-07T20:32:51.8983672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.8984260Z 2025-05-07T20:32:51.8984380Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8984853Z self=, 2025-05-07T20:32:51.8985308Z T=1, 2025-05-07T20:32:51.8985521Z D=7168, 2025-05-07T20:32:51.8985744Z scale_ub=None, 2025-05-07T20:32:51.8986084Z contiguous=True, 2025-05-07T20:32:51.8986347Z compiled=False, 2025-05-07T20:32:51.8986587Z ) 2025-05-07T20:32:51.8986949Z self = 2025-05-07T20:32:51.8987506Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.8987810Z 2025-05-07T20:32:51.8987899Z @given( 2025-05-07T20:32:51.8988165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.8988518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.8988875Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.8989255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.8989629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.8989971Z ) 2025-05-07T20:32:51.8990428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.8990937Z def test_silu_mul_quant( 2025-05-07T20:32:51.8991228Z self, 2025-05-07T20:32:51.8991449Z T: int, 2025-05-07T20:32:51.8991683Z D: int, 2025-05-07T20:32:51.8991938Z scale_ub: Optional[float], 2025-05-07T20:32:51.8992248Z contiguous: bool, 2025-05-07T20:32:51.8992604Z compiled: bool, 2025-05-07T20:32:51.8992862Z ) -> None: 2025-05-07T20:32:51.8993107Z torch.manual_seed(2025) 2025-05-07T20:32:51.8993389Z 2025-05-07T20:32:51.8993775Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.8994163Z 2025-05-07T20:32:51.8994392Z x_sign = torch.sign(x) 2025-05-07T20:32:51.9001453Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.9001854Z x = x_sign * x_clamp 2025-05-07T20:32:51.9002139Z x0 = x[:, :D] 2025-05-07T20:32:51.9002403Z x1 = x[:, D:] 2025-05-07T20:32:51.9002646Z 2025-05-07T20:32:51.9002867Z if contiguous: 2025-05-07T20:32:51.9003141Z x0 = x0.contiguous() 2025-05-07T20:32:51.9003465Z x1 = x1.contiguous() 2025-05-07T20:32:51.9003748Z 2025-05-07T20:32:51.9003972Z if scale_ub is not None: 2025-05-07T20:32:51.9004292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.9004689Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.9005044Z ) 2025-05-07T20:32:51.9005278Z else: 2025-05-07T20:32:51.9005535Z scale_ub_tensor = None 2025-05-07T20:32:51.9005822Z 2025-05-07T20:32:51.9006101Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.9006470Z op = silu_mul_quant 2025-05-07T20:32:51.9006759Z if compiled: 2025-05-07T20:32:51.9007049Z op = torch.compile(op) 2025-05-07T20:32:51.9007391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9007709Z 2025-05-07T20:32:51.9007939Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.9008132Z 2025-05-07T20:32:51.9008253Z moe/activation_test.py:117: 2025-05-07T20:32:51.9008603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9008987Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.9009318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9010112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.9010896Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.9011513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.9012300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.9013050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.9013664Z kernel = self.compile( 2025-05-07T20:32:51.9014402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.9015153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.9015602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9015873Z 2025-05-07T20:32:51.9016112Z self = 2025-05-07T20:32:51.9017338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.9018906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7528d30>} 2025-05-07T20:32:51.9020483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.9021653Z context = 2025-05-07T20:32:51.9021988Z 2025-05-07T20:32:51.9022178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.9022821Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.9023350Z module_map=module_map) 2025-05-07T20:32:51.9024052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.9024463Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.9024761Z E ^ 2025-05-07T20:32:51.9025293Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.9025860Z 2025-05-07T20:32:51.9026336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.9026922Z 2025-05-07T20:32:51.9027052Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.9027520Z self=, 2025-05-07T20:32:51.9027979Z T=16384, 2025-05-07T20:32:51.9028213Z D=7168, 2025-05-07T20:32:51.9028433Z scale_ub=1200.0, 2025-05-07T20:32:51.9028693Z contiguous=False, 2025-05-07T20:32:51.9028952Z compiled=True, 2025-05-07T20:32:52.1892796Z ) 2025-05-07T20:32:52.1893621Z self = 2025-05-07T20:32:52.1894418Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.1894829Z 2025-05-07T20:32:52.1894917Z @given( 2025-05-07T20:32:52.1895177Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.1895531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.1895867Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.1896239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.1896608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.1896921Z ) 2025-05-07T20:32:52.1897320Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.1897816Z def test_silu_mul_quant( 2025-05-07T20:32:52.1898084Z self, 2025-05-07T20:32:52.1898295Z T: int, 2025-05-07T20:32:52.1898513Z D: int, 2025-05-07T20:32:52.1898755Z scale_ub: Optional[float], 2025-05-07T20:32:52.1899051Z contiguous: bool, 2025-05-07T20:32:52.1899315Z compiled: bool, 2025-05-07T20:32:52.1899565Z ) -> None: 2025-05-07T20:32:52.1899800Z torch.manual_seed(2025) 2025-05-07T20:32:52.1900072Z 2025-05-07T20:32:52.1900374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.1900748Z 2025-05-07T20:32:52.1900968Z x_sign = torch.sign(x) 2025-05-07T20:32:52.1901492Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.1901834Z x = x_sign * x_clamp 2025-05-07T20:32:52.1902103Z x0 = x[:, :D] 2025-05-07T20:32:52.1902342Z x1 = x[:, D:] 2025-05-07T20:32:52.1902569Z 2025-05-07T20:32:52.1902781Z if contiguous: 2025-05-07T20:32:52.1903039Z x0 = x0.contiguous() 2025-05-07T20:32:52.1903320Z x1 = x1.contiguous() 2025-05-07T20:32:52.1903589Z 2025-05-07T20:32:52.1903805Z if scale_ub is not None: 2025-05-07T20:32:52.1904111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.1904476Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.1904817Z ) 2025-05-07T20:32:52.1905034Z else: 2025-05-07T20:32:52.1905265Z scale_ub_tensor = None 2025-05-07T20:32:52.1905617Z 2025-05-07T20:32:52.1905874Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.1906215Z op = silu_mul_quant 2025-05-07T20:32:52.1906496Z if compiled: 2025-05-07T20:32:52.1906777Z op = torch.compile(op) 2025-05-07T20:32:52.1907099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1907401Z 2025-05-07T20:32:52.1907614Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.1907867Z 2025-05-07T20:32:52.1907978Z moe/activation_test.py:117: 2025-05-07T20:32:52.1908307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1908677Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.1908989Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1909608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.1910229Z return fn(*args, **kwargs) 2025-05-07T20:32:52.1910958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.1911720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.1912315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.1913071Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.1913872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.1914460Z kernel = self.compile( 2025-05-07T20:32:52.1915062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.1915793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1916233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1916489Z 2025-05-07T20:32:52.1916723Z self = 2025-05-07T20:32:52.1917924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.1919516Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7529bd0>} 2025-05-07T20:32:52.1921010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.1922144Z context = 2025-05-07T20:32:52.1922462Z 2025-05-07T20:32:52.1922646Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.1923222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1924140Z module_map=module_map) 2025-05-07T20:32:52.1924561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1924949Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1925239Z E ^ 2025-05-07T20:32:52.1925754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.1926254Z 2025-05-07T20:32:52.1926713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.1927283Z 2025-05-07T20:32:52.1927398Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.1927859Z self=, 2025-05-07T20:32:52.1928314Z T=1, 2025-05-07T20:32:52.1928581Z D=7168, 2025-05-07T20:32:52.1928803Z scale_ub=None, 2025-05-07T20:32:52.1929045Z contiguous=False, 2025-05-07T20:32:52.1929288Z compiled=False, 2025-05-07T20:32:52.1929515Z ) 2025-05-07T20:32:52.1929874Z self = 2025-05-07T20:32:52.1930409Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.1930703Z 2025-05-07T20:32:52.1930857Z @given( 2025-05-07T20:32:52.1931115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.1931454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.1931796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.1932161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.1932529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.1932839Z ) 2025-05-07T20:32:52.1933247Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.1933736Z def test_silu_mul_quant( 2025-05-07T20:32:52.1934002Z self, 2025-05-07T20:32:52.1934211Z T: int, 2025-05-07T20:32:52.1934430Z D: int, 2025-05-07T20:32:52.1934676Z scale_ub: Optional[float], 2025-05-07T20:32:52.1934998Z contiguous: bool, 2025-05-07T20:32:52.1935328Z compiled: bool, 2025-05-07T20:32:52.1935636Z ) -> None: 2025-05-07T20:32:52.1935932Z torch.manual_seed(2025) 2025-05-07T20:32:52.1936271Z 2025-05-07T20:32:52.1936647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.1937113Z 2025-05-07T20:32:52.1937380Z x_sign = torch.sign(x) 2025-05-07T20:32:52.1937741Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.1938081Z x = x_sign * x_clamp 2025-05-07T20:32:52.1938351Z x0 = x[:, :D] 2025-05-07T20:32:52.1938591Z x1 = x[:, D:] 2025-05-07T20:32:52.1938820Z 2025-05-07T20:32:52.1939023Z if contiguous: 2025-05-07T20:32:52.1939282Z x0 = x0.contiguous() 2025-05-07T20:32:52.1939566Z x1 = x1.contiguous() 2025-05-07T20:32:52.1939830Z 2025-05-07T20:32:52.1940044Z if scale_ub is not None: 2025-05-07T20:32:52.1940361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.1940740Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.1941079Z ) 2025-05-07T20:32:52.1941294Z else: 2025-05-07T20:32:52.1941521Z scale_ub_tensor = None 2025-05-07T20:32:52.1941796Z 2025-05-07T20:32:52.1942047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.1942388Z op = silu_mul_quant 2025-05-07T20:32:52.1942663Z if compiled: 2025-05-07T20:32:52.1942935Z op = torch.compile(op) 2025-05-07T20:32:52.1943254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1943555Z 2025-05-07T20:32:52.1943765Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.1943945Z 2025-05-07T20:32:52.1944060Z moe/activation_test.py:117: 2025-05-07T20:32:52.1944383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1944867Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.1945188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1945948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.1946719Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.1947315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.1948083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.1948815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.1949408Z kernel = self.compile( 2025-05-07T20:32:52.1950010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.1950775Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1951219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1951480Z 2025-05-07T20:32:52.1951711Z self = 2025-05-07T20:32:52.1952948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.1954537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf752a050>} 2025-05-07T20:32:52.1956027Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.1957173Z context = 2025-05-07T20:32:52.1957493Z 2025-05-07T20:32:52.1957685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.1958266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1958787Z module_map=module_map) 2025-05-07T20:32:52.1959195Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1959589Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1959874Z E ^ 2025-05-07T20:32:52.1960392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.1960891Z 2025-05-07T20:32:52.1961357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.1961928Z 2025-05-07T20:32:52.1962049Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.1962509Z self=, 2025-05-07T20:32:52.1962962Z T=2048, 2025-05-07T20:32:52.1963175Z D=7168, 2025-05-07T20:32:52.1963386Z scale_ub=None, 2025-05-07T20:32:52.1963632Z contiguous=False, 2025-05-07T20:32:52.1963888Z compiled=True, 2025-05-07T20:32:52.1964113Z ) 2025-05-07T20:32:52.3048097Z self = 2025-05-07T20:32:52.3048936Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3049399Z 2025-05-07T20:32:52.3049539Z @given( 2025-05-07T20:32:52.3049911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3050400Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3050818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3051210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3051591Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3051922Z ) 2025-05-07T20:32:52.3052540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3053053Z def test_silu_mul_quant( 2025-05-07T20:32:52.3053332Z self, 2025-05-07T20:32:52.3053570Z T: int, 2025-05-07T20:32:52.3053803Z D: int, 2025-05-07T20:32:52.3054056Z scale_ub: Optional[float], 2025-05-07T20:32:52.3054375Z contiguous: bool, 2025-05-07T20:32:52.3054657Z compiled: bool, 2025-05-07T20:32:52.3054919Z ) -> None: 2025-05-07T20:32:52.3055173Z torch.manual_seed(2025) 2025-05-07T20:32:52.3055457Z 2025-05-07T20:32:52.3055772Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3056170Z 2025-05-07T20:32:52.3056398Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3056802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3057162Z x = x_sign * x_clamp 2025-05-07T20:32:52.3057443Z x0 = x[:, :D] 2025-05-07T20:32:52.3057705Z x1 = x[:, D:] 2025-05-07T20:32:52.3057946Z 2025-05-07T20:32:52.3058167Z if contiguous: 2025-05-07T20:32:52.3058444Z x0 = x0.contiguous() 2025-05-07T20:32:52.3058741Z x1 = x1.contiguous() 2025-05-07T20:32:52.3059095Z 2025-05-07T20:32:52.3059327Z if scale_ub is not None: 2025-05-07T20:32:52.3059643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3060034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3060398Z ) 2025-05-07T20:32:52.3060626Z else: 2025-05-07T20:32:52.3060876Z scale_ub_tensor = None 2025-05-07T20:32:52.3061169Z 2025-05-07T20:32:52.3061437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3061801Z op = silu_mul_quant 2025-05-07T20:32:52.3062100Z if compiled: 2025-05-07T20:32:52.3062388Z op = torch.compile(op) 2025-05-07T20:32:52.3062738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3063058Z 2025-05-07T20:32:52.3063288Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3063480Z 2025-05-07T20:32:52.3063596Z moe/activation_test.py:117: 2025-05-07T20:32:52.3063969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3064357Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3064686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3065324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3065965Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3066717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3067506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3068115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3068901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3069657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3070265Z kernel = self.compile( 2025-05-07T20:32:52.3070885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3071634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3072094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3072358Z 2025-05-07T20:32:52.3072595Z self = 2025-05-07T20:32:52.3074026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3075645Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf752b1c0>} 2025-05-07T20:32:52.3077211Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3078377Z context = 2025-05-07T20:32:52.3078716Z 2025-05-07T20:32:52.3078908Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3079513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3080100Z module_map=module_map) 2025-05-07T20:32:52.3080519Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3080934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3081238Z E ^ 2025-05-07T20:32:52.3081773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3082342Z 2025-05-07T20:32:52.3082817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3083411Z 2025-05-07T20:32:52.3083533Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3084019Z self=, 2025-05-07T20:32:52.3084477Z T=4096, 2025-05-07T20:32:52.3084703Z D=7168, 2025-05-07T20:32:52.3084936Z scale_ub=None, 2025-05-07T20:32:52.3085190Z contiguous=False, 2025-05-07T20:32:52.3085465Z compiled=True, 2025-05-07T20:32:52.3085711Z ) 2025-05-07T20:32:52.3086079Z self = 2025-05-07T20:32:52.3086657Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3086975Z 2025-05-07T20:32:52.3087074Z @given( 2025-05-07T20:32:52.3087352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3087720Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3088084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3088478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3088860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3089200Z ) 2025-05-07T20:32:52.3089610Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3090127Z def test_silu_mul_quant( 2025-05-07T20:32:52.3090414Z self, 2025-05-07T20:32:52.3090647Z T: int, 2025-05-07T20:32:52.3090888Z D: int, 2025-05-07T20:32:52.3091142Z scale_ub: Optional[float], 2025-05-07T20:32:52.3091465Z contiguous: bool, 2025-05-07T20:32:52.3091753Z compiled: bool, 2025-05-07T20:32:52.3092021Z ) -> None: 2025-05-07T20:32:52.3092280Z torch.manual_seed(2025) 2025-05-07T20:32:52.3092566Z 2025-05-07T20:32:52.3092878Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3093286Z 2025-05-07T20:32:52.3093520Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3093862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3094229Z x = x_sign * x_clamp 2025-05-07T20:32:52.3094518Z x0 = x[:, :D] 2025-05-07T20:32:52.3094771Z x1 = x[:, D:] 2025-05-07T20:32:52.3095028Z 2025-05-07T20:32:52.3095256Z if contiguous: 2025-05-07T20:32:52.3095596Z x0 = x0.contiguous() 2025-05-07T20:32:52.3095979Z x1 = x1.contiguous() 2025-05-07T20:32:52.3096339Z 2025-05-07T20:32:52.3096622Z if scale_ub is not None: 2025-05-07T20:32:52.3097025Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3097566Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3097932Z ) 2025-05-07T20:32:52.3098156Z else: 2025-05-07T20:32:52.3098409Z scale_ub_tensor = None 2025-05-07T20:32:52.3098708Z 2025-05-07T20:32:52.3098977Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3099346Z op = silu_mul_quant 2025-05-07T20:32:52.3099647Z if compiled: 2025-05-07T20:32:52.3099938Z op = torch.compile(op) 2025-05-07T20:32:52.3100288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3100612Z 2025-05-07T20:32:52.3100840Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3101041Z 2025-05-07T20:32:52.3101158Z moe/activation_test.py:117: 2025-05-07T20:32:52.3101511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3101971Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3102298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3102951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3103610Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3104365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3105211Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3105888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3106675Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3107431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3108053Z kernel = self.compile( 2025-05-07T20:32:52.3108683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3109447Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3109899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3110173Z 2025-05-07T20:32:52.3110415Z self = 2025-05-07T20:32:52.3111655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3113220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71a81f0>} 2025-05-07T20:32:52.3114797Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3122876Z context = 2025-05-07T20:32:52.3123375Z 2025-05-07T20:32:52.3123647Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3124706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3125451Z module_map=module_map) 2025-05-07T20:32:52.3126008Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3126554Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3126863Z E ^ 2025-05-07T20:32:52.3127401Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3127925Z 2025-05-07T20:32:52.3128414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3129007Z 2025-05-07T20:32:52.6911703Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.6912430Z self=, 2025-05-07T20:32:52.6913076Z T=16384, 2025-05-07T20:32:52.6913392Z D=5120, 2025-05-07T20:32:52.6913774Z scale_ub=1200.0, 2025-05-07T20:32:52.6914033Z contiguous=False, 2025-05-07T20:32:52.6914295Z compiled=False, 2025-05-07T20:32:52.6914533Z ) 2025-05-07T20:32:52.6914900Z self = 2025-05-07T20:32:52.6915464Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.6915819Z 2025-05-07T20:32:52.6915917Z @given( 2025-05-07T20:32:52.6916187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.6916548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.6916998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.6917370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.6917752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.6918086Z ) 2025-05-07T20:32:52.6918481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.6918984Z def test_silu_mul_quant( 2025-05-07T20:32:52.6919341Z self, 2025-05-07T20:32:52.6919565Z T: int, 2025-05-07T20:32:52.6919794Z D: int, 2025-05-07T20:32:52.6920046Z scale_ub: Optional[float], 2025-05-07T20:32:52.6920353Z contiguous: bool, 2025-05-07T20:32:52.6920629Z compiled: bool, 2025-05-07T20:32:52.6920889Z ) -> None: 2025-05-07T20:32:52.6921134Z torch.manual_seed(2025) 2025-05-07T20:32:52.6921413Z 2025-05-07T20:32:52.6921728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.6922118Z 2025-05-07T20:32:52.6922343Z x_sign = torch.sign(x) 2025-05-07T20:32:52.6922678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.6923029Z x = x_sign * x_clamp 2025-05-07T20:32:52.6923306Z x0 = x[:, :D] 2025-05-07T20:32:52.6923557Z x1 = x[:, D:] 2025-05-07T20:32:52.6924084Z 2025-05-07T20:32:52.6924301Z if contiguous: 2025-05-07T20:32:52.6924571Z x0 = x0.contiguous() 2025-05-07T20:32:52.6924877Z x1 = x1.contiguous() 2025-05-07T20:32:52.6925153Z 2025-05-07T20:32:52.6925380Z if scale_ub is not None: 2025-05-07T20:32:52.6925696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.6926080Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.6926437Z ) 2025-05-07T20:32:52.6926663Z else: 2025-05-07T20:32:52.6926901Z scale_ub_tensor = None 2025-05-07T20:32:52.6927191Z 2025-05-07T20:32:52.6927460Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.6927821Z op = silu_mul_quant 2025-05-07T20:32:52.6928116Z if compiled: 2025-05-07T20:32:52.6928405Z op = torch.compile(op) 2025-05-07T20:32:52.6928740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6929054Z 2025-05-07T20:32:52.6929278Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.6929466Z 2025-05-07T20:32:52.6929592Z moe/activation_test.py:117: 2025-05-07T20:32:52.6929928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6930310Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.6930636Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6931415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.6932190Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.6932800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.6933579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.6934459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.6935068Z kernel = self.compile( 2025-05-07T20:32:52.6935677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.6936414Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.6936868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6937131Z 2025-05-07T20:32:52.6937368Z self = 2025-05-07T20:32:52.6938582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.6940183Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71a8700>} 2025-05-07T20:32:52.6941690Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.6942903Z context = 2025-05-07T20:32:52.6943227Z 2025-05-07T20:32:52.6943423Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.6944010Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.6944537Z module_map=module_map) 2025-05-07T20:32:52.6944961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.6945368Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.6945664Z E ^ 2025-05-07T20:32:52.6946201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.6946712Z 2025-05-07T20:32:52.6947188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.6947766Z 2025-05-07T20:32:52.6947896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.6948363Z self=, 2025-05-07T20:32:52.6948818Z T=16384, 2025-05-07T20:32:52.6949048Z D=5120, 2025-05-07T20:32:52.6949267Z scale_ub=1200.0, 2025-05-07T20:32:52.6949524Z contiguous=True, 2025-05-07T20:32:52.6949779Z compiled=True, 2025-05-07T20:32:52.6950009Z ) 2025-05-07T20:32:52.6950371Z self = 2025-05-07T20:32:52.6950934Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.6951244Z 2025-05-07T20:32:52.6951338Z @given( 2025-05-07T20:32:52.6951604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.6951968Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.6952319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.6952699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.6953079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.6953413Z ) 2025-05-07T20:32:52.6953897Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.6954400Z def test_silu_mul_quant( 2025-05-07T20:32:52.6954679Z self, 2025-05-07T20:32:52.6954904Z T: int, 2025-05-07T20:32:52.6955134Z D: int, 2025-05-07T20:32:52.6955389Z scale_ub: Optional[float], 2025-05-07T20:32:52.6955702Z contiguous: bool, 2025-05-07T20:32:52.6955981Z compiled: bool, 2025-05-07T20:32:52.6956243Z ) -> None: 2025-05-07T20:32:52.6956494Z torch.manual_seed(2025) 2025-05-07T20:32:52.6956860Z 2025-05-07T20:32:52.6957179Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.6957572Z 2025-05-07T20:32:52.6957792Z x_sign = torch.sign(x) 2025-05-07T20:32:52.6958134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.6958487Z x = x_sign * x_clamp 2025-05-07T20:32:52.6958761Z x0 = x[:, :D] 2025-05-07T20:32:52.6959015Z x1 = x[:, D:] 2025-05-07T20:32:52.6959259Z 2025-05-07T20:32:52.6959476Z if contiguous: 2025-05-07T20:32:52.6959748Z x0 = x0.contiguous() 2025-05-07T20:32:52.6960055Z x1 = x1.contiguous() 2025-05-07T20:32:52.6960336Z 2025-05-07T20:32:52.6960565Z if scale_ub is not None: 2025-05-07T20:32:52.6960883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.6961313Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.6961673Z ) 2025-05-07T20:32:52.6961898Z else: 2025-05-07T20:32:52.6962150Z scale_ub_tensor = None 2025-05-07T20:32:52.6962436Z 2025-05-07T20:32:52.6962702Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.6963064Z op = silu_mul_quant 2025-05-07T20:32:52.6963394Z if compiled: 2025-05-07T20:32:52.6963679Z op = torch.compile(op) 2025-05-07T20:32:52.6964013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6964320Z 2025-05-07T20:32:52.6964544Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.6964735Z 2025-05-07T20:32:52.6964854Z moe/activation_test.py:117: 2025-05-07T20:32:52.6965187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6965572Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.6965905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6966542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.6967182Z return fn(*args, **kwargs) 2025-05-07T20:32:52.6967926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.6968704Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.6969305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.6970073Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.6970821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.6971420Z kernel = self.compile( 2025-05-07T20:32:52.6972026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.6972769Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.6973226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6973486Z 2025-05-07T20:32:52.6973723Z self = 2025-05-07T20:32:52.6974936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.6976522Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71a97e0>} 2025-05-07T20:32:52.6978029Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.6979183Z context = 2025-05-07T20:32:52.6979509Z 2025-05-07T20:32:52.6979786Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.6980377Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.6980911Z module_map=module_map) 2025-05-07T20:32:52.6981326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.6981725Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.6982025Z E ^ 2025-05-07T20:32:52.6982552Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.6983058Z 2025-05-07T20:32:52.6983525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.6984182Z 2025-05-07T20:32:52.9044283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.9044941Z self=, 2025-05-07T20:32:52.9045558Z T=16384, 2025-05-07T20:32:52.9045969Z D=5120, 2025-05-07T20:32:52.9046289Z scale_ub=None, 2025-05-07T20:32:52.9046625Z contiguous=False, 2025-05-07T20:32:52.9046985Z compiled=True, 2025-05-07T20:32:52.9047402Z ) 2025-05-07T20:32:52.9047767Z self = 2025-05-07T20:32:52.9048334Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.9048656Z 2025-05-07T20:32:52.9048750Z @given( 2025-05-07T20:32:52.9049021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.9049379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.9049736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.9050127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.9050508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.9050841Z ) 2025-05-07T20:32:52.9051253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.9051755Z def test_silu_mul_quant( 2025-05-07T20:32:52.9052039Z self, 2025-05-07T20:32:52.9052267Z T: int, 2025-05-07T20:32:52.9052503Z D: int, 2025-05-07T20:32:52.9052761Z scale_ub: Optional[float], 2025-05-07T20:32:52.9053081Z contiguous: bool, 2025-05-07T20:32:52.9053367Z compiled: bool, 2025-05-07T20:32:52.9053624Z ) -> None: 2025-05-07T20:32:52.9053877Z torch.manual_seed(2025) 2025-05-07T20:32:52.9054160Z 2025-05-07T20:32:52.9054472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.9054864Z 2025-05-07T20:32:52.9055092Z x_sign = torch.sign(x) 2025-05-07T20:32:52.9055429Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.9055841Z x = x_sign * x_clamp 2025-05-07T20:32:52.9056124Z x0 = x[:, :D] 2025-05-07T20:32:52.9056375Z x1 = x[:, D:] 2025-05-07T20:32:52.9056624Z 2025-05-07T20:32:52.9056849Z if contiguous: 2025-05-07T20:32:52.9057117Z x0 = x0.contiguous() 2025-05-07T20:32:52.9057423Z x1 = x1.contiguous() 2025-05-07T20:32:52.9057707Z 2025-05-07T20:32:52.9057930Z if scale_ub is not None: 2025-05-07T20:32:52.9058249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.9058640Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.9058995Z ) 2025-05-07T20:32:52.9059225Z else: 2025-05-07T20:32:52.9059474Z scale_ub_tensor = None 2025-05-07T20:32:52.9059770Z 2025-05-07T20:32:52.9060038Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.9060402Z op = silu_mul_quant 2025-05-07T20:32:52.9060695Z if compiled: 2025-05-07T20:32:52.9060982Z op = torch.compile(op) 2025-05-07T20:32:52.9061326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.9061643Z 2025-05-07T20:32:52.9062001Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.9062200Z 2025-05-07T20:32:52.9062316Z moe/activation_test.py:117: 2025-05-07T20:32:52.9062664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.9063045Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.9063372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.9064011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.9064652Z return fn(*args, **kwargs) 2025-05-07T20:32:52.9065401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.9066185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.9066868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.9067643Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.9068401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.9069011Z kernel = self.compile( 2025-05-07T20:32:52.9069684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.9070429Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.9070887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.9071149Z 2025-05-07T20:32:52.9071392Z self = 2025-05-07T20:32:52.9072615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.9074262Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71aa680>} 2025-05-07T20:32:52.9075789Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.9076955Z context = 2025-05-07T20:32:52.9077283Z 2025-05-07T20:32:52.9077481Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.9078067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.9078607Z module_map=module_map) 2025-05-07T20:32:52.9079031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.9079436Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.9079736Z E ^ 2025-05-07T20:32:52.9080269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.9080780Z 2025-05-07T20:32:52.9081269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.9081856Z 2025-05-07T20:32:52.9081985Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.9082455Z self=, 2025-05-07T20:32:52.9082918Z T=2048, 2025-05-07T20:32:52.9083142Z D=5120, 2025-05-07T20:32:52.9083365Z scale_ub=None, 2025-05-07T20:32:52.9083624Z contiguous=False, 2025-05-07T20:32:52.9083895Z compiled=True, 2025-05-07T20:32:52.9084128Z ) 2025-05-07T20:32:53.0219930Z self = 2025-05-07T20:32:53.0220588Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:53.0221259Z 2025-05-07T20:32:53.0221417Z @given( 2025-05-07T20:32:53.0221830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0222335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0222707Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0223094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0223479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0224106Z ) 2025-05-07T20:32:53.0224519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0225029Z def test_silu_mul_quant( 2025-05-07T20:32:53.0225306Z self, 2025-05-07T20:32:53.0225559Z T: int, 2025-05-07T20:32:53.0225819Z D: int, 2025-05-07T20:32:53.0226163Z scale_ub: Optional[float], 2025-05-07T20:32:53.0226480Z contiguous: bool, 2025-05-07T20:32:53.0226764Z compiled: bool, 2025-05-07T20:32:53.0227021Z ) -> None: 2025-05-07T20:32:53.0227285Z torch.manual_seed(2025) 2025-05-07T20:32:53.0227571Z 2025-05-07T20:32:53.0227884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0228282Z 2025-05-07T20:32:53.0228590Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0228929Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0229284Z x = x_sign * x_clamp 2025-05-07T20:32:53.0229566Z x0 = x[:, :D] 2025-05-07T20:32:53.0229824Z x1 = x[:, D:] 2025-05-07T20:32:53.0230065Z 2025-05-07T20:32:53.0230286Z if contiguous: 2025-05-07T20:32:53.0230563Z x0 = x0.contiguous() 2025-05-07T20:32:53.0230860Z x1 = x1.contiguous() 2025-05-07T20:32:53.0231143Z 2025-05-07T20:32:53.0231377Z if scale_ub is not None: 2025-05-07T20:32:53.0231700Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0232096Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0232463Z ) 2025-05-07T20:32:53.0232689Z else: 2025-05-07T20:32:53.0232941Z scale_ub_tensor = None 2025-05-07T20:32:53.0233237Z 2025-05-07T20:32:53.0233586Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0233959Z op = silu_mul_quant 2025-05-07T20:32:53.0234259Z if compiled: 2025-05-07T20:32:53.0234556Z op = torch.compile(op) 2025-05-07T20:32:53.0234900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0235223Z 2025-05-07T20:32:53.0235457Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.0235650Z 2025-05-07T20:32:53.0235768Z moe/activation_test.py:117: 2025-05-07T20:32:53.0236118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0236509Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.0236835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0237486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.0238132Z return fn(*args, **kwargs) 2025-05-07T20:32:53.0238896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.0239684Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.0240304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0241090Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0241843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0242454Z kernel = self.compile( 2025-05-07T20:32:53.0243083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0243843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0244436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0244710Z 2025-05-07T20:32:53.0244949Z self = 2025-05-07T20:32:53.0246181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0247753Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71aa560>} 2025-05-07T20:32:53.0249281Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0250746Z context = 2025-05-07T20:32:53.0251135Z 2025-05-07T20:32:53.0251344Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0252036Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0252695Z module_map=module_map) 2025-05-07T20:32:53.0253163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0253613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0253936Z E ^ 2025-05-07T20:32:53.0254544Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0255166Z 2025-05-07T20:32:53.0255758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.0256489Z 2025-05-07T20:32:53.0256624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.0257165Z self=, 2025-05-07T20:32:53.0257683Z T=2048, 2025-05-07T20:32:53.0257911Z D=5120, 2025-05-07T20:32:53.0258148Z scale_ub=1200.0, 2025-05-07T20:32:53.0258421Z contiguous=False, 2025-05-07T20:32:53.0258696Z compiled=True, 2025-05-07T20:32:53.0258947Z ) 2025-05-07T20:32:53.0259348Z self = 2025-05-07T20:32:53.0260000Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.0260360Z 2025-05-07T20:32:53.0260460Z @given( 2025-05-07T20:32:53.0260737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0261137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0261525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0261950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0262366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0262739Z ) 2025-05-07T20:32:53.0263151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0263659Z def test_silu_mul_quant( 2025-05-07T20:32:53.0263951Z self, 2025-05-07T20:32:53.0264186Z T: int, 2025-05-07T20:32:53.0264416Z D: int, 2025-05-07T20:32:53.0264676Z scale_ub: Optional[float], 2025-05-07T20:32:53.0264996Z contiguous: bool, 2025-05-07T20:32:53.0265304Z compiled: bool, 2025-05-07T20:32:53.0265633Z ) -> None: 2025-05-07T20:32:53.0265956Z torch.manual_seed(2025) 2025-05-07T20:32:53.0266303Z 2025-05-07T20:32:53.0266704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0267202Z 2025-05-07T20:32:53.0274728Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0275124Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0275547Z x = x_sign * x_clamp 2025-05-07T20:32:53.0275905Z x0 = x[:, :D] 2025-05-07T20:32:53.0276392Z x1 = x[:, D:] 2025-05-07T20:32:53.0276704Z 2025-05-07T20:32:53.0276988Z if contiguous: 2025-05-07T20:32:53.0277282Z x0 = x0.contiguous() 2025-05-07T20:32:53.0277592Z x1 = x1.contiguous() 2025-05-07T20:32:53.0277881Z 2025-05-07T20:32:53.0278104Z if scale_ub is not None: 2025-05-07T20:32:53.0278431Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0278845Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0279213Z ) 2025-05-07T20:32:53.0279437Z else: 2025-05-07T20:32:53.0279690Z scale_ub_tensor = None 2025-05-07T20:32:53.0279988Z 2025-05-07T20:32:53.0280255Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0280677Z op = silu_mul_quant 2025-05-07T20:32:53.0280969Z if compiled: 2025-05-07T20:32:53.0281260Z op = torch.compile(op) 2025-05-07T20:32:53.0281615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0281940Z 2025-05-07T20:32:53.0282165Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.0282365Z 2025-05-07T20:32:53.0282482Z moe/activation_test.py:117: 2025-05-07T20:32:53.0282881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0283270Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.0283594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0284245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.0284890Z return fn(*args, **kwargs) 2025-05-07T20:32:53.0285644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.0286438Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.0287049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0287837Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0288601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0289216Z kernel = self.compile( 2025-05-07T20:32:53.0289840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0290597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0291059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0291323Z 2025-05-07T20:32:53.0291567Z self = 2025-05-07T20:32:53.0292811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0294379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71ab370>} 2025-05-07T20:32:53.0295921Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0297095Z context = 2025-05-07T20:32:53.0297425Z 2025-05-07T20:32:53.0297618Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0298215Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0298760Z module_map=module_map) 2025-05-07T20:32:53.0299176Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0299678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0299987Z E ^ 2025-05-07T20:32:53.0300524Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0301043Z 2025-05-07T20:32:53.0301520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.0302112Z 2025-05-07T20:32:53.2373174Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.2373804Z self=, 2025-05-07T20:32:53.2374422Z T=4096, 2025-05-07T20:32:53.2374737Z D=5120, 2025-05-07T20:32:53.2375037Z scale_ub=1200.0, 2025-05-07T20:32:53.2375391Z contiguous=True, 2025-05-07T20:32:53.2375882Z compiled=True, 2025-05-07T20:32:53.2376186Z ) 2025-05-07T20:32:53.2376659Z self = 2025-05-07T20:32:53.2377319Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.2377621Z 2025-05-07T20:32:53.2377710Z @given( 2025-05-07T20:32:53.2377974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.2378424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.2378779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.2379150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.2379530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.2379855Z ) 2025-05-07T20:32:53.2380245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.2380740Z def test_silu_mul_quant( 2025-05-07T20:32:53.2381018Z self, 2025-05-07T20:32:53.2381238Z T: int, 2025-05-07T20:32:53.2381467Z D: int, 2025-05-07T20:32:53.2381716Z scale_ub: Optional[float], 2025-05-07T20:32:53.2382021Z contiguous: bool, 2025-05-07T20:32:53.2382300Z compiled: bool, 2025-05-07T20:32:53.2382558Z ) -> None: 2025-05-07T20:32:53.2382798Z torch.manual_seed(2025) 2025-05-07T20:32:53.2383076Z 2025-05-07T20:32:53.2383386Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.2383778Z 2025-05-07T20:32:53.2383995Z x_sign = torch.sign(x) 2025-05-07T20:32:53.2384324Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.2384674Z x = x_sign * x_clamp 2025-05-07T20:32:53.2384945Z x0 = x[:, :D] 2025-05-07T20:32:53.2385195Z x1 = x[:, D:] 2025-05-07T20:32:53.2385434Z 2025-05-07T20:32:53.2385647Z if contiguous: 2025-05-07T20:32:53.2385914Z x0 = x0.contiguous() 2025-05-07T20:32:53.2386210Z x1 = x1.contiguous() 2025-05-07T20:32:53.2386482Z 2025-05-07T20:32:53.2386703Z if scale_ub is not None: 2025-05-07T20:32:53.2387015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.2387392Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.2387743Z ) 2025-05-07T20:32:53.2387963Z else: 2025-05-07T20:32:53.2388198Z scale_ub_tensor = None 2025-05-07T20:32:53.2388484Z 2025-05-07T20:32:53.2388752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.2389102Z op = silu_mul_quant 2025-05-07T20:32:53.2389383Z if compiled: 2025-05-07T20:32:53.2389665Z op = torch.compile(op) 2025-05-07T20:32:53.2390002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.2390312Z 2025-05-07T20:32:53.2390538Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.2390725Z 2025-05-07T20:32:53.2390842Z moe/activation_test.py:117: 2025-05-07T20:32:53.2391175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.2391553Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.2391875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.2392625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.2393253Z return fn(*args, **kwargs) 2025-05-07T20:32:53.2394090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.2394864Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.2395463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.2396225Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.2396970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.2397619Z kernel = self.compile( 2025-05-07T20:32:53.2398221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.2398957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.2399404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.2399659Z 2025-05-07T20:32:53.2399939Z self = 2025-05-07T20:32:53.2401140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.2402660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f00310>} 2025-05-07T20:32:53.2404157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.2405295Z context = 2025-05-07T20:32:53.2405620Z 2025-05-07T20:32:53.2405812Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.2406399Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.2406926Z module_map=module_map) 2025-05-07T20:32:53.2407337Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.2407728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.2408025Z E ^ 2025-05-07T20:32:53.2408548Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.2409052Z 2025-05-07T20:32:53.2409531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.2410103Z 2025-05-07T20:32:53.2410228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.2410697Z self=, 2025-05-07T20:32:53.2411150Z T=128, 2025-05-07T20:32:53.2411358Z D=5120, 2025-05-07T20:32:53.2411582Z scale_ub=1200.0, 2025-05-07T20:32:53.2411840Z contiguous=False, 2025-05-07T20:32:53.2412092Z compiled=True, 2025-05-07T20:32:53.2412326Z ) 2025-05-07T20:32:53.5561307Z self = 2025-05-07T20:32:53.5561987Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.5562493Z 2025-05-07T20:32:53.5562627Z @given( 2025-05-07T20:32:53.5562983Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5563469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5563836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5564204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5564754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5565080Z ) 2025-05-07T20:32:53.5565470Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5565968Z def test_silu_mul_quant( 2025-05-07T20:32:53.5566244Z self, 2025-05-07T20:32:53.5566462Z T: int, 2025-05-07T20:32:53.5566685Z D: int, 2025-05-07T20:32:53.5566934Z scale_ub: Optional[float], 2025-05-07T20:32:53.5567239Z contiguous: bool, 2025-05-07T20:32:53.5567510Z compiled: bool, 2025-05-07T20:32:53.5567763Z ) -> None: 2025-05-07T20:32:53.5568006Z torch.manual_seed(2025) 2025-05-07T20:32:53.5568279Z 2025-05-07T20:32:53.5568579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5569060Z 2025-05-07T20:32:53.5569282Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5569601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5569954Z x = x_sign * x_clamp 2025-05-07T20:32:53.5570225Z x0 = x[:, :D] 2025-05-07T20:32:53.5570463Z x1 = x[:, D:] 2025-05-07T20:32:53.5570698Z 2025-05-07T20:32:53.5570907Z if contiguous: 2025-05-07T20:32:53.5571237Z x0 = x0.contiguous() 2025-05-07T20:32:53.5571523Z x1 = x1.contiguous() 2025-05-07T20:32:53.5571794Z 2025-05-07T20:32:53.5572005Z if scale_ub is not None: 2025-05-07T20:32:53.5572311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.5572687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.5573032Z ) 2025-05-07T20:32:53.5573248Z else: 2025-05-07T20:32:53.5573492Z scale_ub_tensor = None 2025-05-07T20:32:53.5573779Z 2025-05-07T20:32:53.5574040Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5574397Z op = silu_mul_quant 2025-05-07T20:32:53.5574682Z if compiled: 2025-05-07T20:32:53.5574963Z op = torch.compile(op) 2025-05-07T20:32:53.5575302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5575611Z 2025-05-07T20:32:53.5575827Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.5576018Z 2025-05-07T20:32:53.5576136Z moe/activation_test.py:117: 2025-05-07T20:32:53.5576469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5576844Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.5577158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5577787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.5578409Z return fn(*args, **kwargs) 2025-05-07T20:32:53.5579136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.5579902Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.5580505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.5581263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.5581995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.5582594Z kernel = self.compile( 2025-05-07T20:32:53.5583199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.5583927Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.5584370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5584631Z 2025-05-07T20:32:53.5584862Z self = 2025-05-07T20:32:53.5586362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.5588189Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f01090>} 2025-05-07T20:32:53.5589686Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.5590830Z context = 2025-05-07T20:32:53.5591152Z 2025-05-07T20:32:53.5591350Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.5591934Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.5592504Z module_map=module_map) 2025-05-07T20:32:53.5592923Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.5593323Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.5593712Z E ^ 2025-05-07T20:32:53.5594237Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5594789Z 2025-05-07T20:32:53.5595263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.5595833Z 2025-05-07T20:32:53.5595958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5596419Z self=, 2025-05-07T20:32:53.5596872Z T=16384, 2025-05-07T20:32:53.5597098Z D=7168, 2025-05-07T20:32:53.5597323Z scale_ub=1200.0, 2025-05-07T20:32:53.5597586Z contiguous=True, 2025-05-07T20:32:53.5597847Z compiled=True, 2025-05-07T20:32:53.5598080Z ) 2025-05-07T20:32:53.5598453Z self = 2025-05-07T20:32:53.5599013Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.5599321Z 2025-05-07T20:32:53.5599416Z @given( 2025-05-07T20:32:53.5599678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5600034Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5600384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5600753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5601127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5601457Z ) 2025-05-07T20:32:53.5601847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5602346Z def test_silu_mul_quant( 2025-05-07T20:32:53.5602625Z self, 2025-05-07T20:32:53.5602847Z T: int, 2025-05-07T20:32:53.5603076Z D: int, 2025-05-07T20:32:53.5603327Z scale_ub: Optional[float], 2025-05-07T20:32:53.5603638Z contiguous: bool, 2025-05-07T20:32:53.5603908Z compiled: bool, 2025-05-07T20:32:53.5604161Z ) -> None: 2025-05-07T20:32:53.5604409Z torch.manual_seed(2025) 2025-05-07T20:32:53.5604682Z 2025-05-07T20:32:53.5604994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5605384Z 2025-05-07T20:32:53.5605603Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5605932Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5606283Z x = x_sign * x_clamp 2025-05-07T20:32:53.5606552Z x0 = x[:, :D] 2025-05-07T20:32:53.5606801Z x1 = x[:, D:] 2025-05-07T20:32:53.5607040Z 2025-05-07T20:32:53.5607249Z if contiguous: 2025-05-07T20:32:53.5607514Z x0 = x0.contiguous() 2025-05-07T20:32:53.5607813Z x1 = x1.contiguous() 2025-05-07T20:32:53.5608082Z 2025-05-07T20:32:53.5608305Z if scale_ub is not None: 2025-05-07T20:32:53.5608709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.5609087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.5609438Z ) 2025-05-07T20:32:53.5609664Z else: 2025-05-07T20:32:53.5609910Z scale_ub_tensor = None 2025-05-07T20:32:53.5610194Z 2025-05-07T20:32:53.5610462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5610817Z op = silu_mul_quant 2025-05-07T20:32:53.5611102Z if compiled: 2025-05-07T20:32:53.5611386Z op = torch.compile(op) 2025-05-07T20:32:53.5611727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5612037Z 2025-05-07T20:32:53.5612260Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.5612448Z 2025-05-07T20:32:53.5612568Z moe/activation_test.py:117: 2025-05-07T20:32:53.5612951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5613327Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.5613659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5614295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.5614918Z return fn(*args, **kwargs) 2025-05-07T20:32:53.5615707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.5616529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.5617125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.5617887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.5618634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.5619236Z kernel = self.compile( 2025-05-07T20:32:53.5619841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.5620573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.5621019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5621281Z 2025-05-07T20:32:53.5621520Z self = 2025-05-07T20:32:53.5622709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.5624433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f02290>} 2025-05-07T20:32:53.5625937Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.5627079Z context = 2025-05-07T20:32:53.5627403Z 2025-05-07T20:32:53.5627595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.5628182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.5628711Z module_map=module_map) 2025-05-07T20:32:53.5629124Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.5629517Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.5629812Z E ^ 2025-05-07T20:32:53.5630336Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5630840Z 2025-05-07T20:32:53.5631436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.5632013Z 2025-05-07T20:32:53.7104040Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.7105606Z self=, 2025-05-07T20:32:53.7106529Z T=16384, 2025-05-07T20:32:53.7106904Z D=5120, 2025-05-07T20:32:53.7107259Z scale_ub=1200.0, 2025-05-07T20:32:53.7107665Z contiguous=True, 2025-05-07T20:32:53.7108057Z compiled=False, 2025-05-07T20:32:53.7108437Z ) 2025-05-07T20:32:53.7109043Z self = 2025-05-07T20:32:53.7110014Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.7110572Z 2025-05-07T20:32:53.7110722Z @given( 2025-05-07T20:32:53.7111164Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.7112133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.7112723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.7113388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.7114205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.7114762Z ) 2025-05-07T20:32:53.7115453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.7116537Z def test_silu_mul_quant( 2025-05-07T20:32:53.7117002Z self, 2025-05-07T20:32:53.7117381Z T: int, 2025-05-07T20:32:53.7117771Z D: int, 2025-05-07T20:32:53.7118179Z scale_ub: Optional[float], 2025-05-07T20:32:53.7118707Z contiguous: bool, 2025-05-07T20:32:53.7119166Z compiled: bool, 2025-05-07T20:32:53.7119604Z ) -> None: 2025-05-07T20:32:53.7120013Z torch.manual_seed(2025) 2025-05-07T20:32:53.7120482Z 2025-05-07T20:32:53.7121007Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.7121670Z 2025-05-07T20:32:53.7122045Z x_sign = torch.sign(x) 2025-05-07T20:32:53.7122607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.7123202Z x = x_sign * x_clamp 2025-05-07T20:32:53.7123674Z x0 = x[:, :D] 2025-05-07T20:32:53.7124505Z x1 = x[:, D:] 2025-05-07T20:32:53.7124903Z 2025-05-07T20:32:53.7125270Z if contiguous: 2025-05-07T20:32:53.7125714Z x0 = x0.contiguous() 2025-05-07T20:32:53.7126208Z x1 = x1.contiguous() 2025-05-07T20:32:53.7126677Z 2025-05-07T20:32:53.7127051Z if scale_ub is not None: 2025-05-07T20:32:53.7127573Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.7128282Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.7128878Z ) 2025-05-07T20:32:53.7129260Z else: 2025-05-07T20:32:53.7129660Z scale_ub_tensor = None 2025-05-07T20:32:53.7130139Z 2025-05-07T20:32:53.7130581Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.7131185Z op = silu_mul_quant 2025-05-07T20:32:53.7131668Z if compiled: 2025-05-07T20:32:53.7132123Z op = torch.compile(op) 2025-05-07T20:32:53.7132679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.7133234Z 2025-05-07T20:32:53.7133596Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.7133917Z 2025-05-07T20:32:53.7134100Z moe/activation_test.py:117: 2025-05-07T20:32:53.7134661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.7135288Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.7135824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.7137149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.7138462Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.7139538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.7141179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.7142533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.7143616Z kernel = self.compile( 2025-05-07T20:32:53.7144727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.7146045Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.7146830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.7163026Z 2025-05-07T20:32:53.7163488Z self = 2025-05-07T20:32:53.7165735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.7168692Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f011b0>} 2025-05-07T20:32:53.7171395Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.7173551Z context = 2025-05-07T20:32:53.7174076Z 2025-05-07T20:32:53.7174341Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.7175154Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.7175866Z module_map=module_map) 2025-05-07T20:32:53.7176438Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.7176978Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.7177380Z E ^ 2025-05-07T20:32:53.7178088Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.7178831Z 2025-05-07T20:32:53.7179544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.7180400Z 2025-05-07T20:32:53.7180580Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.7181273Z self=, 2025-05-07T20:32:53.7181940Z T=1, 2025-05-07T20:32:53.7182260Z D=7168, 2025-05-07T20:32:53.7182598Z scale_ub=1200.0, 2025-05-07T20:32:53.7182990Z contiguous=False, 2025-05-07T20:32:53.7183380Z compiled=False, 2025-05-07T20:32:53.7183733Z ) 2025-05-07T20:32:53.7184259Z self = 2025-05-07T20:32:53.7185117Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:53.7185572Z 2025-05-07T20:32:53.7185723Z @given( 2025-05-07T20:32:53.7186133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.7186656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.7187232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.7187835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.7188447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.7188982Z ) 2025-05-07T20:32:53.7189620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.7190441Z def test_silu_mul_quant( 2025-05-07T20:32:53.7190893Z self, 2025-05-07T20:32:53.7191249Z T: int, 2025-05-07T20:32:53.7191620Z D: int, 2025-05-07T20:32:53.7192031Z scale_ub: Optional[float], 2025-05-07T20:32:53.7192520Z contiguous: bool, 2025-05-07T20:32:53.7192961Z compiled: bool, 2025-05-07T20:32:53.7193376Z ) -> None: 2025-05-07T20:32:53.7194093Z torch.manual_seed(2025) 2025-05-07T20:32:53.7194559Z 2025-05-07T20:32:53.7195067Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.7195703Z 2025-05-07T20:32:53.7196057Z x_sign = torch.sign(x) 2025-05-07T20:32:53.7196592Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.7197182Z x = x_sign * x_clamp 2025-05-07T20:32:53.7197621Z x0 = x[:, :D] 2025-05-07T20:32:53.7198027Z x1 = x[:, D:] 2025-05-07T20:32:53.7198425Z 2025-05-07T20:32:53.7198765Z if contiguous: 2025-05-07T20:32:53.7199202Z x0 = x0.contiguous() 2025-05-07T20:32:53.7199687Z x1 = x1.contiguous() 2025-05-07T20:32:53.7200128Z 2025-05-07T20:32:53.7200581Z if scale_ub is not None: 2025-05-07T20:32:53.7201094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.7201701Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.7202281Z ) 2025-05-07T20:32:53.7202641Z else: 2025-05-07T20:32:53.7202994Z scale_ub_tensor = None 2025-05-07T20:32:53.7203448Z 2025-05-07T20:32:53.7203873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.7204540Z op = silu_mul_quant 2025-05-07T20:32:53.7204978Z if compiled: 2025-05-07T20:32:53.7205437Z op = torch.compile(op) 2025-05-07T20:32:53.7205994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.7206516Z 2025-05-07T20:32:53.7206886Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.7207209Z 2025-05-07T20:32:53.7207401Z moe/activation_test.py:117: 2025-05-07T20:32:53.7207962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.7208604Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.7209139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.7210420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.7211724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.7212695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.7214007Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.7215314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.7216383Z kernel = self.compile( 2025-05-07T20:32:53.7217456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.7218760Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.7219537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.7220000Z 2025-05-07T20:32:53.7220404Z self = 2025-05-07T20:32:53.7222566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.7225830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f02680>} 2025-05-07T20:32:53.7228551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.7230608Z context = 2025-05-07T20:32:53.7231200Z 2025-05-07T20:32:53.7231520Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.7232781Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.7233801Z module_map=module_map) 2025-05-07T20:32:53.7234510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.7235210Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.7235707Z E ^ 2025-05-07T20:32:53.7236715Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.7237647Z 2025-05-07T20:32:53.7238482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.7239520Z 2025-05-07T20:32:53.9301910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9303128Z self=, 2025-05-07T20:32:53.9303885Z T=4096, 2025-05-07T20:32:53.9304228Z D=7168, 2025-05-07T20:32:53.9304615Z scale_ub=1200.0, 2025-05-07T20:32:53.9305027Z contiguous=False, 2025-05-07T20:32:53.9305451Z compiled=True, 2025-05-07T20:32:53.9305854Z ) 2025-05-07T20:32:53.9306509Z self = 2025-05-07T20:32:53.9307498Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.9307949Z 2025-05-07T20:32:53.9308094Z @given( 2025-05-07T20:32:53.9308474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9309003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9309541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9310140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9310732Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9311249Z ) 2025-05-07T20:32:53.9311853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9312616Z def test_silu_mul_quant( 2025-05-07T20:32:53.9313065Z self, 2025-05-07T20:32:53.9313429Z T: int, 2025-05-07T20:32:53.9313939Z D: int, 2025-05-07T20:32:53.9314371Z scale_ub: Optional[float], 2025-05-07T20:32:53.9314863Z contiguous: bool, 2025-05-07T20:32:53.9315290Z compiled: bool, 2025-05-07T20:32:53.9315728Z ) -> None: 2025-05-07T20:32:53.9316151Z torch.manual_seed(2025) 2025-05-07T20:32:53.9316587Z 2025-05-07T20:32:53.9317102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9317755Z 2025-05-07T20:32:53.9318114Z x_sign = torch.sign(x) 2025-05-07T20:32:53.9318661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.9319248Z x = x_sign * x_clamp 2025-05-07T20:32:53.9319714Z x0 = x[:, :D] 2025-05-07T20:32:53.9320109Z x1 = x[:, D:] 2025-05-07T20:32:53.9320516Z 2025-05-07T20:32:53.9320877Z if contiguous: 2025-05-07T20:32:53.9321327Z x0 = x0.contiguous() 2025-05-07T20:32:53.9321831Z x1 = x1.contiguous() 2025-05-07T20:32:53.9322307Z 2025-05-07T20:32:53.9322671Z if scale_ub is not None: 2025-05-07T20:32:53.9323201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.9324106Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.9324714Z ) 2025-05-07T20:32:53.9325094Z else: 2025-05-07T20:32:53.9325505Z scale_ub_tensor = None 2025-05-07T20:32:53.9325992Z 2025-05-07T20:32:53.9326457Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.9327160Z op = silu_mul_quant 2025-05-07T20:32:53.9327640Z if compiled: 2025-05-07T20:32:53.9328115Z op = torch.compile(op) 2025-05-07T20:32:53.9328706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9329249Z 2025-05-07T20:32:53.9329608Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.9329945Z 2025-05-07T20:32:53.9330414Z moe/activation_test.py:117: 2025-05-07T20:32:53.9330994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9331626Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.9332166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9333243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.9334334Z return fn(*args, **kwargs) 2025-05-07T20:32:53.9335642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.9337062Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.9338126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.9339614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.9340967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.9341993Z kernel = self.compile( 2025-05-07T20:32:53.9343001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.9344323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.9345087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9345508Z 2025-05-07T20:32:53.9345904Z self = 2025-05-07T20:32:53.9347927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.9350642Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f03b50>} 2025-05-07T20:32:53.9353279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.9355352Z context = 2025-05-07T20:32:53.9355893Z 2025-05-07T20:32:53.9356209Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.9357246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.9358163Z module_map=module_map) 2025-05-07T20:32:53.9358854Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.9359520Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.9360018Z E ^ 2025-05-07T20:32:53.9360929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.9361819Z 2025-05-07T20:32:53.9362629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.9363631Z 2025-05-07T20:32:53.9363830Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9364623Z self=, 2025-05-07T20:32:53.9365394Z T=128, 2025-05-07T20:32:53.9365737Z D=7168, 2025-05-07T20:32:53.9366154Z scale_ub=1200.0, 2025-05-07T20:32:53.9366610Z contiguous=False, 2025-05-07T20:32:53.9367023Z compiled=True, 2025-05-07T20:32:53.9367415Z ) 2025-05-07T20:32:54.0476098Z self = 2025-05-07T20:32:54.0477168Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.0477718Z 2025-05-07T20:32:54.0477867Z @given( 2025-05-07T20:32:54.0478677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0479249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0479773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0480401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0481033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0481593Z ) 2025-05-07T20:32:54.0482272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0483131Z def test_silu_mul_quant( 2025-05-07T20:32:54.0483604Z self, 2025-05-07T20:32:54.0483976Z T: int, 2025-05-07T20:32:54.0484340Z D: int, 2025-05-07T20:32:54.0484761Z scale_ub: Optional[float], 2025-05-07T20:32:54.0485286Z contiguous: bool, 2025-05-07T20:32:54.0485875Z compiled: bool, 2025-05-07T20:32:54.0486311Z ) -> None: 2025-05-07T20:32:54.0486724Z torch.manual_seed(2025) 2025-05-07T20:32:54.0487208Z 2025-05-07T20:32:54.0487739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0488409Z 2025-05-07T20:32:54.0488781Z x_sign = torch.sign(x) 2025-05-07T20:32:54.0489328Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.0490061Z x = x_sign * x_clamp 2025-05-07T20:32:54.0490523Z x0 = x[:, :D] 2025-05-07T20:32:54.0490930Z x1 = x[:, D:] 2025-05-07T20:32:54.0491329Z 2025-05-07T20:32:54.0491684Z if contiguous: 2025-05-07T20:32:54.0492121Z x0 = x0.contiguous() 2025-05-07T20:32:54.0492623Z x1 = x1.contiguous() 2025-05-07T20:32:54.0493087Z 2025-05-07T20:32:54.0493448Z if scale_ub is not None: 2025-05-07T20:32:54.0493975Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.0494624Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.0495210Z ) 2025-05-07T20:32:54.0495585Z else: 2025-05-07T20:32:54.0496019Z scale_ub_tensor = None 2025-05-07T20:32:54.0496527Z 2025-05-07T20:32:54.0496953Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.0497559Z op = silu_mul_quant 2025-05-07T20:32:54.0498039Z if compiled: 2025-05-07T20:32:54.0498512Z op = torch.compile(op) 2025-05-07T20:32:54.0499087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0499618Z 2025-05-07T20:32:54.0499977Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.0500301Z 2025-05-07T20:32:54.0500488Z moe/activation_test.py:117: 2025-05-07T20:32:54.0501071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0501702Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.0502240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0503304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.0504368Z return fn(*args, **kwargs) 2025-05-07T20:32:54.0505611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.0506963Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.0507986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.0509265Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.0510569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.0511625Z kernel = self.compile( 2025-05-07T20:32:54.0512690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.0514137Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.0514909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0515509Z 2025-05-07T20:32:54.0515933Z self = 2025-05-07T20:32:54.0518086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.0520851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6e30280>} 2025-05-07T20:32:54.0523551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.0525812Z context = 2025-05-07T20:32:54.0526361Z 2025-05-07T20:32:54.0526699Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.0527717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.0528641Z module_map=module_map) 2025-05-07T20:32:54.0529480Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.0530162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.0530654Z E ^ 2025-05-07T20:32:54.0531570Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.0532476Z 2025-05-07T20:32:54.0533310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.0534334Z 2025-05-07T20:32:54.0534548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0535343Z self=, 2025-05-07T20:32:54.0536132Z T=2048, 2025-05-07T20:32:54.0536503Z D=7168, 2025-05-07T20:32:54.0536865Z scale_ub=None, 2025-05-07T20:32:54.0537281Z contiguous=True, 2025-05-07T20:32:54.0537708Z compiled=True, 2025-05-07T20:32:54.0538097Z ) 2025-05-07T20:32:54.0538716Z self = 2025-05-07T20:32:54.0539669Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.0540192Z 2025-05-07T20:32:54.0540340Z @given( 2025-05-07T20:32:54.0540805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0541421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0542017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0542651Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0543298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0543855Z ) 2025-05-07T20:32:54.0544532Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0545404Z def test_silu_mul_quant( 2025-05-07T20:32:54.0545870Z self, 2025-05-07T20:32:54.0546235Z T: int, 2025-05-07T20:32:54.0546634Z D: int, 2025-05-07T20:32:54.0547030Z scale_ub: Optional[float], 2025-05-07T20:32:54.0547471Z contiguous: bool, 2025-05-07T20:32:54.0547841Z compiled: bool, 2025-05-07T20:32:54.0548201Z ) -> None: 2025-05-07T20:32:54.0548527Z torch.manual_seed(2025) 2025-05-07T20:32:54.0548910Z 2025-05-07T20:32:54.0549342Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0549868Z 2025-05-07T20:32:54.0550184Z x_sign = torch.sign(x) 2025-05-07T20:32:54.0550638Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.0551133Z x = x_sign * x_clamp 2025-05-07T20:32:54.0551491Z x0 = x[:, :D] 2025-05-07T20:32:54.0551837Z x1 = x[:, D:] 2025-05-07T20:32:54.0552188Z 2025-05-07T20:32:54.0552718Z if contiguous: 2025-05-07T20:32:54.0553149Z x0 = x0.contiguous() 2025-05-07T20:32:54.0553692Z x1 = x1.contiguous() 2025-05-07T20:32:54.0554098Z 2025-05-07T20:32:54.0554421Z if scale_ub is not None: 2025-05-07T20:32:54.0554897Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.0555443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.0555940Z ) 2025-05-07T20:32:54.0556269Z else: 2025-05-07T20:32:54.0556613Z scale_ub_tensor = None 2025-05-07T20:32:54.0557060Z 2025-05-07T20:32:54.0557447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.0557972Z op = silu_mul_quant 2025-05-07T20:32:54.0558401Z if compiled: 2025-05-07T20:32:54.0559045Z op = torch.compile(op) 2025-05-07T20:32:54.0559533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0559950Z 2025-05-07T20:32:54.0560267Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.0560565Z 2025-05-07T20:32:54.0560734Z moe/activation_test.py:117: 2025-05-07T20:32:54.0561184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0561716Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.0562255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0563154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.0564063Z return fn(*args, **kwargs) 2025-05-07T20:32:54.0565108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.0566196Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.0567085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.0568171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.0569229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.0570078Z kernel = self.compile( 2025-05-07T20:32:54.0570922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.0571978Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.0572612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0572977Z 2025-05-07T20:32:54.0573300Z self = 2025-05-07T20:32:54.0575007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.0577222Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6e30dc0>} 2025-05-07T20:32:54.0579361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.0580975Z context = 2025-05-07T20:32:54.0581435Z 2025-05-07T20:32:54.0581697Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.0582515Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.0583258Z module_map=module_map) 2025-05-07T20:32:54.0583819Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.0584366Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.0584773Z E ^ 2025-05-07T20:32:54.0585673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.0586393Z 2025-05-07T20:32:54.0587046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.0587868Z 2025-05-07T20:32:54.1391482Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1392331Z self=, 2025-05-07T20:32:54.1393067Z T=16384, 2025-05-07T20:32:54.1393432Z D=5120, 2025-05-07T20:32:54.1393910Z scale_ub=None, 2025-05-07T20:32:54.1394314Z contiguous=False, 2025-05-07T20:32:54.1394735Z compiled=False, 2025-05-07T20:32:54.1395123Z ) 2025-05-07T20:32:54.1395705Z self = 2025-05-07T20:32:54.1396954Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.1397489Z 2025-05-07T20:32:54.1397659Z @given( 2025-05-07T20:32:54.1398091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1398679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1399270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1400024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1400644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1401184Z ) 2025-05-07T20:32:54.1401856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1402709Z def test_silu_mul_quant( 2025-05-07T20:32:54.1403170Z self, 2025-05-07T20:32:54.1403538Z T: int, 2025-05-07T20:32:54.1403904Z D: int, 2025-05-07T20:32:54.1404312Z scale_ub: Optional[float], 2025-05-07T20:32:54.1404839Z contiguous: bool, 2025-05-07T20:32:54.1405289Z compiled: bool, 2025-05-07T20:32:54.1405727Z ) -> None: 2025-05-07T20:32:54.1416508Z torch.manual_seed(2025) 2025-05-07T20:32:54.1417034Z 2025-05-07T20:32:54.1417554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1418220Z 2025-05-07T20:32:54.1418584Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1419157Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1423106Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1427100Z 2025-05-07T20:32:54.1427339Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.1427744Z 2025-05-07T20:32:54.1427927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1428660Z self=, 2025-05-07T20:32:54.1429415Z T=4096, 2025-05-07T20:32:54.1429763Z D=7168, 2025-05-07T20:32:54.1430116Z scale_ub=1200.0, 2025-05-07T20:32:54.1430521Z contiguous=True, 2025-05-07T20:32:54.1430929Z compiled=True, 2025-05-07T20:32:54.1431310Z ) 2025-05-07T20:32:54.1431899Z self = 2025-05-07T20:32:54.1432821Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.1433338Z 2025-05-07T20:32:54.1433492Z @given( 2025-05-07T20:32:54.1434004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1434599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1435174Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1436049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1436676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1437213Z ) 2025-05-07T20:32:54.1437879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1438718Z def test_silu_mul_quant( 2025-05-07T20:32:54.1439176Z self, 2025-05-07T20:32:54.1439540Z T: int, 2025-05-07T20:32:54.1439893Z D: int, 2025-05-07T20:32:54.1440296Z scale_ub: Optional[float], 2025-05-07T20:32:54.1440805Z contiguous: bool, 2025-05-07T20:32:54.1441242Z compiled: bool, 2025-05-07T20:32:54.1441660Z ) -> None: 2025-05-07T20:32:54.1442054Z torch.manual_seed(2025) 2025-05-07T20:32:54.1442495Z 2025-05-07T20:32:54.1442995Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1443757Z 2025-05-07T20:32:54.1444112Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1444653Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1448581Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1452322Z 2025-05-07T20:32:54.1452546Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.1452944Z 2025-05-07T20:32:54.1453139Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1453901Z self=, 2025-05-07T20:32:54.1454661Z T=16384, 2025-05-07T20:32:54.1455016Z D=7168, 2025-05-07T20:32:54.1455372Z scale_ub=None, 2025-05-07T20:32:54.1455752Z contiguous=False, 2025-05-07T20:32:54.1456171Z compiled=False, 2025-05-07T20:32:54.1456551Z ) 2025-05-07T20:32:54.1457124Z self = 2025-05-07T20:32:54.1458069Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.1458598Z 2025-05-07T20:32:54.1458749Z @given( 2025-05-07T20:32:54.1459159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1459744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1460308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1460913Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1461533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1462074Z ) 2025-05-07T20:32:54.1462728Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1463544Z def test_silu_mul_quant( 2025-05-07T20:32:54.1464010Z self, 2025-05-07T20:32:54.1464370Z T: int, 2025-05-07T20:32:54.1464724Z D: int, 2025-05-07T20:32:54.1465125Z scale_ub: Optional[float], 2025-05-07T20:32:54.1465639Z contiguous: bool, 2025-05-07T20:32:54.1466104Z compiled: bool, 2025-05-07T20:32:54.1466538Z ) -> None: 2025-05-07T20:32:54.1466938Z torch.manual_seed(2025) 2025-05-07T20:32:54.1467382Z 2025-05-07T20:32:54.1467885Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1472056Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1475833Z 2025-05-07T20:32:54.1476060Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.1476473Z 2025-05-07T20:32:54.1476674Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1477502Z self=, 2025-05-07T20:32:54.1478259Z T=2048, 2025-05-07T20:32:54.1478605Z D=7168, 2025-05-07T20:32:54.1478946Z scale_ub=1200.0, 2025-05-07T20:32:54.1479363Z contiguous=True, 2025-05-07T20:32:54.1479775Z compiled=True, 2025-05-07T20:32:54.1480141Z ) 2025-05-07T20:32:54.1480726Z self = 2025-05-07T20:32:54.1481757Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.1482270Z 2025-05-07T20:32:54.1482421Z @given( 2025-05-07T20:32:54.1482854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1483426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1483981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1484673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1485286Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1485825Z ) 2025-05-07T20:32:54.1486479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1487342Z def test_silu_mul_quant( 2025-05-07T20:32:54.1487793Z self, 2025-05-07T20:32:54.1488148Z T: int, 2025-05-07T20:32:54.1488514Z D: int, 2025-05-07T20:32:54.1488948Z scale_ub: Optional[float], 2025-05-07T20:32:54.1489462Z contiguous: bool, 2025-05-07T20:32:54.1489902Z compiled: bool, 2025-05-07T20:32:54.1490324Z ) -> None: 2025-05-07T20:32:54.1490726Z torch.manual_seed(2025) 2025-05-07T20:32:54.1491187Z 2025-05-07T20:32:54.1491690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1492336Z 2025-05-07T20:32:54.1492687Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1493242Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1497161Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1500843Z 2025-05-07T20:32:54.1501064Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.1501462Z 2025-05-07T20:32:54.1501667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1502440Z self=, 2025-05-07T20:32:54.1503206Z T=2048, 2025-05-07T20:32:54.1503557Z D=7168, 2025-05-07T20:32:54.1503903Z scale_ub=None, 2025-05-07T20:32:54.1504301Z contiguous=True, 2025-05-07T20:32:54.1504717Z compiled=False, 2025-05-07T20:32:54.1505092Z ) 2025-05-07T20:32:54.2858448Z self = 2025-05-07T20:32:54.2859442Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2859953Z 2025-05-07T20:32:54.2860105Z @given( 2025-05-07T20:32:54.2860528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2861117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2861682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2862688Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2863269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2863768Z ) 2025-05-07T20:32:54.2864374Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2865177Z def test_silu_mul_quant( 2025-05-07T20:32:54.2865622Z self, 2025-05-07T20:32:54.2865982Z T: int, 2025-05-07T20:32:54.2866334Z D: int, 2025-05-07T20:32:54.2866727Z scale_ub: Optional[float], 2025-05-07T20:32:54.2867220Z contiguous: bool, 2025-05-07T20:32:54.2867679Z compiled: bool, 2025-05-07T20:32:54.2868098Z ) -> None: 2025-05-07T20:32:54.2868479Z torch.manual_seed(2025) 2025-05-07T20:32:54.2868922Z 2025-05-07T20:32:54.2869423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2870232Z 2025-05-07T20:32:54.2870585Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.2874500Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2878371Z 2025-05-07T20:32:54.2878607Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.2879026Z 2025-05-07T20:32:54.2879219Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2880016Z self=, 2025-05-07T20:32:54.2880797Z T=1, 2025-05-07T20:32:54.2881138Z D=7168, 2025-05-07T20:32:54.2881510Z scale_ub=1200.0, 2025-05-07T20:32:54.2881919Z contiguous=True, 2025-05-07T20:32:54.2882335Z compiled=False, 2025-05-07T20:32:54.2882722Z ) 2025-05-07T20:32:54.2883315Z self = 2025-05-07T20:32:54.2884215Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2884728Z 2025-05-07T20:32:54.2884873Z @given( 2025-05-07T20:32:54.2885306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2885880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2886461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2887092Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2887718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2888254Z ) 2025-05-07T20:32:54.2888917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2889790Z def test_silu_mul_quant( 2025-05-07T20:32:54.2890239Z self, 2025-05-07T20:32:54.2890597Z T: int, 2025-05-07T20:32:54.2890981Z D: int, 2025-05-07T20:32:54.2891380Z scale_ub: Optional[float], 2025-05-07T20:32:54.2891897Z contiguous: bool, 2025-05-07T20:32:54.2892346Z compiled: bool, 2025-05-07T20:32:54.2892751Z ) -> None: 2025-05-07T20:32:54.2893157Z torch.manual_seed(2025) 2025-05-07T20:32:54.2893606Z 2025-05-07T20:32:54.2894091Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2894716Z 2025-05-07T20:32:54.2895044Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2895570Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2896142Z x = x_sign * x_clamp 2025-05-07T20:32:54.2896627Z x0 = x[:, :D] 2025-05-07T20:32:54.2897040Z x1 = x[:, D:] 2025-05-07T20:32:54.2897421Z 2025-05-07T20:32:54.2897765Z if contiguous: 2025-05-07T20:32:54.2898181Z x0 = x0.contiguous() 2025-05-07T20:32:54.2898646Z x1 = x1.contiguous() 2025-05-07T20:32:54.2899233Z 2025-05-07T20:32:54.2899597Z if scale_ub is not None: 2025-05-07T20:32:54.2900098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2900719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2901298Z ) 2025-05-07T20:32:54.2901670Z else: 2025-05-07T20:32:54.2902056Z scale_ub_tensor = None 2025-05-07T20:32:54.2902514Z 2025-05-07T20:32:54.2902976Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2903552Z op = silu_mul_quant 2025-05-07T20:32:54.2904005Z if compiled: 2025-05-07T20:32:54.2904430Z op = torch.compile(op) 2025-05-07T20:32:54.2904975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2905499Z 2025-05-07T20:32:54.2905968Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2906294Z 2025-05-07T20:32:54.2906477Z moe/activation_test.py:117: 2025-05-07T20:32:54.2907025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2907635Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2908162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2909477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2910892Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2911896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2913211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2914543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2915557Z kernel = self.compile( 2025-05-07T20:32:54.2916642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2917909Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2918657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2919096Z 2025-05-07T20:32:54.2919483Z self = 2025-05-07T20:32:54.2921524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2924551Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6e32cb0>} 2025-05-07T20:32:54.2927160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2929119Z context = 2025-05-07T20:32:54.2929673Z 2025-05-07T20:32:54.2929980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2930974Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2931891Z module_map=module_map) 2025-05-07T20:32:54.2932561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2933234Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2933727Z E ^ 2025-05-07T20:32:54.2934626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2935506Z 2025-05-07T20:32:54.2936354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2937355Z 2025-05-07T20:32:54.2937784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2938579Z self=, 2025-05-07T20:32:54.2939340Z T=128, 2025-05-07T20:32:54.2939679Z D=5120, 2025-05-07T20:32:54.2940042Z scale_ub=None, 2025-05-07T20:32:54.2940443Z contiguous=True, 2025-05-07T20:32:54.2940856Z compiled=False, 2025-05-07T20:32:54.2941239Z ) 2025-05-07T20:32:54.3774797Z self = 2025-05-07T20:32:54.3775791Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.3776263Z 2025-05-07T20:32:54.3776425Z @given( 2025-05-07T20:32:54.3776833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3777385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3778276Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3778901Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3779534Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3780074Z ) 2025-05-07T20:32:54.3780748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3781589Z def test_silu_mul_quant( 2025-05-07T20:32:54.3782221Z self, 2025-05-07T20:32:54.3782587Z T: int, 2025-05-07T20:32:54.3782947Z D: int, 2025-05-07T20:32:54.3783356Z scale_ub: Optional[float], 2025-05-07T20:32:54.3783864Z contiguous: bool, 2025-05-07T20:32:54.3784305Z compiled: bool, 2025-05-07T20:32:54.3784724Z ) -> None: 2025-05-07T20:32:54.3785132Z torch.manual_seed(2025) 2025-05-07T20:32:54.3785591Z 2025-05-07T20:32:54.3786103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3786773Z 2025-05-07T20:32:54.3787128Z x_sign = torch.sign(x) 2025-05-07T20:32:54.3787679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.3788283Z x = x_sign * x_clamp 2025-05-07T20:32:54.3788747Z x0 = x[:, :D] 2025-05-07T20:32:54.3789151Z x1 = x[:, D:] 2025-05-07T20:32:54.3789558Z 2025-05-07T20:32:54.3789914Z if contiguous: 2025-05-07T20:32:54.3790356Z x0 = x0.contiguous() 2025-05-07T20:32:54.3790856Z x1 = x1.contiguous() 2025-05-07T20:32:54.3791322Z 2025-05-07T20:32:54.3791679Z if scale_ub is not None: 2025-05-07T20:32:54.3792208Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.3792851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.3793441Z ) 2025-05-07T20:32:54.3793976Z else: 2025-05-07T20:32:54.3794386Z scale_ub_tensor = None 2025-05-07T20:32:54.3794862Z 2025-05-07T20:32:54.3795289Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.3795888Z op = silu_mul_quant 2025-05-07T20:32:54.3796354Z if compiled: 2025-05-07T20:32:54.3796822Z op = torch.compile(op) 2025-05-07T20:32:54.3797379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.3797895Z 2025-05-07T20:32:54.3798252Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.3798579Z 2025-05-07T20:32:54.3798772Z moe/activation_test.py:117: 2025-05-07T20:32:54.3799336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3799981Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.3800524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.3801863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.3803237Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.3804288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.3805647Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.3807231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.3808228Z kernel = self.compile( 2025-05-07T20:32:54.3809277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.3810554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.3811309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3811740Z 2025-05-07T20:32:54.3812116Z self = 2025-05-07T20:32:54.3814193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.3817041Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6e33640>} 2025-05-07T20:32:54.3819661Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.3821736Z context = 2025-05-07T20:32:54.3822305Z 2025-05-07T20:32:54.3822614Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.3823605Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.3824897Z module_map=module_map) 2025-05-07T20:32:54.3825576Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.3826231Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.3826716Z E ^ 2025-05-07T20:32:54.3827599Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.3828497Z 2025-05-07T20:32:54.3829307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.3830321Z 2025-05-07T20:32:54.3830514Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3831300Z self=, 2025-05-07T20:32:54.3832058Z T=128, 2025-05-07T20:32:54.3832412Z D=7168, 2025-05-07T20:32:54.3832763Z scale_ub=None, 2025-05-07T20:32:54.3833154Z contiguous=True, 2025-05-07T20:32:54.3833635Z compiled=False, 2025-05-07T20:32:54.3834031Z ) 2025-05-07T20:32:54.3834624Z self = 2025-05-07T20:32:54.3835558Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.3836078Z 2025-05-07T20:32:54.3836223Z @given( 2025-05-07T20:32:54.3836656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3837240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3837818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3838448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3839061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3839593Z ) 2025-05-07T20:32:54.3840256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3841092Z def test_silu_mul_quant( 2025-05-07T20:32:54.3841552Z self, 2025-05-07T20:32:54.3841912Z T: int, 2025-05-07T20:32:54.3842272Z D: int, 2025-05-07T20:32:54.3842681Z scale_ub: Optional[float], 2025-05-07T20:32:54.3843193Z contiguous: bool, 2025-05-07T20:32:54.3843638Z compiled: bool, 2025-05-07T20:32:54.3844041Z ) -> None: 2025-05-07T20:32:54.3844440Z torch.manual_seed(2025) 2025-05-07T20:32:54.3845110Z 2025-05-07T20:32:54.3845620Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3846268Z 2025-05-07T20:32:54.3846633Z x_sign = torch.sign(x) 2025-05-07T20:32:54.3847173Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.3847764Z x = x_sign * x_clamp 2025-05-07T20:32:54.3848221Z x0 = x[:, :D] 2025-05-07T20:32:54.3848623Z x1 = x[:, D:] 2025-05-07T20:32:54.3849012Z 2025-05-07T20:32:54.3849360Z if contiguous: 2025-05-07T20:32:54.3849792Z x0 = x0.contiguous() 2025-05-07T20:32:54.3850282Z x1 = x1.contiguous() 2025-05-07T20:32:54.3850742Z 2025-05-07T20:32:54.3851096Z if scale_ub is not None: 2025-05-07T20:32:54.3851616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.3852380Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.3852965Z ) 2025-05-07T20:32:54.3853319Z else: 2025-05-07T20:32:54.3853727Z scale_ub_tensor = None 2025-05-07T20:32:54.3854205Z 2025-05-07T20:32:54.3854634Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.3855240Z op = silu_mul_quant 2025-05-07T20:32:54.3855841Z if compiled: 2025-05-07T20:32:54.3856305Z op = torch.compile(op) 2025-05-07T20:32:54.3856924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.3857454Z 2025-05-07T20:32:54.3857808Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.3858132Z 2025-05-07T20:32:54.3858317Z moe/activation_test.py:117: 2025-05-07T20:32:54.3858880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3859504Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.3860038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.3861710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.3863220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.3864326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.3865908Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.3878190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.3879288Z kernel = self.compile( 2025-05-07T20:32:54.3880350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.3881633Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.3882403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3882857Z 2025-05-07T20:32:54.3883246Z self = 2025-05-07T20:32:54.3885380Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.3888117Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6ab4160>} 2025-05-07T20:32:54.3890763Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.3892674Z context = 2025-05-07T20:32:54.3893244Z 2025-05-07T20:32:54.3893549Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.3894684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.3895581Z module_map=module_map) 2025-05-07T20:32:54.3896248Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.3896952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.3897443Z E ^ 2025-05-07T20:32:54.3898355Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.3899272Z 2025-05-07T20:32:54.3900095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.3901127Z 2025-05-07T20:32:54.3901323Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3902134Z self=, 2025-05-07T20:32:54.3902990Z T=2048, 2025-05-07T20:32:54.3903351Z D=7168, 2025-05-07T20:32:54.3903721Z scale_ub=1200.0, 2025-05-07T20:32:54.3904139Z contiguous=True, 2025-05-07T20:32:54.3904576Z compiled=False, 2025-05-07T20:32:54.3904973Z ) 2025-05-07T20:32:54.4895122Z self = 2025-05-07T20:32:54.4896075Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.4896869Z 2025-05-07T20:32:54.4897009Z @given( 2025-05-07T20:32:54.4897439Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4897992Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4898546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4899145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4899740Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4900261Z ) 2025-05-07T20:32:54.4900884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4901721Z def test_silu_mul_quant( 2025-05-07T20:32:54.4902156Z self, 2025-05-07T20:32:54.4902498Z T: int, 2025-05-07T20:32:54.4902861Z D: int, 2025-05-07T20:32:54.4903253Z scale_ub: Optional[float], 2025-05-07T20:32:54.4903728Z contiguous: bool, 2025-05-07T20:32:54.4904153Z compiled: bool, 2025-05-07T20:32:54.4904562Z ) -> None: 2025-05-07T20:32:54.4904944Z torch.manual_seed(2025) 2025-05-07T20:32:54.4905378Z 2025-05-07T20:32:54.4905873Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4909803Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.4913403Z 2025-05-07T20:32:54.4913766Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.4914176Z 2025-05-07T20:32:54.4914361Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4915113Z self=, 2025-05-07T20:32:54.4915850Z T=1, 2025-05-07T20:32:54.4916177Z D=5120, 2025-05-07T20:32:54.4916515Z scale_ub=1200.0, 2025-05-07T20:32:54.4916905Z contiguous=True, 2025-05-07T20:32:54.4917310Z compiled=False, 2025-05-07T20:32:54.4917675Z ) 2025-05-07T20:32:54.4918250Z self = 2025-05-07T20:32:54.4919138Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.4919633Z 2025-05-07T20:32:54.4919773Z @given( 2025-05-07T20:32:54.4920182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4921011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4921585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4922212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4922821Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4923366Z ) 2025-05-07T20:32:54.4924366Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4925196Z def test_silu_mul_quant( 2025-05-07T20:32:54.4925641Z self, 2025-05-07T20:32:54.4925984Z T: int, 2025-05-07T20:32:54.4926335Z D: int, 2025-05-07T20:32:54.4926722Z scale_ub: Optional[float], 2025-05-07T20:32:54.4927188Z contiguous: bool, 2025-05-07T20:32:54.4927623Z compiled: bool, 2025-05-07T20:32:54.4928021Z ) -> None: 2025-05-07T20:32:54.4928623Z torch.manual_seed(2025) 2025-05-07T20:32:54.4929067Z 2025-05-07T20:32:54.4929554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4930158Z 2025-05-07T20:32:54.4930497Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4931011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4931558Z x = x_sign * x_clamp 2025-05-07T20:32:54.4932108Z x0 = x[:, :D] 2025-05-07T20:32:54.4932492Z x1 = x[:, D:] 2025-05-07T20:32:54.4932862Z 2025-05-07T20:32:54.4933188Z if contiguous: 2025-05-07T20:32:54.4933599Z x0 = x0.contiguous() 2025-05-07T20:32:54.4934066Z x1 = x1.contiguous() 2025-05-07T20:32:54.4934491Z 2025-05-07T20:32:54.4934832Z if scale_ub is not None: 2025-05-07T20:32:54.4935328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4935933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4936496Z ) 2025-05-07T20:32:54.4936842Z else: 2025-05-07T20:32:54.4937206Z scale_ub_tensor = None 2025-05-07T20:32:54.4937650Z 2025-05-07T20:32:54.4938066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4938637Z op = silu_mul_quant 2025-05-07T20:32:54.4939085Z if compiled: 2025-05-07T20:32:54.4939529Z op = torch.compile(op) 2025-05-07T20:32:54.4940063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4940557Z 2025-05-07T20:32:54.4940900Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4941196Z 2025-05-07T20:32:54.4941381Z moe/activation_test.py:117: 2025-05-07T20:32:54.4941898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4942506Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4943017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4944256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4945533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4946520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4947779Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4948988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4950006Z kernel = self.compile( 2025-05-07T20:32:54.4950995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4952186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4952911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4953338Z 2025-05-07T20:32:54.4953828Z self = 2025-05-07T20:32:54.4956023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4958581Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6ab4940>} 2025-05-07T20:32:54.4961079Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4962971Z context = 2025-05-07T20:32:54.4963501Z 2025-05-07T20:32:54.4963810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4964899Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4965755Z module_map=module_map) 2025-05-07T20:32:54.4966418Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4967050Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4967511Z E ^ 2025-05-07T20:32:54.4968374Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4969295Z 2025-05-07T20:32:54.4970077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4971042Z 2025-05-07T20:32:54.4971242Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4971990Z self=, 2025-05-07T20:32:54.4972734Z T=2048, 2025-05-07T20:32:54.4973074Z D=5120, 2025-05-07T20:32:54.4973416Z scale_ub=None, 2025-05-07T20:32:54.4973808Z contiguous=True, 2025-05-07T20:32:54.4974217Z compiled=False, 2025-05-07T20:32:54.4974578Z ) 2025-05-07T20:32:54.4975163Z self = 2025-05-07T20:32:54.4976066Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.4976564Z 2025-05-07T20:32:54.4976714Z @given( 2025-05-07T20:32:54.4977124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4977702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4978256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4978847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4979439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4979958Z ) 2025-05-07T20:32:54.4980592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4981408Z def test_silu_mul_quant( 2025-05-07T20:32:54.4981858Z self, 2025-05-07T20:32:54.4982206Z T: int, 2025-05-07T20:32:54.4982570Z D: int, 2025-05-07T20:32:54.4982960Z scale_ub: Optional[float], 2025-05-07T20:32:54.4983433Z contiguous: bool, 2025-05-07T20:32:54.4983860Z compiled: bool, 2025-05-07T20:32:54.4984266Z ) -> None: 2025-05-07T20:32:54.4984652Z torch.manual_seed(2025) 2025-05-07T20:32:54.4985085Z 2025-05-07T20:32:54.4985581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4986210Z 2025-05-07T20:32:54.4986560Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.4990391Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.4994045Z 2025-05-07T20:32:54.4994263Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.4994652Z 2025-05-07T20:32:54.4994846Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4995603Z self=, 2025-05-07T20:32:54.4996356Z T=16384, 2025-05-07T20:32:54.4996718Z D=5120, 2025-05-07T20:32:54.4997062Z scale_ub=None, 2025-05-07T20:32:54.4997442Z contiguous=True, 2025-05-07T20:32:54.4997843Z compiled=False, 2025-05-07T20:32:54.4998216Z ) 2025-05-07T20:32:54.5980465Z self = 2025-05-07T20:32:54.5981428Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5981914Z 2025-05-07T20:32:54.5982353Z @given( 2025-05-07T20:32:54.5982761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5983323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5983882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5984487Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5985094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5985760Z ) 2025-05-07T20:32:54.5986398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5987224Z def test_silu_mul_quant( 2025-05-07T20:32:54.5987663Z self, 2025-05-07T20:32:54.5988010Z T: int, 2025-05-07T20:32:54.5988358Z D: int, 2025-05-07T20:32:54.5988748Z scale_ub: Optional[float], 2025-05-07T20:32:54.5989231Z contiguous: bool, 2025-05-07T20:32:54.5989655Z compiled: bool, 2025-05-07T20:32:54.5990057Z ) -> None: 2025-05-07T20:32:54.5990452Z torch.manual_seed(2025) 2025-05-07T20:32:54.5990897Z 2025-05-07T20:32:54.5991391Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5995425Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5999027Z 2025-05-07T20:32:54.5999259Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5999644Z 2025-05-07T20:32:54.5999837Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.6000577Z self=, 2025-05-07T20:32:54.6001309Z T=4096, 2025-05-07T20:32:54.6001644Z D=5120, 2025-05-07T20:32:54.6001975Z scale_ub=None, 2025-05-07T20:32:54.6002368Z contiguous=True, 2025-05-07T20:32:54.6002771Z compiled=False, 2025-05-07T20:32:54.6003139Z ) 2025-05-07T20:32:54.6003715Z self = 2025-05-07T20:32:54.6004636Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.6005139Z 2025-05-07T20:32:54.6005278Z @given( 2025-05-07T20:32:54.6005693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.6006317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.6006885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.6007496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.6008098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.6008640Z ) 2025-05-07T20:32:54.6009294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.6010136Z def test_silu_mul_quant( 2025-05-07T20:32:54.6010588Z self, 2025-05-07T20:32:54.6011167Z T: int, 2025-05-07T20:32:54.6011529Z D: int, 2025-05-07T20:32:54.6011922Z scale_ub: Optional[float], 2025-05-07T20:32:54.6012394Z contiguous: bool, 2025-05-07T20:32:54.6012827Z compiled: bool, 2025-05-07T20:32:54.6013229Z ) -> None: 2025-05-07T20:32:54.6013611Z torch.manual_seed(2025) 2025-05-07T20:32:54.6014054Z 2025-05-07T20:32:54.6014547Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.6018194Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.6021720Z 2025-05-07T20:32:54.6021952Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.6022337Z 2025-05-07T20:32:54.6022603Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.6023372Z self=, 2025-05-07T20:32:54.6024359Z T=2048, 2025-05-07T20:32:54.6024695Z D=5120, 2025-05-07T20:32:54.6025035Z scale_ub=None, 2025-05-07T20:32:54.6025429Z contiguous=False, 2025-05-07T20:32:54.6025823Z compiled=False, 2025-05-07T20:32:54.6026191Z ) 2025-05-07T20:32:54.6026755Z self = 2025-05-07T20:32:54.6027649Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.6028152Z 2025-05-07T20:32:54.6028289Z @given( 2025-05-07T20:32:54.6028696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.6029272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.6029827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.6030442Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.6031047Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.6031548Z ) 2025-05-07T20:32:54.6032182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.6032992Z def test_silu_mul_quant( 2025-05-07T20:32:54.6033426Z self, 2025-05-07T20:32:54.6033860Z T: int, 2025-05-07T20:32:54.6034209Z D: int, 2025-05-07T20:32:54.6034594Z scale_ub: Optional[float], 2025-05-07T20:32:54.6035076Z contiguous: bool, 2025-05-07T20:32:54.6035504Z compiled: bool, 2025-05-07T20:32:54.6035906Z ) -> None: 2025-05-07T20:32:54.6036330Z torch.manual_seed(2025) 2025-05-07T20:32:54.6036769Z 2025-05-07T20:32:54.6037261Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.6041030Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.6044515Z 2025-05-07T20:32:54.6044734Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.6045130Z 2025-05-07T20:32:54.6045315Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.6046060Z self=, 2025-05-07T20:32:54.6046793Z T=4096, 2025-05-07T20:32:54.6047116Z D=7168, 2025-05-07T20:32:54.6047664Z scale_ub=None, 2025-05-07T20:32:54.6048065Z contiguous=True, 2025-05-07T20:32:54.6048451Z compiled=True, 2025-05-07T20:32:54.6048812Z ) 2025-05-07T20:32:54.6049379Z self = 2025-05-07T20:32:54.6050257Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.6050757Z 2025-05-07T20:32:54.6050895Z @given( 2025-05-07T20:32:54.6051302Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.6051852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.6052395Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.6052994Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.6053587Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.6054216Z ) 2025-05-07T20:32:54.6054854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.6055677Z def test_silu_mul_quant( 2025-05-07T20:32:54.6056105Z self, 2025-05-07T20:32:54.6056455Z T: int, 2025-05-07T20:32:54.6056827Z D: int, 2025-05-07T20:32:54.6057253Z scale_ub: Optional[float], 2025-05-07T20:32:54.6057884Z contiguous: bool, 2025-05-07T20:32:54.6058326Z compiled: bool, 2025-05-07T20:32:54.6058723Z ) -> None: 2025-05-07T20:32:54.6059116Z torch.manual_seed(2025) 2025-05-07T20:32:54.6059568Z 2025-05-07T20:32:54.6060052Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.6063967Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.6067489Z 2025-05-07T20:32:54.6067711Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.6068111Z 2025-05-07T20:32:54.6068297Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.6069055Z self=, 2025-05-07T20:32:54.6069785Z T=2048, 2025-05-07T20:32:54.6070131Z D=5120, 2025-05-07T20:32:54.6070488Z scale_ub=1200.0, 2025-05-07T20:32:54.6070873Z contiguous=False, 2025-05-07T20:32:54.6071278Z compiled=False, 2025-05-07T20:32:54.6071631Z ) 2025-05-07T20:32:54.6072204Z self = 2025-05-07T20:32:54.6073115Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.6073724Z 2025-05-07T20:32:54.6073866Z @given( 2025-05-07T20:32:54.6074293Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.6074858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.6075421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.6076032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.6076678Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.6077205Z ) 2025-05-07T20:32:54.6077851Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.6078677Z def test_silu_mul_quant( 2025-05-07T20:32:54.6079109Z self, 2025-05-07T20:32:54.6079462Z T: int, 2025-05-07T20:32:54.6079819Z D: int, 2025-05-07T20:32:54.6080209Z scale_ub: Optional[float], 2025-05-07T20:32:54.6080703Z contiguous: bool, 2025-05-07T20:32:54.6081149Z compiled: bool, 2025-05-07T20:32:54.6081545Z ) -> None: 2025-05-07T20:32:54.6081929Z torch.manual_seed(2025) 2025-05-07T20:32:54.6082372Z 2025-05-07T20:32:54.6082994Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.6086942Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.6090516Z 2025-05-07T20:32:54.6090730Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.6091199Z 2025-05-07T20:32:54.6091388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.6092151Z self=, 2025-05-07T20:32:54.6092891Z T=4096, 2025-05-07T20:32:54.6093231Z D=7168, 2025-05-07T20:32:54.6093581Z scale_ub=1200.0, 2025-05-07T20:32:54.6093961Z contiguous=True, 2025-05-07T20:32:54.6094364Z compiled=False, 2025-05-07T20:32:54.6094811Z ) 2025-05-07T20:32:54.7449452Z self = 2025-05-07T20:32:54.7450441Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.7450946Z 2025-05-07T20:32:54.7451095Z @given( 2025-05-07T20:32:54.7451521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.7452092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.7452653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.7453251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.7453837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.7454329Z ) 2025-05-07T20:32:54.7454939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.7455759Z def test_silu_mul_quant( 2025-05-07T20:32:54.7456179Z self, 2025-05-07T20:32:54.7456517Z T: int, 2025-05-07T20:32:54.7456902Z D: int, 2025-05-07T20:32:54.7457352Z scale_ub: Optional[float], 2025-05-07T20:32:54.7457867Z contiguous: bool, 2025-05-07T20:32:54.7458324Z compiled: bool, 2025-05-07T20:32:54.7458752Z ) -> None: 2025-05-07T20:32:54.7459155Z torch.manual_seed(2025) 2025-05-07T20:32:54.7459619Z 2025-05-07T20:32:54.7460130Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.7464109Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.7467877Z 2025-05-07T20:32:54.7468119Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.7468525Z 2025-05-07T20:32:54.7468721Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.7469522Z self=, 2025-05-07T20:32:54.7470307Z T=16384, 2025-05-07T20:32:54.7470670Z D=7168, 2025-05-07T20:32:54.7471032Z scale_ub=None, 2025-05-07T20:32:54.7471446Z contiguous=False, 2025-05-07T20:32:54.7471865Z compiled=True, 2025-05-07T20:32:54.7472264Z ) 2025-05-07T20:32:54.7472884Z self = 2025-05-07T20:32:54.7474008Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.7475041Z 2025-05-07T20:32:54.7475202Z @given( 2025-05-07T20:32:54.7475641Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.7476243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.7476831Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.7477511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.7478154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.7478703Z ) 2025-05-07T20:32:54.7479381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.7480274Z def test_silu_mul_quant( 2025-05-07T20:32:54.7480737Z self, 2025-05-07T20:32:54.7481101Z T: int, 2025-05-07T20:32:54.7481476Z D: int, 2025-05-07T20:32:54.7482124Z scale_ub: Optional[float], 2025-05-07T20:32:54.7482725Z contiguous: bool, 2025-05-07T20:32:54.7495245Z compiled: bool, 2025-05-07T20:32:54.7495713Z ) -> None: 2025-05-07T20:32:54.7496142Z torch.manual_seed(2025) 2025-05-07T20:32:54.7496623Z 2025-05-07T20:32:54.7497135Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.7501093Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.7504927Z 2025-05-07T20:32:54.7505147Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.7505531Z 2025-05-07T20:32:54.7505721Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.7506463Z self=, 2025-05-07T20:32:54.7507266Z T=4096, 2025-05-07T20:32:54.7507621Z D=7168, 2025-05-07T20:32:54.7507985Z scale_ub=None, 2025-05-07T20:32:54.7508395Z contiguous=True, 2025-05-07T20:32:54.7508828Z compiled=False, 2025-05-07T20:32:54.7509221Z ) 2025-05-07T20:32:54.7509818Z self = 2025-05-07T20:32:54.7510770Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.7511303Z 2025-05-07T20:32:54.7511459Z @given( 2025-05-07T20:32:54.7511893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.7512482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.7513076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.7513806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.7514439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.7515005Z ) 2025-05-07T20:32:54.7515689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.7516562Z def test_silu_mul_quant( 2025-05-07T20:32:54.7517054Z self, 2025-05-07T20:32:54.7517436Z T: int, 2025-05-07T20:32:54.7517812Z D: int, 2025-05-07T20:32:54.7518229Z scale_ub: Optional[float], 2025-05-07T20:32:54.7518763Z contiguous: bool, 2025-05-07T20:32:54.7519220Z compiled: bool, 2025-05-07T20:32:54.7519645Z ) -> None: 2025-05-07T20:32:54.7520056Z torch.manual_seed(2025) 2025-05-07T20:32:54.7520521Z 2025-05-07T20:32:54.7521028Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.7525770Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.7529412Z 2025-05-07T20:32:54.7529647Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.7530050Z 2025-05-07T20:32:54.7530254Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.7531034Z self=, 2025-05-07T20:32:54.7531806Z T=16384, 2025-05-07T20:32:54.7532176Z D=7168, 2025-05-07T20:32:54.7532536Z scale_ub=None, 2025-05-07T20:32:54.7532937Z contiguous=True, 2025-05-07T20:32:54.7533509Z compiled=False, 2025-05-07T20:32:54.7533906Z ) 2025-05-07T20:32:54.7534496Z self = 2025-05-07T20:32:54.7535452Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.7535994Z 2025-05-07T20:32:54.7536153Z @given( 2025-05-07T20:32:54.7536580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.7537321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.7537907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.7538529Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.7539160Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.7539710Z ) 2025-05-07T20:32:54.7540380Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.7541227Z def test_silu_mul_quant( 2025-05-07T20:32:54.7541700Z self, 2025-05-07T20:32:54.7542083Z T: int, 2025-05-07T20:32:54.7542452Z D: int, 2025-05-07T20:32:54.7542865Z scale_ub: Optional[float], 2025-05-07T20:32:54.7543381Z contiguous: bool, 2025-05-07T20:32:54.7543842Z compiled: bool, 2025-05-07T20:32:54.7544271Z ) -> None: 2025-05-07T20:32:54.7544675Z torch.manual_seed(2025) 2025-05-07T20:32:54.7545133Z 2025-05-07T20:32:54.7545650Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.7549690Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.7553317Z 2025-05-07T20:32:54.7553621Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.7554035Z 2025-05-07T20:32:54.7554253Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.7555032Z self=, 2025-05-07T20:32:54.7555801Z T=16384, 2025-05-07T20:32:54.7556184Z D=7168, 2025-05-07T20:32:54.7556540Z scale_ub=1200.0, 2025-05-07T20:32:54.7556989Z contiguous=True, 2025-05-07T20:32:54.7557444Z compiled=False, 2025-05-07T20:32:54.7557828Z ) 2025-05-07T20:32:54.7558425Z self = 2025-05-07T20:32:54.7559386Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.7559915Z 2025-05-07T20:32:54.7560075Z @given( 2025-05-07T20:32:54.7560499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.7561111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.7561703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.7562330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.7563113Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.7563679Z ) 2025-05-07T20:32:54.7564345Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.7565220Z def test_silu_mul_quant( 2025-05-07T20:32:54.7565683Z self, 2025-05-07T20:32:54.7566048Z T: int, 2025-05-07T20:32:54.7566426Z D: int, 2025-05-07T20:32:54.7566846Z scale_ub: Optional[float], 2025-05-07T20:32:54.7567356Z contiguous: bool, 2025-05-07T20:32:54.7567817Z compiled: bool, 2025-05-07T20:32:54.7568246Z ) -> None: 2025-05-07T20:32:54.7568660Z torch.manual_seed(2025) 2025-05-07T20:32:54.7569121Z 2025-05-07T20:32:54.7569646Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.7573731Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.7577399Z 2025-05-07T20:32:54.7577646Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.7578065Z 2025-05-07T20:32:54.7578264Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.7579051Z self=, 2025-05-07T20:32:54.7579829Z T=128, 2025-05-07T20:32:54.7580189Z D=5120, 2025-05-07T20:32:54.7580560Z scale_ub=1200.0, 2025-05-07T20:32:54.7580994Z contiguous=False, 2025-05-07T20:32:54.7581433Z compiled=False, 2025-05-07T20:32:54.7581821Z ) 2025-05-07T20:32:55.0994110Z self = 2025-05-07T20:32:55.0995048Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.0995521Z 2025-05-07T20:32:55.0995658Z @given( 2025-05-07T20:32:55.0996081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.0996600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.0997123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.0997703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.0998288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.0998771Z ) 2025-05-07T20:32:55.0999293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.0999960Z def test_silu_mul_quant( 2025-05-07T20:32:55.1000338Z self, 2025-05-07T20:32:55.1000633Z T: int, 2025-05-07T20:32:55.1000943Z D: int, 2025-05-07T20:32:55.1001287Z scale_ub: Optional[float], 2025-05-07T20:32:55.1001719Z contiguous: bool, 2025-05-07T20:32:55.1002111Z compiled: bool, 2025-05-07T20:32:55.1002487Z ) -> None: 2025-05-07T20:32:55.1002835Z torch.manual_seed(2025) 2025-05-07T20:32:55.1003221Z 2025-05-07T20:32:55.1003645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.1004184Z 2025-05-07T20:32:55.1004489Z x_sign = torch.sign(x) 2025-05-07T20:32:55.1004980Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.1005483Z x = x_sign * x_clamp 2025-05-07T20:32:55.1005898Z x0 = x[:, :D] 2025-05-07T20:32:55.1006283Z x1 = x[:, D:] 2025-05-07T20:32:55.1006648Z 2025-05-07T20:32:55.1006980Z if contiguous: 2025-05-07T20:32:55.1007397Z x0 = x0.contiguous() 2025-05-07T20:32:55.1007865Z x1 = x1.contiguous() 2025-05-07T20:32:55.1008298Z 2025-05-07T20:32:55.1008642Z if scale_ub is not None: 2025-05-07T20:32:55.1009570Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.1010188Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.1010744Z ) 2025-05-07T20:32:55.1011091Z else: 2025-05-07T20:32:55.1011467Z scale_ub_tensor = None 2025-05-07T20:32:55.1011913Z 2025-05-07T20:32:55.1012325Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.1012877Z op = silu_mul_quant 2025-05-07T20:32:55.1013330Z if compiled: 2025-05-07T20:32:55.1013779Z op = torch.compile(op) 2025-05-07T20:32:55.1014283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1014733Z 2025-05-07T20:32:55.1015053Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.1015341Z 2025-05-07T20:32:55.1015668Z moe/activation_test.py:117: 2025-05-07T20:32:55.1016204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1016788Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.1017285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1018527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.1019894Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.1020868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.1022104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.1023279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.1024686Z kernel = self.compile( 2025-05-07T20:32:55.1025695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.1026948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.1027672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1028119Z 2025-05-07T20:32:55.1028488Z self = 2025-05-07T20:32:55.1030453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.1032995Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6948940>} 2025-05-07T20:32:55.1035579Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.1037539Z context = 2025-05-07T20:32:55.1038067Z 2025-05-07T20:32:55.1038364Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.1039323Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.1040152Z module_map=module_map) 2025-05-07T20:32:55.1040787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.1041388Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.1041812Z E ^ 2025-05-07T20:32:55.1042579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.1043426Z 2025-05-07T20:32:55.1044141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.1045023Z 2025-05-07T20:32:55.1045220Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.1046161Z self=, 2025-05-07T20:32:55.1046867Z T=2048, 2025-05-07T20:32:55.1047193Z D=7168, 2025-05-07T20:32:55.1047515Z scale_ub=None, 2025-05-07T20:32:55.1047892Z contiguous=False, 2025-05-07T20:32:55.1048287Z compiled=False, 2025-05-07T20:32:55.1048655Z ) 2025-05-07T20:32:55.1049198Z self = 2025-05-07T20:32:55.1050061Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.1050542Z 2025-05-07T20:32:55.1050687Z @given( 2025-05-07T20:32:55.1051079Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.1051643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.1052176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.1052859Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.1053431Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.1053938Z ) 2025-05-07T20:32:55.1054554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.1055356Z def test_silu_mul_quant( 2025-05-07T20:32:55.1055781Z self, 2025-05-07T20:32:55.1056121Z T: int, 2025-05-07T20:32:55.1056617Z D: int, 2025-05-07T20:32:55.1056998Z scale_ub: Optional[float], 2025-05-07T20:32:55.1057469Z contiguous: bool, 2025-05-07T20:32:55.1057887Z compiled: bool, 2025-05-07T20:32:55.1058278Z ) -> None: 2025-05-07T20:32:55.1058656Z torch.manual_seed(2025) 2025-05-07T20:32:55.1059085Z 2025-05-07T20:32:55.1059543Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.1063290Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.1066699Z 2025-05-07T20:32:55.1066949Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.1067342Z 2025-05-07T20:32:55.1067530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.1068253Z self=, 2025-05-07T20:32:55.1068970Z T=128, 2025-05-07T20:32:55.1069296Z D=7168, 2025-05-07T20:32:55.1069624Z scale_ub=1200.0, 2025-05-07T20:32:55.1070007Z contiguous=True, 2025-05-07T20:32:55.1070401Z compiled=True, 2025-05-07T20:32:55.1070762Z ) 2025-05-07T20:32:55.1490857Z self = 2025-05-07T20:32:55.1491780Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.1492238Z 2025-05-07T20:32:55.1492387Z @given( 2025-05-07T20:32:55.1492744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.1493227Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.1493729Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.1494305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.1494881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.1495380Z ) 2025-05-07T20:32:55.1496017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.1496750Z def test_silu_mul_quant( 2025-05-07T20:32:55.1497141Z self, 2025-05-07T20:32:55.1497450Z T: int, 2025-05-07T20:32:55.1497762Z D: int, 2025-05-07T20:32:55.1498106Z scale_ub: Optional[float], 2025-05-07T20:32:55.1498549Z contiguous: bool, 2025-05-07T20:32:55.1498953Z compiled: bool, 2025-05-07T20:32:55.1499622Z ) -> None: 2025-05-07T20:32:55.1499989Z torch.manual_seed(2025) 2025-05-07T20:32:55.1500391Z 2025-05-07T20:32:55.1500863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.1501488Z 2025-05-07T20:32:55.1501793Z x_sign = torch.sign(x) 2025-05-07T20:32:55.1502224Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.1502692Z x = x_sign * x_clamp 2025-05-07T20:32:55.1503059Z x0 = x[:, :D] 2025-05-07T20:32:55.1503378Z x1 = x[:, D:] 2025-05-07T20:32:55.1503706Z 2025-05-07T20:32:55.1503994Z if contiguous: 2025-05-07T20:32:55.1504353Z x0 = x0.contiguous() 2025-05-07T20:32:55.1504772Z x1 = x1.contiguous() 2025-05-07T20:32:55.1505172Z 2025-05-07T20:32:55.1505605Z if scale_ub is not None: 2025-05-07T20:32:55.1506042Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.1506582Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.1507081Z ) 2025-05-07T20:32:55.1507381Z else: 2025-05-07T20:32:55.1507699Z scale_ub_tensor = None 2025-05-07T20:32:55.1508071Z 2025-05-07T20:32:55.1508420Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.1508997Z op = silu_mul_quant 2025-05-07T20:32:55.1509371Z if compiled: 2025-05-07T20:32:55.1509756Z op = torch.compile(op) 2025-05-07T20:32:55.1510204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1510629Z 2025-05-07T20:32:55.1510908Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.1511160Z 2025-05-07T20:32:55.1511309Z moe/activation_test.py:117: 2025-05-07T20:32:55.1511749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1512246Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.1512669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1513690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.1514599Z return fn(*args, **kwargs) 2025-05-07T20:32:55.1515675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.1516812Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.1517675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.1518779Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.1519864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.1520734Z kernel = self.compile( 2025-05-07T20:32:55.1521602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.1522678Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.1523313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1523677Z 2025-05-07T20:32:55.1524563Z self = 2025-05-07T20:32:55.1526348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.1528669Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6948dc0>} 2025-05-07T20:32:55.1530905Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.1532809Z context = 2025-05-07T20:32:55.1533292Z 2025-05-07T20:32:55.1533552Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.1534401Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.1535153Z module_map=module_map) 2025-05-07T20:32:55.1535711Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.1536247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.1536644Z E ^ 2025-05-07T20:32:55.1537387Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.1538145Z 2025-05-07T20:32:55.1538848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.1539797Z 2025-05-07T20:32:55.1539965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.1540606Z self=, 2025-05-07T20:32:55.1541262Z T=128, 2025-05-07T20:32:55.1541562Z D=7168, 2025-05-07T20:32:55.1541870Z scale_ub=1200.0, 2025-05-07T20:32:55.1542342Z contiguous=True, 2025-05-07T20:32:55.1542699Z compiled=False, 2025-05-07T20:32:55.1543004Z ) 2025-05-07T20:32:55.1543493Z self = 2025-05-07T20:32:55.1544258Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.1544690Z 2025-05-07T20:32:55.1544822Z @given( 2025-05-07T20:32:55.1545181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.1545693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.1546235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.1546810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.1547355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.1547799Z ) 2025-05-07T20:32:55.1548337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.1549069Z def test_silu_mul_quant( 2025-05-07T20:32:55.1549489Z self, 2025-05-07T20:32:55.1549810Z T: int, 2025-05-07T20:32:55.1550131Z D: int, 2025-05-07T20:32:55.1550496Z scale_ub: Optional[float], 2025-05-07T20:32:55.1550932Z contiguous: bool, 2025-05-07T20:32:55.1551290Z compiled: bool, 2025-05-07T20:32:55.1551633Z ) -> None: 2025-05-07T20:32:55.1551960Z torch.manual_seed(2025) 2025-05-07T20:32:55.1552337Z 2025-05-07T20:32:55.1552764Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.1553318Z 2025-05-07T20:32:55.1553704Z x_sign = torch.sign(x) 2025-05-07T20:32:55.1554179Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.1557506Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.1560472Z 2025-05-07T20:32:55.1560657Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:55.1560979Z 2025-05-07T20:32:55.1561142Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.1561757Z self=, 2025-05-07T20:32:55.1562381Z T=128, 2025-05-07T20:32:55.1562662Z D=5120, 2025-05-07T20:32:55.1562942Z scale_ub=1200.0, 2025-05-07T20:32:55.1563275Z contiguous=True, 2025-05-07T20:32:55.1563720Z compiled=True, 2025-05-07T20:32:55.1564031Z ) 2025-05-07T20:32:55.1564511Z self = 2025-05-07T20:32:55.1565260Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.1565681Z 2025-05-07T20:32:55.1565807Z @given( 2025-05-07T20:32:55.1566147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.1566621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.1567090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.1567585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.1568099Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.1568540Z ) 2025-05-07T20:32:55.1569074Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.1569843Z def test_silu_mul_quant( 2025-05-07T20:32:55.1570216Z self, 2025-05-07T20:32:55.1570515Z T: int, 2025-05-07T20:32:55.1570821Z D: int, 2025-05-07T20:32:55.1571155Z scale_ub: Optional[float], 2025-05-07T20:32:55.1571561Z contiguous: bool, 2025-05-07T20:32:55.1571929Z compiled: bool, 2025-05-07T20:32:55.1572336Z ) -> None: 2025-05-07T20:32:55.1572657Z torch.manual_seed(2025) 2025-05-07T20:32:55.1573051Z 2025-05-07T20:32:55.1573463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.1573995Z 2025-05-07T20:32:55.1574283Z x_sign = torch.sign(x) 2025-05-07T20:32:55.1574727Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.1578187Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.1581402Z 2025-05-07T20:32:55.1581645Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:55.1582052Z 2025-05-07T20:32:55.1582244Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.1582997Z self=, 2025-05-07T20:32:55.1593464Z T=128, 2025-05-07T20:32:55.1593829Z D=7168, 2025-05-07T20:32:55.1594127Z scale_ub=None, 2025-05-07T20:32:55.1594465Z contiguous=True, 2025-05-07T20:32:55.1594811Z compiled=True, 2025-05-07T20:32:55.1595119Z ) 2025-05-07T20:32:55.4002495Z self = 2025-05-07T20:32:55.4003412Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4003853Z 2025-05-07T20:32:55.4004024Z @given( 2025-05-07T20:32:55.4004396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4004879Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4005393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4005934Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4006518Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4007010Z ) 2025-05-07T20:32:55.4007579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4008303Z def test_silu_mul_quant( 2025-05-07T20:32:55.4008738Z self, 2025-05-07T20:32:55.4009063Z T: int, 2025-05-07T20:32:55.4009391Z D: int, 2025-05-07T20:32:55.4009756Z scale_ub: Optional[float], 2025-05-07T20:32:55.4010251Z contiguous: bool, 2025-05-07T20:32:55.4010675Z compiled: bool, 2025-05-07T20:32:55.4011074Z ) -> None: 2025-05-07T20:32:55.4011864Z torch.manual_seed(2025) 2025-05-07T20:32:55.4012305Z 2025-05-07T20:32:55.4012777Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4016564Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.4019949Z 2025-05-07T20:32:55.4020300Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.4020667Z 2025-05-07T20:32:55.4047048Z FAILED 2025-05-07T20:32:55.4047248Z 2025-05-07T20:32:55.4047460Z =================================== FAILURES =================================== 2025-05-07T20:32:55.4048024Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:55.4048543Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:55.4049409Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:55.4050071Z | yield 2025-05-07T20:32:55.4050602Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:32:55.4051213Z | self._callTestMethod(testMethod) 2025-05-07T20:32:55.4051854Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:32:55.4052521Z | method() 2025-05-07T20:32:55.4053251Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:55.4054077Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4054796Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:55.4055558Z | raise the_error_hypothesis_found 2025-05-07T20:32:55.4056130Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:55.4056706Z +-+---------------- 1 ---------------- 2025-05-07T20:32:55.4057066Z | Traceback (most recent call last): 2025-05-07T20:32:55.4057909Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:55.4058824Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4061219Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.4063509Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:55.4064022Z | self=, 2025-05-07T20:32:55.4064511Z | T=2048, 2025-05-07T20:32:55.4064795Z | D=5120, # or any other generated value 2025-05-07T20:32:55.4065187Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:55.4065622Z | contiguous=True, # or any other generated value 2025-05-07T20:32:55.4066065Z | compiled=False, # or any other generated value 2025-05-07T20:32:55.4066593Z | ) 2025-05-07T20:32:55.4066855Z | 2025-05-07T20:32:55.4067477Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:55.4068225Z +---------------- 2 ---------------- 2025-05-07T20:32:55.4068537Z | Traceback (most recent call last): 2025-05-07T20:32:55.4069387Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:55.4070303Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4072778Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.4076530Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:55.4077273Z | self=, 2025-05-07T20:32:55.4077983Z | T=128, 2025-05-07T20:32:55.4078319Z | D=7168, 2025-05-07T20:32:55.4078661Z | scale_ub=None, 2025-05-07T20:32:55.4079054Z | contiguous=True, 2025-05-07T20:32:55.4079438Z | compiled=True, 2025-05-07T20:32:55.4079790Z | ) 2025-05-07T20:32:55.4080079Z | 2025-05-07T20:32:55.4080771Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:55.4081431Z +---------------- 3 ---------------- 2025-05-07T20:32:55.4081746Z | Traceback (most recent call last): 2025-05-07T20:32:55.4082495Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:55.4084290Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4086440Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.4088507Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:55.4088975Z | self=, 2025-05-07T20:32:55.4089415Z | T=128, 2025-05-07T20:32:55.4089636Z | D=5120, 2025-05-07T20:32:55.4089866Z | scale_ub=1200.0, 2025-05-07T20:32:55.4090132Z | contiguous=True, 2025-05-07T20:32:55.4090391Z | compiled=True, 2025-05-07T20:32:55.4090631Z | ) 2025-05-07T20:32:55.4090840Z | 2025-05-07T20:32:55.4091526Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:55.4092173Z +---------------- 4 ---------------- 2025-05-07T20:32:55.4092479Z | Traceback (most recent call last): 2025-05-07T20:32:55.4093226Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:55.4093988Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4094803Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:55.4095541Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4096425Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:55.4097263Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4097900Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:55.4098807Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4099941Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:55.4101070Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4102253Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:55.4103496Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4104649Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:55.4105676Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4106673Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:55.4107508Z | fn() 2025-05-07T20:32:55.4108358Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:55.4109272Z | self.fn.run( 2025-05-07T20:32:55.4110047Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:55.4110926Z | kernel = self.compile( 2025-05-07T20:32:55.4111867Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:55.4112918Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4114094Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:55.4115252Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4116026Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4116552Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4116945Z | ^ 2025-05-07T20:32:55.4117634Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4118476Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:55.4119077Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:55.4119836Z | self=, 2025-05-07T20:32:55.4120486Z | T=1, # or any other generated value 2025-05-07T20:32:55.4120958Z | D=5120, # or any other generated value 2025-05-07T20:32:55.4121448Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:55.4121990Z | contiguous=True, # or any other generated value 2025-05-07T20:32:55.4122535Z | compiled=True, # or any other generated value 2025-05-07T20:32:55.4123004Z | ) 2025-05-07T20:32:55.4123278Z | 2025-05-07T20:32:55.4124565Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:55.4125453Z +------------------------------------ 2025-05-07T20:32:55.4125964Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:55.4126518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4127127Z self=, 2025-05-07T20:32:55.4127710Z T=1, 2025-05-07T20:32:55.4127990Z D=5120, 2025-05-07T20:32:55.4128293Z scale_ub=None, 2025-05-07T20:32:55.4128612Z contiguous=True, 2025-05-07T20:32:55.4128949Z compiled=True, 2025-05-07T20:32:55.4129273Z ) 2025-05-07T20:32:55.4129741Z self = 2025-05-07T20:32:55.4130560Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4130934Z 2025-05-07T20:32:55.4131050Z @given( 2025-05-07T20:32:55.4131395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4131857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4132330Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4132812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4133373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4133802Z ) 2025-05-07T20:32:55.4134329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4134992Z def test_silu_mul_quant( 2025-05-07T20:32:55.4135347Z self, 2025-05-07T20:32:55.4135639Z T: int, 2025-05-07T20:32:55.4135937Z D: int, 2025-05-07T20:32:55.4136267Z scale_ub: Optional[float], 2025-05-07T20:32:55.4136695Z contiguous: bool, 2025-05-07T20:32:55.4137057Z compiled: bool, 2025-05-07T20:32:55.4137385Z ) -> None: 2025-05-07T20:32:55.4137709Z torch.manual_seed(2025) 2025-05-07T20:32:55.4138068Z 2025-05-07T20:32:55.4138466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4138981Z 2025-05-07T20:32:55.4139277Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4139705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4140179Z x = x_sign * x_clamp 2025-05-07T20:32:55.4140553Z x0 = x[:, :D] 2025-05-07T20:32:55.4140885Z x1 = x[:, D:] 2025-05-07T20:32:55.4141210Z 2025-05-07T20:32:55.4141499Z if contiguous: 2025-05-07T20:32:55.4141848Z x0 = x0.contiguous() 2025-05-07T20:32:55.4142249Z x1 = x1.contiguous() 2025-05-07T20:32:55.4142622Z 2025-05-07T20:32:55.4142911Z if scale_ub is not None: 2025-05-07T20:32:55.4143300Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4143783Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4144223Z ) 2025-05-07T20:32:55.4144504Z else: 2025-05-07T20:32:55.4144816Z scale_ub_tensor = None 2025-05-07T20:32:55.4145175Z 2025-05-07T20:32:55.4145497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4145938Z op = silu_mul_quant 2025-05-07T20:32:55.4146340Z if compiled: 2025-05-07T20:32:55.4146696Z op = torch.compile(op) 2025-05-07T20:32:55.4147114Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4147512Z 2025-05-07T20:32:55.4147782Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4148191Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4148603Z 2025-05-07T20:32:55.4148942Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4149420Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4149862Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4150345Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4150873Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4151432Z 2025-05-07T20:32:55.4151729Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4152008Z 2025-05-07T20:32:55.4152154Z moe/activation_test.py:126: 2025-05-07T20:32:55.4152577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4153060Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4153642Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4154780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4155884Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4156648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4157648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4158654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4159688Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4160731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4161820Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4162832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4163729Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4164567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4165289Z fn() 2025-05-07T20:32:55.4165998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4166818Z self.fn.run( 2025-05-07T20:32:55.4167469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4168221Z kernel = self.compile( 2025-05-07T20:32:55.4168978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4169907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4170480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4170810Z 2025-05-07T20:32:55.4171102Z self = 2025-05-07T20:32:55.4172676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4174637Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff1bc73400>} 2025-05-07T20:32:55.4176508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4177994Z context = 2025-05-07T20:32:55.4178420Z 2025-05-07T20:32:55.4178662Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4179420Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4180099Z module_map=module_map) 2025-05-07T20:32:55.4180627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4181144Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4181629Z E ^ 2025-05-07T20:32:55.4182303Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4182964Z 2025-05-07T20:32:55.4183568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4184279Z 2025-05-07T20:32:55.4184440Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4185021Z self=, 2025-05-07T20:32:55.4185583Z T=2048, 2025-05-07T20:32:55.4185859Z D=5120, 2025-05-07T20:32:55.4186132Z scale_ub=1200.0, 2025-05-07T20:32:55.4186458Z contiguous=True, 2025-05-07T20:32:55.4186781Z compiled=False, 2025-05-07T20:32:55.4187127Z ) 2025-05-07T20:32:55.4187577Z self = 2025-05-07T20:32:55.4188281Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.4188678Z 2025-05-07T20:32:55.4188803Z @given( 2025-05-07T20:32:55.4189147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4189624Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4190146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4190638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4191146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4191581Z ) 2025-05-07T20:32:55.4192103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4192759Z def test_silu_mul_quant( 2025-05-07T20:32:55.4193108Z self, 2025-05-07T20:32:55.4193387Z T: int, 2025-05-07T20:32:55.4193762Z D: int, 2025-05-07T20:32:55.4194086Z scale_ub: Optional[float], 2025-05-07T20:32:55.4194471Z contiguous: bool, 2025-05-07T20:32:55.4194803Z compiled: bool, 2025-05-07T20:32:55.4195122Z ) -> None: 2025-05-07T20:32:55.4195444Z torch.manual_seed(2025) 2025-05-07T20:32:55.4195787Z 2025-05-07T20:32:55.4196179Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4196664Z 2025-05-07T20:32:55.4196945Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4197355Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4197795Z x = x_sign * x_clamp 2025-05-07T20:32:55.4198137Z x0 = x[:, :D] 2025-05-07T20:32:55.4198455Z x1 = x[:, D:] 2025-05-07T20:32:55.4198754Z 2025-05-07T20:32:55.4199026Z if contiguous: 2025-05-07T20:32:55.4199368Z x0 = x0.contiguous() 2025-05-07T20:32:55.4199757Z x1 = x1.contiguous() 2025-05-07T20:32:55.4200112Z 2025-05-07T20:32:55.4200391Z if scale_ub is not None: 2025-05-07T20:32:55.4200787Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4201272Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4201730Z ) 2025-05-07T20:32:55.4202025Z else: 2025-05-07T20:32:55.4202338Z scale_ub_tensor = None 2025-05-07T20:32:55.4202707Z 2025-05-07T20:32:55.4203041Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4203504Z op = silu_mul_quant 2025-05-07T20:32:55.4203864Z if compiled: 2025-05-07T20:32:55.4204234Z op = torch.compile(op) 2025-05-07T20:32:55.4204682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4205074Z 2025-05-07T20:32:55.4205358Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4205595Z 2025-05-07T20:32:55.4205747Z moe/activation_test.py:117: 2025-05-07T20:32:55.4206178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4206652Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4207055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4208085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4209030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4209786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4210756Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4211697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4212458Z kernel = self.compile( 2025-05-07T20:32:55.4213233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4214188Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4214824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4215173Z 2025-05-07T20:32:55.4215487Z self = 2025-05-07T20:32:55.4217093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4219196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff1b452ef0>} 2025-05-07T20:32:55.4221130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4222600Z context = 2025-05-07T20:32:55.4223031Z 2025-05-07T20:32:55.4223278Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4224300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4225000Z module_map=module_map) 2025-05-07T20:32:55.4225527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4244273Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4244728Z E ^ 2025-05-07T20:32:55.4245430Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4246113Z 2025-05-07T20:32:55.4246728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4247507Z 2025-05-07T20:32:55.4247667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4248283Z self=, 2025-05-07T20:32:55.4248865Z T=2048, 2025-05-07T20:32:55.4249140Z D=5120, 2025-05-07T20:32:55.4249432Z scale_ub=1200.0, 2025-05-07T20:32:55.4249781Z contiguous=True, 2025-05-07T20:32:55.4250120Z compiled=True, 2025-05-07T20:32:55.4250436Z ) 2025-05-07T20:32:55.4250922Z self = 2025-05-07T20:32:55.4251667Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.4252083Z 2025-05-07T20:32:55.4252202Z @given( 2025-05-07T20:32:55.4252556Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4253031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4253499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4254003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4254512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4254939Z ) 2025-05-07T20:32:55.4255438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4256079Z def test_silu_mul_quant( 2025-05-07T20:32:55.4256723Z self, 2025-05-07T20:32:55.4257031Z T: int, 2025-05-07T20:32:55.4257330Z D: int, 2025-05-07T20:32:55.4257649Z scale_ub: Optional[float], 2025-05-07T20:32:55.4258061Z contiguous: bool, 2025-05-07T20:32:55.4258426Z compiled: bool, 2025-05-07T20:32:55.4258755Z ) -> None: 2025-05-07T20:32:55.4259080Z torch.manual_seed(2025) 2025-05-07T20:32:55.4259452Z 2025-05-07T20:32:55.4259825Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4260313Z 2025-05-07T20:32:55.4260589Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4260997Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4261432Z x = x_sign * x_clamp 2025-05-07T20:32:55.4261782Z x0 = x[:, :D] 2025-05-07T20:32:55.4262211Z x1 = x[:, D:] 2025-05-07T20:32:55.4262528Z 2025-05-07T20:32:55.4262810Z if contiguous: 2025-05-07T20:32:55.4263151Z x0 = x0.contiguous() 2025-05-07T20:32:55.4263535Z x1 = x1.contiguous() 2025-05-07T20:32:55.4263899Z 2025-05-07T20:32:55.4264187Z if scale_ub is not None: 2025-05-07T20:32:55.4264590Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4265187Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4265636Z ) 2025-05-07T20:32:55.4265925Z else: 2025-05-07T20:32:55.4266247Z scale_ub_tensor = None 2025-05-07T20:32:55.4266631Z 2025-05-07T20:32:55.4266972Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4267450Z op = silu_mul_quant 2025-05-07T20:32:55.4267832Z if compiled: 2025-05-07T20:32:55.4268208Z op = torch.compile(op) 2025-05-07T20:32:55.4268652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4269073Z 2025-05-07T20:32:55.4269368Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4269800Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4270239Z 2025-05-07T20:32:55.4270526Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4270884Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4271207Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4271545Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4271924Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4272263Z 2025-05-07T20:32:55.4272481Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4272689Z 2025-05-07T20:32:55.4272801Z moe/activation_test.py:126: 2025-05-07T20:32:55.4273116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4273477Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4273967Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4274802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4275599Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4276178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4276902Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4277623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4278391Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4279192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4279988Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4280859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4281542Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4282177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4282721Z fn() 2025-05-07T20:32:55.4283259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4283873Z self.fn.run( 2025-05-07T20:32:55.4284371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4284928Z kernel = self.compile( 2025-05-07T20:32:55.4285500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4286234Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4286653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4286902Z 2025-05-07T20:32:55.4287123Z self = 2025-05-07T20:32:55.4288271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4289776Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09f211b0>} 2025-05-07T20:32:55.4291191Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4292272Z context = 2025-05-07T20:32:55.4292584Z 2025-05-07T20:32:55.4292767Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4293325Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4293828Z module_map=module_map) 2025-05-07T20:32:55.4294212Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4294593Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4294878Z E ^ 2025-05-07T20:32:55.4295367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4295848Z 2025-05-07T20:32:55.4296289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4296835Z 2025-05-07T20:32:55.4296947Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4297388Z self=, 2025-05-07T20:32:55.4297814Z T=16384, 2025-05-07T20:32:55.4298024Z D=7168, 2025-05-07T20:32:55.4298235Z scale_ub=1200.0, 2025-05-07T20:32:55.4298474Z contiguous=False, 2025-05-07T20:32:55.4298717Z compiled=False, 2025-05-07T20:32:55.4298940Z ) 2025-05-07T20:32:55.4299276Z self = 2025-05-07T20:32:55.4299813Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.4300109Z 2025-05-07T20:32:55.4300198Z @given( 2025-05-07T20:32:55.4300449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4300778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4301105Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4301457Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4301811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4302118Z ) 2025-05-07T20:32:55.4302579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4303048Z def test_silu_mul_quant( 2025-05-07T20:32:55.4303310Z self, 2025-05-07T20:32:55.4303523Z T: int, 2025-05-07T20:32:55.4303732Z D: int, 2025-05-07T20:32:55.4303969Z scale_ub: Optional[float], 2025-05-07T20:32:55.4304259Z contiguous: bool, 2025-05-07T20:32:55.4304512Z compiled: bool, 2025-05-07T20:32:55.4304754Z ) -> None: 2025-05-07T20:32:55.4304984Z torch.manual_seed(2025) 2025-05-07T20:32:55.4305237Z 2025-05-07T20:32:55.4305529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4305895Z 2025-05-07T20:32:55.4306106Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4306414Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4306793Z x = x_sign * x_clamp 2025-05-07T20:32:55.4307054Z x0 = x[:, :D] 2025-05-07T20:32:55.4307281Z x1 = x[:, D:] 2025-05-07T20:32:55.4307512Z 2025-05-07T20:32:55.4307714Z if contiguous: 2025-05-07T20:32:55.4307959Z x0 = x0.contiguous() 2025-05-07T20:32:55.4308235Z x1 = x1.contiguous() 2025-05-07T20:32:55.4308495Z 2025-05-07T20:32:55.4308743Z if scale_ub is not None: 2025-05-07T20:32:55.4309040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4309398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4309725Z ) 2025-05-07T20:32:55.4309935Z else: 2025-05-07T20:32:55.4310164Z scale_ub_tensor = None 2025-05-07T20:32:55.4310429Z 2025-05-07T20:32:55.4310681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4311018Z op = silu_mul_quant 2025-05-07T20:32:55.4311290Z if compiled: 2025-05-07T20:32:55.4311553Z op = torch.compile(op) 2025-05-07T20:32:55.4311871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4312171Z 2025-05-07T20:32:55.4312380Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4312562Z 2025-05-07T20:32:55.4312668Z moe/activation_test.py:117: 2025-05-07T20:32:55.4312989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4313344Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4313744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4314483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4315217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4315781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4316507Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4317213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4317779Z kernel = self.compile( 2025-05-07T20:32:55.4318355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4319051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4319477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4319726Z 2025-05-07T20:32:55.4319946Z self = 2025-05-07T20:32:55.4321092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4322556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09f20af0>} 2025-05-07T20:32:55.4324467Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4325570Z context = 2025-05-07T20:32:55.4325878Z 2025-05-07T20:32:55.4326056Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4326661Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4327160Z module_map=module_map) 2025-05-07T20:32:55.4327543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4327919Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4328345Z E ^ 2025-05-07T20:32:55.4328836Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4329314Z 2025-05-07T20:32:55.4329759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4330307Z 2025-05-07T20:32:55.4330419Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4330929Z self=, 2025-05-07T20:32:55.4331351Z T=1, 2025-05-07T20:32:55.4331549Z D=7168, 2025-05-07T20:32:55.4331757Z scale_ub=None, 2025-05-07T20:32:55.4331981Z contiguous=True, 2025-05-07T20:32:55.4332239Z compiled=True, 2025-05-07T20:32:55.4332463Z ) 2025-05-07T20:32:55.4332806Z self = 2025-05-07T20:32:55.4333313Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4333594Z 2025-05-07T20:32:55.4333678Z @given( 2025-05-07T20:32:55.4333929Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4334263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4334599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4334953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4335301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4335612Z ) 2025-05-07T20:32:55.4335988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4336459Z def test_silu_mul_quant( 2025-05-07T20:32:55.4336717Z self, 2025-05-07T20:32:55.4336928Z T: int, 2025-05-07T20:32:55.4337142Z D: int, 2025-05-07T20:32:55.4337373Z scale_ub: Optional[float], 2025-05-07T20:32:55.4337665Z contiguous: bool, 2025-05-07T20:32:55.4337927Z compiled: bool, 2025-05-07T20:32:55.4338164Z ) -> None: 2025-05-07T20:32:55.4338401Z torch.manual_seed(2025) 2025-05-07T20:32:55.4338662Z 2025-05-07T20:32:55.4338948Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4339318Z 2025-05-07T20:32:55.4339527Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4339834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4340165Z x = x_sign * x_clamp 2025-05-07T20:32:55.4340425Z x0 = x[:, :D] 2025-05-07T20:32:55.4340653Z x1 = x[:, D:] 2025-05-07T20:32:55.4340878Z 2025-05-07T20:32:55.4341080Z if contiguous: 2025-05-07T20:32:55.4341325Z x0 = x0.contiguous() 2025-05-07T20:32:55.4341608Z x1 = x1.contiguous() 2025-05-07T20:32:55.4341873Z 2025-05-07T20:32:55.4342082Z if scale_ub is not None: 2025-05-07T20:32:55.4342373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4342730Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4343064Z ) 2025-05-07T20:32:55.4343269Z else: 2025-05-07T20:32:55.4343499Z scale_ub_tensor = None 2025-05-07T20:32:55.4343771Z 2025-05-07T20:32:55.4344107Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4344448Z op = silu_mul_quant 2025-05-07T20:32:55.4344720Z if compiled: 2025-05-07T20:32:55.4344981Z op = torch.compile(op) 2025-05-07T20:32:55.4345303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4345598Z 2025-05-07T20:32:55.4345801Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4346112Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4346430Z 2025-05-07T20:32:55.4346682Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4347097Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4347413Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4347750Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4348176Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4348514Z 2025-05-07T20:32:55.4348734Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4348947Z 2025-05-07T20:32:55.4349053Z moe/activation_test.py:126: 2025-05-07T20:32:55.4349370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4349735Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4350132Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4350959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4351758Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4352339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4353059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4353848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4354622Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4355414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4356202Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4357026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4357701Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4358338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4358879Z fn() 2025-05-07T20:32:55.4359422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4360046Z self.fn.run( 2025-05-07T20:32:55.4360542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4361104Z kernel = self.compile( 2025-05-07T20:32:55.4361677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4362371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4362788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4363039Z 2025-05-07T20:32:55.4363260Z self = 2025-05-07T20:32:55.4364399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4365934Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09ea5ab0>} 2025-05-07T20:32:55.4367343Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4368424Z context = 2025-05-07T20:32:55.4368733Z 2025-05-07T20:32:55.4368910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4369464Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4369955Z module_map=module_map) 2025-05-07T20:32:55.4370343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4370768Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4371049Z E ^ 2025-05-07T20:32:55.4371546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4372026Z 2025-05-07T20:32:55.4372464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4373046Z 2025-05-07T20:32:55.4373169Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4373605Z self=, 2025-05-07T20:32:55.4374031Z T=4096, 2025-05-07T20:32:55.4374234Z D=5120, 2025-05-07T20:32:55.4374444Z scale_ub=None, 2025-05-07T20:32:55.4374673Z contiguous=False, 2025-05-07T20:32:55.4374919Z compiled=False, 2025-05-07T20:32:55.4375143Z ) 2025-05-07T20:32:55.4375476Z self = 2025-05-07T20:32:55.4376007Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.4376297Z 2025-05-07T20:32:55.4376387Z @given( 2025-05-07T20:32:55.4376636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4376973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4377305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4377656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4378014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4378323Z ) 2025-05-07T20:32:55.4378697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4379159Z def test_silu_mul_quant( 2025-05-07T20:32:55.4379420Z self, 2025-05-07T20:32:55.4379631Z T: int, 2025-05-07T20:32:55.4379839Z D: int, 2025-05-07T20:32:55.4380077Z scale_ub: Optional[float], 2025-05-07T20:32:55.4380370Z contiguous: bool, 2025-05-07T20:32:55.4380625Z compiled: bool, 2025-05-07T20:32:55.4380868Z ) -> None: 2025-05-07T20:32:55.4381104Z torch.manual_seed(2025) 2025-05-07T20:32:55.4381359Z 2025-05-07T20:32:55.4381657Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4382020Z 2025-05-07T20:32:55.4382223Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4382540Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4382877Z x = x_sign * x_clamp 2025-05-07T20:32:55.4383130Z x0 = x[:, :D] 2025-05-07T20:32:55.4383364Z x1 = x[:, D:] 2025-05-07T20:32:55.4383590Z 2025-05-07T20:32:55.4383787Z if contiguous: 2025-05-07T20:32:55.4384039Z x0 = x0.contiguous() 2025-05-07T20:32:55.4384316Z x1 = x1.contiguous() 2025-05-07T20:32:55.4384576Z 2025-05-07T20:32:55.4384778Z if scale_ub is not None: 2025-05-07T20:32:55.4385071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4385434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4385760Z ) 2025-05-07T20:32:55.4385985Z else: 2025-05-07T20:32:55.4386305Z scale_ub_tensor = None 2025-05-07T20:32:55.4386604Z 2025-05-07T20:32:55.4386873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4387210Z op = silu_mul_quant 2025-05-07T20:32:55.4387478Z if compiled: 2025-05-07T20:32:55.4387747Z op = torch.compile(op) 2025-05-07T20:32:55.4388070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4388363Z 2025-05-07T20:32:55.4388572Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4388755Z 2025-05-07T20:32:55.4388861Z moe/activation_test.py:117: 2025-05-07T20:32:55.4389182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4389534Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4389838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4390618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4391346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4391916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4392639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4393392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4394018Z kernel = self.compile( 2025-05-07T20:32:55.4394595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4395288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4395708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4395963Z 2025-05-07T20:32:55.4396183Z self = 2025-05-07T20:32:55.4397327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4398781Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09ea6d40>} 2025-05-07T20:32:55.4400200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4401279Z context = 2025-05-07T20:32:55.4401589Z 2025-05-07T20:32:55.4401770Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4402325Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4402828Z module_map=module_map) 2025-05-07T20:32:55.4403213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4403593Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4403873Z E ^ 2025-05-07T20:32:55.4404361Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4404844Z 2025-05-07T20:32:55.4405283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4405831Z 2025-05-07T20:32:55.4405942Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4406387Z self=, 2025-05-07T20:32:55.4406811Z T=4096, 2025-05-07T20:32:55.4407015Z D=7168, 2025-05-07T20:32:55.4407224Z scale_ub=None, 2025-05-07T20:32:55.4407449Z contiguous=False, 2025-05-07T20:32:55.4407808Z compiled=False, 2025-05-07T20:32:55.4408035Z ) 2025-05-07T20:32:55.4408370Z self = 2025-05-07T20:32:55.4408897Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.4409195Z 2025-05-07T20:32:55.4409280Z @given( 2025-05-07T20:32:55.4409527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4409854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4410184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4410537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4410883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4416908Z ) 2025-05-07T20:32:55.4417301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4417855Z def test_silu_mul_quant( 2025-05-07T20:32:55.4418118Z self, 2025-05-07T20:32:55.4418327Z T: int, 2025-05-07T20:32:55.4418541Z D: int, 2025-05-07T20:32:55.4418776Z scale_ub: Optional[float], 2025-05-07T20:32:55.4419068Z contiguous: bool, 2025-05-07T20:32:55.4419318Z compiled: bool, 2025-05-07T20:32:55.4419609Z ) -> None: 2025-05-07T20:32:55.4419844Z torch.manual_seed(2025) 2025-05-07T20:32:55.4420105Z 2025-05-07T20:32:55.4420391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4420758Z 2025-05-07T20:32:55.4420969Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4421280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4421614Z x = x_sign * x_clamp 2025-05-07T20:32:55.4421876Z x0 = x[:, :D] 2025-05-07T20:32:55.4422107Z x1 = x[:, D:] 2025-05-07T20:32:55.4422336Z 2025-05-07T20:32:55.4422541Z if contiguous: 2025-05-07T20:32:55.4422788Z x0 = x0.contiguous() 2025-05-07T20:32:55.4423068Z x1 = x1.contiguous() 2025-05-07T20:32:55.4423327Z 2025-05-07T20:32:55.4423534Z if scale_ub is not None: 2025-05-07T20:32:55.4424099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4424562Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4424894Z ) 2025-05-07T20:32:55.4425102Z else: 2025-05-07T20:32:55.4425327Z scale_ub_tensor = None 2025-05-07T20:32:55.4425591Z 2025-05-07T20:32:55.4425841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4426181Z op = silu_mul_quant 2025-05-07T20:32:55.4426450Z if compiled: 2025-05-07T20:32:55.4426709Z op = torch.compile(op) 2025-05-07T20:32:55.4427030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4427325Z 2025-05-07T20:32:55.4427530Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4427717Z 2025-05-07T20:32:55.4427822Z moe/activation_test.py:117: 2025-05-07T20:32:55.4428144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4428497Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4428801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4429539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4430272Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4430839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4431565Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4432264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4432827Z kernel = self.compile( 2025-05-07T20:32:55.4433405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4434374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4434799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4435042Z 2025-05-07T20:32:55.4435261Z self = 2025-05-07T20:32:55.4436394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4437841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09ea7c70>} 2025-05-07T20:32:55.4439256Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4440401Z context = 2025-05-07T20:32:55.4440704Z 2025-05-07T20:32:55.4440880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4441504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4442003Z module_map=module_map) 2025-05-07T20:32:55.4442385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4442763Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4443044Z E ^ 2025-05-07T20:32:55.4443536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4444011Z 2025-05-07T20:32:55.4444452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4444994Z 2025-05-07T20:32:55.4445111Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4445550Z self=, 2025-05-07T20:32:55.4445971Z T=128, 2025-05-07T20:32:55.4446172Z D=7168, 2025-05-07T20:32:55.4446382Z scale_ub=None, 2025-05-07T20:32:55.4446613Z contiguous=False, 2025-05-07T20:32:55.4446850Z compiled=True, 2025-05-07T20:32:55.4447069Z ) 2025-05-07T20:32:55.4447408Z self = 2025-05-07T20:32:55.4447922Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.4448210Z 2025-05-07T20:32:55.4448294Z @given( 2025-05-07T20:32:55.4448542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4448870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4449203Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4449557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4449908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4450216Z ) 2025-05-07T20:32:55.4450592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4451065Z def test_silu_mul_quant( 2025-05-07T20:32:55.4451328Z self, 2025-05-07T20:32:55.4451539Z T: int, 2025-05-07T20:32:55.4451754Z D: int, 2025-05-07T20:32:55.4451986Z scale_ub: Optional[float], 2025-05-07T20:32:55.4452278Z contiguous: bool, 2025-05-07T20:32:55.4452535Z compiled: bool, 2025-05-07T20:32:55.4452772Z ) -> None: 2025-05-07T20:32:55.4453005Z torch.manual_seed(2025) 2025-05-07T20:32:55.4453267Z 2025-05-07T20:32:55.4453552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4453923Z 2025-05-07T20:32:55.4454140Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4454450Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4454785Z x = x_sign * x_clamp 2025-05-07T20:32:55.4455137Z x0 = x[:, :D] 2025-05-07T20:32:55.4455369Z x1 = x[:, D:] 2025-05-07T20:32:55.4455598Z 2025-05-07T20:32:55.4455797Z if contiguous: 2025-05-07T20:32:55.4456046Z x0 = x0.contiguous() 2025-05-07T20:32:55.4456320Z x1 = x1.contiguous() 2025-05-07T20:32:55.4456578Z 2025-05-07T20:32:55.4456784Z if scale_ub is not None: 2025-05-07T20:32:55.4457072Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4457427Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4457757Z ) 2025-05-07T20:32:55.4457961Z else: 2025-05-07T20:32:55.4458187Z scale_ub_tensor = None 2025-05-07T20:32:55.4458459Z 2025-05-07T20:32:55.4458703Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4459091Z op = silu_mul_quant 2025-05-07T20:32:55.4459360Z if compiled: 2025-05-07T20:32:55.4459624Z op = torch.compile(op) 2025-05-07T20:32:55.4459948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4460244Z 2025-05-07T20:32:55.4460446Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4460757Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4461120Z 2025-05-07T20:32:55.4461378Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4461736Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4462053Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4462389Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4462772Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4463108Z 2025-05-07T20:32:55.4463327Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4463538Z 2025-05-07T20:32:55.4463645Z moe/activation_test.py:126: 2025-05-07T20:32:55.4463969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4464337Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4464696Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4465524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4466321Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4466906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4467625Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4468353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4469124Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4469931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4470720Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4471496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4472180Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4472821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4473367Z fn() 2025-05-07T20:32:55.4473981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4474595Z self.fn.run( 2025-05-07T20:32:55.4475087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4475651Z kernel = self.compile( 2025-05-07T20:32:55.4476311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4477003Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4477421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4477674Z 2025-05-07T20:32:55.4477893Z self = 2025-05-07T20:32:55.4479031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4480481Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09ea7ac0>} 2025-05-07T20:32:55.4481944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4483027Z context = 2025-05-07T20:32:55.4483379Z 2025-05-07T20:32:55.4483556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4484111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4484606Z module_map=module_map) 2025-05-07T20:32:55.4484994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4485374Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4485662Z E ^ 2025-05-07T20:32:55.4486148Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4486634Z 2025-05-07T20:32:55.4487076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4487615Z 2025-05-07T20:32:55.4487734Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4488168Z self=, 2025-05-07T20:32:55.4488596Z T=128, 2025-05-07T20:32:55.4488798Z D=7168, 2025-05-07T20:32:55.4489007Z scale_ub=None, 2025-05-07T20:32:55.4489233Z contiguous=False, 2025-05-07T20:32:55.4489478Z compiled=False, 2025-05-07T20:32:55.4489701Z ) 2025-05-07T20:32:55.4490036Z self = 2025-05-07T20:32:55.4490561Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.4490845Z 2025-05-07T20:32:55.4490935Z @given( 2025-05-07T20:32:55.4491178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4491520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4491850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4492202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4492554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4492862Z ) 2025-05-07T20:32:55.4493238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4493707Z def test_silu_mul_quant( 2025-05-07T20:32:55.4493968Z self, 2025-05-07T20:32:55.4494182Z T: int, 2025-05-07T20:32:55.4494390Z D: int, 2025-05-07T20:32:55.4494629Z scale_ub: Optional[float], 2025-05-07T20:32:55.4494922Z contiguous: bool, 2025-05-07T20:32:55.4495184Z compiled: bool, 2025-05-07T20:32:55.4495425Z ) -> None: 2025-05-07T20:32:55.4495653Z torch.manual_seed(2025) 2025-05-07T20:32:55.4495918Z 2025-05-07T20:32:55.4496216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4496575Z 2025-05-07T20:32:55.4496792Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4497222Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4497553Z x = x_sign * x_clamp 2025-05-07T20:32:55.4497811Z x0 = x[:, :D] 2025-05-07T20:32:55.4498048Z x1 = x[:, D:] 2025-05-07T20:32:55.4498269Z 2025-05-07T20:32:55.4498472Z if contiguous: 2025-05-07T20:32:55.4498723Z x0 = x0.contiguous() 2025-05-07T20:32:55.4499002Z x1 = x1.contiguous() 2025-05-07T20:32:55.4499256Z 2025-05-07T20:32:55.4499464Z if scale_ub is not None: 2025-05-07T20:32:55.4499759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4500111Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4500442Z ) 2025-05-07T20:32:55.4500651Z else: 2025-05-07T20:32:55.4500871Z scale_ub_tensor = None 2025-05-07T20:32:55.4501188Z 2025-05-07T20:32:55.4501438Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4501768Z op = silu_mul_quant 2025-05-07T20:32:55.4502041Z if compiled: 2025-05-07T20:32:55.4502310Z op = torch.compile(op) 2025-05-07T20:32:55.4502623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4502917Z 2025-05-07T20:32:55.4503171Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4503346Z 2025-05-07T20:32:55.4503458Z moe/activation_test.py:117: 2025-05-07T20:32:55.4503772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4504132Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4504437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4505166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4505903Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4506474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4507201Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4507899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4508471Z kernel = self.compile( 2025-05-07T20:32:55.4509046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4509737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4510163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4510414Z 2025-05-07T20:32:55.4510636Z self = 2025-05-07T20:32:55.4511786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4513239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff1a0cbb50>} 2025-05-07T20:32:55.4514744Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4515828Z context = 2025-05-07T20:32:55.4516133Z 2025-05-07T20:32:55.4516317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4516888Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4517400Z module_map=module_map) 2025-05-07T20:32:55.4517789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4518250Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4518525Z E ^ 2025-05-07T20:32:55.4519016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4519495Z 2025-05-07T20:32:55.4519937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4520474Z 2025-05-07T20:32:55.4520593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4521026Z self=, 2025-05-07T20:32:55.4521451Z T=4096, 2025-05-07T20:32:55.4521659Z D=5120, 2025-05-07T20:32:55.4521861Z scale_ub=1200.0, 2025-05-07T20:32:55.4522102Z contiguous=True, 2025-05-07T20:32:55.4522341Z compiled=False, 2025-05-07T20:32:55.4522622Z ) 2025-05-07T20:32:55.4522962Z self = 2025-05-07T20:32:55.4523492Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.4523982Z 2025-05-07T20:32:55.4524115Z @given( 2025-05-07T20:32:55.4524442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4524790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4525258Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4525604Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4525956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4526260Z ) 2025-05-07T20:32:55.4526627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4527094Z def test_silu_mul_quant( 2025-05-07T20:32:55.4527353Z self, 2025-05-07T20:32:55.4527558Z T: int, 2025-05-07T20:32:55.4527773Z D: int, 2025-05-07T20:32:55.4528013Z scale_ub: Optional[float], 2025-05-07T20:32:55.4528298Z contiguous: bool, 2025-05-07T20:32:55.4528558Z compiled: bool, 2025-05-07T20:32:55.4528802Z ) -> None: 2025-05-07T20:32:55.4529038Z torch.manual_seed(2025) 2025-05-07T20:32:55.4529292Z 2025-05-07T20:32:55.4529581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4529948Z 2025-05-07T20:32:55.4530153Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4530462Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4530791Z x = x_sign * x_clamp 2025-05-07T20:32:55.4531042Z x0 = x[:, :D] 2025-05-07T20:32:55.4531274Z x1 = x[:, D:] 2025-05-07T20:32:55.4531497Z 2025-05-07T20:32:55.4531692Z if contiguous: 2025-05-07T20:32:55.4531941Z x0 = x0.contiguous() 2025-05-07T20:32:55.4532218Z x1 = x1.contiguous() 2025-05-07T20:32:55.4532473Z 2025-05-07T20:32:55.4532688Z if scale_ub is not None: 2025-05-07T20:32:55.4532983Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4533335Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4533666Z ) 2025-05-07T20:32:55.4533874Z else: 2025-05-07T20:32:55.4534094Z scale_ub_tensor = None 2025-05-07T20:32:55.4534364Z 2025-05-07T20:32:55.4534620Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4534957Z op = silu_mul_quant 2025-05-07T20:32:55.4535218Z if compiled: 2025-05-07T20:32:55.4535482Z op = torch.compile(op) 2025-05-07T20:32:55.4535800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4536089Z 2025-05-07T20:32:55.4536300Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4536476Z 2025-05-07T20:32:55.4536588Z moe/activation_test.py:117: 2025-05-07T20:32:55.4536902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4537264Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4537570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4538442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4539164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4539728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4540449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4541143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4541706Z kernel = self.compile( 2025-05-07T20:32:55.4542275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4542965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4543444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4543692Z 2025-05-07T20:32:55.4543916Z self = 2025-05-07T20:32:55.4545045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4546528Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0994a0e0>} 2025-05-07T20:32:55.4548011Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4549096Z context = 2025-05-07T20:32:55.4549407Z 2025-05-07T20:32:55.4549584Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4550143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4550637Z module_map=module_map) 2025-05-07T20:32:55.4551029Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4551411Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4551688Z E ^ 2025-05-07T20:32:55.4552173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4552652Z 2025-05-07T20:32:55.4553088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4553692Z 2025-05-07T20:32:55.4553812Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4554251Z self=, 2025-05-07T20:32:55.4554676Z T=1, 2025-05-07T20:32:55.4554874Z D=5120, 2025-05-07T20:32:55.4555088Z scale_ub=None, 2025-05-07T20:32:55.4555312Z contiguous=True, 2025-05-07T20:32:55.4555554Z compiled=True, 2025-05-07T20:32:55.4555776Z ) 2025-05-07T20:32:55.4556111Z self = 2025-05-07T20:32:55.4556631Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4556929Z 2025-05-07T20:32:55.4557018Z @given( 2025-05-07T20:32:55.4557258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4557590Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4557920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4558265Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4558617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4558926Z ) 2025-05-07T20:32:55.4559301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4559846Z def test_silu_mul_quant( 2025-05-07T20:32:55.4560108Z self, 2025-05-07T20:32:55.4560318Z T: int, 2025-05-07T20:32:55.4560522Z D: int, 2025-05-07T20:32:55.4560755Z scale_ub: Optional[float], 2025-05-07T20:32:55.4560854Z contiguous: bool, 2025-05-07T20:32:55.4560954Z compiled: bool, 2025-05-07T20:32:55.4561039Z ) -> None: 2025-05-07T20:32:55.4561142Z torch.manual_seed(2025) 2025-05-07T20:32:55.4561226Z 2025-05-07T20:32:55.4561409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4561489Z 2025-05-07T20:32:55.4561594Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4561728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4561829Z x = x_sign * x_clamp 2025-05-07T20:32:55.4561959Z x0 = x[:, :D] 2025-05-07T20:32:55.4562045Z x1 = x[:, D:] 2025-05-07T20:32:55.4562133Z 2025-05-07T20:32:55.4562223Z if contiguous: 2025-05-07T20:32:55.4562325Z x0 = x0.contiguous() 2025-05-07T20:32:55.4562427Z x1 = x1.contiguous() 2025-05-07T20:32:55.4562507Z 2025-05-07T20:32:55.4562604Z if scale_ub is not None: 2025-05-07T20:32:55.4562723Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4562933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4563015Z ) 2025-05-07T20:32:55.4563102Z else: 2025-05-07T20:32:55.4563203Z scale_ub_tensor = None 2025-05-07T20:32:55.4563281Z 2025-05-07T20:32:55.4563423Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4563520Z op = silu_mul_quant 2025-05-07T20:32:55.4563617Z if compiled: 2025-05-07T20:32:55.4563723Z op = torch.compile(op) 2025-05-07T20:32:55.4563838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4563924Z 2025-05-07T20:32:55.4564021Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4564156Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4564242Z 2025-05-07T20:32:55.4564386Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4564495Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4564611Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4564741Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4564899Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4564979Z 2025-05-07T20:32:55.4565085Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4565090Z 2025-05-07T20:32:55.4565199Z moe/activation_test.py:126: 2025-05-07T20:32:55.4565339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4565453Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4565604Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4566195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4566311Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4566689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4566945Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4567376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4567669Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4568090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4568368Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4568842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4569030Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4569393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4569480Z fn() 2025-05-07T20:32:55.4569910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4569999Z self.fn.run( 2025-05-07T20:32:55.4570363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4570464Z kernel = self.compile( 2025-05-07T20:32:55.4570873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4571106Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4576900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4576913Z 2025-05-07T20:32:55.4577162Z self = 2025-05-07T20:32:55.4577993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4578621Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff09f20ee0>} 2025-05-07T20:32:55.4579411Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4579619Z context = 2025-05-07T20:32:55.4579636Z 2025-05-07T20:32:55.4579814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4580095Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4580219Z module_map=module_map) 2025-05-07T20:32:55.4580393Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4580506Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4580598Z E ^ 2025-05-07T20:32:55.4580974Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4580980Z 2025-05-07T20:32:55.4581425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4581433Z 2025-05-07T20:32:55.4581545Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4581785Z self=, 2025-05-07T20:32:55.4581875Z T=2048, 2025-05-07T20:32:55.4581957Z D=5120, 2025-05-07T20:32:55.4582045Z scale_ub=None, 2025-05-07T20:32:55.4582144Z contiguous=True, 2025-05-07T20:32:55.4582235Z compiled=True, 2025-05-07T20:32:55.4582314Z ) 2025-05-07T20:32:55.4582550Z self = 2025-05-07T20:32:55.4582731Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4582736Z 2025-05-07T20:32:55.4582826Z @given( 2025-05-07T20:32:55.4582954Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4583061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4583189Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4583317Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4583440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4583526Z ) 2025-05-07T20:32:55.4583882Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4583991Z def test_silu_mul_quant( 2025-05-07T20:32:55.4584075Z self, 2025-05-07T20:32:55.4584156Z T: int, 2025-05-07T20:32:55.4584248Z D: int, 2025-05-07T20:32:55.4584353Z scale_ub: Optional[float], 2025-05-07T20:32:55.4584450Z contiguous: bool, 2025-05-07T20:32:55.4584548Z compiled: bool, 2025-05-07T20:32:55.4584636Z ) -> None: 2025-05-07T20:32:55.4584737Z torch.manual_seed(2025) 2025-05-07T20:32:55.4584821Z 2025-05-07T20:32:55.4585003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4585083Z 2025-05-07T20:32:55.4585187Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4585323Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4585464Z x = x_sign * x_clamp 2025-05-07T20:32:55.4585558Z x0 = x[:, :D] 2025-05-07T20:32:55.4585644Z x1 = x[:, D:] 2025-05-07T20:32:55.4585734Z 2025-05-07T20:32:55.4585824Z if contiguous: 2025-05-07T20:32:55.4585924Z x0 = x0.contiguous() 2025-05-07T20:32:55.4586025Z x1 = x1.contiguous() 2025-05-07T20:32:55.4586147Z 2025-05-07T20:32:55.4586245Z if scale_ub is not None: 2025-05-07T20:32:55.4586366Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4586511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4586593Z ) 2025-05-07T20:32:55.4586679Z else: 2025-05-07T20:32:55.4586779Z scale_ub_tensor = None 2025-05-07T20:32:55.4586858Z 2025-05-07T20:32:55.4587007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4587104Z op = silu_mul_quant 2025-05-07T20:32:55.4587205Z if compiled: 2025-05-07T20:32:55.4587312Z op = torch.compile(op) 2025-05-07T20:32:55.4587426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4587510Z 2025-05-07T20:32:55.4587612Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4587742Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4587827Z 2025-05-07T20:32:55.4587972Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4588085Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4588200Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4588331Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4588481Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4588566Z 2025-05-07T20:32:55.4588674Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4588678Z 2025-05-07T20:32:55.4588789Z moe/activation_test.py:126: 2025-05-07T20:32:55.4588932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4589047Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4589202Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4589797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4589909Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4590301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4590540Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4590940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4591212Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4591639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4591998Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4592397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4592584Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4592952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4593037Z fn() 2025-05-07T20:32:55.4593468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4593650Z self.fn.run( 2025-05-07T20:32:55.4594010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4594164Z kernel = self.compile( 2025-05-07T20:32:55.4594569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4594767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4594905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4594909Z 2025-05-07T20:32:55.4595173Z self = 2025-05-07T20:32:55.4595999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4596537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0949e7a0>} 2025-05-07T20:32:55.4597328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4597540Z context = 2025-05-07T20:32:55.4597545Z 2025-05-07T20:32:55.4597730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4598020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4598138Z module_map=module_map) 2025-05-07T20:32:55.4598320Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4598429Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4598513Z E ^ 2025-05-07T20:32:55.4598894Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4598899Z 2025-05-07T20:32:55.4599339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4599343Z 2025-05-07T20:32:55.4599465Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4599700Z self=, 2025-05-07T20:32:55.4599785Z T=128, 2025-05-07T20:32:55.4599874Z D=5120, 2025-05-07T20:32:55.4599968Z scale_ub=None, 2025-05-07T20:32:55.4600059Z contiguous=True, 2025-05-07T20:32:55.4600157Z compiled=True, 2025-05-07T20:32:55.4600235Z ) 2025-05-07T20:32:55.4600467Z self = 2025-05-07T20:32:55.4600652Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4600656Z 2025-05-07T20:32:55.4600740Z @given( 2025-05-07T20:32:55.4600873Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4600980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4601106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4601238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4601441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4601523Z ) 2025-05-07T20:32:55.4601791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4601892Z def test_silu_mul_quant( 2025-05-07T20:32:55.4601984Z self, 2025-05-07T20:32:55.4602067Z T: int, 2025-05-07T20:32:55.4602149Z D: int, 2025-05-07T20:32:55.4602259Z scale_ub: Optional[float], 2025-05-07T20:32:55.4602355Z contiguous: bool, 2025-05-07T20:32:55.4602447Z compiled: bool, 2025-05-07T20:32:55.4602537Z ) -> None: 2025-05-07T20:32:55.4602639Z torch.manual_seed(2025) 2025-05-07T20:32:55.4602718Z 2025-05-07T20:32:55.4602905Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4603059Z 2025-05-07T20:32:55.4603158Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4603299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4603401Z x = x_sign * x_clamp 2025-05-07T20:32:55.4603491Z x0 = x[:, :D] 2025-05-07T20:32:55.4603583Z x1 = x[:, D:] 2025-05-07T20:32:55.4603661Z 2025-05-07T20:32:55.4603759Z if contiguous: 2025-05-07T20:32:55.4603859Z x0 = x0.contiguous() 2025-05-07T20:32:55.4603999Z x1 = x1.contiguous() 2025-05-07T20:32:55.4604088Z 2025-05-07T20:32:55.4604185Z if scale_ub is not None: 2025-05-07T20:32:55.4604298Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4604450Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4604531Z ) 2025-05-07T20:32:55.4604616Z else: 2025-05-07T20:32:55.4604725Z scale_ub_tensor = None 2025-05-07T20:32:55.4604806Z 2025-05-07T20:32:55.4604944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4605050Z op = silu_mul_quant 2025-05-07T20:32:55.4605141Z if compiled: 2025-05-07T20:32:55.4605259Z op = torch.compile(op) 2025-05-07T20:32:55.4605374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4605453Z 2025-05-07T20:32:55.4605554Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4605686Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4605768Z 2025-05-07T20:32:55.4605921Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4606030Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4606137Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4606274Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4606424Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4606503Z 2025-05-07T20:32:55.4606618Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4606626Z 2025-05-07T20:32:55.4606734Z moe/activation_test.py:126: 2025-05-07T20:32:55.4606892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4607028Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4607173Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4607770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4607880Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4608260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4608503Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4608889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4609169Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4609751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4610022Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4610423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4610605Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4610976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4611059Z fn() 2025-05-07T20:32:55.4611481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4611576Z self.fn.run( 2025-05-07T20:32:55.4611932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4612077Z kernel = self.compile( 2025-05-07T20:32:55.4612492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4612679Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4612819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4612864Z 2025-05-07T20:32:55.4613081Z self = 2025-05-07T20:32:55.4613898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4614434Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0949f640>} 2025-05-07T20:32:55.4615218Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4615426Z context = 2025-05-07T20:32:55.4615434Z 2025-05-07T20:32:55.4615609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4615892Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4616007Z module_map=module_map) 2025-05-07T20:32:55.4616180Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4616296Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4616379Z E ^ 2025-05-07T20:32:55.4616753Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4616760Z 2025-05-07T20:32:55.4617201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4617206Z 2025-05-07T20:32:55.4617316Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4617554Z self=, 2025-05-07T20:32:55.4617639Z T=4096, 2025-05-07T20:32:55.4617720Z D=5120, 2025-05-07T20:32:55.4617813Z scale_ub=None, 2025-05-07T20:32:55.4617903Z contiguous=True, 2025-05-07T20:32:55.4617995Z compiled=True, 2025-05-07T20:32:55.4618080Z ) 2025-05-07T20:32:55.4618308Z self = 2025-05-07T20:32:55.4618490Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4618504Z 2025-05-07T20:32:55.4618585Z @given( 2025-05-07T20:32:55.4618715Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4618826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4618948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4619180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4619311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4619391Z ) 2025-05-07T20:32:55.4619650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4619760Z def test_silu_mul_quant( 2025-05-07T20:32:55.4619843Z self, 2025-05-07T20:32:55.4619925Z T: int, 2025-05-07T20:32:55.4620012Z D: int, 2025-05-07T20:32:55.4620117Z scale_ub: Optional[float], 2025-05-07T20:32:55.4620218Z contiguous: bool, 2025-05-07T20:32:55.4620309Z compiled: bool, 2025-05-07T20:32:55.4620394Z ) -> None: 2025-05-07T20:32:55.4620501Z torch.manual_seed(2025) 2025-05-07T20:32:55.4620579Z 2025-05-07T20:32:55.4620800Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4620889Z 2025-05-07T20:32:55.4620986Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4621124Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4621223Z x = x_sign * x_clamp 2025-05-07T20:32:55.4621308Z x0 = x[:, :D] 2025-05-07T20:32:55.4621394Z x1 = x[:, D:] 2025-05-07T20:32:55.4621519Z 2025-05-07T20:32:55.4621611Z if contiguous: 2025-05-07T20:32:55.4621716Z x0 = x0.contiguous() 2025-05-07T20:32:55.4621812Z x1 = x1.contiguous() 2025-05-07T20:32:55.4621891Z 2025-05-07T20:32:55.4621993Z if scale_ub is not None: 2025-05-07T20:32:55.4622109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4622253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4622341Z ) 2025-05-07T20:32:55.4622422Z else: 2025-05-07T20:32:55.4622524Z scale_ub_tensor = None 2025-05-07T20:32:55.4622610Z 2025-05-07T20:32:55.4622747Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4622844Z op = silu_mul_quant 2025-05-07T20:32:55.4622946Z if compiled: 2025-05-07T20:32:55.4623056Z op = torch.compile(op) 2025-05-07T20:32:55.4623178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4623261Z 2025-05-07T20:32:55.4623357Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4623491Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4623570Z 2025-05-07T20:32:55.4623716Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4624314Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4624474Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4624613Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4624771Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4624855Z 2025-05-07T20:32:55.4624963Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4624976Z 2025-05-07T20:32:55.4625082Z moe/activation_test.py:126: 2025-05-07T20:32:55.4625223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4625341Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4625486Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4626074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4626190Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4626601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4626860Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4627247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4627519Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4628174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4628446Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4628843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4629026Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4629385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4629475Z fn() 2025-05-07T20:32:55.4629895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4630041Z self.fn.run( 2025-05-07T20:32:55.4630403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4630510Z kernel = self.compile( 2025-05-07T20:32:55.4630915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4631101Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4631303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4631308Z 2025-05-07T20:32:55.4631532Z self = 2025-05-07T20:32:55.4632350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4632893Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0888f5b0>} 2025-05-07T20:32:55.4633738Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4633945Z context = 2025-05-07T20:32:55.4633950Z 2025-05-07T20:32:55.4634129Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4634404Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4634525Z module_map=module_map) 2025-05-07T20:32:55.4634697Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4634806Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4634896Z E ^ 2025-05-07T20:32:55.4635269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4635278Z 2025-05-07T20:32:55.4635712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4635722Z 2025-05-07T20:32:55.4635835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4636071Z self=, 2025-05-07T20:32:55.4636159Z T=16384, 2025-05-07T20:32:55.4636241Z D=5120, 2025-05-07T20:32:55.4636329Z scale_ub=None, 2025-05-07T20:32:55.4636425Z contiguous=True, 2025-05-07T20:32:55.4636513Z compiled=True, 2025-05-07T20:32:55.4636591Z ) 2025-05-07T20:32:55.4636826Z self = 2025-05-07T20:32:55.4637021Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4637031Z 2025-05-07T20:32:55.4637127Z @given( 2025-05-07T20:32:55.4637279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4637497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4637629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4637753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4637878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4637963Z ) 2025-05-07T20:32:55.4638224Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4638325Z def test_silu_mul_quant( 2025-05-07T20:32:55.4638414Z self, 2025-05-07T20:32:55.4638498Z T: int, 2025-05-07T20:32:55.4638579Z D: int, 2025-05-07T20:32:55.4638689Z scale_ub: Optional[float], 2025-05-07T20:32:55.4638785Z contiguous: bool, 2025-05-07T20:32:55.4638883Z compiled: bool, 2025-05-07T20:32:55.4639009Z ) -> None: 2025-05-07T20:32:55.4639110Z torch.manual_seed(2025) 2025-05-07T20:32:55.4639192Z 2025-05-07T20:32:55.4639370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4639454Z 2025-05-07T20:32:55.4639557Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4639689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4639783Z x = x_sign * x_clamp 2025-05-07T20:32:55.4639919Z x0 = x[:, :D] 2025-05-07T20:32:55.4640005Z x1 = x[:, D:] 2025-05-07T20:32:55.4640084Z 2025-05-07T20:32:55.4640179Z if contiguous: 2025-05-07T20:32:55.4640277Z x0 = x0.contiguous() 2025-05-07T20:32:55.4640376Z x1 = x1.contiguous() 2025-05-07T20:32:55.4640454Z 2025-05-07T20:32:55.4640551Z if scale_ub is not None: 2025-05-07T20:32:55.4640673Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4640817Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4640900Z ) 2025-05-07T20:32:55.4640987Z else: 2025-05-07T20:32:55.4641088Z scale_ub_tensor = None 2025-05-07T20:32:55.4641166Z 2025-05-07T20:32:55.4641317Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4641413Z op = silu_mul_quant 2025-05-07T20:32:55.4641505Z if compiled: 2025-05-07T20:32:55.4641618Z op = torch.compile(op) 2025-05-07T20:32:55.4641735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4641820Z 2025-05-07T20:32:55.4641917Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4642046Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4642134Z 2025-05-07T20:32:55.4642277Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4642384Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4642507Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4642636Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4642786Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4642872Z 2025-05-07T20:32:55.4642987Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4642992Z 2025-05-07T20:32:55.4643106Z moe/activation_test.py:126: 2025-05-07T20:32:55.4643240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4643356Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4643506Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4644091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4644200Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4644586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4644821Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4645218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4645575Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4645998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4646283Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4646673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4646856Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4647214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4647298Z fn() 2025-05-07T20:32:55.4647762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4647850Z self.fn.run( 2025-05-07T20:32:55.4648213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4648319Z kernel = self.compile( 2025-05-07T20:32:55.4648717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4648949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4649085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4649089Z 2025-05-07T20:32:55.4649306Z self = 2025-05-07T20:32:55.4650126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4650662Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0949ec20>} 2025-05-07T20:32:55.4651446Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4651653Z context = 2025-05-07T20:32:55.4651658Z 2025-05-07T20:32:55.4651834Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4652116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4652233Z module_map=module_map) 2025-05-07T20:32:55.4652409Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4652521Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4652603Z E ^ 2025-05-07T20:32:55.4652984Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4652988Z 2025-05-07T20:32:55.4653421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4653428Z 2025-05-07T20:32:55.4653546Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4653779Z self=, 2025-05-07T20:32:55.4653861Z T=1, 2025-05-07T20:32:55.4653948Z D=5120, 2025-05-07T20:32:55.4654038Z scale_ub=1200.0, 2025-05-07T20:32:55.4654128Z contiguous=True, 2025-05-07T20:32:55.4654222Z compiled=True, 2025-05-07T20:32:55.4654301Z ) 2025-05-07T20:32:55.4654528Z self = 2025-05-07T20:32:55.4654711Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.4654716Z 2025-05-07T20:32:55.4654800Z @given( 2025-05-07T20:32:55.4655016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4655126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4655250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4655384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4655506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4655585Z ) 2025-05-07T20:32:55.4655851Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4655952Z def test_silu_mul_quant( 2025-05-07T20:32:55.4656034Z self, 2025-05-07T20:32:55.4656126Z T: int, 2025-05-07T20:32:55.4656207Z D: int, 2025-05-07T20:32:55.4656317Z scale_ub: Optional[float], 2025-05-07T20:32:55.4656455Z contiguous: bool, 2025-05-07T20:32:55.4656548Z compiled: bool, 2025-05-07T20:32:55.4656638Z ) -> None: 2025-05-07T20:32:55.4656739Z torch.manual_seed(2025) 2025-05-07T20:32:55.4656823Z 2025-05-07T20:32:55.4657006Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4657085Z 2025-05-07T20:32:55.4657185Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4657368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4657461Z x = x_sign * x_clamp 2025-05-07T20:32:55.4657547Z x0 = x[:, :D] 2025-05-07T20:32:55.4657638Z x1 = x[:, D:] 2025-05-07T20:32:55.4657718Z 2025-05-07T20:32:55.4657813Z if contiguous: 2025-05-07T20:32:55.4657917Z x0 = x0.contiguous() 2025-05-07T20:32:55.4658011Z x1 = x1.contiguous() 2025-05-07T20:32:55.4658096Z 2025-05-07T20:32:55.4658191Z if scale_ub is not None: 2025-05-07T20:32:55.4658304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4658455Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4658536Z ) 2025-05-07T20:32:55.4658617Z else: 2025-05-07T20:32:55.4658730Z scale_ub_tensor = None 2025-05-07T20:32:55.4658808Z 2025-05-07T20:32:55.4658947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4659051Z op = silu_mul_quant 2025-05-07T20:32:55.4659145Z if compiled: 2025-05-07T20:32:55.4659251Z op = torch.compile(op) 2025-05-07T20:32:55.4659370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4659448Z 2025-05-07T20:32:55.4659552Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4659556Z 2025-05-07T20:32:55.4659665Z moe/activation_test.py:117: 2025-05-07T20:32:55.4659803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4659916Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4660025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4660413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4660524Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4661042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4661156Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4661532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4661769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4662134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4662236Z kernel = self.compile( 2025-05-07T20:32:55.4662635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4662830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4663047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4663052Z 2025-05-07T20:32:55.4663277Z self = 2025-05-07T20:32:55.4664082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4664620Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08d2ac20>} 2025-05-07T20:32:55.4665396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4665641Z context = 2025-05-07T20:32:55.4665645Z 2025-05-07T20:32:55.4665835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4666114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4666273Z module_map=module_map) 2025-05-07T20:32:55.4666448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4666554Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4666643Z E ^ 2025-05-07T20:32:55.4667015Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4667020Z 2025-05-07T20:32:55.4667451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4667466Z 2025-05-07T20:32:55.4667576Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4667812Z self=, 2025-05-07T20:32:55.4667907Z T=1, 2025-05-07T20:32:55.4667990Z D=5120, 2025-05-07T20:32:55.4668080Z scale_ub=None, 2025-05-07T20:32:55.4668178Z contiguous=False, 2025-05-07T20:32:55.4668266Z compiled=True, 2025-05-07T20:32:55.4668348Z ) 2025-05-07T20:32:55.4668584Z self = 2025-05-07T20:32:55.4668759Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.4668764Z 2025-05-07T20:32:55.4668850Z @given( 2025-05-07T20:32:55.4668985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4669091Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4669219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4669343Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4669471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4669558Z ) 2025-05-07T20:32:55.4669820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4669920Z def test_silu_mul_quant( 2025-05-07T20:32:55.4670008Z self, 2025-05-07T20:32:55.4670092Z T: int, 2025-05-07T20:32:55.4670177Z D: int, 2025-05-07T20:32:55.4670289Z scale_ub: Optional[float], 2025-05-07T20:32:55.4670384Z contiguous: bool, 2025-05-07T20:32:55.4670483Z compiled: bool, 2025-05-07T20:32:55.4670567Z ) -> None: 2025-05-07T20:32:55.4670670Z torch.manual_seed(2025) 2025-05-07T20:32:55.4670755Z 2025-05-07T20:32:55.4670934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4671012Z 2025-05-07T20:32:55.4671117Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4671249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4671346Z x = x_sign * x_clamp 2025-05-07T20:32:55.4671443Z x0 = x[:, :D] 2025-05-07T20:32:55.4671528Z x1 = x[:, D:] 2025-05-07T20:32:55.4671606Z 2025-05-07T20:32:55.4671812Z if contiguous: 2025-05-07T20:32:55.4671912Z x0 = x0.contiguous() 2025-05-07T20:32:55.4672009Z x1 = x1.contiguous() 2025-05-07T20:32:55.4672095Z 2025-05-07T20:32:55.4672196Z if scale_ub is not None: 2025-05-07T20:32:55.4672316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4672458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4672539Z ) 2025-05-07T20:32:55.4672630Z else: 2025-05-07T20:32:55.4672731Z scale_ub_tensor = None 2025-05-07T20:32:55.4672810Z 2025-05-07T20:32:55.4672952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4673048Z op = silu_mul_quant 2025-05-07T20:32:55.4673138Z if compiled: 2025-05-07T20:32:55.4673293Z op = torch.compile(op) 2025-05-07T20:32:55.4673406Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4673487Z 2025-05-07T20:32:55.4673726Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4673856Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4673939Z 2025-05-07T20:32:55.4674084Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4674825Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4674938Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4675069Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4675219Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4675301Z 2025-05-07T20:32:55.4675408Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4675413Z 2025-05-07T20:32:55.4675521Z moe/activation_test.py:126: 2025-05-07T20:32:55.4675666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4675782Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4675931Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4676524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4676635Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4677023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4677263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4677653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4677920Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4678338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4678619Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4679012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4679190Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4679558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4679642Z fn() 2025-05-07T20:32:55.4680068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4680157Z self.fn.run( 2025-05-07T20:32:55.4680510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4680617Z kernel = self.compile( 2025-05-07T20:32:55.4681018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4681288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4681435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4681440Z 2025-05-07T20:32:55.4681656Z self = 2025-05-07T20:32:55.4682478Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4683011Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0837f370>} 2025-05-07T20:32:55.4683794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4684045Z context = 2025-05-07T20:32:55.4684049Z 2025-05-07T20:32:55.4684228Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4684557Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4684673Z module_map=module_map) 2025-05-07T20:32:55.4684855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4684967Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4685049Z E ^ 2025-05-07T20:32:55.4685430Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4685435Z 2025-05-07T20:32:55.4685871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4685875Z 2025-05-07T20:32:55.4685996Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4686231Z self=, 2025-05-07T20:32:55.4686313Z T=1, 2025-05-07T20:32:55.4686401Z D=5120, 2025-05-07T20:32:55.4686491Z scale_ub=None, 2025-05-07T20:32:55.4686585Z contiguous=True, 2025-05-07T20:32:55.4686681Z compiled=False, 2025-05-07T20:32:55.4686762Z ) 2025-05-07T20:32:55.4686990Z self = 2025-05-07T20:32:55.4687170Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.4687175Z 2025-05-07T20:32:55.4687257Z @given( 2025-05-07T20:32:55.4687382Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4687497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4687624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4687754Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4687878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4687957Z ) 2025-05-07T20:32:55.4688222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4688321Z def test_silu_mul_quant( 2025-05-07T20:32:55.4688406Z self, 2025-05-07T20:32:55.4688494Z T: int, 2025-05-07T20:32:55.4688577Z D: int, 2025-05-07T20:32:55.4688682Z scale_ub: Optional[float], 2025-05-07T20:32:55.4688784Z contiguous: bool, 2025-05-07T20:32:55.4688877Z compiled: bool, 2025-05-07T20:32:55.4688961Z ) -> None: 2025-05-07T20:32:55.4689067Z torch.manual_seed(2025) 2025-05-07T20:32:55.4689148Z 2025-05-07T20:32:55.4689332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4689411Z 2025-05-07T20:32:55.4689513Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4689654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4689748Z x = x_sign * x_clamp 2025-05-07T20:32:55.4689917Z x0 = x[:, :D] 2025-05-07T20:32:55.4690011Z x1 = x[:, D:] 2025-05-07T20:32:55.4690089Z 2025-05-07T20:32:55.4690180Z if contiguous: 2025-05-07T20:32:55.4690287Z x0 = x0.contiguous() 2025-05-07T20:32:55.4690384Z x1 = x1.contiguous() 2025-05-07T20:32:55.4690460Z 2025-05-07T20:32:55.4690562Z if scale_ub is not None: 2025-05-07T20:32:55.4690675Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4690825Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4690906Z ) 2025-05-07T20:32:55.4690986Z else: 2025-05-07T20:32:55.4691094Z scale_ub_tensor = None 2025-05-07T20:32:55.4691172Z 2025-05-07T20:32:55.4691309Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4691476Z op = silu_mul_quant 2025-05-07T20:32:55.4691566Z if compiled: 2025-05-07T20:32:55.4691672Z op = torch.compile(op) 2025-05-07T20:32:55.4691796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4691877Z 2025-05-07T20:32:55.4691973Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4691985Z 2025-05-07T20:32:55.4692088Z moe/activation_test.py:117: 2025-05-07T20:32:55.4692268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4692382Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4692489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4693014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4693129Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4693505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4693745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4694114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4694215Z kernel = self.compile( 2025-05-07T20:32:55.4694624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4694814Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4694947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4694952Z 2025-05-07T20:32:55.4695178Z self = 2025-05-07T20:32:55.4695985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4696527Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff0837feb0>} 2025-05-07T20:32:55.4697303Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4697515Z context = 2025-05-07T20:32:55.4697519Z 2025-05-07T20:32:55.4697695Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4697972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4698096Z module_map=module_map) 2025-05-07T20:32:55.4698268Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4698376Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4698469Z E ^ 2025-05-07T20:32:55.4698925Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4698931Z 2025-05-07T20:32:55.4699370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4699380Z 2025-05-07T20:32:55.4699490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4699725Z self=, 2025-05-07T20:32:55.4699818Z T=128, 2025-05-07T20:32:55.4699901Z D=5120, 2025-05-07T20:32:55.4699989Z scale_ub=None, 2025-05-07T20:32:55.4700091Z contiguous=False, 2025-05-07T20:32:55.4700182Z compiled=True, 2025-05-07T20:32:55.4700272Z ) 2025-05-07T20:32:55.4700502Z self = 2025-05-07T20:32:55.4700723Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.4700727Z 2025-05-07T20:32:55.4700816Z @given( 2025-05-07T20:32:55.4700948Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4701056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4701184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4701376Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4701497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4701582Z ) 2025-05-07T20:32:55.4701842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4701951Z def test_silu_mul_quant( 2025-05-07T20:32:55.4702033Z self, 2025-05-07T20:32:55.4702114Z T: int, 2025-05-07T20:32:55.4702201Z D: int, 2025-05-07T20:32:55.4702306Z scale_ub: Optional[float], 2025-05-07T20:32:55.4702402Z contiguous: bool, 2025-05-07T20:32:55.4702504Z compiled: bool, 2025-05-07T20:32:55.4702588Z ) -> None: 2025-05-07T20:32:55.4702690Z torch.manual_seed(2025) 2025-05-07T20:32:55.4702776Z 2025-05-07T20:32:55.4702959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4703038Z 2025-05-07T20:32:55.4703141Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4703273Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4703379Z x = x_sign * x_clamp 2025-05-07T20:32:55.4703463Z x0 = x[:, :D] 2025-05-07T20:32:55.4703548Z x1 = x[:, D:] 2025-05-07T20:32:55.4703631Z 2025-05-07T20:32:55.4703720Z if contiguous: 2025-05-07T20:32:55.4703816Z x0 = x0.contiguous() 2025-05-07T20:32:55.4703917Z x1 = x1.contiguous() 2025-05-07T20:32:55.4703994Z 2025-05-07T20:32:55.4704092Z if scale_ub is not None: 2025-05-07T20:32:55.4704213Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4704360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4704440Z ) 2025-05-07T20:32:55.4704528Z else: 2025-05-07T20:32:55.4704633Z scale_ub_tensor = None 2025-05-07T20:32:55.4704711Z 2025-05-07T20:32:55.4704855Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4704951Z op = silu_mul_quant 2025-05-07T20:32:55.4705050Z if compiled: 2025-05-07T20:32:55.4705156Z op = torch.compile(op) 2025-05-07T20:32:55.4705268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4705350Z 2025-05-07T20:32:55.4705446Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4705451Z 2025-05-07T20:32:55.4705555Z moe/activation_test.py:117: 2025-05-07T20:32:55.4705698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4705806Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4705912Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4706309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4706409Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4707016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4707123Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4707504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4707745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4708100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4708207Z kernel = self.compile( 2025-05-07T20:32:55.4708606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4708833Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4708977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4708981Z 2025-05-07T20:32:55.4709197Z self = 2025-05-07T20:32:55.4710022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4710598Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08fce8c0>} 2025-05-07T20:32:55.4711373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4716975Z context = 2025-05-07T20:32:55.4716986Z 2025-05-07T20:32:55.4717197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4717488Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4717606Z module_map=module_map) 2025-05-07T20:32:55.4717787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4717904Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4717989Z E ^ 2025-05-07T20:32:55.4718371Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4718376Z 2025-05-07T20:32:55.4718822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4718832Z 2025-05-07T20:32:55.4718944Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4719190Z self=, 2025-05-07T20:32:55.4719277Z T=128, 2025-05-07T20:32:55.4719360Z D=7168, 2025-05-07T20:32:55.4719456Z scale_ub=1200.0, 2025-05-07T20:32:55.4719551Z contiguous=False, 2025-05-07T20:32:55.4719641Z compiled=False, 2025-05-07T20:32:55.4719733Z ) 2025-05-07T20:32:55.4719963Z self = 2025-05-07T20:32:55.4720147Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.4720161Z 2025-05-07T20:32:55.4720246Z @given( 2025-05-07T20:32:55.4720376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4720492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4720616Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4720742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4720872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4720952Z ) 2025-05-07T20:32:55.4721337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4721450Z def test_silu_mul_quant( 2025-05-07T20:32:55.4721534Z self, 2025-05-07T20:32:55.4721618Z T: int, 2025-05-07T20:32:55.4721705Z D: int, 2025-05-07T20:32:55.4721815Z scale_ub: Optional[float], 2025-05-07T20:32:55.4721916Z contiguous: bool, 2025-05-07T20:32:55.4722010Z compiled: bool, 2025-05-07T20:32:55.4722096Z ) -> None: 2025-05-07T20:32:55.4722203Z torch.manual_seed(2025) 2025-05-07T20:32:55.4722284Z 2025-05-07T20:32:55.4722463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4722550Z 2025-05-07T20:32:55.4722649Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4722782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4722931Z x = x_sign * x_clamp 2025-05-07T20:32:55.4723018Z x0 = x[:, :D] 2025-05-07T20:32:55.4723105Z x1 = x[:, D:] 2025-05-07T20:32:55.4723190Z 2025-05-07T20:32:55.4723285Z if contiguous: 2025-05-07T20:32:55.4723391Z x0 = x0.contiguous() 2025-05-07T20:32:55.4723490Z x1 = x1.contiguous() 2025-05-07T20:32:55.4723568Z 2025-05-07T20:32:55.4723673Z if scale_ub is not None: 2025-05-07T20:32:55.4724114Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4724323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4724451Z ) 2025-05-07T20:32:55.4724543Z else: 2025-05-07T20:32:55.4724645Z scale_ub_tensor = None 2025-05-07T20:32:55.4724732Z 2025-05-07T20:32:55.4724871Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4724968Z op = silu_mul_quant 2025-05-07T20:32:55.4725073Z if compiled: 2025-05-07T20:32:55.4725185Z op = torch.compile(op) 2025-05-07T20:32:55.4725308Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4725386Z 2025-05-07T20:32:55.4725488Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4725493Z 2025-05-07T20:32:55.4725603Z moe/activation_test.py:117: 2025-05-07T20:32:55.4725744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4725857Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4725973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4726495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4726601Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4726986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4727221Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4727590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4727691Z kernel = self.compile( 2025-05-07T20:32:55.4728099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4728291Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4728429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4728434Z 2025-05-07T20:32:55.4728657Z self = 2025-05-07T20:32:55.4729469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4730003Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08d2aef0>} 2025-05-07T20:32:55.4731057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4731264Z context = 2025-05-07T20:32:55.4731271Z 2025-05-07T20:32:55.4731455Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4731733Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4731850Z module_map=module_map) 2025-05-07T20:32:55.4732029Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4732135Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4732227Z E ^ 2025-05-07T20:32:55.4732665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4732670Z 2025-05-07T20:32:55.4733108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4733113Z 2025-05-07T20:32:55.4733234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4733535Z self=, 2025-05-07T20:32:55.4733626Z T=128, 2025-05-07T20:32:55.4733712Z D=5120, 2025-05-07T20:32:55.4733801Z scale_ub=None, 2025-05-07T20:32:55.4733905Z contiguous=False, 2025-05-07T20:32:55.4733998Z compiled=False, 2025-05-07T20:32:55.4734077Z ) 2025-05-07T20:32:55.4734314Z self = 2025-05-07T20:32:55.4734498Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.4734506Z 2025-05-07T20:32:55.4734588Z @given( 2025-05-07T20:32:55.4734723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4734831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4734959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4735090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4735213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4735305Z ) 2025-05-07T20:32:55.4735565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4735667Z def test_silu_mul_quant( 2025-05-07T20:32:55.4735755Z self, 2025-05-07T20:32:55.4735838Z T: int, 2025-05-07T20:32:55.4735920Z D: int, 2025-05-07T20:32:55.4736030Z scale_ub: Optional[float], 2025-05-07T20:32:55.4736126Z contiguous: bool, 2025-05-07T20:32:55.4736218Z compiled: bool, 2025-05-07T20:32:55.4736311Z ) -> None: 2025-05-07T20:32:55.4736415Z torch.manual_seed(2025) 2025-05-07T20:32:55.4736492Z 2025-05-07T20:32:55.4736678Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4736760Z 2025-05-07T20:32:55.4736869Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4737003Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4737098Z x = x_sign * x_clamp 2025-05-07T20:32:55.4737191Z x0 = x[:, :D] 2025-05-07T20:32:55.4737281Z x1 = x[:, D:] 2025-05-07T20:32:55.4737360Z 2025-05-07T20:32:55.4737455Z if contiguous: 2025-05-07T20:32:55.4737553Z x0 = x0.contiguous() 2025-05-07T20:32:55.4737649Z x1 = x1.contiguous() 2025-05-07T20:32:55.4737733Z 2025-05-07T20:32:55.4737830Z if scale_ub is not None: 2025-05-07T20:32:55.4737944Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4738093Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4738174Z ) 2025-05-07T20:32:55.4738267Z else: 2025-05-07T20:32:55.4738368Z scale_ub_tensor = None 2025-05-07T20:32:55.4738446Z 2025-05-07T20:32:55.4738677Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4738776Z op = silu_mul_quant 2025-05-07T20:32:55.4738867Z if compiled: 2025-05-07T20:32:55.4738980Z op = torch.compile(op) 2025-05-07T20:32:55.4739098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4739179Z 2025-05-07T20:32:55.4739284Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4739288Z 2025-05-07T20:32:55.4739394Z moe/activation_test.py:117: 2025-05-07T20:32:55.4739536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4739644Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4739749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4740277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4740465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4740847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4741090Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4741448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4741599Z kernel = self.compile( 2025-05-07T20:32:55.4742000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4742185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4742324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4742330Z 2025-05-07T20:32:55.4742544Z self = 2025-05-07T20:32:55.4743370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4743902Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08fccb80>} 2025-05-07T20:32:55.4744682Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4744895Z context = 2025-05-07T20:32:55.4744900Z 2025-05-07T20:32:55.4745074Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4745356Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4745473Z module_map=module_map) 2025-05-07T20:32:55.4745650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4745763Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4745846Z E ^ 2025-05-07T20:32:55.4746219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4746233Z 2025-05-07T20:32:55.4746664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4746669Z 2025-05-07T20:32:55.4746780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4747019Z self=, 2025-05-07T20:32:55.4747102Z T=128, 2025-05-07T20:32:55.4747184Z D=5120, 2025-05-07T20:32:55.4747280Z scale_ub=1200.0, 2025-05-07T20:32:55.4747372Z contiguous=True, 2025-05-07T20:32:55.4747462Z compiled=False, 2025-05-07T20:32:55.4747548Z ) 2025-05-07T20:32:55.4747859Z self = 2025-05-07T20:32:55.4748048Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.4748053Z 2025-05-07T20:32:55.4748134Z @given( 2025-05-07T20:32:55.4748261Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4748382Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4748507Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4748632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4748760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4748844Z ) 2025-05-07T20:32:55.4749103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4749213Z def test_silu_mul_quant( 2025-05-07T20:32:55.4749296Z self, 2025-05-07T20:32:55.4749429Z T: int, 2025-05-07T20:32:55.4749512Z D: int, 2025-05-07T20:32:55.4749617Z scale_ub: Optional[float], 2025-05-07T20:32:55.4749728Z contiguous: bool, 2025-05-07T20:32:55.4749820Z compiled: bool, 2025-05-07T20:32:55.4749908Z ) -> None: 2025-05-07T20:32:55.4750015Z torch.manual_seed(2025) 2025-05-07T20:32:55.4750093Z 2025-05-07T20:32:55.4750318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4750403Z 2025-05-07T20:32:55.4750501Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4750635Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4750740Z x = x_sign * x_clamp 2025-05-07T20:32:55.4750826Z x0 = x[:, :D] 2025-05-07T20:32:55.4750922Z x1 = x[:, D:] 2025-05-07T20:32:55.4750999Z 2025-05-07T20:32:55.4751089Z if contiguous: 2025-05-07T20:32:55.4751193Z x0 = x0.contiguous() 2025-05-07T20:32:55.4751292Z x1 = x1.contiguous() 2025-05-07T20:32:55.4751369Z 2025-05-07T20:32:55.4751476Z if scale_ub is not None: 2025-05-07T20:32:55.4751589Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4751738Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4751826Z ) 2025-05-07T20:32:55.4751906Z else: 2025-05-07T20:32:55.4752006Z scale_ub_tensor = None 2025-05-07T20:32:55.4752095Z 2025-05-07T20:32:55.4752232Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4752329Z op = silu_mul_quant 2025-05-07T20:32:55.4752426Z if compiled: 2025-05-07T20:32:55.4752532Z op = torch.compile(op) 2025-05-07T20:32:55.4752652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4752730Z 2025-05-07T20:32:55.4752826Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4752830Z 2025-05-07T20:32:55.4752937Z moe/activation_test.py:117: 2025-05-07T20:32:55.4753077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4753184Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4753301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4753924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4754035Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4754416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4754651Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4755015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4755116Z kernel = self.compile( 2025-05-07T20:32:55.4755517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4755711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4756017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4756022Z 2025-05-07T20:32:55.4756249Z self = 2025-05-07T20:32:55.4757057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4757590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08fcff40>} 2025-05-07T20:32:55.4758372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4758616Z context = 2025-05-07T20:32:55.4758621Z 2025-05-07T20:32:55.4758807Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4759085Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4759249Z module_map=module_map) 2025-05-07T20:32:55.4759422Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4759527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4759618Z E ^ 2025-05-07T20:32:55.4759989Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4759994Z 2025-05-07T20:32:55.4760425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4760432Z 2025-05-07T20:32:55.4760549Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4760783Z self=, 2025-05-07T20:32:55.4760876Z T=1, 2025-05-07T20:32:55.4760965Z D=7168, 2025-05-07T20:32:55.4761054Z scale_ub=1200.0, 2025-05-07T20:32:55.4761150Z contiguous=True, 2025-05-07T20:32:55.4761240Z compiled=True, 2025-05-07T20:32:55.4761322Z ) 2025-05-07T20:32:55.4761557Z self = 2025-05-07T20:32:55.4761733Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.4761737Z 2025-05-07T20:32:55.4761819Z @given( 2025-05-07T20:32:55.4761951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4762057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4762187Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4762311Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4762436Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4762524Z ) 2025-05-07T20:32:55.4762788Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4762888Z def test_silu_mul_quant( 2025-05-07T20:32:55.4762977Z self, 2025-05-07T20:32:55.4763060Z T: int, 2025-05-07T20:32:55.4763142Z D: int, 2025-05-07T20:32:55.4763262Z scale_ub: Optional[float], 2025-05-07T20:32:55.4763358Z contiguous: bool, 2025-05-07T20:32:55.4763450Z compiled: bool, 2025-05-07T20:32:55.4763541Z ) -> None: 2025-05-07T20:32:55.4763642Z torch.manual_seed(2025) 2025-05-07T20:32:55.4763729Z 2025-05-07T20:32:55.4763907Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4763987Z 2025-05-07T20:32:55.4764093Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4764227Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4764324Z x = x_sign * x_clamp 2025-05-07T20:32:55.4764419Z x0 = x[:, :D] 2025-05-07T20:32:55.4764506Z x1 = x[:, D:] 2025-05-07T20:32:55.4764584Z 2025-05-07T20:32:55.4764765Z if contiguous: 2025-05-07T20:32:55.4764864Z x0 = x0.contiguous() 2025-05-07T20:32:55.4764959Z x1 = x1.contiguous() 2025-05-07T20:32:55.4765045Z 2025-05-07T20:32:55.4765149Z if scale_ub is not None: 2025-05-07T20:32:55.4765273Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4765417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4765501Z ) 2025-05-07T20:32:55.4765591Z else: 2025-05-07T20:32:55.4765692Z scale_ub_tensor = None 2025-05-07T20:32:55.4765771Z 2025-05-07T20:32:55.4765914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4766014Z op = silu_mul_quant 2025-05-07T20:32:55.4766106Z if compiled: 2025-05-07T20:32:55.4766262Z op = torch.compile(op) 2025-05-07T20:32:55.4766375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4766453Z 2025-05-07T20:32:55.4766564Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4766569Z 2025-05-07T20:32:55.4766672Z moe/activation_test.py:117: 2025-05-07T20:32:55.4766813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4766964Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4767069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4767460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4767564Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4768082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4768190Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4768569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4768816Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4769171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4769273Z kernel = self.compile( 2025-05-07T20:32:55.4769682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4769867Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4770001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4770012Z 2025-05-07T20:32:55.4770227Z self = 2025-05-07T20:32:55.4771037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4771579Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7eff08fcf640>} 2025-05-07T20:32:55.4772353Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4772565Z context = 2025-05-07T20:32:55.4772569Z 2025-05-07T20:32:55.4772744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4773020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4773140Z module_map=module_map) 2025-05-07T20:32:55.4773316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4773428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4773512Z E ^ 2025-05-07T20:32:55.4773997Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4774003Z 2025-05-07T20:32:55.4774446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4774454Z 2025-05-07T20:32:55.4774565Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4774798Z self=, 2025-05-07T20:32:55.4774889Z T=1, 2025-05-07T20:32:55.4774971Z D=7168, 2025-05-07T20:32:55.4775068Z scale_ub=1200.0, 2025-05-07T20:32:55.4775162Z contiguous=False, 2025-05-07T20:32:55.4775252Z compiled=True, 2025-05-07T20:32:55.4775335Z ) 2025-05-07T20:32:55.4775609Z self = 2025-05-07T20:32:55.4775784Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.4775789Z 2025-05-07T20:32:55.4775881Z @given( 2025-05-07T20:32:55.4776008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4776113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4776240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4776467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4776594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4776677Z ) 2025-05-07T20:32:55.4776942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4777048Z def test_silu_mul_quant( 2025-05-07T20:32:55.4777129Z self, 2025-05-07T20:32:55.4777210Z T: int, 2025-05-07T20:32:55.4777297Z D: int, 2025-05-07T20:32:55.4777401Z scale_ub: Optional[float], 2025-05-07T20:32:55.4777500Z contiguous: bool, 2025-05-07T20:32:55.4777597Z compiled: bool, 2025-05-07T20:32:55.4777680Z ) -> None: 2025-05-07T20:32:55.4777786Z torch.manual_seed(2025) 2025-05-07T20:32:55.4777870Z 2025-05-07T20:32:55.4778046Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4778131Z 2025-05-07T20:32:55.4778230Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4778369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4778474Z x = x_sign * x_clamp 2025-05-07T20:32:55.4778560Z x0 = x[:, :D] 2025-05-07T20:32:55.4778645Z x1 = x[:, D:] 2025-05-07T20:32:55.4778735Z 2025-05-07T20:32:55.4778828Z if contiguous: 2025-05-07T20:32:55.4778927Z x0 = x0.contiguous() 2025-05-07T20:32:55.4779045Z x1 = x1.contiguous() 2025-05-07T20:32:55.4779124Z 2025-05-07T20:32:55.4779221Z if scale_ub is not None: 2025-05-07T20:32:55.4779344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4779488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4779576Z ) 2025-05-07T20:32:55.4779662Z else: 2025-05-07T20:32:55.4779763Z scale_ub_tensor = None 2025-05-07T20:32:55.4779848Z 2025-05-07T20:32:55.4779985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4780084Z op = silu_mul_quant 2025-05-07T20:32:55.4780184Z if compiled: 2025-05-07T20:32:55.4780292Z op = torch.compile(op) 2025-05-07T20:32:55.4780404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4780491Z 2025-05-07T20:32:55.4780589Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4780594Z 2025-05-07T20:32:55.4780705Z moe/activation_test.py:117: 2025-05-07T20:32:55.4780843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4780951Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4781071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4781545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4781645Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4782166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4782274Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4782661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4782897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4783255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4783361Z kernel = self.compile( 2025-05-07T20:32:55.4783761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4783989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4784136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4784141Z 2025-05-07T20:32:55.4784357Z self = 2025-05-07T20:32:55.4785217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4785749Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f5b5b0>} 2025-05-07T20:32:55.4786533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4786743Z context = 2025-05-07T20:32:55.4786748Z 2025-05-07T20:32:55.4786925Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4787208Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4787328Z module_map=module_map) 2025-05-07T20:32:55.4787500Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4787613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4787697Z E ^ 2025-05-07T20:32:55.4788075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4788079Z 2025-05-07T20:32:55.4788510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4788517Z 2025-05-07T20:32:55.4788627Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4788870Z self=, 2025-05-07T20:32:55.4788952Z T=1, 2025-05-07T20:32:55.4789040Z D=7168, 2025-05-07T20:32:55.4789127Z scale_ub=None, 2025-05-07T20:32:55.4789219Z contiguous=False, 2025-05-07T20:32:55.4789316Z compiled=True, 2025-05-07T20:32:55.4789394Z ) 2025-05-07T20:32:55.4789625Z self = 2025-05-07T20:32:55.4789805Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.4789809Z 2025-05-07T20:32:55.4789891Z @given( 2025-05-07T20:32:55.4790019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4790132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4790255Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4790392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4790515Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4790595Z ) 2025-05-07T20:32:55.4790952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4791055Z def test_silu_mul_quant( 2025-05-07T20:32:55.4791137Z self, 2025-05-07T20:32:55.4791230Z T: int, 2025-05-07T20:32:55.4791311Z D: int, 2025-05-07T20:32:55.4791416Z scale_ub: Optional[float], 2025-05-07T20:32:55.4791518Z contiguous: bool, 2025-05-07T20:32:55.4791610Z compiled: bool, 2025-05-07T20:32:55.4791694Z ) -> None: 2025-05-07T20:32:55.4791803Z torch.manual_seed(2025) 2025-05-07T20:32:55.4791880Z 2025-05-07T20:32:55.4792064Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4792143Z 2025-05-07T20:32:55.4792240Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4792424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4792518Z x = x_sign * x_clamp 2025-05-07T20:32:55.4792603Z x0 = x[:, :D] 2025-05-07T20:32:55.4792700Z x1 = x[:, D:] 2025-05-07T20:32:55.4792777Z 2025-05-07T20:32:55.4792867Z if contiguous: 2025-05-07T20:32:55.4792973Z x0 = x0.contiguous() 2025-05-07T20:32:55.4793068Z x1 = x1.contiguous() 2025-05-07T20:32:55.4793192Z 2025-05-07T20:32:55.4793296Z if scale_ub is not None: 2025-05-07T20:32:55.4793409Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4793647Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4793736Z ) 2025-05-07T20:32:55.4793818Z else: 2025-05-07T20:32:55.4793924Z scale_ub_tensor = None 2025-05-07T20:32:55.4794003Z 2025-05-07T20:32:55.4794141Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4794243Z op = silu_mul_quant 2025-05-07T20:32:55.4794338Z if compiled: 2025-05-07T20:32:55.4794445Z op = torch.compile(op) 2025-05-07T20:32:55.4794568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4794652Z 2025-05-07T20:32:55.4794749Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.4794887Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.4794966Z 2025-05-07T20:32:55.4795115Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4795231Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.4795339Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.4795477Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.4795628Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4795708Z 2025-05-07T20:32:55.4795822Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.4795826Z 2025-05-07T20:32:55.4795931Z moe/activation_test.py:126: 2025-05-07T20:32:55.4796075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4796195Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.4796345Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.4796938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.4797049Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.4797426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4797668Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4798051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.4798333Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4798754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.4799112Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.4799511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.4799693Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.4800051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.4800146Z fn() 2025-05-07T20:32:55.4800566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.4800661Z self.fn.run( 2025-05-07T20:32:55.4801020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4801160Z kernel = self.compile( 2025-05-07T20:32:55.4801568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4801760Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4801896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4801907Z 2025-05-07T20:32:55.4802169Z self = 2025-05-07T20:32:55.4802979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4803517Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f59bd0>} 2025-05-07T20:32:55.4804300Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4804509Z context = 2025-05-07T20:32:55.4804514Z 2025-05-07T20:32:55.4804689Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4804971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4805095Z module_map=module_map) 2025-05-07T20:32:55.4805265Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4805375Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.4805464Z E ^ 2025-05-07T20:32:55.4805836Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4805843Z 2025-05-07T20:32:55.4806281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4806286Z 2025-05-07T20:32:55.4806401Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4806635Z self=, 2025-05-07T20:32:55.4806724Z T=1, 2025-05-07T20:32:55.4806806Z D=5120, 2025-05-07T20:32:55.4806904Z scale_ub=1200.0, 2025-05-07T20:32:55.4806997Z contiguous=False, 2025-05-07T20:32:55.4807086Z compiled=True, 2025-05-07T20:32:55.4807170Z ) 2025-05-07T20:32:55.4807397Z self = 2025-05-07T20:32:55.4807572Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.4807577Z 2025-05-07T20:32:55.4807665Z @given( 2025-05-07T20:32:55.4807793Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4807900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4808033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4808159Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4808406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4808488Z ) 2025-05-07T20:32:55.4808750Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4808860Z def test_silu_mul_quant( 2025-05-07T20:32:55.4808942Z self, 2025-05-07T20:32:55.4809026Z T: int, 2025-05-07T20:32:55.4809114Z D: int, 2025-05-07T20:32:55.4809217Z scale_ub: Optional[float], 2025-05-07T20:32:55.4809312Z contiguous: bool, 2025-05-07T20:32:55.4809409Z compiled: bool, 2025-05-07T20:32:55.4809493Z ) -> None: 2025-05-07T20:32:55.4809593Z torch.manual_seed(2025) 2025-05-07T20:32:55.4809682Z 2025-05-07T20:32:55.4809860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4809993Z 2025-05-07T20:32:55.4810092Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4810226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4810332Z x = x_sign * x_clamp 2025-05-07T20:32:55.4810417Z x0 = x[:, :D] 2025-05-07T20:32:55.4810502Z x1 = x[:, D:] 2025-05-07T20:32:55.4810586Z 2025-05-07T20:32:55.4810675Z if contiguous: 2025-05-07T20:32:55.4810822Z x0 = x0.contiguous() 2025-05-07T20:32:55.4810926Z x1 = x1.contiguous() 2025-05-07T20:32:55.4811005Z 2025-05-07T20:32:55.4811101Z if scale_ub is not None: 2025-05-07T20:32:55.4811221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4811365Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4811449Z ) 2025-05-07T20:32:55.4811541Z else: 2025-05-07T20:32:55.4811642Z scale_ub_tensor = None 2025-05-07T20:32:55.4811730Z 2025-05-07T20:32:55.4811869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4811965Z op = silu_mul_quant 2025-05-07T20:32:55.4812065Z if compiled: 2025-05-07T20:32:55.4812177Z op = torch.compile(op) 2025-05-07T20:32:55.4812292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4812378Z 2025-05-07T20:32:55.4812475Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4812483Z 2025-05-07T20:32:55.4812587Z moe/activation_test.py:117: 2025-05-07T20:32:55.4812732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4812842Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4812955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4813344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4813443Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4813972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4814080Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4814464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4814708Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4815067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4815177Z kernel = self.compile( 2025-05-07T20:32:55.4815581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4815769Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4815912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4815917Z 2025-05-07T20:32:55.4816135Z self = 2025-05-07T20:32:55.4817038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4817579Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f583a0>} 2025-05-07T20:32:55.4818361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4818572Z context = 2025-05-07T20:32:55.4818577Z 2025-05-07T20:32:55.4818753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4819076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4819193Z module_map=module_map) 2025-05-07T20:32:55.4819370Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4819482Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4819565Z E ^ 2025-05-07T20:32:55.4819946Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4819994Z 2025-05-07T20:32:55.4820428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4820432Z 2025-05-07T20:32:55.4820543Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4820782Z self=, 2025-05-07T20:32:55.4820866Z T=1, 2025-05-07T20:32:55.4820948Z D=5120, 2025-05-07T20:32:55.4821046Z scale_ub=1200.0, 2025-05-07T20:32:55.4821144Z contiguous=False, 2025-05-07T20:32:55.4821242Z compiled=False, 2025-05-07T20:32:55.4821323Z ) 2025-05-07T20:32:55.4821556Z self = 2025-05-07T20:32:55.4821742Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.4821747Z 2025-05-07T20:32:55.4821830Z @given( 2025-05-07T20:32:55.4821960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4822074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4822197Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4822321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4822448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4822527Z ) 2025-05-07T20:32:55.4822791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4822891Z def test_silu_mul_quant( 2025-05-07T20:32:55.4822978Z self, 2025-05-07T20:32:55.4823068Z T: int, 2025-05-07T20:32:55.4823149Z D: int, 2025-05-07T20:32:55.4823254Z scale_ub: Optional[float], 2025-05-07T20:32:55.4823361Z contiguous: bool, 2025-05-07T20:32:55.4823454Z compiled: bool, 2025-05-07T20:32:55.4823538Z ) -> None: 2025-05-07T20:32:55.4823645Z torch.manual_seed(2025) 2025-05-07T20:32:55.4823726Z 2025-05-07T20:32:55.4824286Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4824416Z 2025-05-07T20:32:55.4824520Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4824663Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4824757Z x = x_sign * x_clamp 2025-05-07T20:32:55.4824842Z x0 = x[:, :D] 2025-05-07T20:32:55.4824933Z x1 = x[:, D:] 2025-05-07T20:32:55.4825011Z 2025-05-07T20:32:55.4825101Z if contiguous: 2025-05-07T20:32:55.4825205Z x0 = x0.contiguous() 2025-05-07T20:32:55.4825311Z x1 = x1.contiguous() 2025-05-07T20:32:55.4825388Z 2025-05-07T20:32:55.4825490Z if scale_ub is not None: 2025-05-07T20:32:55.4825832Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4825980Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4826069Z ) 2025-05-07T20:32:55.4826150Z else: 2025-05-07T20:32:55.4826254Z scale_ub_tensor = None 2025-05-07T20:32:55.4826340Z 2025-05-07T20:32:55.4826477Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4826581Z op = silu_mul_quant 2025-05-07T20:32:55.4826671Z if compiled: 2025-05-07T20:32:55.4826778Z op = torch.compile(op) 2025-05-07T20:32:55.4826898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4826975Z 2025-05-07T20:32:55.4827072Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4827076Z 2025-05-07T20:32:55.4827188Z moe/activation_test.py:117: 2025-05-07T20:32:55.4827391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4827501Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4827619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4828147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4828323Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4828705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4828941Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4829308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4829410Z kernel = self.compile( 2025-05-07T20:32:55.4829820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4830009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4830147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4830152Z 2025-05-07T20:32:55.4830376Z self = 2025-05-07T20:32:55.4831185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4831729Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f58ee0>} 2025-05-07T20:32:55.4832506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4832713Z context = 2025-05-07T20:32:55.4832722Z 2025-05-07T20:32:55.4832906Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4833185Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4833311Z module_map=module_map) 2025-05-07T20:32:55.4833484Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4833663Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4833757Z E ^ 2025-05-07T20:32:55.4834133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4834137Z 2025-05-07T20:32:55.4834573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4834587Z 2025-05-07T20:32:55.4834699Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4835021Z self=, 2025-05-07T20:32:55.4835113Z T=16384, 2025-05-07T20:32:55.4835196Z D=5120, 2025-05-07T20:32:55.4835286Z scale_ub=1200.0, 2025-05-07T20:32:55.4835386Z contiguous=False, 2025-05-07T20:32:55.4835480Z compiled=True, 2025-05-07T20:32:55.4835559Z ) 2025-05-07T20:32:55.4835795Z self = 2025-05-07T20:32:55.4835981Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.4835986Z 2025-05-07T20:32:55.4836074Z @given( 2025-05-07T20:32:55.4836200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4836309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4836436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4836656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4836779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4836865Z ) 2025-05-07T20:32:55.4837132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4837232Z def test_silu_mul_quant( 2025-05-07T20:32:55.4837320Z self, 2025-05-07T20:32:55.4837404Z T: int, 2025-05-07T20:32:55.4837572Z D: int, 2025-05-07T20:32:55.4837683Z scale_ub: Optional[float], 2025-05-07T20:32:55.4837778Z contiguous: bool, 2025-05-07T20:32:55.4837876Z compiled: bool, 2025-05-07T20:32:55.4837960Z ) -> None: 2025-05-07T20:32:55.4838059Z torch.manual_seed(2025) 2025-05-07T20:32:55.4838143Z 2025-05-07T20:32:55.4838321Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4838400Z 2025-05-07T20:32:55.4838504Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4838636Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4838733Z x = x_sign * x_clamp 2025-05-07T20:32:55.4838825Z x0 = x[:, :D] 2025-05-07T20:32:55.4838910Z x1 = x[:, D:] 2025-05-07T20:32:55.4838992Z 2025-05-07T20:32:55.4839088Z if contiguous: 2025-05-07T20:32:55.4839185Z x0 = x0.contiguous() 2025-05-07T20:32:55.4839286Z x1 = x1.contiguous() 2025-05-07T20:32:55.4839367Z 2025-05-07T20:32:55.4839464Z if scale_ub is not None: 2025-05-07T20:32:55.4839584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4839728Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4839809Z ) 2025-05-07T20:32:55.4839897Z else: 2025-05-07T20:32:55.4839998Z scale_ub_tensor = None 2025-05-07T20:32:55.4840076Z 2025-05-07T20:32:55.4840219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4840315Z op = silu_mul_quant 2025-05-07T20:32:55.4840409Z if compiled: 2025-05-07T20:32:55.4840520Z op = torch.compile(op) 2025-05-07T20:32:55.4840632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4840722Z 2025-05-07T20:32:55.4840820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4840824Z 2025-05-07T20:32:55.4840927Z moe/activation_test.py:117: 2025-05-07T20:32:55.4841073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4841183Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4841290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4841683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4841782Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4842302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4842427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4842806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4843136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4843496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4843599Z kernel = self.compile( 2025-05-07T20:32:55.4844008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4844194Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4844352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4844357Z 2025-05-07T20:32:55.4844578Z self = 2025-05-07T20:32:55.4845390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4851244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f5a9e0>} 2025-05-07T20:32:55.4852056Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4852353Z context = 2025-05-07T20:32:55.4852359Z 2025-05-07T20:32:55.4852537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4852817Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4852948Z module_map=module_map) 2025-05-07T20:32:55.4853123Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4853240Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4853328Z E ^ 2025-05-07T20:32:55.4853704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4853709Z 2025-05-07T20:32:55.4854154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4854160Z 2025-05-07T20:32:55.4854272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4854507Z self=, 2025-05-07T20:32:55.4854598Z T=2048, 2025-05-07T20:32:55.4854681Z D=7168, 2025-05-07T20:32:55.4854779Z scale_ub=1200.0, 2025-05-07T20:32:55.4854875Z contiguous=False, 2025-05-07T20:32:55.4854965Z compiled=True, 2025-05-07T20:32:55.4855057Z ) 2025-05-07T20:32:55.4855288Z self = 2025-05-07T20:32:55.4855481Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.4855485Z 2025-05-07T20:32:55.4855577Z @given( 2025-05-07T20:32:55.4855705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4855811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4855945Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4856070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4856197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4856276Z ) 2025-05-07T20:32:55.4856537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4856647Z def test_silu_mul_quant( 2025-05-07T20:32:55.4856730Z self, 2025-05-07T20:32:55.4856812Z T: int, 2025-05-07T20:32:55.4856900Z D: int, 2025-05-07T20:32:55.4857007Z scale_ub: Optional[float], 2025-05-07T20:32:55.4857104Z contiguous: bool, 2025-05-07T20:32:55.4857202Z compiled: bool, 2025-05-07T20:32:55.4857382Z ) -> None: 2025-05-07T20:32:55.4857486Z torch.manual_seed(2025) 2025-05-07T20:32:55.4857572Z 2025-05-07T20:32:55.4857751Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4857839Z 2025-05-07T20:32:55.4857938Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4858071Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4858172Z x = x_sign * x_clamp 2025-05-07T20:32:55.4858258Z x0 = x[:, :D] 2025-05-07T20:32:55.4858344Z x1 = x[:, D:] 2025-05-07T20:32:55.4858429Z 2025-05-07T20:32:55.4858518Z if contiguous: 2025-05-07T20:32:55.4858617Z x0 = x0.contiguous() 2025-05-07T20:32:55.4858719Z x1 = x1.contiguous() 2025-05-07T20:32:55.4858797Z 2025-05-07T20:32:55.4858946Z if scale_ub is not None: 2025-05-07T20:32:55.4859068Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4859211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4859307Z ) 2025-05-07T20:32:55.4859389Z else: 2025-05-07T20:32:55.4859490Z scale_ub_tensor = None 2025-05-07T20:32:55.4859579Z 2025-05-07T20:32:55.4859717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4859861Z op = silu_mul_quant 2025-05-07T20:32:55.4859959Z if compiled: 2025-05-07T20:32:55.4860066Z op = torch.compile(op) 2025-05-07T20:32:55.4860179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4860263Z 2025-05-07T20:32:55.4860358Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4860363Z 2025-05-07T20:32:55.4860467Z moe/activation_test.py:117: 2025-05-07T20:32:55.4860611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4860721Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4860833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4861228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4861327Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4861855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4861964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4862342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4862587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4862946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4863054Z kernel = self.compile( 2025-05-07T20:32:55.4863460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4863650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4863791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4863796Z 2025-05-07T20:32:55.4864013Z self = 2025-05-07T20:32:55.4864840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4865372Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7f5bb50>} 2025-05-07T20:32:55.4866148Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4866449Z context = 2025-05-07T20:32:55.4866454Z 2025-05-07T20:32:55.4866631Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4866918Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4867041Z module_map=module_map) 2025-05-07T20:32:55.4867217Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4867331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4867415Z E ^ 2025-05-07T20:32:55.4867794Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4867798Z 2025-05-07T20:32:55.4868232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4868280Z 2025-05-07T20:32:55.4868392Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4868640Z self=, 2025-05-07T20:32:55.4868726Z T=1, 2025-05-07T20:32:55.4868811Z D=5120, 2025-05-07T20:32:55.4868908Z scale_ub=None, 2025-05-07T20:32:55.4869044Z contiguous=False, 2025-05-07T20:32:55.4869142Z compiled=False, 2025-05-07T20:32:55.4869223Z ) 2025-05-07T20:32:55.4869452Z self = 2025-05-07T20:32:55.4869636Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.4869641Z 2025-05-07T20:32:55.4869726Z @given( 2025-05-07T20:32:55.4869854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4869971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4870096Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4870223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4870352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4870439Z ) 2025-05-07T20:32:55.4870708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4870808Z def test_silu_mul_quant( 2025-05-07T20:32:55.4870894Z self, 2025-05-07T20:32:55.4870985Z T: int, 2025-05-07T20:32:55.4871066Z D: int, 2025-05-07T20:32:55.4871172Z scale_ub: Optional[float], 2025-05-07T20:32:55.4871276Z contiguous: bool, 2025-05-07T20:32:55.4871368Z compiled: bool, 2025-05-07T20:32:55.4871455Z ) -> None: 2025-05-07T20:32:55.4871566Z torch.manual_seed(2025) 2025-05-07T20:32:55.4871643Z 2025-05-07T20:32:55.4871824Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4871910Z 2025-05-07T20:32:55.4872011Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4872151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4872246Z x = x_sign * x_clamp 2025-05-07T20:32:55.4872335Z x0 = x[:, :D] 2025-05-07T20:32:55.4872426Z x1 = x[:, D:] 2025-05-07T20:32:55.4872504Z 2025-05-07T20:32:55.4872595Z if contiguous: 2025-05-07T20:32:55.4872698Z x0 = x0.contiguous() 2025-05-07T20:32:55.4872796Z x1 = x1.contiguous() 2025-05-07T20:32:55.4872874Z 2025-05-07T20:32:55.4872976Z if scale_ub is not None: 2025-05-07T20:32:55.4873088Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4873231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4873323Z ) 2025-05-07T20:32:55.4873404Z else: 2025-05-07T20:32:55.4873655Z scale_ub_tensor = None 2025-05-07T20:32:55.4873735Z 2025-05-07T20:32:55.4873872Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4873978Z op = silu_mul_quant 2025-05-07T20:32:55.4874069Z if compiled: 2025-05-07T20:32:55.4874177Z op = torch.compile(op) 2025-05-07T20:32:55.4874396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4874481Z 2025-05-07T20:32:55.4874580Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4874585Z 2025-05-07T20:32:55.4874694Z moe/activation_test.py:117: 2025-05-07T20:32:55.4874835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4874944Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4875057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4875583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4875691Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4876070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4876385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4876754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4876855Z kernel = self.compile( 2025-05-07T20:32:55.4877266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4877495Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4877630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4877635Z 2025-05-07T20:32:55.4877861Z self = 2025-05-07T20:32:55.4878673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4879223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a085e0>} 2025-05-07T20:32:55.4880000Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4880208Z context = 2025-05-07T20:32:55.4880212Z 2025-05-07T20:32:55.4880395Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4880675Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4880796Z module_map=module_map) 2025-05-07T20:32:55.4880970Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4881079Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4881172Z E ^ 2025-05-07T20:32:55.4881553Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4881557Z 2025-05-07T20:32:55.4881996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4882003Z 2025-05-07T20:32:55.4882114Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4882350Z self=, 2025-05-07T20:32:55.4882440Z T=4096, 2025-05-07T20:32:55.4882522Z D=7168, 2025-05-07T20:32:55.4882613Z scale_ub=1200.0, 2025-05-07T20:32:55.4882714Z contiguous=False, 2025-05-07T20:32:55.4882805Z compiled=False, 2025-05-07T20:32:55.4882886Z ) 2025-05-07T20:32:55.4883121Z self = 2025-05-07T20:32:55.4883311Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.4883316Z 2025-05-07T20:32:55.4883405Z @given( 2025-05-07T20:32:55.4883617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4883728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4883863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4883989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4884115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4884201Z ) 2025-05-07T20:32:55.4884461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4884565Z def test_silu_mul_quant( 2025-05-07T20:32:55.4884655Z self, 2025-05-07T20:32:55.4884738Z T: int, 2025-05-07T20:32:55.4884831Z D: int, 2025-05-07T20:32:55.4884936Z scale_ub: Optional[float], 2025-05-07T20:32:55.4885032Z contiguous: bool, 2025-05-07T20:32:55.4885174Z compiled: bool, 2025-05-07T20:32:55.4885260Z ) -> None: 2025-05-07T20:32:55.4885361Z torch.manual_seed(2025) 2025-05-07T20:32:55.4885448Z 2025-05-07T20:32:55.4885631Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4885711Z 2025-05-07T20:32:55.4885816Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4885953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4886090Z x = x_sign * x_clamp 2025-05-07T20:32:55.4886184Z x0 = x[:, :D] 2025-05-07T20:32:55.4886270Z x1 = x[:, D:] 2025-05-07T20:32:55.4886349Z 2025-05-07T20:32:55.4886446Z if contiguous: 2025-05-07T20:32:55.4886543Z x0 = x0.contiguous() 2025-05-07T20:32:55.4886647Z x1 = x1.contiguous() 2025-05-07T20:32:55.4886727Z 2025-05-07T20:32:55.4886824Z if scale_ub is not None: 2025-05-07T20:32:55.4886943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4887090Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4887172Z ) 2025-05-07T20:32:55.4887260Z else: 2025-05-07T20:32:55.4887366Z scale_ub_tensor = None 2025-05-07T20:32:55.4887446Z 2025-05-07T20:32:55.4887589Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4887686Z op = silu_mul_quant 2025-05-07T20:32:55.4887777Z if compiled: 2025-05-07T20:32:55.4887892Z op = torch.compile(op) 2025-05-07T20:32:55.4888005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4888089Z 2025-05-07T20:32:55.4888184Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4888189Z 2025-05-07T20:32:55.4888292Z moe/activation_test.py:117: 2025-05-07T20:32:55.4888433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4888541Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4888648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4889179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4889286Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4889669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4889906Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4890266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4890373Z kernel = self.compile( 2025-05-07T20:32:55.4890776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4890962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4891102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4891110Z 2025-05-07T20:32:55.4891327Z self = 2025-05-07T20:32:55.4892227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4892766Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a08ca0>} 2025-05-07T20:32:55.4893548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4893752Z context = 2025-05-07T20:32:55.4893757Z 2025-05-07T20:32:55.4893933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4894258Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4894378Z module_map=module_map) 2025-05-07T20:32:55.4894557Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4894663Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4894885Z E ^ 2025-05-07T20:32:55.4895266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4895271Z 2025-05-07T20:32:55.4895704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4895708Z 2025-05-07T20:32:55.4895818Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4896061Z self=, 2025-05-07T20:32:55.4896149Z T=16384, 2025-05-07T20:32:55.4896240Z D=7168, 2025-05-07T20:32:55.4896328Z scale_ub=None, 2025-05-07T20:32:55.4896419Z contiguous=True, 2025-05-07T20:32:55.4896513Z compiled=True, 2025-05-07T20:32:55.4896595Z ) 2025-05-07T20:32:55.4896825Z self = 2025-05-07T20:32:55.4897014Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.4897022Z 2025-05-07T20:32:55.4897104Z @given( 2025-05-07T20:32:55.4897232Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4897347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4897470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4897603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4897725Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4897806Z ) 2025-05-07T20:32:55.4898073Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4898176Z def test_silu_mul_quant( 2025-05-07T20:32:55.4898259Z self, 2025-05-07T20:32:55.4898350Z T: int, 2025-05-07T20:32:55.4898436Z D: int, 2025-05-07T20:32:55.4898541Z scale_ub: Optional[float], 2025-05-07T20:32:55.4898645Z contiguous: bool, 2025-05-07T20:32:55.4898741Z compiled: bool, 2025-05-07T20:32:55.4898825Z ) -> None: 2025-05-07T20:32:55.4898938Z torch.manual_seed(2025) 2025-05-07T20:32:55.4899016Z 2025-05-07T20:32:55.4899199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4899280Z 2025-05-07T20:32:55.4899378Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4899522Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4899618Z x = x_sign * x_clamp 2025-05-07T20:32:55.4899704Z x0 = x[:, :D] 2025-05-07T20:32:55.4899798Z x1 = x[:, D:] 2025-05-07T20:32:55.4899877Z 2025-05-07T20:32:55.4899970Z if contiguous: 2025-05-07T20:32:55.4900075Z x0 = x0.contiguous() 2025-05-07T20:32:55.4900172Z x1 = x1.contiguous() 2025-05-07T20:32:55.4900250Z 2025-05-07T20:32:55.4900439Z if scale_ub is not None: 2025-05-07T20:32:55.4900554Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4900704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4900788Z ) 2025-05-07T20:32:55.4900871Z else: 2025-05-07T20:32:55.4900977Z scale_ub_tensor = None 2025-05-07T20:32:55.4901055Z 2025-05-07T20:32:55.4901193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4901297Z op = silu_mul_quant 2025-05-07T20:32:55.4901389Z if compiled: 2025-05-07T20:32:55.4901495Z op = torch.compile(op) 2025-05-07T20:32:55.4901615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4901694Z 2025-05-07T20:32:55.4901790Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4901838Z 2025-05-07T20:32:55.4901949Z moe/activation_test.py:117: 2025-05-07T20:32:55.4902085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4902210Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4902316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4902702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4902853Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4903369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4903473Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4903854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4904090Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4904456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4904556Z kernel = self.compile( 2025-05-07T20:32:55.4904962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4905155Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4905291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4905296Z 2025-05-07T20:32:55.4905518Z self = 2025-05-07T20:32:55.4906326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4906857Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a09b40>} 2025-05-07T20:32:55.4907644Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4907849Z context = 2025-05-07T20:32:55.4907856Z 2025-05-07T20:32:55.4908037Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4908317Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4908433Z module_map=module_map) 2025-05-07T20:32:55.4908612Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4908720Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4908808Z E ^ 2025-05-07T20:32:55.4909188Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4909193Z 2025-05-07T20:32:55.4909741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4909746Z 2025-05-07T20:32:55.4909865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4910103Z self=, 2025-05-07T20:32:55.4910186Z T=4096, 2025-05-07T20:32:55.4910273Z D=5120, 2025-05-07T20:32:55.4910364Z scale_ub=None, 2025-05-07T20:32:55.4910461Z contiguous=False, 2025-05-07T20:32:55.4910549Z compiled=True, 2025-05-07T20:32:55.4910627Z ) 2025-05-07T20:32:55.4910860Z self = 2025-05-07T20:32:55.4911044Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.4911048Z 2025-05-07T20:32:55.4911172Z @given( 2025-05-07T20:32:55.4911306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4911413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4911540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4911675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4911798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4911924Z ) 2025-05-07T20:32:55.4912189Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4912290Z def test_silu_mul_quant( 2025-05-07T20:32:55.4912378Z self, 2025-05-07T20:32:55.4912462Z T: int, 2025-05-07T20:32:55.4912544Z D: int, 2025-05-07T20:32:55.4912655Z scale_ub: Optional[float], 2025-05-07T20:32:55.4912751Z contiguous: bool, 2025-05-07T20:32:55.4912842Z compiled: bool, 2025-05-07T20:32:55.4912931Z ) -> None: 2025-05-07T20:32:55.4913034Z torch.manual_seed(2025) 2025-05-07T20:32:55.4913117Z 2025-05-07T20:32:55.4913304Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4913384Z 2025-05-07T20:32:55.4913495Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4913753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4913849Z x = x_sign * x_clamp 2025-05-07T20:32:55.4913941Z x0 = x[:, :D] 2025-05-07T20:32:55.4914046Z x1 = x[:, D:] 2025-05-07T20:32:55.4914123Z 2025-05-07T20:32:55.4914219Z if contiguous: 2025-05-07T20:32:55.4914317Z x0 = x0.contiguous() 2025-05-07T20:32:55.4914412Z x1 = x1.contiguous() 2025-05-07T20:32:55.4914499Z 2025-05-07T20:32:55.4914597Z if scale_ub is not None: 2025-05-07T20:32:55.4914710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4914860Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4914941Z ) 2025-05-07T20:32:55.4915034Z else: 2025-05-07T20:32:55.4915135Z scale_ub_tensor = None 2025-05-07T20:32:55.4915214Z 2025-05-07T20:32:55.4915358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4915461Z op = silu_mul_quant 2025-05-07T20:32:55.4915555Z if compiled: 2025-05-07T20:32:55.4915667Z op = torch.compile(op) 2025-05-07T20:32:55.4915779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4915861Z 2025-05-07T20:32:55.4915964Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4915969Z 2025-05-07T20:32:55.4916073Z moe/activation_test.py:117: 2025-05-07T20:32:55.4916215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4916323Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4916428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4916824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4916926Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4917540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4917653Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4918031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4918280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4918640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4918741Z kernel = self.compile( 2025-05-07T20:32:55.4919155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4919343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4919478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4919534Z 2025-05-07T20:32:55.4919753Z self = 2025-05-07T20:32:55.4920570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4921152Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a09240>} 2025-05-07T20:32:55.4921930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4922141Z context = 2025-05-07T20:32:55.4922148Z 2025-05-07T20:32:55.4922325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4922608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4922729Z module_map=module_map) 2025-05-07T20:32:55.4922901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4923010Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4923099Z E ^ 2025-05-07T20:32:55.4923473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4923478Z 2025-05-07T20:32:55.4924256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4924266Z 2025-05-07T20:32:55.4924426Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4924686Z self=, 2025-05-07T20:32:55.4924781Z T=4096, 2025-05-07T20:32:55.4924864Z D=5120, 2025-05-07T20:32:55.4924960Z scale_ub=1200.0, 2025-05-07T20:32:55.4925057Z contiguous=False, 2025-05-07T20:32:55.4925146Z compiled=False, 2025-05-07T20:32:55.4925231Z ) 2025-05-07T20:32:55.4925460Z self = 2025-05-07T20:32:55.4925650Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.4925654Z 2025-05-07T20:32:55.4925744Z @given( 2025-05-07T20:32:55.4925871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4925978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4926106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4926232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4926360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4926440Z ) 2025-05-07T20:32:55.4926703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4926810Z def test_silu_mul_quant( 2025-05-07T20:32:55.4926893Z self, 2025-05-07T20:32:55.4927229Z T: int, 2025-05-07T20:32:55.4927320Z D: int, 2025-05-07T20:32:55.4927426Z scale_ub: Optional[float], 2025-05-07T20:32:55.4927521Z contiguous: bool, 2025-05-07T20:32:55.4927622Z compiled: bool, 2025-05-07T20:32:55.4927709Z ) -> None: 2025-05-07T20:32:55.4927809Z torch.manual_seed(2025) 2025-05-07T20:32:55.4927894Z 2025-05-07T20:32:55.4928072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4928156Z 2025-05-07T20:32:55.4928256Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4928388Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4928488Z x = x_sign * x_clamp 2025-05-07T20:32:55.4928575Z x0 = x[:, :D] 2025-05-07T20:32:55.4928659Z x1 = x[:, D:] 2025-05-07T20:32:55.4928809Z 2025-05-07T20:32:55.4928899Z if contiguous: 2025-05-07T20:32:55.4928997Z x0 = x0.contiguous() 2025-05-07T20:32:55.4929106Z x1 = x1.contiguous() 2025-05-07T20:32:55.4929185Z 2025-05-07T20:32:55.4929282Z if scale_ub is not None: 2025-05-07T20:32:55.4929402Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4929546Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4929697Z ) 2025-05-07T20:32:55.4929785Z else: 2025-05-07T20:32:55.4929886Z scale_ub_tensor = None 2025-05-07T20:32:55.4929971Z 2025-05-07T20:32:55.4930110Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4930207Z op = silu_mul_quant 2025-05-07T20:32:55.4930304Z if compiled: 2025-05-07T20:32:55.4930412Z op = torch.compile(op) 2025-05-07T20:32:55.4930527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4930616Z 2025-05-07T20:32:55.4930713Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4930718Z 2025-05-07T20:32:55.4930822Z moe/activation_test.py:117: 2025-05-07T20:32:55.4930971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4931081Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4931195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4931720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4931823Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4932214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4932449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4932807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4932917Z kernel = self.compile( 2025-05-07T20:32:55.4933324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4933517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4933652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4933659Z 2025-05-07T20:32:55.4933878Z self = 2025-05-07T20:32:55.4934696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4935228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a0acb0>} 2025-05-07T20:32:55.4936096Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4936303Z context = 2025-05-07T20:32:55.4936307Z 2025-05-07T20:32:55.4936487Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4936769Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4936889Z module_map=module_map) 2025-05-07T20:32:55.4937070Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4937175Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4937259Z E ^ 2025-05-07T20:32:55.4937639Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4937685Z 2025-05-07T20:32:55.4938119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4938124Z 2025-05-07T20:32:55.4938249Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4938482Z self=, 2025-05-07T20:32:55.4938565Z T=4096, 2025-05-07T20:32:55.4938695Z D=5120, 2025-05-07T20:32:55.4938784Z scale_ub=1200.0, 2025-05-07T20:32:55.4938876Z contiguous=False, 2025-05-07T20:32:55.4938971Z compiled=True, 2025-05-07T20:32:55.4939050Z ) 2025-05-07T20:32:55.4939281Z self = 2025-05-07T20:32:55.4939471Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.4939476Z 2025-05-07T20:32:55.4939558Z @given( 2025-05-07T20:32:55.4939689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4939799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4939921Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4940057Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4940179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4940259Z ) 2025-05-07T20:32:55.4940527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4940632Z def test_silu_mul_quant( 2025-05-07T20:32:55.4940713Z self, 2025-05-07T20:32:55.4940802Z T: int, 2025-05-07T20:32:55.4940883Z D: int, 2025-05-07T20:32:55.4940993Z scale_ub: Optional[float], 2025-05-07T20:32:55.4941087Z contiguous: bool, 2025-05-07T20:32:55.4941178Z compiled: bool, 2025-05-07T20:32:55.4941268Z ) -> None: 2025-05-07T20:32:55.4941369Z torch.manual_seed(2025) 2025-05-07T20:32:55.4941447Z 2025-05-07T20:32:55.4941631Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4941712Z 2025-05-07T20:32:55.4941810Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4941949Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4942049Z x = x_sign * x_clamp 2025-05-07T20:32:55.4942136Z x0 = x[:, :D] 2025-05-07T20:32:55.4942227Z x1 = x[:, D:] 2025-05-07T20:32:55.4942305Z 2025-05-07T20:32:55.4942401Z if contiguous: 2025-05-07T20:32:55.4942500Z x0 = x0.contiguous() 2025-05-07T20:32:55.4942595Z x1 = x1.contiguous() 2025-05-07T20:32:55.4942678Z 2025-05-07T20:32:55.4942774Z if scale_ub is not None: 2025-05-07T20:32:55.4942887Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4943036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4943117Z ) 2025-05-07T20:32:55.4943199Z else: 2025-05-07T20:32:55.4943305Z scale_ub_tensor = None 2025-05-07T20:32:55.4943385Z 2025-05-07T20:32:55.4943529Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4943632Z op = silu_mul_quant 2025-05-07T20:32:55.4943723Z if compiled: 2025-05-07T20:32:55.4943950Z op = torch.compile(op) 2025-05-07T20:32:55.4944066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4944146Z 2025-05-07T20:32:55.4944248Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4944259Z 2025-05-07T20:32:55.4944363Z moe/activation_test.py:117: 2025-05-07T20:32:55.4944501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4944614Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4944719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4945105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4945212Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4945731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4945883Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4946267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4946505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4946914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4947015Z kernel = self.compile( 2025-05-07T20:32:55.4947418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4947611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4947744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4947749Z 2025-05-07T20:32:55.4947976Z self = 2025-05-07T20:32:55.4948789Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4949329Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7a0ab90>} 2025-05-07T20:32:55.4950113Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4950316Z context = 2025-05-07T20:32:55.4950321Z 2025-05-07T20:32:55.4950504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4950785Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4950910Z module_map=module_map) 2025-05-07T20:32:55.4951085Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4951192Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4951285Z E ^ 2025-05-07T20:32:55.4951661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4951668Z 2025-05-07T20:32:55.4952102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4952112Z 2025-05-07T20:32:55.4952222Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4952456Z self=, 2025-05-07T20:32:55.4952546Z T=2048, 2025-05-07T20:32:55.4952628Z D=7168, 2025-05-07T20:32:55.4952721Z scale_ub=1200.0, 2025-05-07T20:32:55.4952822Z contiguous=False, 2025-05-07T20:32:55.4952913Z compiled=False, 2025-05-07T20:32:55.4952991Z ) 2025-05-07T20:32:55.4953313Z self = 2025-05-07T20:32:55.4953619Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.4953624Z 2025-05-07T20:32:55.4953719Z @given( 2025-05-07T20:32:55.4953845Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4953951Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4954080Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4954204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4954326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4954411Z ) 2025-05-07T20:32:55.4954670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4954827Z def test_silu_mul_quant( 2025-05-07T20:32:55.4954917Z self, 2025-05-07T20:32:55.4955001Z T: int, 2025-05-07T20:32:55.4955082Z D: int, 2025-05-07T20:32:55.4955200Z scale_ub: Optional[float], 2025-05-07T20:32:55.4955295Z contiguous: bool, 2025-05-07T20:32:55.4955392Z compiled: bool, 2025-05-07T20:32:55.4955477Z ) -> None: 2025-05-07T20:32:55.4955577Z torch.manual_seed(2025) 2025-05-07T20:32:55.4955708Z 2025-05-07T20:32:55.4955887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4955966Z 2025-05-07T20:32:55.4956069Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4956202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4956297Z x = x_sign * x_clamp 2025-05-07T20:32:55.4956389Z x0 = x[:, :D] 2025-05-07T20:32:55.4956475Z x1 = x[:, D:] 2025-05-07T20:32:55.4956553Z 2025-05-07T20:32:55.4956647Z if contiguous: 2025-05-07T20:32:55.4956749Z x0 = x0.contiguous() 2025-05-07T20:32:55.4956851Z x1 = x1.contiguous() 2025-05-07T20:32:55.4956931Z 2025-05-07T20:32:55.4957029Z if scale_ub is not None: 2025-05-07T20:32:55.4957153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4957298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4957378Z ) 2025-05-07T20:32:55.4957470Z else: 2025-05-07T20:32:55.4957570Z scale_ub_tensor = None 2025-05-07T20:32:55.4957649Z 2025-05-07T20:32:55.4957792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4957890Z op = silu_mul_quant 2025-05-07T20:32:55.4957984Z if compiled: 2025-05-07T20:32:55.4958096Z op = torch.compile(op) 2025-05-07T20:32:55.4958210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4958294Z 2025-05-07T20:32:55.4958391Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4958395Z 2025-05-07T20:32:55.4958502Z moe/activation_test.py:117: 2025-05-07T20:32:55.4958646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4958758Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4958865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4959399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4959507Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4959886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4960128Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4960487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4960595Z kernel = self.compile( 2025-05-07T20:32:55.4960998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4961187Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4961414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4961419Z 2025-05-07T20:32:55.4961638Z self = 2025-05-07T20:32:55.4962460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4962991Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf75285e0>} 2025-05-07T20:32:55.4963777Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4964029Z context = 2025-05-07T20:32:55.4964033Z 2025-05-07T20:32:55.4964210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4964497Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4964659Z module_map=module_map) 2025-05-07T20:32:55.4964833Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4964948Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4965031Z E ^ 2025-05-07T20:32:55.4965412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4965416Z 2025-05-07T20:32:55.4965852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4965861Z 2025-05-07T20:32:55.4965972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4966218Z self=, 2025-05-07T20:32:55.4966302Z T=1, 2025-05-07T20:32:55.4966391Z D=7168, 2025-05-07T20:32:55.4966479Z scale_ub=None, 2025-05-07T20:32:55.4966571Z contiguous=True, 2025-05-07T20:32:55.4966673Z compiled=False, 2025-05-07T20:32:55.4966752Z ) 2025-05-07T20:32:55.4966982Z self = 2025-05-07T20:32:55.4967164Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.4967168Z 2025-05-07T20:32:55.4967251Z @given( 2025-05-07T20:32:55.4967379Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4967493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4967617Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4967752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4967875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4967959Z ) 2025-05-07T20:32:55.4968230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4968332Z def test_silu_mul_quant( 2025-05-07T20:32:55.4968415Z self, 2025-05-07T20:32:55.4968507Z T: int, 2025-05-07T20:32:55.4968590Z D: int, 2025-05-07T20:32:55.4968694Z scale_ub: Optional[float], 2025-05-07T20:32:55.4968797Z contiguous: bool, 2025-05-07T20:32:55.4968890Z compiled: bool, 2025-05-07T20:32:55.4968973Z ) -> None: 2025-05-07T20:32:55.4969080Z torch.manual_seed(2025) 2025-05-07T20:32:55.4969162Z 2025-05-07T20:32:55.4969339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4969424Z 2025-05-07T20:32:55.4969523Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4969665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4969760Z x = x_sign * x_clamp 2025-05-07T20:32:55.4969845Z x0 = x[:, :D] 2025-05-07T20:32:55.4970021Z x1 = x[:, D:] 2025-05-07T20:32:55.4970101Z 2025-05-07T20:32:55.4970190Z if contiguous: 2025-05-07T20:32:55.4970291Z x0 = x0.contiguous() 2025-05-07T20:32:55.4970386Z x1 = x1.contiguous() 2025-05-07T20:32:55.4970466Z 2025-05-07T20:32:55.4970569Z if scale_ub is not None: 2025-05-07T20:32:55.4970685Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4970829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4970915Z ) 2025-05-07T20:32:55.4970998Z else: 2025-05-07T20:32:55.4971103Z scale_ub_tensor = None 2025-05-07T20:32:55.4971181Z 2025-05-07T20:32:55.4971318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4971421Z op = silu_mul_quant 2025-05-07T20:32:55.4971555Z if compiled: 2025-05-07T20:32:55.4971661Z op = torch.compile(op) 2025-05-07T20:32:55.4971782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4971867Z 2025-05-07T20:32:55.4971963Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4971968Z 2025-05-07T20:32:55.4972079Z moe/activation_test.py:117: 2025-05-07T20:32:55.4972217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4972410Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4972517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4973043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4973154Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4973530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4973771Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4974140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4974240Z kernel = self.compile( 2025-05-07T20:32:55.4974650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4974842Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4974976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4974981Z 2025-05-07T20:32:55.4975204Z self = 2025-05-07T20:32:55.4976019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4976563Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7528d30>} 2025-05-07T20:32:55.4977342Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4977548Z context = 2025-05-07T20:32:55.4977559Z 2025-05-07T20:32:55.4977734Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4978011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4978131Z module_map=module_map) 2025-05-07T20:32:55.4978302Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4978409Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4978502Z E ^ 2025-05-07T20:32:55.4978961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4978967Z 2025-05-07T20:32:55.4979408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4979413Z 2025-05-07T20:32:55.4979530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4979764Z self=, 2025-05-07T20:32:55.4979873Z T=16384, 2025-05-07T20:32:55.4979958Z D=7168, 2025-05-07T20:32:55.4980050Z scale_ub=1200.0, 2025-05-07T20:32:55.4980149Z contiguous=False, 2025-05-07T20:32:55.4985558Z compiled=True, 2025-05-07T20:32:55.4985652Z ) 2025-05-07T20:32:55.4985900Z self = 2025-05-07T20:32:55.4986096Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.4986177Z 2025-05-07T20:32:55.4986271Z @given( 2025-05-07T20:32:55.4986400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.4986514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.4986646Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.4986771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.4986942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.4987030Z ) 2025-05-07T20:32:55.4987294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.4987396Z def test_silu_mul_quant( 2025-05-07T20:32:55.4987488Z self, 2025-05-07T20:32:55.4987571Z T: int, 2025-05-07T20:32:55.4987654Z D: int, 2025-05-07T20:32:55.4987768Z scale_ub: Optional[float], 2025-05-07T20:32:55.4987864Z contiguous: bool, 2025-05-07T20:32:55.4987962Z compiled: bool, 2025-05-07T20:32:55.4988050Z ) -> None: 2025-05-07T20:32:55.4988152Z torch.manual_seed(2025) 2025-05-07T20:32:55.4988238Z 2025-05-07T20:32:55.4988423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.4988504Z 2025-05-07T20:32:55.4988610Z x_sign = torch.sign(x) 2025-05-07T20:32:55.4988744Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.4988842Z x = x_sign * x_clamp 2025-05-07T20:32:55.4988934Z x0 = x[:, :D] 2025-05-07T20:32:55.4989019Z x1 = x[:, D:] 2025-05-07T20:32:55.4989097Z 2025-05-07T20:32:55.4989192Z if contiguous: 2025-05-07T20:32:55.4989289Z x0 = x0.contiguous() 2025-05-07T20:32:55.4989389Z x1 = x1.contiguous() 2025-05-07T20:32:55.4989468Z 2025-05-07T20:32:55.4989564Z if scale_ub is not None: 2025-05-07T20:32:55.4989687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.4989833Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.4989915Z ) 2025-05-07T20:32:55.4990002Z else: 2025-05-07T20:32:55.4990104Z scale_ub_tensor = None 2025-05-07T20:32:55.4990182Z 2025-05-07T20:32:55.4990331Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.4990429Z op = silu_mul_quant 2025-05-07T20:32:55.4990519Z if compiled: 2025-05-07T20:32:55.4990635Z op = torch.compile(op) 2025-05-07T20:32:55.4990747Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4990832Z 2025-05-07T20:32:55.4990929Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.4990934Z 2025-05-07T20:32:55.4991038Z moe/activation_test.py:117: 2025-05-07T20:32:55.4991182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4991290Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.4991397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.4991802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.4991905Z return fn(*args, **kwargs) 2025-05-07T20:32:55.4992518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.4992633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.4993012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.4993262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.4993805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.4993906Z kernel = self.compile( 2025-05-07T20:32:55.4994315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.4994502Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.4994690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.4994696Z 2025-05-07T20:32:55.4994921Z self = 2025-05-07T20:32:55.4995735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.4996319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf7529bd0>} 2025-05-07T20:32:55.4997102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.4997316Z context = 2025-05-07T20:32:55.4997320Z 2025-05-07T20:32:55.4997500Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.4997779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.4997901Z module_map=module_map) 2025-05-07T20:32:55.4998078Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.4998191Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.4998274Z E ^ 2025-05-07T20:32:55.4998648Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.4998654Z 2025-05-07T20:32:55.4999091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.4999096Z 2025-05-07T20:32:55.4999206Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.4999452Z self=, 2025-05-07T20:32:55.4999534Z T=1, 2025-05-07T20:32:55.4999617Z D=7168, 2025-05-07T20:32:55.4999715Z scale_ub=None, 2025-05-07T20:32:55.4999809Z contiguous=False, 2025-05-07T20:32:55.4999899Z compiled=False, 2025-05-07T20:32:55.4999987Z ) 2025-05-07T20:32:55.5000217Z self = 2025-05-07T20:32:55.5000397Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.5000402Z 2025-05-07T20:32:55.5000491Z @given( 2025-05-07T20:32:55.5000619Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5000734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5000859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5000985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5001114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5001198Z ) 2025-05-07T20:32:55.5001458Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5001653Z def test_silu_mul_quant( 2025-05-07T20:32:55.5001740Z self, 2025-05-07T20:32:55.5001823Z T: int, 2025-05-07T20:32:55.5001911Z D: int, 2025-05-07T20:32:55.5002018Z scale_ub: Optional[float], 2025-05-07T20:32:55.5002117Z contiguous: bool, 2025-05-07T20:32:55.5002219Z compiled: bool, 2025-05-07T20:32:55.5002304Z ) -> None: 2025-05-07T20:32:55.5002412Z torch.manual_seed(2025) 2025-05-07T20:32:55.5002490Z 2025-05-07T20:32:55.5002671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5002758Z 2025-05-07T20:32:55.5002857Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5002992Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5003094Z x = x_sign * x_clamp 2025-05-07T20:32:55.5003224Z x0 = x[:, :D] 2025-05-07T20:32:55.5003310Z x1 = x[:, D:] 2025-05-07T20:32:55.5003397Z 2025-05-07T20:32:55.5003487Z if contiguous: 2025-05-07T20:32:55.5003593Z x0 = x0.contiguous() 2025-05-07T20:32:55.5003694Z x1 = x1.contiguous() 2025-05-07T20:32:55.5003772Z 2025-05-07T20:32:55.5003870Z if scale_ub is not None: 2025-05-07T20:32:55.5003990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5004177Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5004264Z ) 2025-05-07T20:32:55.5004346Z else: 2025-05-07T20:32:55.5004445Z scale_ub_tensor = None 2025-05-07T20:32:55.5004530Z 2025-05-07T20:32:55.5004666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5004761Z op = silu_mul_quant 2025-05-07T20:32:55.5004861Z if compiled: 2025-05-07T20:32:55.5004967Z op = torch.compile(op) 2025-05-07T20:32:55.5005083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5005167Z 2025-05-07T20:32:55.5005265Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5005270Z 2025-05-07T20:32:55.5005385Z moe/activation_test.py:117: 2025-05-07T20:32:55.5005522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5005631Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5005745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5006266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5006370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5006752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5006987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5007352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5007456Z kernel = self.compile( 2025-05-07T20:32:55.5007861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5008054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5008188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5008196Z 2025-05-07T20:32:55.5008412Z self = 2025-05-07T20:32:55.5009226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5009757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf752a050>} 2025-05-07T20:32:55.5010630Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5010839Z context = 2025-05-07T20:32:55.5010848Z 2025-05-07T20:32:55.5011029Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5011310Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5011426Z module_map=module_map) 2025-05-07T20:32:55.5011603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5011708Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5011791Z E ^ 2025-05-07T20:32:55.5012170Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5012251Z 2025-05-07T20:32:55.5012689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5012694Z 2025-05-07T20:32:55.5012813Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5013048Z self=, 2025-05-07T20:32:55.5013171Z T=2048, 2025-05-07T20:32:55.5013259Z D=7168, 2025-05-07T20:32:55.5013347Z scale_ub=None, 2025-05-07T20:32:55.5013439Z contiguous=False, 2025-05-07T20:32:55.5013536Z compiled=True, 2025-05-07T20:32:55.5013613Z ) 2025-05-07T20:32:55.5013848Z self = 2025-05-07T20:32:55.5014031Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.5014036Z 2025-05-07T20:32:55.5014118Z @given( 2025-05-07T20:32:55.5014256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5014363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5014491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5014623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5014743Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5014833Z ) 2025-05-07T20:32:55.5015095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5015196Z def test_silu_mul_quant( 2025-05-07T20:32:55.5015285Z self, 2025-05-07T20:32:55.5015369Z T: int, 2025-05-07T20:32:55.5015455Z D: int, 2025-05-07T20:32:55.5015566Z scale_ub: Optional[float], 2025-05-07T20:32:55.5015663Z contiguous: bool, 2025-05-07T20:32:55.5015755Z compiled: bool, 2025-05-07T20:32:55.5015845Z ) -> None: 2025-05-07T20:32:55.5015951Z torch.manual_seed(2025) 2025-05-07T20:32:55.5016033Z 2025-05-07T20:32:55.5016217Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5016298Z 2025-05-07T20:32:55.5016397Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5016540Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5016638Z x = x_sign * x_clamp 2025-05-07T20:32:55.5016730Z x0 = x[:, :D] 2025-05-07T20:32:55.5016814Z x1 = x[:, D:] 2025-05-07T20:32:55.5016898Z 2025-05-07T20:32:55.5016998Z if contiguous: 2025-05-07T20:32:55.5017096Z x0 = x0.contiguous() 2025-05-07T20:32:55.5017193Z x1 = x1.contiguous() 2025-05-07T20:32:55.5017278Z 2025-05-07T20:32:55.5017377Z if scale_ub is not None: 2025-05-07T20:32:55.5017495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5017646Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5017729Z ) 2025-05-07T20:32:55.5017811Z else: 2025-05-07T20:32:55.5017923Z scale_ub_tensor = None 2025-05-07T20:32:55.5018002Z 2025-05-07T20:32:55.5018147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5018345Z op = silu_mul_quant 2025-05-07T20:32:55.5018439Z if compiled: 2025-05-07T20:32:55.5018553Z op = torch.compile(op) 2025-05-07T20:32:55.5018666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5018748Z 2025-05-07T20:32:55.5018850Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5018855Z 2025-05-07T20:32:55.5018959Z moe/activation_test.py:117: 2025-05-07T20:32:55.5019096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5019208Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5019314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5019710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5019810Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5020370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5020485Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5020861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5021097Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5021511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5021613Z kernel = self.compile( 2025-05-07T20:32:55.5022020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5022207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5022344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5022350Z 2025-05-07T20:32:55.5022573Z self = 2025-05-07T20:32:55.5023390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5024261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf752b1c0>} 2025-05-07T20:32:55.5025118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5025323Z context = 2025-05-07T20:32:55.5025333Z 2025-05-07T20:32:55.5025515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5025791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5025917Z module_map=module_map) 2025-05-07T20:32:55.5026089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5026196Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5026288Z E ^ 2025-05-07T20:32:55.5026661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5026666Z 2025-05-07T20:32:55.5027104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5027109Z 2025-05-07T20:32:55.5027222Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5027455Z self=, 2025-05-07T20:32:55.5027546Z T=4096, 2025-05-07T20:32:55.5027629Z D=7168, 2025-05-07T20:32:55.5027716Z scale_ub=None, 2025-05-07T20:32:55.5027814Z contiguous=False, 2025-05-07T20:32:55.5028149Z compiled=True, 2025-05-07T20:32:55.5028231Z ) 2025-05-07T20:32:55.5028468Z self = 2025-05-07T20:32:55.5028650Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.5028657Z 2025-05-07T20:32:55.5028743Z @given( 2025-05-07T20:32:55.5028870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5028977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5029104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5029230Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5029351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5029438Z ) 2025-05-07T20:32:55.5029698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5029872Z def test_silu_mul_quant( 2025-05-07T20:32:55.5029955Z self, 2025-05-07T20:32:55.5030038Z T: int, 2025-05-07T20:32:55.5030132Z D: int, 2025-05-07T20:32:55.5030237Z scale_ub: Optional[float], 2025-05-07T20:32:55.5030333Z contiguous: bool, 2025-05-07T20:32:55.5030431Z compiled: bool, 2025-05-07T20:32:55.5030597Z ) -> None: 2025-05-07T20:32:55.5030699Z torch.manual_seed(2025) 2025-05-07T20:32:55.5030783Z 2025-05-07T20:32:55.5030961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5031041Z 2025-05-07T20:32:55.5031151Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5031284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5031380Z x = x_sign * x_clamp 2025-05-07T20:32:55.5031474Z x0 = x[:, :D] 2025-05-07T20:32:55.5031560Z x1 = x[:, D:] 2025-05-07T20:32:55.5031645Z 2025-05-07T20:32:55.5031739Z if contiguous: 2025-05-07T20:32:55.5031837Z x0 = x0.contiguous() 2025-05-07T20:32:55.5031940Z x1 = x1.contiguous() 2025-05-07T20:32:55.5032019Z 2025-05-07T20:32:55.5032119Z if scale_ub is not None: 2025-05-07T20:32:55.5032240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5032383Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5032467Z ) 2025-05-07T20:32:55.5032557Z else: 2025-05-07T20:32:55.5032657Z scale_ub_tensor = None 2025-05-07T20:32:55.5032738Z 2025-05-07T20:32:55.5032883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5032981Z op = silu_mul_quant 2025-05-07T20:32:55.5033082Z if compiled: 2025-05-07T20:32:55.5033189Z op = torch.compile(op) 2025-05-07T20:32:55.5033303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5033388Z 2025-05-07T20:32:55.5033489Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5033493Z 2025-05-07T20:32:55.5033693Z moe/activation_test.py:117: 2025-05-07T20:32:55.5033842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5033950Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5034055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5034447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5034550Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5035074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5035177Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5035556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5035799Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5036158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5036352Z kernel = self.compile( 2025-05-07T20:32:55.5036755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5036941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5037082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5037087Z 2025-05-07T20:32:55.5037302Z self = 2025-05-07T20:32:55.5038111Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5038648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71a81f0>} 2025-05-07T20:32:55.5039525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5039734Z context = 2025-05-07T20:32:55.5039777Z 2025-05-07T20:32:55.5039953Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5040236Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5040351Z module_map=module_map) 2025-05-07T20:32:55.5040523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5040636Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5040717Z E ^ 2025-05-07T20:32:55.5041093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5041098Z 2025-05-07T20:32:55.5041542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5041546Z 2025-05-07T20:32:55.5041656Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5041898Z self=, 2025-05-07T20:32:55.5041980Z T=16384, 2025-05-07T20:32:55.5042062Z D=5120, 2025-05-07T20:32:55.5042156Z scale_ub=1200.0, 2025-05-07T20:32:55.5042249Z contiguous=False, 2025-05-07T20:32:55.5042338Z compiled=False, 2025-05-07T20:32:55.5042424Z ) 2025-05-07T20:32:55.5042653Z self = 2025-05-07T20:32:55.5042845Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.5042858Z 2025-05-07T20:32:55.5042940Z @given( 2025-05-07T20:32:55.5043070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5043181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5043308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5043434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5043561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5043643Z ) 2025-05-07T20:32:55.5043905Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5044013Z def test_silu_mul_quant( 2025-05-07T20:32:55.5044094Z self, 2025-05-07T20:32:55.5044181Z T: int, 2025-05-07T20:32:55.5044262Z D: int, 2025-05-07T20:32:55.5044365Z scale_ub: Optional[float], 2025-05-07T20:32:55.5044464Z contiguous: bool, 2025-05-07T20:32:55.5044555Z compiled: bool, 2025-05-07T20:32:55.5044638Z ) -> None: 2025-05-07T20:32:55.5044750Z torch.manual_seed(2025) 2025-05-07T20:32:55.5044829Z 2025-05-07T20:32:55.5045006Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5045092Z 2025-05-07T20:32:55.5045308Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5045442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5045543Z x = x_sign * x_clamp 2025-05-07T20:32:55.5045631Z x0 = x[:, :D] 2025-05-07T20:32:55.5045716Z x1 = x[:, D:] 2025-05-07T20:32:55.5045800Z 2025-05-07T20:32:55.5045889Z if contiguous: 2025-05-07T20:32:55.5045994Z x0 = x0.contiguous() 2025-05-07T20:32:55.5046089Z x1 = x1.contiguous() 2025-05-07T20:32:55.5046166Z 2025-05-07T20:32:55.5046270Z if scale_ub is not None: 2025-05-07T20:32:55.5046383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5046528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5046617Z ) 2025-05-07T20:32:55.5046743Z else: 2025-05-07T20:32:55.5046845Z scale_ub_tensor = None 2025-05-07T20:32:55.5046936Z 2025-05-07T20:32:55.5047082Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5047178Z op = silu_mul_quant 2025-05-07T20:32:55.5047286Z if compiled: 2025-05-07T20:32:55.5047395Z op = torch.compile(op) 2025-05-07T20:32:55.5047557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5047635Z 2025-05-07T20:32:55.5047733Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5047737Z 2025-05-07T20:32:55.5047849Z moe/activation_test.py:117: 2025-05-07T20:32:55.5047986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5048094Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5048206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5048728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5048841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5049226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5049461Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5049828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5049933Z kernel = self.compile( 2025-05-07T20:32:55.5050337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5050529Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5050664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5050668Z 2025-05-07T20:32:55.5050891Z self = 2025-05-07T20:32:55.5051707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5052243Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71a8700>} 2025-05-07T20:32:55.5053026Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5053229Z context = 2025-05-07T20:32:55.5053235Z 2025-05-07T20:32:55.5053418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5053698Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5053822Z module_map=module_map) 2025-05-07T20:32:55.5054074Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5054181Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5054270Z E ^ 2025-05-07T20:32:55.5054644Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5054652Z 2025-05-07T20:32:55.5055086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5055097Z 2025-05-07T20:32:55.5055209Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5055444Z self=, 2025-05-07T20:32:55.5055535Z T=16384, 2025-05-07T20:32:55.5055618Z D=5120, 2025-05-07T20:32:55.5055708Z scale_ub=1200.0, 2025-05-07T20:32:55.5055846Z contiguous=True, 2025-05-07T20:32:55.5055935Z compiled=True, 2025-05-07T20:32:55.5056014Z ) 2025-05-07T20:32:55.5056260Z self = 2025-05-07T20:32:55.5056444Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.5056449Z 2025-05-07T20:32:55.5056531Z @given( 2025-05-07T20:32:55.5056662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5056821Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5056967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5057115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5057238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5057324Z ) 2025-05-07T20:32:55.5057584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5057685Z def test_silu_mul_quant( 2025-05-07T20:32:55.5057777Z self, 2025-05-07T20:32:55.5057859Z T: int, 2025-05-07T20:32:55.5057942Z D: int, 2025-05-07T20:32:55.5058054Z scale_ub: Optional[float], 2025-05-07T20:32:55.5058156Z contiguous: bool, 2025-05-07T20:32:55.5058253Z compiled: bool, 2025-05-07T20:32:55.5058337Z ) -> None: 2025-05-07T20:32:55.5058437Z torch.manual_seed(2025) 2025-05-07T20:32:55.5058523Z 2025-05-07T20:32:55.5058704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5058784Z 2025-05-07T20:32:55.5058889Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5059021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5059115Z x = x_sign * x_clamp 2025-05-07T20:32:55.5059209Z x0 = x[:, :D] 2025-05-07T20:32:55.5059294Z x1 = x[:, D:] 2025-05-07T20:32:55.5059372Z 2025-05-07T20:32:55.5059468Z if contiguous: 2025-05-07T20:32:55.5059566Z x0 = x0.contiguous() 2025-05-07T20:32:55.5059664Z x1 = x1.contiguous() 2025-05-07T20:32:55.5059750Z 2025-05-07T20:32:55.5059846Z if scale_ub is not None: 2025-05-07T20:32:55.5059967Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5060111Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5060192Z ) 2025-05-07T20:32:55.5060276Z else: 2025-05-07T20:32:55.5060375Z scale_ub_tensor = None 2025-05-07T20:32:55.5060456Z 2025-05-07T20:32:55.5060599Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5060695Z op = silu_mul_quant 2025-05-07T20:32:55.5060785Z if compiled: 2025-05-07T20:32:55.5060900Z op = torch.compile(op) 2025-05-07T20:32:55.5061012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5061091Z 2025-05-07T20:32:55.5061194Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5061198Z 2025-05-07T20:32:55.5061302Z moe/activation_test.py:117: 2025-05-07T20:32:55.5061449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5061558Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5061748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5062147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5062246Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5062770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5062879Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5063256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5063497Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5063856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5063998Z kernel = self.compile( 2025-05-07T20:32:55.5064412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5064599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5064742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5064786Z 2025-05-07T20:32:55.5065007Z self = 2025-05-07T20:32:55.5065815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5066353Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71a97e0>} 2025-05-07T20:32:55.5067163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5067399Z context = 2025-05-07T20:32:55.5067403Z 2025-05-07T20:32:55.5067578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5067858Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5067981Z module_map=module_map) 2025-05-07T20:32:55.5068154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5068265Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5068352Z E ^ 2025-05-07T20:32:55.5068726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5068733Z 2025-05-07T20:32:55.5069171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5069179Z 2025-05-07T20:32:55.5069289Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5069530Z self=, 2025-05-07T20:32:55.5069615Z T=16384, 2025-05-07T20:32:55.5069698Z D=5120, 2025-05-07T20:32:55.5069793Z scale_ub=None, 2025-05-07T20:32:55.5069886Z contiguous=False, 2025-05-07T20:32:55.5069976Z compiled=True, 2025-05-07T20:32:55.5070060Z ) 2025-05-07T20:32:55.5070290Z self = 2025-05-07T20:32:55.5070478Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.5070482Z 2025-05-07T20:32:55.5070573Z @given( 2025-05-07T20:32:55.5070699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5070811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5070941Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5071149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5071279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5071359Z ) 2025-05-07T20:32:55.5071621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5071734Z def test_silu_mul_quant( 2025-05-07T20:32:55.5071814Z self, 2025-05-07T20:32:55.5071896Z T: int, 2025-05-07T20:32:55.5071986Z D: int, 2025-05-07T20:32:55.5072090Z scale_ub: Optional[float], 2025-05-07T20:32:55.5072185Z contiguous: bool, 2025-05-07T20:32:55.5072284Z compiled: bool, 2025-05-07T20:32:55.5072367Z ) -> None: 2025-05-07T20:32:55.5072467Z torch.manual_seed(2025) 2025-05-07T20:32:55.5072549Z 2025-05-07T20:32:55.5072726Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5072859Z 2025-05-07T20:32:55.5072957Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5073097Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5073197Z x = x_sign * x_clamp 2025-05-07T20:32:55.5073282Z x0 = x[:, :D] 2025-05-07T20:32:55.5073366Z x1 = x[:, D:] 2025-05-07T20:32:55.5073452Z 2025-05-07T20:32:55.5073685Z if contiguous: 2025-05-07T20:32:55.5073783Z x0 = x0.contiguous() 2025-05-07T20:32:55.5073884Z x1 = x1.contiguous() 2025-05-07T20:32:55.5073961Z 2025-05-07T20:32:55.5074057Z if scale_ub is not None: 2025-05-07T20:32:55.5074174Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5074317Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5074405Z ) 2025-05-07T20:32:55.5074487Z else: 2025-05-07T20:32:55.5074587Z scale_ub_tensor = None 2025-05-07T20:32:55.5074674Z 2025-05-07T20:32:55.5074811Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5074908Z op = silu_mul_quant 2025-05-07T20:32:55.5075010Z if compiled: 2025-05-07T20:32:55.5075116Z op = torch.compile(op) 2025-05-07T20:32:55.5075227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5075311Z 2025-05-07T20:32:55.5075405Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5075413Z 2025-05-07T20:32:55.5075516Z moe/activation_test.py:117: 2025-05-07T20:32:55.5075658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5075765Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5075875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5076261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5076360Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5076889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5076995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5077375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5077615Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5077977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5078081Z kernel = self.compile( 2025-05-07T20:32:55.5078482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5078668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5078807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5078814Z 2025-05-07T20:32:55.5079031Z self = 2025-05-07T20:32:55.5079974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5080508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71aa680>} 2025-05-07T20:32:55.5081294Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5081500Z context = 2025-05-07T20:32:55.5081505Z 2025-05-07T20:32:55.5081680Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5082006Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5082127Z module_map=module_map) 2025-05-07T20:32:55.5082299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5082412Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5082494Z E ^ 2025-05-07T20:32:55.5082918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5082922Z 2025-05-07T20:32:55.5083355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5083359Z 2025-05-07T20:32:55.5083470Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5083715Z self=, 2025-05-07T20:32:55.5083803Z T=2048, 2025-05-07T20:32:55.5083895Z D=5120, 2025-05-07T20:32:55.5083988Z scale_ub=None, 2025-05-07T20:32:55.5084081Z contiguous=False, 2025-05-07T20:32:55.5084180Z compiled=True, 2025-05-07T20:32:55.5084258Z ) 2025-05-07T20:32:55.5084491Z self = 2025-05-07T20:32:55.5084683Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.5084688Z 2025-05-07T20:32:55.5084775Z @given( 2025-05-07T20:32:55.5084903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5085015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5085138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5085270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5085394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5085473Z ) 2025-05-07T20:32:55.5085740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5085844Z def test_silu_mul_quant( 2025-05-07T20:32:55.5085927Z self, 2025-05-07T20:32:55.5086019Z T: int, 2025-05-07T20:32:55.5086100Z D: int, 2025-05-07T20:32:55.5086208Z scale_ub: Optional[float], 2025-05-07T20:32:55.5086312Z contiguous: bool, 2025-05-07T20:32:55.5086404Z compiled: bool, 2025-05-07T20:32:55.5086487Z ) -> None: 2025-05-07T20:32:55.5086593Z torch.manual_seed(2025) 2025-05-07T20:32:55.5086674Z 2025-05-07T20:32:55.5086852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5086938Z 2025-05-07T20:32:55.5087035Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5087175Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5087269Z x = x_sign * x_clamp 2025-05-07T20:32:55.5087354Z x0 = x[:, :D] 2025-05-07T20:32:55.5087448Z x1 = x[:, D:] 2025-05-07T20:32:55.5087525Z 2025-05-07T20:32:55.5087614Z if contiguous: 2025-05-07T20:32:55.5087721Z x0 = x0.contiguous() 2025-05-07T20:32:55.5087818Z x1 = x1.contiguous() 2025-05-07T20:32:55.5087895Z 2025-05-07T20:32:55.5087999Z if scale_ub is not None: 2025-05-07T20:32:55.5088194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5088340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5088430Z ) 2025-05-07T20:32:55.5088512Z else: 2025-05-07T20:32:55.5088617Z scale_ub_tensor = None 2025-05-07T20:32:55.5088694Z 2025-05-07T20:32:55.5088831Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5088932Z op = silu_mul_quant 2025-05-07T20:32:55.5089023Z if compiled: 2025-05-07T20:32:55.5089129Z op = torch.compile(op) 2025-05-07T20:32:55.5089248Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5089326Z 2025-05-07T20:32:55.5089422Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5089467Z 2025-05-07T20:32:55.5089578Z moe/activation_test.py:117: 2025-05-07T20:32:55.5089715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5089833Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5089939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5090325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5090473Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5090993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5091096Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5091478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5091711Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5092079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5092181Z kernel = self.compile( 2025-05-07T20:32:55.5092588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5092779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5092913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5092921Z 2025-05-07T20:32:55.5093138Z self = 2025-05-07T20:32:55.5093955Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5094486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71aa560>} 2025-05-07T20:32:55.5095276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5095480Z context = 2025-05-07T20:32:55.5095488Z 2025-05-07T20:32:55.5095668Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5095945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5096062Z module_map=module_map) 2025-05-07T20:32:55.5096239Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5096346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5096429Z E ^ 2025-05-07T20:32:55.5096811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5096818Z 2025-05-07T20:32:55.5097330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5097336Z 2025-05-07T20:32:55.5097455Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5097693Z self=, 2025-05-07T20:32:55.5097781Z T=2048, 2025-05-07T20:32:55.5097869Z D=5120, 2025-05-07T20:32:55.5097960Z scale_ub=1200.0, 2025-05-07T20:32:55.5098054Z contiguous=False, 2025-05-07T20:32:55.5098150Z compiled=True, 2025-05-07T20:32:55.5098231Z ) 2025-05-07T20:32:55.5098467Z self = 2025-05-07T20:32:55.5098655Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.5098660Z 2025-05-07T20:32:55.5098742Z @given( 2025-05-07T20:32:55.5098918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5099023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5099154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5099290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5099412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5099502Z ) 2025-05-07T20:32:55.5099805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5099906Z def test_silu_mul_quant( 2025-05-07T20:32:55.5099999Z self, 2025-05-07T20:32:55.5100081Z T: int, 2025-05-07T20:32:55.5100162Z D: int, 2025-05-07T20:32:55.5100273Z scale_ub: Optional[float], 2025-05-07T20:32:55.5100370Z contiguous: bool, 2025-05-07T20:32:55.5100465Z compiled: bool, 2025-05-07T20:32:55.5100555Z ) -> None: 2025-05-07T20:32:55.5100656Z torch.manual_seed(2025) 2025-05-07T20:32:55.5100736Z 2025-05-07T20:32:55.5100921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5101002Z 2025-05-07T20:32:55.5101099Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5101244Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5101340Z x = x_sign * x_clamp 2025-05-07T20:32:55.5101431Z x0 = x[:, :D] 2025-05-07T20:32:55.5101520Z x1 = x[:, D:] 2025-05-07T20:32:55.5101601Z 2025-05-07T20:32:55.5101696Z if contiguous: 2025-05-07T20:32:55.5101793Z x0 = x0.contiguous() 2025-05-07T20:32:55.5101889Z x1 = x1.contiguous() 2025-05-07T20:32:55.5101970Z 2025-05-07T20:32:55.5102067Z if scale_ub is not None: 2025-05-07T20:32:55.5102178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5102326Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5102407Z ) 2025-05-07T20:32:55.5102487Z else: 2025-05-07T20:32:55.5102595Z scale_ub_tensor = None 2025-05-07T20:32:55.5102673Z 2025-05-07T20:32:55.5102814Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5102914Z op = silu_mul_quant 2025-05-07T20:32:55.5103005Z if compiled: 2025-05-07T20:32:55.5103116Z op = torch.compile(op) 2025-05-07T20:32:55.5103229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5103309Z 2025-05-07T20:32:55.5103412Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5103417Z 2025-05-07T20:32:55.5103520Z moe/activation_test.py:117: 2025-05-07T20:32:55.5103657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5103769Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5103874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5104266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5104367Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5104889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5105090Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5105467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5105702Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5106070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5106169Z kernel = self.compile( 2025-05-07T20:32:55.5106575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5106763Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5106897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5106943Z 2025-05-07T20:32:55.5107169Z self = 2025-05-07T20:32:55.5107990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5108595Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf71ab370>} 2025-05-07T20:32:55.5109373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5109577Z context = 2025-05-07T20:32:55.5109590Z 2025-05-07T20:32:55.5109765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5110043Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5110165Z module_map=module_map) 2025-05-07T20:32:55.5110335Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5110441Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5110532Z E ^ 2025-05-07T20:32:55.5110904Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5110909Z 2025-05-07T20:32:55.5111350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5111354Z 2025-05-07T20:32:55.5111465Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5111700Z self=, 2025-05-07T20:32:55.5111791Z T=4096, 2025-05-07T20:32:55.5111873Z D=5120, 2025-05-07T20:32:55.5111962Z scale_ub=1200.0, 2025-05-07T20:32:55.5112060Z contiguous=True, 2025-05-07T20:32:55.5112156Z compiled=True, 2025-05-07T20:32:55.5112234Z ) 2025-05-07T20:32:55.5112469Z self = 2025-05-07T20:32:55.5112652Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.5112660Z 2025-05-07T20:32:55.5112748Z @given( 2025-05-07T20:32:55.5112874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5112980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5113123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5113249Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5113371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5113461Z ) 2025-05-07T20:32:55.5119047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5119179Z def test_silu_mul_quant( 2025-05-07T20:32:55.5119274Z self, 2025-05-07T20:32:55.5119358Z T: int, 2025-05-07T20:32:55.5119567Z D: int, 2025-05-07T20:32:55.5119675Z scale_ub: Optional[float], 2025-05-07T20:32:55.5119774Z contiguous: bool, 2025-05-07T20:32:55.5119873Z compiled: bool, 2025-05-07T20:32:55.5119963Z ) -> None: 2025-05-07T20:32:55.5120065Z torch.manual_seed(2025) 2025-05-07T20:32:55.5120152Z 2025-05-07T20:32:55.5120333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5120413Z 2025-05-07T20:32:55.5120521Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5120657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5120753Z x = x_sign * x_clamp 2025-05-07T20:32:55.5120846Z x0 = x[:, :D] 2025-05-07T20:32:55.5120932Z x1 = x[:, D:] 2025-05-07T20:32:55.5121061Z 2025-05-07T20:32:55.5121159Z if contiguous: 2025-05-07T20:32:55.5121257Z x0 = x0.contiguous() 2025-05-07T20:32:55.5121360Z x1 = x1.contiguous() 2025-05-07T20:32:55.5121444Z 2025-05-07T20:32:55.5121544Z if scale_ub is not None: 2025-05-07T20:32:55.5121664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5121807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5121937Z ) 2025-05-07T20:32:55.5122024Z else: 2025-05-07T20:32:55.5122125Z scale_ub_tensor = None 2025-05-07T20:32:55.5122204Z 2025-05-07T20:32:55.5122353Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5122449Z op = silu_mul_quant 2025-05-07T20:32:55.5122542Z if compiled: 2025-05-07T20:32:55.5122655Z op = torch.compile(op) 2025-05-07T20:32:55.5122768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5122852Z 2025-05-07T20:32:55.5122951Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5122956Z 2025-05-07T20:32:55.5123060Z moe/activation_test.py:117: 2025-05-07T20:32:55.5123214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5123322Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5123431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5124116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5124272Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5124886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5124992Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5125371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5125618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5125979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5126085Z kernel = self.compile( 2025-05-07T20:32:55.5126494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5126681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5126828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5126833Z 2025-05-07T20:32:55.5127051Z self = 2025-05-07T20:32:55.5127863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5128408Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f00310>} 2025-05-07T20:32:55.5129416Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5129633Z context = 2025-05-07T20:32:55.5129638Z 2025-05-07T20:32:55.5129814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5130097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5130213Z module_map=module_map) 2025-05-07T20:32:55.5130385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5130497Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5130658Z E ^ 2025-05-07T20:32:55.5131034Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5131039Z 2025-05-07T20:32:55.5131482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5131486Z 2025-05-07T20:32:55.5131597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5131903Z self=, 2025-05-07T20:32:55.5131985Z T=128, 2025-05-07T20:32:55.5132068Z D=5120, 2025-05-07T20:32:55.5132163Z scale_ub=1200.0, 2025-05-07T20:32:55.5132254Z contiguous=False, 2025-05-07T20:32:55.5132343Z compiled=True, 2025-05-07T20:32:55.5132433Z ) 2025-05-07T20:32:55.5132662Z self = 2025-05-07T20:32:55.5132847Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.5132863Z 2025-05-07T20:32:55.5132945Z @given( 2025-05-07T20:32:55.5133072Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5133192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5133316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5133439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5133567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5133651Z ) 2025-05-07T20:32:55.5133910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5134019Z def test_silu_mul_quant( 2025-05-07T20:32:55.5134102Z self, 2025-05-07T20:32:55.5134184Z T: int, 2025-05-07T20:32:55.5134274Z D: int, 2025-05-07T20:32:55.5134378Z scale_ub: Optional[float], 2025-05-07T20:32:55.5134481Z contiguous: bool, 2025-05-07T20:32:55.5134574Z compiled: bool, 2025-05-07T20:32:55.5134659Z ) -> None: 2025-05-07T20:32:55.5134770Z torch.manual_seed(2025) 2025-05-07T20:32:55.5134850Z 2025-05-07T20:32:55.5135028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5135120Z 2025-05-07T20:32:55.5135219Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5135353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5135454Z x = x_sign * x_clamp 2025-05-07T20:32:55.5135548Z x0 = x[:, :D] 2025-05-07T20:32:55.5135634Z x1 = x[:, D:] 2025-05-07T20:32:55.5135720Z 2025-05-07T20:32:55.5135810Z if contiguous: 2025-05-07T20:32:55.5135907Z x0 = x0.contiguous() 2025-05-07T20:32:55.5136009Z x1 = x1.contiguous() 2025-05-07T20:32:55.5136088Z 2025-05-07T20:32:55.5136195Z if scale_ub is not None: 2025-05-07T20:32:55.5136308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5136453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5136543Z ) 2025-05-07T20:32:55.5136625Z else: 2025-05-07T20:32:55.5136728Z scale_ub_tensor = None 2025-05-07T20:32:55.5136812Z 2025-05-07T20:32:55.5137037Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5137137Z op = silu_mul_quant 2025-05-07T20:32:55.5137236Z if compiled: 2025-05-07T20:32:55.5137346Z op = torch.compile(op) 2025-05-07T20:32:55.5137463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5137548Z 2025-05-07T20:32:55.5137645Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5137649Z 2025-05-07T20:32:55.5137761Z moe/activation_test.py:117: 2025-05-07T20:32:55.5137897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5138004Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5138115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5138502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5138644Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5139174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5139277Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5139658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5139937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5140295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5140401Z kernel = self.compile( 2025-05-07T20:32:55.5140801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5140987Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5141129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5141133Z 2025-05-07T20:32:55.5141354Z self = 2025-05-07T20:32:55.5142169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5142705Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f01090>} 2025-05-07T20:32:55.5143488Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5143692Z context = 2025-05-07T20:32:55.5143698Z 2025-05-07T20:32:55.5143873Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5144161Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5144277Z module_map=module_map) 2025-05-07T20:32:55.5144455Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5144565Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5144647Z E ^ 2025-05-07T20:32:55.5145029Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5145033Z 2025-05-07T20:32:55.5145467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5145471Z 2025-05-07T20:32:55.5145588Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5145824Z self=, 2025-05-07T20:32:55.5145911Z T=16384, 2025-05-07T20:32:55.5145999Z D=7168, 2025-05-07T20:32:55.5146087Z scale_ub=1200.0, 2025-05-07T20:32:55.5146257Z contiguous=True, 2025-05-07T20:32:55.5146354Z compiled=True, 2025-05-07T20:32:55.5146432Z ) 2025-05-07T20:32:55.5146663Z self = 2025-05-07T20:32:55.5146856Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.5146861Z 2025-05-07T20:32:55.5146944Z @given( 2025-05-07T20:32:55.5147069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5147183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5147306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5147438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5147560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5147711Z ) 2025-05-07T20:32:55.5147978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5148078Z def test_silu_mul_quant( 2025-05-07T20:32:55.5148167Z self, 2025-05-07T20:32:55.5148261Z T: int, 2025-05-07T20:32:55.5148343Z D: int, 2025-05-07T20:32:55.5148447Z scale_ub: Optional[float], 2025-05-07T20:32:55.5148549Z contiguous: bool, 2025-05-07T20:32:55.5148686Z compiled: bool, 2025-05-07T20:32:55.5148776Z ) -> None: 2025-05-07T20:32:55.5148875Z torch.manual_seed(2025) 2025-05-07T20:32:55.5148955Z 2025-05-07T20:32:55.5149140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5149225Z 2025-05-07T20:32:55.5149326Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5149465Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5149560Z x = x_sign * x_clamp 2025-05-07T20:32:55.5149647Z x0 = x[:, :D] 2025-05-07T20:32:55.5149743Z x1 = x[:, D:] 2025-05-07T20:32:55.5149823Z 2025-05-07T20:32:55.5149913Z if contiguous: 2025-05-07T20:32:55.5150016Z x0 = x0.contiguous() 2025-05-07T20:32:55.5150117Z x1 = x1.contiguous() 2025-05-07T20:32:55.5150196Z 2025-05-07T20:32:55.5150301Z if scale_ub is not None: 2025-05-07T20:32:55.5150414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5150568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5150649Z ) 2025-05-07T20:32:55.5150731Z else: 2025-05-07T20:32:55.5150839Z scale_ub_tensor = None 2025-05-07T20:32:55.5150917Z 2025-05-07T20:32:55.5151056Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5151159Z op = silu_mul_quant 2025-05-07T20:32:55.5151250Z if compiled: 2025-05-07T20:32:55.5151360Z op = torch.compile(op) 2025-05-07T20:32:55.5151483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5151564Z 2025-05-07T20:32:55.5151661Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5151673Z 2025-05-07T20:32:55.5151777Z moe/activation_test.py:117: 2025-05-07T20:32:55.5151918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5152033Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5152139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5152530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5152636Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5153152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5153256Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5153756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5153996Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5154454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5154556Z kernel = self.compile( 2025-05-07T20:32:55.5154960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5155154Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5155291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5155296Z 2025-05-07T20:32:55.5155519Z self = 2025-05-07T20:32:55.5156331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5156910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f02290>} 2025-05-07T20:32:55.5157695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5157941Z context = 2025-05-07T20:32:55.5157946Z 2025-05-07T20:32:55.5158127Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5158407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5158521Z module_map=module_map) 2025-05-07T20:32:55.5158700Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5158810Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5158898Z E ^ 2025-05-07T20:32:55.5159277Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5159282Z 2025-05-07T20:32:55.5159717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5159724Z 2025-05-07T20:32:55.5159841Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5160075Z self=, 2025-05-07T20:32:55.5160168Z T=16384, 2025-05-07T20:32:55.5160250Z D=5120, 2025-05-07T20:32:55.5160339Z scale_ub=1200.0, 2025-05-07T20:32:55.5160440Z contiguous=True, 2025-05-07T20:32:55.5160529Z compiled=False, 2025-05-07T20:32:55.5160608Z ) 2025-05-07T20:32:55.5160844Z self = 2025-05-07T20:32:55.5161032Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.5161039Z 2025-05-07T20:32:55.5161121Z @given( 2025-05-07T20:32:55.5161256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5161368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5161497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5161623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5161746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5161831Z ) 2025-05-07T20:32:55.5162092Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5162193Z def test_silu_mul_quant( 2025-05-07T20:32:55.5162281Z self, 2025-05-07T20:32:55.5162364Z T: int, 2025-05-07T20:32:55.5162445Z D: int, 2025-05-07T20:32:55.5162558Z scale_ub: Optional[float], 2025-05-07T20:32:55.5162654Z contiguous: bool, 2025-05-07T20:32:55.5162747Z compiled: bool, 2025-05-07T20:32:55.5162838Z ) -> None: 2025-05-07T20:32:55.5162940Z torch.manual_seed(2025) 2025-05-07T20:32:55.5163018Z 2025-05-07T20:32:55.5163294Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5163376Z 2025-05-07T20:32:55.5163485Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5163621Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5163719Z x = x_sign * x_clamp 2025-05-07T20:32:55.5163810Z x0 = x[:, :D] 2025-05-07T20:32:55.5163896Z x1 = x[:, D:] 2025-05-07T20:32:55.5163975Z 2025-05-07T20:32:55.5164072Z if contiguous: 2025-05-07T20:32:55.5164171Z x0 = x0.contiguous() 2025-05-07T20:32:55.5164265Z x1 = x1.contiguous() 2025-05-07T20:32:55.5164351Z 2025-05-07T20:32:55.5164449Z if scale_ub is not None: 2025-05-07T20:32:55.5164562Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5164713Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5164839Z ) 2025-05-07T20:32:55.5164928Z else: 2025-05-07T20:32:55.5165030Z scale_ub_tensor = None 2025-05-07T20:32:55.5165116Z 2025-05-07T20:32:55.5165265Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5165362Z op = silu_mul_quant 2025-05-07T20:32:55.5165453Z if compiled: 2025-05-07T20:32:55.5165611Z op = torch.compile(op) 2025-05-07T20:32:55.5165724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5165803Z 2025-05-07T20:32:55.5165908Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5165913Z 2025-05-07T20:32:55.5166017Z moe/activation_test.py:117: 2025-05-07T20:32:55.5166161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5166269Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5166374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5166908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5167015Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5167400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5167645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5168007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5168114Z kernel = self.compile( 2025-05-07T20:32:55.5168521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5168707Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5168848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5168856Z 2025-05-07T20:32:55.5169072Z self = 2025-05-07T20:32:55.5169894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5170428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f011b0>} 2025-05-07T20:32:55.5171211Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5171421Z context = 2025-05-07T20:32:55.5171425Z 2025-05-07T20:32:55.5171601Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5171890Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5172085Z module_map=module_map) 2025-05-07T20:32:55.5172264Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5172375Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5172458Z E ^ 2025-05-07T20:32:55.5172835Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5172846Z 2025-05-07T20:32:55.5173282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5173287Z 2025-05-07T20:32:55.5173397Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5173636Z self=, 2025-05-07T20:32:55.5173721Z T=1, 2025-05-07T20:32:55.5173846Z D=7168, 2025-05-07T20:32:55.5173943Z scale_ub=1200.0, 2025-05-07T20:32:55.5174036Z contiguous=False, 2025-05-07T20:32:55.5174129Z compiled=False, 2025-05-07T20:32:55.5174217Z ) 2025-05-07T20:32:55.5174449Z self = 2025-05-07T20:32:55.5174637Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.5174682Z 2025-05-07T20:32:55.5174764Z @given( 2025-05-07T20:32:55.5174891Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5175006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5175128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5175253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5175381Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5175462Z ) 2025-05-07T20:32:55.5175722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5175833Z def test_silu_mul_quant( 2025-05-07T20:32:55.5175915Z self, 2025-05-07T20:32:55.5176003Z T: int, 2025-05-07T20:32:55.5176085Z D: int, 2025-05-07T20:32:55.5176196Z scale_ub: Optional[float], 2025-05-07T20:32:55.5176297Z contiguous: bool, 2025-05-07T20:32:55.5176389Z compiled: bool, 2025-05-07T20:32:55.5176473Z ) -> None: 2025-05-07T20:32:55.5176583Z torch.manual_seed(2025) 2025-05-07T20:32:55.5176661Z 2025-05-07T20:32:55.5176839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5176922Z 2025-05-07T20:32:55.5177019Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5177153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5177255Z x = x_sign * x_clamp 2025-05-07T20:32:55.5177341Z x0 = x[:, :D] 2025-05-07T20:32:55.5177431Z x1 = x[:, D:] 2025-05-07T20:32:55.5177509Z 2025-05-07T20:32:55.5177598Z if contiguous: 2025-05-07T20:32:55.5177703Z x0 = x0.contiguous() 2025-05-07T20:32:55.5177801Z x1 = x1.contiguous() 2025-05-07T20:32:55.5177880Z 2025-05-07T20:32:55.5177988Z if scale_ub is not None: 2025-05-07T20:32:55.5178101Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5178243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5178331Z ) 2025-05-07T20:32:55.5178415Z else: 2025-05-07T20:32:55.5178515Z scale_ub_tensor = None 2025-05-07T20:32:55.5178600Z 2025-05-07T20:32:55.5178736Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5178834Z op = silu_mul_quant 2025-05-07T20:32:55.5178930Z if compiled: 2025-05-07T20:32:55.5179037Z op = torch.compile(op) 2025-05-07T20:32:55.5179158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5179236Z 2025-05-07T20:32:55.5179333Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5179340Z 2025-05-07T20:32:55.5179454Z moe/activation_test.py:117: 2025-05-07T20:32:55.5179592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5179891Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5180008Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5180534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5180646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5181045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5181282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5181649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5181749Z kernel = self.compile( 2025-05-07T20:32:55.5182196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5182395Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5182531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5182536Z 2025-05-07T20:32:55.5182761Z self = 2025-05-07T20:32:55.5183616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5184156Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f02680>} 2025-05-07T20:32:55.5184936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5185147Z context = 2025-05-07T20:32:55.5185152Z 2025-05-07T20:32:55.5185336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5185616Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5185742Z module_map=module_map) 2025-05-07T20:32:55.5185915Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5186021Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5186113Z E ^ 2025-05-07T20:32:55.5186513Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5186518Z 2025-05-07T20:32:55.5186980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5186995Z 2025-05-07T20:32:55.5187105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5187346Z self=, 2025-05-07T20:32:55.5187438Z T=4096, 2025-05-07T20:32:55.5187523Z D=7168, 2025-05-07T20:32:55.5187613Z scale_ub=1200.0, 2025-05-07T20:32:55.5187720Z contiguous=False, 2025-05-07T20:32:55.5187811Z compiled=True, 2025-05-07T20:32:55.5187892Z ) 2025-05-07T20:32:55.5188127Z self = 2025-05-07T20:32:55.5188313Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.5188318Z 2025-05-07T20:32:55.5188407Z @given( 2025-05-07T20:32:55.5188538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5188644Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5188778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5188905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5189028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5189197Z ) 2025-05-07T20:32:55.5189461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5189560Z def test_silu_mul_quant( 2025-05-07T20:32:55.5189651Z self, 2025-05-07T20:32:55.5189733Z T: int, 2025-05-07T20:32:55.5189816Z D: int, 2025-05-07T20:32:55.5189927Z scale_ub: Optional[float], 2025-05-07T20:32:55.5190022Z contiguous: bool, 2025-05-07T20:32:55.5190119Z compiled: bool, 2025-05-07T20:32:55.5190202Z ) -> None: 2025-05-07T20:32:55.5190304Z torch.manual_seed(2025) 2025-05-07T20:32:55.5190387Z 2025-05-07T20:32:55.5190569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5190649Z 2025-05-07T20:32:55.5190754Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5190931Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5191026Z x = x_sign * x_clamp 2025-05-07T20:32:55.5191125Z x0 = x[:, :D] 2025-05-07T20:32:55.5191211Z x1 = x[:, D:] 2025-05-07T20:32:55.5191289Z 2025-05-07T20:32:55.5191384Z if contiguous: 2025-05-07T20:32:55.5191480Z x0 = x0.contiguous() 2025-05-07T20:32:55.5191623Z x1 = x1.contiguous() 2025-05-07T20:32:55.5191701Z 2025-05-07T20:32:55.5191798Z if scale_ub is not None: 2025-05-07T20:32:55.5191917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5192061Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5192142Z ) 2025-05-07T20:32:55.5192231Z else: 2025-05-07T20:32:55.5192331Z scale_ub_tensor = None 2025-05-07T20:32:55.5192410Z 2025-05-07T20:32:55.5192554Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5192654Z op = silu_mul_quant 2025-05-07T20:32:55.5192744Z if compiled: 2025-05-07T20:32:55.5192858Z op = torch.compile(op) 2025-05-07T20:32:55.5192974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5193056Z 2025-05-07T20:32:55.5193152Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5193157Z 2025-05-07T20:32:55.5193260Z moe/activation_test.py:117: 2025-05-07T20:32:55.5193409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5193618Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5193727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5194121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5194220Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5194741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5194854Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5195237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5195479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5195839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5195942Z kernel = self.compile( 2025-05-07T20:32:55.5196350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5196538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5196679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5196683Z 2025-05-07T20:32:55.5196900Z self = 2025-05-07T20:32:55.5197893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5198433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6f03b50>} 2025-05-07T20:32:55.5199217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5199429Z context = 2025-05-07T20:32:55.5199434Z 2025-05-07T20:32:55.5199609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5199887Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5200052Z module_map=module_map) 2025-05-07T20:32:55.5200225Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5200345Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5200429Z E ^ 2025-05-07T20:32:55.5200804Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5200848Z 2025-05-07T20:32:55.5201289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5201294Z 2025-05-07T20:32:55.5201405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5201647Z self=, 2025-05-07T20:32:55.5201732Z T=128, 2025-05-07T20:32:55.5201814Z D=7168, 2025-05-07T20:32:55.5201911Z scale_ub=1200.0, 2025-05-07T20:32:55.5202004Z contiguous=False, 2025-05-07T20:32:55.5202095Z compiled=True, 2025-05-07T20:32:55.5202183Z ) 2025-05-07T20:32:55.5202413Z self = 2025-05-07T20:32:55.5202599Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:55.5202604Z 2025-05-07T20:32:55.5202692Z @given( 2025-05-07T20:32:55.5202820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5202935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5203059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5203185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5203312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5203392Z ) 2025-05-07T20:32:55.5203657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5203765Z def test_silu_mul_quant( 2025-05-07T20:32:55.5203848Z self, 2025-05-07T20:32:55.5203932Z T: int, 2025-05-07T20:32:55.5204020Z D: int, 2025-05-07T20:32:55.5204124Z scale_ub: Optional[float], 2025-05-07T20:32:55.5204221Z contiguous: bool, 2025-05-07T20:32:55.5204326Z compiled: bool, 2025-05-07T20:32:55.5204410Z ) -> None: 2025-05-07T20:32:55.5204518Z torch.manual_seed(2025) 2025-05-07T20:32:55.5204596Z 2025-05-07T20:32:55.5204775Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5204867Z 2025-05-07T20:32:55.5204965Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5205098Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5205199Z x = x_sign * x_clamp 2025-05-07T20:32:55.5205284Z x0 = x[:, :D] 2025-05-07T20:32:55.5205369Z x1 = x[:, D:] 2025-05-07T20:32:55.5205454Z 2025-05-07T20:32:55.5205542Z if contiguous: 2025-05-07T20:32:55.5205641Z x0 = x0.contiguous() 2025-05-07T20:32:55.5205742Z x1 = x1.contiguous() 2025-05-07T20:32:55.5205822Z 2025-05-07T20:32:55.5205920Z if scale_ub is not None: 2025-05-07T20:32:55.5206041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5206267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5206358Z ) 2025-05-07T20:32:55.5206440Z else: 2025-05-07T20:32:55.5206539Z scale_ub_tensor = None 2025-05-07T20:32:55.5206625Z 2025-05-07T20:32:55.5206763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5206859Z op = silu_mul_quant 2025-05-07T20:32:55.5206955Z if compiled: 2025-05-07T20:32:55.5207062Z op = torch.compile(op) 2025-05-07T20:32:55.5207176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5207263Z 2025-05-07T20:32:55.5207360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5207365Z 2025-05-07T20:32:55.5207475Z moe/activation_test.py:117: 2025-05-07T20:32:55.5207613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5207762Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5207873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5208265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5208367Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5208891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5209035Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5209418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5209654Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5210016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5210125Z kernel = self.compile( 2025-05-07T20:32:55.5210528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5210717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5210858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5210862Z 2025-05-07T20:32:55.5211082Z self = 2025-05-07T20:32:55.5211899Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5212433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6e30280>} 2025-05-07T20:32:55.5213225Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5213429Z context = 2025-05-07T20:32:55.5213434Z 2025-05-07T20:32:55.5213610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5213898Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5214014Z module_map=module_map) 2025-05-07T20:32:55.5214192Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5214299Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5214381Z E ^ 2025-05-07T20:32:55.5214763Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5214770Z 2025-05-07T20:32:55.5215203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5215207Z 2025-05-07T20:32:55.5215428Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5215674Z self=, 2025-05-07T20:32:55.5215760Z T=2048, 2025-05-07T20:32:55.5215856Z D=7168, 2025-05-07T20:32:55.5215947Z scale_ub=None, 2025-05-07T20:32:55.5216040Z contiguous=True, 2025-05-07T20:32:55.5216138Z compiled=True, 2025-05-07T20:32:55.5216220Z ) 2025-05-07T20:32:55.5216452Z self = 2025-05-07T20:32:55.5216639Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.5216644Z 2025-05-07T20:32:55.5216727Z @given( 2025-05-07T20:32:55.5216854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5216966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5217136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5217267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5217393Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5217473Z ) 2025-05-07T20:32:55.5217741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5217883Z def test_silu_mul_quant( 2025-05-07T20:32:55.5217965Z self, 2025-05-07T20:32:55.5218055Z T: int, 2025-05-07T20:32:55.5218137Z D: int, 2025-05-07T20:32:55.5218242Z scale_ub: Optional[float], 2025-05-07T20:32:55.5218343Z contiguous: bool, 2025-05-07T20:32:55.5218435Z compiled: bool, 2025-05-07T20:32:55.5218519Z ) -> None: 2025-05-07T20:32:55.5218626Z torch.manual_seed(2025) 2025-05-07T20:32:55.5218705Z 2025-05-07T20:32:55.5218890Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5218974Z 2025-05-07T20:32:55.5219075Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5219214Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5219313Z x = x_sign * x_clamp 2025-05-07T20:32:55.5219398Z x0 = x[:, :D] 2025-05-07T20:32:55.5219491Z x1 = x[:, D:] 2025-05-07T20:32:55.5219569Z 2025-05-07T20:32:55.5219658Z if contiguous: 2025-05-07T20:32:55.5219766Z x0 = x0.contiguous() 2025-05-07T20:32:55.5219860Z x1 = x1.contiguous() 2025-05-07T20:32:55.5219937Z 2025-05-07T20:32:55.5220041Z if scale_ub is not None: 2025-05-07T20:32:55.5220153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5220301Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5220382Z ) 2025-05-07T20:32:55.5220463Z else: 2025-05-07T20:32:55.5220571Z scale_ub_tensor = None 2025-05-07T20:32:55.5220648Z 2025-05-07T20:32:55.5220787Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5220888Z op = silu_mul_quant 2025-05-07T20:32:55.5220978Z if compiled: 2025-05-07T20:32:55.5221087Z op = torch.compile(op) 2025-05-07T20:32:55.5221205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5221283Z 2025-05-07T20:32:55.5221378Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5221383Z 2025-05-07T20:32:55.5221494Z moe/activation_test.py:117: 2025-05-07T20:32:55.5221630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5221742Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5221848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5222234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5222337Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5222858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5222965Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5223430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5223669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5224403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5224517Z kernel = self.compile( 2025-05-07T20:32:55.5224919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5225110Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5225244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5225248Z 2025-05-07T20:32:55.5225471Z self = 2025-05-07T20:32:55.5226447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5226976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6e30dc0>} 2025-05-07T20:32:55.5227832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5228034Z context = 2025-05-07T20:32:55.5228039Z 2025-05-07T20:32:55.5228219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5228501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5228615Z module_map=module_map) 2025-05-07T20:32:55.5228802Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5228906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5228994Z E ^ 2025-05-07T20:32:55.5229367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5229375Z 2025-05-07T20:32:55.5229807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5229812Z 2025-05-07T20:32:55.5229927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5230161Z self=, 2025-05-07T20:32:55.5230245Z T=16384, 2025-05-07T20:32:55.5230332Z D=5120, 2025-05-07T20:32:55.5230422Z scale_ub=None, 2025-05-07T20:32:55.5230524Z contiguous=False, 2025-05-07T20:32:55.5230614Z compiled=False, 2025-05-07T20:32:55.5230694Z ) 2025-05-07T20:32:55.5230932Z self = 2025-05-07T20:32:55.5231119Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.5231124Z 2025-05-07T20:32:55.5231206Z @given( 2025-05-07T20:32:55.5231340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5231448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5231570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5231702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5231823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5231910Z ) 2025-05-07T20:32:55.5232172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5232272Z def test_silu_mul_quant( 2025-05-07T20:32:55.5232363Z self, 2025-05-07T20:32:55.5232446Z T: int, 2025-05-07T20:32:55.5232528Z D: int, 2025-05-07T20:32:55.5232643Z scale_ub: Optional[float], 2025-05-07T20:32:55.5232866Z contiguous: bool, 2025-05-07T20:32:55.5232961Z compiled: bool, 2025-05-07T20:32:55.5233053Z ) -> None: 2025-05-07T20:32:55.5233153Z torch.manual_seed(2025) 2025-05-07T20:32:55.5233234Z 2025-05-07T20:32:55.5233417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5233570Z 2025-05-07T20:32:55.5233676Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5233810Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5235707Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5235765Z 2025-05-07T20:32:55.5235894Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:55.5235937Z 2025-05-07T20:32:55.5236050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5236293Z self=, 2025-05-07T20:32:55.5236378Z T=4096, 2025-05-07T20:32:55.5236459Z D=7168, 2025-05-07T20:32:55.5236554Z scale_ub=1200.0, 2025-05-07T20:32:55.5236646Z contiguous=True, 2025-05-07T20:32:55.5236735Z compiled=True, 2025-05-07T20:32:55.5236819Z ) 2025-05-07T20:32:55.5237047Z self = 2025-05-07T20:32:55.5237233Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.5237241Z 2025-05-07T20:32:55.5237321Z @given( 2025-05-07T20:32:55.5237452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5237564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5237687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5237812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5237943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5238025Z ) 2025-05-07T20:32:55.5238289Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5238394Z def test_silu_mul_quant( 2025-05-07T20:32:55.5238475Z self, 2025-05-07T20:32:55.5238561Z T: int, 2025-05-07T20:32:55.5238643Z D: int, 2025-05-07T20:32:55.5238746Z scale_ub: Optional[float], 2025-05-07T20:32:55.5238847Z contiguous: bool, 2025-05-07T20:32:55.5238937Z compiled: bool, 2025-05-07T20:32:55.5239023Z ) -> None: 2025-05-07T20:32:55.5239128Z torch.manual_seed(2025) 2025-05-07T20:32:55.5239205Z 2025-05-07T20:32:55.5239385Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5239469Z 2025-05-07T20:32:55.5239567Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5239697Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5241558Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5241567Z 2025-05-07T20:32:55.5241693Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:55.5241703Z 2025-05-07T20:32:55.5241896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5242131Z self=, 2025-05-07T20:32:55.5242220Z T=16384, 2025-05-07T20:32:55.5242303Z D=7168, 2025-05-07T20:32:55.5242392Z scale_ub=None, 2025-05-07T20:32:55.5242490Z contiguous=False, 2025-05-07T20:32:55.5242578Z compiled=False, 2025-05-07T20:32:55.5242655Z ) 2025-05-07T20:32:55.5242888Z self = 2025-05-07T20:32:55.5243073Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.5243078Z 2025-05-07T20:32:55.5243167Z @given( 2025-05-07T20:32:55.5243291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5243397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5243567Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5243692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5243819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5243906Z ) 2025-05-07T20:32:55.5244166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5244266Z def test_silu_mul_quant( 2025-05-07T20:32:55.5244424Z self, 2025-05-07T20:32:55.5247443Z T: int, 2025-05-07T20:32:55.5247545Z D: int, 2025-05-07T20:32:55.5247661Z scale_ub: Optional[float], 2025-05-07T20:32:55.5247755Z contiguous: bool, 2025-05-07T20:32:55.5247847Z compiled: bool, 2025-05-07T20:32:55.5247939Z ) -> None: 2025-05-07T20:32:55.5248040Z torch.manual_seed(2025) 2025-05-07T20:32:55.5248117Z 2025-05-07T20:32:55.5248302Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5250221Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5250234Z 2025-05-07T20:32:55.5250360Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5250364Z 2025-05-07T20:32:55.5250478Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5250711Z self=, 2025-05-07T20:32:55.5250799Z T=2048, 2025-05-07T20:32:55.5250881Z D=7168, 2025-05-07T20:32:55.5250970Z scale_ub=1200.0, 2025-05-07T20:32:55.5251082Z contiguous=True, 2025-05-07T20:32:55.5251172Z compiled=True, 2025-05-07T20:32:55.5251258Z ) 2025-05-07T20:32:55.5251488Z self = 2025-05-07T20:32:55.5257094Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.5257104Z 2025-05-07T20:32:55.5257209Z @given( 2025-05-07T20:32:55.5257341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5257458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5257587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5257712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5257840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5257921Z ) 2025-05-07T20:32:55.5258186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5258294Z def test_silu_mul_quant( 2025-05-07T20:32:55.5258379Z self, 2025-05-07T20:32:55.5258461Z T: int, 2025-05-07T20:32:55.5258549Z D: int, 2025-05-07T20:32:55.5258653Z scale_ub: Optional[float], 2025-05-07T20:32:55.5258828Z contiguous: bool, 2025-05-07T20:32:55.5258928Z compiled: bool, 2025-05-07T20:32:55.5259012Z ) -> None: 2025-05-07T20:32:55.5259113Z torch.manual_seed(2025) 2025-05-07T20:32:55.5259199Z 2025-05-07T20:32:55.5259382Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5259464Z 2025-05-07T20:32:55.5259569Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5259703Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5261586Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5261635Z 2025-05-07T20:32:55.5261764Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:55.5261768Z 2025-05-07T20:32:55.5261926Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5262252Z self=, 2025-05-07T20:32:55.5262337Z T=2048, 2025-05-07T20:32:55.5262425Z D=7168, 2025-05-07T20:32:55.5262514Z scale_ub=None, 2025-05-07T20:32:55.5262607Z contiguous=True, 2025-05-07T20:32:55.5262704Z compiled=False, 2025-05-07T20:32:55.5262785Z ) 2025-05-07T20:32:55.5263014Z self = 2025-05-07T20:32:55.5263206Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.5263214Z 2025-05-07T20:32:55.5263299Z @given( 2025-05-07T20:32:55.5263431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5263540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5263662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5263795Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5263921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5264002Z ) 2025-05-07T20:32:55.5264271Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5264372Z def test_silu_mul_quant( 2025-05-07T20:32:55.5264463Z self, 2025-05-07T20:32:55.5264545Z T: int, 2025-05-07T20:32:55.5264628Z D: int, 2025-05-07T20:32:55.5264738Z scale_ub: Optional[float], 2025-05-07T20:32:55.5264834Z contiguous: bool, 2025-05-07T20:32:55.5264925Z compiled: bool, 2025-05-07T20:32:55.5265016Z ) -> None: 2025-05-07T20:32:55.5265116Z torch.manual_seed(2025) 2025-05-07T20:32:55.5265193Z 2025-05-07T20:32:55.5265378Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5265459Z 2025-05-07T20:32:55.5265558Z > x_sign = torch.sign(x) 2025-05-07T20:32:55.5267484Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5267492Z 2025-05-07T20:32:55.5267619Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:55.5267625Z 2025-05-07T20:32:55.5267741Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5267975Z self=, 2025-05-07T20:32:55.5268113Z T=1, 2025-05-07T20:32:55.5268198Z D=7168, 2025-05-07T20:32:55.5268289Z scale_ub=1200.0, 2025-05-07T20:32:55.5268387Z contiguous=True, 2025-05-07T20:32:55.5268476Z compiled=False, 2025-05-07T20:32:55.5268557Z ) 2025-05-07T20:32:55.5268792Z self = 2025-05-07T20:32:55.5268967Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.5268972Z 2025-05-07T20:32:55.5269055Z @given( 2025-05-07T20:32:55.5269184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5269291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5269419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5269543Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5269706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5269792Z ) 2025-05-07T20:32:55.5270056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5270156Z def test_silu_mul_quant( 2025-05-07T20:32:55.5270245Z self, 2025-05-07T20:32:55.5270327Z T: int, 2025-05-07T20:32:55.5270408Z D: int, 2025-05-07T20:32:55.5270559Z scale_ub: Optional[float], 2025-05-07T20:32:55.5270714Z contiguous: bool, 2025-05-07T20:32:55.5270807Z compiled: bool, 2025-05-07T20:32:55.5270898Z ) -> None: 2025-05-07T20:32:55.5270998Z torch.manual_seed(2025) 2025-05-07T20:32:55.5271082Z 2025-05-07T20:32:55.5271259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5271338Z 2025-05-07T20:32:55.5271443Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5271575Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5271674Z x = x_sign * x_clamp 2025-05-07T20:32:55.5271765Z x0 = x[:, :D] 2025-05-07T20:32:55.5271851Z x1 = x[:, D:] 2025-05-07T20:32:55.5271929Z 2025-05-07T20:32:55.5272027Z if contiguous: 2025-05-07T20:32:55.5272127Z x0 = x0.contiguous() 2025-05-07T20:32:55.5272222Z x1 = x1.contiguous() 2025-05-07T20:32:55.5272306Z 2025-05-07T20:32:55.5272405Z if scale_ub is not None: 2025-05-07T20:32:55.5272527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5272670Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5272752Z ) 2025-05-07T20:32:55.5272843Z else: 2025-05-07T20:32:55.5272943Z scale_ub_tensor = None 2025-05-07T20:32:55.5273022Z 2025-05-07T20:32:55.5273167Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5273264Z op = silu_mul_quant 2025-05-07T20:32:55.5273355Z if compiled: 2025-05-07T20:32:55.5273470Z op = torch.compile(op) 2025-05-07T20:32:55.5273773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5273851Z 2025-05-07T20:32:55.5273959Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5273963Z 2025-05-07T20:32:55.5274068Z moe/activation_test.py:117: 2025-05-07T20:32:55.5274213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5274325Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5274436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5274975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5275079Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5275461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5275704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5276066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5276224Z kernel = self.compile( 2025-05-07T20:32:55.5276633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5276822Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5276991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5276997Z 2025-05-07T20:32:55.5277239Z self = 2025-05-07T20:32:55.5278058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5278594Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6e32cb0>} 2025-05-07T20:32:55.5279419Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5279667Z context = 2025-05-07T20:32:55.5279715Z 2025-05-07T20:32:55.5279893Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5280179Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5280300Z module_map=module_map) 2025-05-07T20:32:55.5280476Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5280593Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5280678Z E ^ 2025-05-07T20:32:55.5281059Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5281063Z 2025-05-07T20:32:55.5281498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5281503Z 2025-05-07T20:32:55.5281614Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5281863Z self=, 2025-05-07T20:32:55.5281948Z T=128, 2025-05-07T20:32:55.5282030Z D=5120, 2025-05-07T20:32:55.5282125Z scale_ub=None, 2025-05-07T20:32:55.5282217Z contiguous=True, 2025-05-07T20:32:55.5282314Z compiled=False, 2025-05-07T20:32:55.5282392Z ) 2025-05-07T20:32:55.5282621Z self = 2025-05-07T20:32:55.5282806Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.5282814Z 2025-05-07T20:32:55.5282894Z @given( 2025-05-07T20:32:55.5283023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5283135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5283265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5283390Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5283515Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5283598Z ) 2025-05-07T20:32:55.5283865Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5283964Z def test_silu_mul_quant( 2025-05-07T20:32:55.5284046Z self, 2025-05-07T20:32:55.5284132Z T: int, 2025-05-07T20:32:55.5284213Z D: int, 2025-05-07T20:32:55.5284315Z scale_ub: Optional[float], 2025-05-07T20:32:55.5284421Z contiguous: bool, 2025-05-07T20:32:55.5284511Z compiled: bool, 2025-05-07T20:32:55.5284594Z ) -> None: 2025-05-07T20:32:55.5284702Z torch.manual_seed(2025) 2025-05-07T20:32:55.5284779Z 2025-05-07T20:32:55.5284955Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5285039Z 2025-05-07T20:32:55.5285184Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5285325Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5285419Z x = x_sign * x_clamp 2025-05-07T20:32:55.5285507Z x0 = x[:, :D] 2025-05-07T20:32:55.5285598Z x1 = x[:, D:] 2025-05-07T20:32:55.5285678Z 2025-05-07T20:32:55.5285767Z if contiguous: 2025-05-07T20:32:55.5285871Z x0 = x0.contiguous() 2025-05-07T20:32:55.5285966Z x1 = x1.contiguous() 2025-05-07T20:32:55.5286045Z 2025-05-07T20:32:55.5286148Z if scale_ub is not None: 2025-05-07T20:32:55.5286258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5286401Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5286489Z ) 2025-05-07T20:32:55.5286613Z else: 2025-05-07T20:32:55.5286713Z scale_ub_tensor = None 2025-05-07T20:32:55.5286797Z 2025-05-07T20:32:55.5286936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5287038Z op = silu_mul_quant 2025-05-07T20:32:55.5287129Z if compiled: 2025-05-07T20:32:55.5287235Z op = torch.compile(op) 2025-05-07T20:32:55.5287422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5287503Z 2025-05-07T20:32:55.5287644Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5287649Z 2025-05-07T20:32:55.5287758Z moe/activation_test.py:117: 2025-05-07T20:32:55.5287896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5288002Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5288113Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5288636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5288748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5289126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5289364Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5289730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5289836Z kernel = self.compile( 2025-05-07T20:32:55.5290238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5290428Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5290565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5290569Z 2025-05-07T20:32:55.5290791Z self = 2025-05-07T20:32:55.5291604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5292142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6e33640>} 2025-05-07T20:32:55.5292926Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5293130Z context = 2025-05-07T20:32:55.5293134Z 2025-05-07T20:32:55.5293315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5293592Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5293714Z module_map=module_map) 2025-05-07T20:32:55.5293936Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5294044Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5294134Z E ^ 2025-05-07T20:32:55.5294508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5294518Z 2025-05-07T20:32:55.5294951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5294962Z 2025-05-07T20:32:55.5295072Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5295306Z self=, 2025-05-07T20:32:55.5295395Z T=128, 2025-05-07T20:32:55.5295477Z D=7168, 2025-05-07T20:32:55.5295566Z scale_ub=None, 2025-05-07T20:32:55.5295705Z contiguous=True, 2025-05-07T20:32:55.5295796Z compiled=False, 2025-05-07T20:32:55.5295876Z ) 2025-05-07T20:32:55.5296115Z self = 2025-05-07T20:32:55.5296294Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.5296299Z 2025-05-07T20:32:55.5296386Z @given( 2025-05-07T20:32:55.5296513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5296702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5296833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5296958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5297080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5297168Z ) 2025-05-07T20:32:55.5297428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5297528Z def test_silu_mul_quant( 2025-05-07T20:32:55.5297618Z self, 2025-05-07T20:32:55.5297701Z T: int, 2025-05-07T20:32:55.5297781Z D: int, 2025-05-07T20:32:55.5297891Z scale_ub: Optional[float], 2025-05-07T20:32:55.5297990Z contiguous: bool, 2025-05-07T20:32:55.5298090Z compiled: bool, 2025-05-07T20:32:55.5298173Z ) -> None: 2025-05-07T20:32:55.5298274Z torch.manual_seed(2025) 2025-05-07T20:32:55.5298357Z 2025-05-07T20:32:55.5298539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5298620Z 2025-05-07T20:32:55.5298726Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5298858Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5298952Z x = x_sign * x_clamp 2025-05-07T20:32:55.5299043Z x0 = x[:, :D] 2025-05-07T20:32:55.5299128Z x1 = x[:, D:] 2025-05-07T20:32:55.5299205Z 2025-05-07T20:32:55.5299299Z if contiguous: 2025-05-07T20:32:55.5299395Z x0 = x0.contiguous() 2025-05-07T20:32:55.5299497Z x1 = x1.contiguous() 2025-05-07T20:32:55.5299575Z 2025-05-07T20:32:55.5299671Z if scale_ub is not None: 2025-05-07T20:32:55.5299790Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5299933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5300013Z ) 2025-05-07T20:32:55.5300098Z else: 2025-05-07T20:32:55.5300197Z scale_ub_tensor = None 2025-05-07T20:32:55.5300277Z 2025-05-07T20:32:55.5300422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5300518Z op = silu_mul_quant 2025-05-07T20:32:55.5300609Z if compiled: 2025-05-07T20:32:55.5300720Z op = torch.compile(op) 2025-05-07T20:32:55.5300832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5300909Z 2025-05-07T20:32:55.5301010Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5301015Z 2025-05-07T20:32:55.5301118Z moe/activation_test.py:117: 2025-05-07T20:32:55.5301264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5301371Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5301531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5302060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5302163Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5302545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5302790Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5303146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5303254Z kernel = self.compile( 2025-05-07T20:32:55.5303657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5303884Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5304026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5304030Z 2025-05-07T20:32:55.5304246Z self = 2025-05-07T20:32:55.5305115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5305685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6ab4160>} 2025-05-07T20:32:55.5306471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5306678Z context = 2025-05-07T20:32:55.5306682Z 2025-05-07T20:32:55.5306858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5307146Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5307263Z module_map=module_map) 2025-05-07T20:32:55.5307436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5307549Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5307630Z E ^ 2025-05-07T20:32:55.5308007Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5308012Z 2025-05-07T20:32:55.5308443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5308450Z 2025-05-07T20:32:55.5308560Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5308800Z self=, 2025-05-07T20:32:55.5308887Z T=2048, 2025-05-07T20:32:55.5308976Z D=7168, 2025-05-07T20:32:55.5309064Z scale_ub=1200.0, 2025-05-07T20:32:55.5309153Z contiguous=True, 2025-05-07T20:32:55.5309251Z compiled=False, 2025-05-07T20:32:55.5309335Z ) 2025-05-07T20:32:55.5309568Z self = 2025-05-07T20:32:55.5309758Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.5309762Z 2025-05-07T20:32:55.5309848Z @given( 2025-05-07T20:32:55.5309974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5310088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5310211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5310343Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5310467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5310547Z ) 2025-05-07T20:32:55.5310860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5310962Z def test_silu_mul_quant( 2025-05-07T20:32:55.5311044Z self, 2025-05-07T20:32:55.5311134Z T: int, 2025-05-07T20:32:55.5311218Z D: int, 2025-05-07T20:32:55.5311322Z scale_ub: Optional[float], 2025-05-07T20:32:55.5311426Z contiguous: bool, 2025-05-07T20:32:55.5311518Z compiled: bool, 2025-05-07T20:32:55.5311602Z ) -> None: 2025-05-07T20:32:55.5311711Z torch.manual_seed(2025) 2025-05-07T20:32:55.5311790Z 2025-05-07T20:32:55.5311970Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5313916Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5314008Z 2025-05-07T20:32:55.5314184Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5314189Z 2025-05-07T20:32:55.5314301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5314534Z self=, 2025-05-07T20:32:55.5314624Z T=1, 2025-05-07T20:32:55.5314707Z D=5120, 2025-05-07T20:32:55.5314796Z scale_ub=1200.0, 2025-05-07T20:32:55.5314894Z contiguous=True, 2025-05-07T20:32:55.5314985Z compiled=False, 2025-05-07T20:32:55.5315064Z ) 2025-05-07T20:32:55.5315299Z self = 2025-05-07T20:32:55.5315473Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.5315480Z 2025-05-07T20:32:55.5315570Z @given( 2025-05-07T20:32:55.5315695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5315801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5315932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5316060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5316183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5316271Z ) 2025-05-07T20:32:55.5316530Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5316631Z def test_silu_mul_quant( 2025-05-07T20:32:55.5316719Z self, 2025-05-07T20:32:55.5316806Z T: int, 2025-05-07T20:32:55.5316913Z D: int, 2025-05-07T20:32:55.5317031Z scale_ub: Optional[float], 2025-05-07T20:32:55.5317138Z contiguous: bool, 2025-05-07T20:32:55.5317237Z compiled: bool, 2025-05-07T20:32:55.5317319Z ) -> None: 2025-05-07T20:32:55.5317421Z torch.manual_seed(2025) 2025-05-07T20:32:55.5317506Z 2025-05-07T20:32:55.5317682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5317761Z 2025-05-07T20:32:55.5317869Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5318006Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5318101Z x = x_sign * x_clamp 2025-05-07T20:32:55.5318193Z x0 = x[:, :D] 2025-05-07T20:32:55.5318277Z x1 = x[:, D:] 2025-05-07T20:32:55.5318354Z 2025-05-07T20:32:55.5318451Z if contiguous: 2025-05-07T20:32:55.5318547Z x0 = x0.contiguous() 2025-05-07T20:32:55.5318646Z x1 = x1.contiguous() 2025-05-07T20:32:55.5318735Z 2025-05-07T20:32:55.5318839Z if scale_ub is not None: 2025-05-07T20:32:55.5318955Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5319099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5319233Z ) 2025-05-07T20:32:55.5319318Z else: 2025-05-07T20:32:55.5319421Z scale_ub_tensor = None 2025-05-07T20:32:55.5319507Z 2025-05-07T20:32:55.5319648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5319747Z op = silu_mul_quant 2025-05-07T20:32:55.5319847Z if compiled: 2025-05-07T20:32:55.5319956Z op = torch.compile(op) 2025-05-07T20:32:55.5320074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5320152Z 2025-05-07T20:32:55.5320249Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5320253Z 2025-05-07T20:32:55.5320362Z moe/activation_test.py:117: 2025-05-07T20:32:55.5320500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5320675Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5320787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5321314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5321422Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5321798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5322170Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5322542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5322642Z kernel = self.compile( 2025-05-07T20:32:55.5323043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5323239Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5323378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5323382Z 2025-05-07T20:32:55.5323607Z self = 2025-05-07T20:32:55.5324826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5325373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6ab4940>} 2025-05-07T20:32:55.5326161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5326366Z context = 2025-05-07T20:32:55.5326374Z 2025-05-07T20:32:55.5326556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5326838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5326971Z module_map=module_map) 2025-05-07T20:32:55.5327153Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5327261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5327351Z E ^ 2025-05-07T20:32:55.5327723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5327728Z 2025-05-07T20:32:55.5328157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5328161Z 2025-05-07T20:32:55.5328276Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5328511Z self=, 2025-05-07T20:32:55.5328598Z T=2048, 2025-05-07T20:32:55.5328679Z D=5120, 2025-05-07T20:32:55.5328766Z scale_ub=None, 2025-05-07T20:32:55.5329036Z contiguous=True, 2025-05-07T20:32:55.5329127Z compiled=False, 2025-05-07T20:32:55.5329204Z ) 2025-05-07T20:32:55.5329438Z self = 2025-05-07T20:32:55.5329622Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.5329627Z 2025-05-07T20:32:55.5329709Z @given( 2025-05-07T20:32:55.5329841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5329946Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5330074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5330200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5330320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5330470Z ) 2025-05-07T20:32:55.5330727Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5330826Z def test_silu_mul_quant( 2025-05-07T20:32:55.5330917Z self, 2025-05-07T20:32:55.5330998Z T: int, 2025-05-07T20:32:55.5331079Z D: int, 2025-05-07T20:32:55.5331190Z scale_ub: Optional[float], 2025-05-07T20:32:55.5331285Z contiguous: bool, 2025-05-07T20:32:55.5331442Z compiled: bool, 2025-05-07T20:32:55.5331594Z ) -> None: 2025-05-07T20:32:55.5331696Z torch.manual_seed(2025) 2025-05-07T20:32:55.5331779Z 2025-05-07T20:32:55.5331954Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5332033Z 2025-05-07T20:32:55.5332135Z > x_sign = torch.sign(x) 2025-05-07T20:32:55.5333977Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5333989Z 2025-05-07T20:32:55.5334121Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:55.5334128Z 2025-05-07T20:32:55.5334236Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5334468Z self=, 2025-05-07T20:32:55.5334559Z T=16384, 2025-05-07T20:32:55.5334641Z D=5120, 2025-05-07T20:32:55.5334730Z scale_ub=None, 2025-05-07T20:32:55.5334826Z contiguous=True, 2025-05-07T20:32:55.5334915Z compiled=False, 2025-05-07T20:32:55.5334994Z ) 2025-05-07T20:32:55.5335228Z self = 2025-05-07T20:32:55.5335414Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.5335419Z 2025-05-07T20:32:55.5335510Z @given( 2025-05-07T20:32:55.5335637Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5335741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5335872Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5336002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5336126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5336212Z ) 2025-05-07T20:32:55.5336518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5336626Z def test_silu_mul_quant( 2025-05-07T20:32:55.5336711Z self, 2025-05-07T20:32:55.5336792Z T: int, 2025-05-07T20:32:55.5336880Z D: int, 2025-05-07T20:32:55.5336982Z scale_ub: Optional[float], 2025-05-07T20:32:55.5337079Z contiguous: bool, 2025-05-07T20:32:55.5337179Z compiled: bool, 2025-05-07T20:32:55.5337262Z ) -> None: 2025-05-07T20:32:55.5337410Z torch.manual_seed(2025) 2025-05-07T20:32:55.5337496Z 2025-05-07T20:32:55.5337672Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5339519Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5339526Z 2025-05-07T20:32:55.5339691Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5339696Z 2025-05-07T20:32:55.5339809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5340044Z self=, 2025-05-07T20:32:55.5340128Z T=4096, 2025-05-07T20:32:55.5340217Z D=5120, 2025-05-07T20:32:55.5340304Z scale_ub=None, 2025-05-07T20:32:55.5340396Z contiguous=True, 2025-05-07T20:32:55.5340532Z compiled=False, 2025-05-07T20:32:55.5340613Z ) 2025-05-07T20:32:55.5340879Z self = 2025-05-07T20:32:55.5341063Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.5341068Z 2025-05-07T20:32:55.5341151Z @given( 2025-05-07T20:32:55.5341275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5341389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5341508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5341642Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5341762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5341840Z ) 2025-05-07T20:32:55.5342107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5342207Z def test_silu_mul_quant( 2025-05-07T20:32:55.5342290Z self, 2025-05-07T20:32:55.5342378Z T: int, 2025-05-07T20:32:55.5342464Z D: int, 2025-05-07T20:32:55.5342571Z scale_ub: Optional[float], 2025-05-07T20:32:55.5342672Z contiguous: bool, 2025-05-07T20:32:55.5342762Z compiled: bool, 2025-05-07T20:32:55.5342853Z ) -> None: 2025-05-07T20:32:55.5342951Z torch.manual_seed(2025) 2025-05-07T20:32:55.5343028Z 2025-05-07T20:32:55.5343209Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5345033Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5345044Z 2025-05-07T20:32:55.5345177Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5345181Z 2025-05-07T20:32:55.5345288Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5345519Z self=, 2025-05-07T20:32:55.5345608Z T=2048, 2025-05-07T20:32:55.5345688Z D=5120, 2025-05-07T20:32:55.5345774Z scale_ub=None, 2025-05-07T20:32:55.5345871Z contiguous=False, 2025-05-07T20:32:55.5345961Z compiled=False, 2025-05-07T20:32:55.5346040Z ) 2025-05-07T20:32:55.5346271Z self = 2025-05-07T20:32:55.5346498Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.5346503Z 2025-05-07T20:32:55.5346591Z @given( 2025-05-07T20:32:55.5346714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5346823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5346977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5347100Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5347245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5347335Z ) 2025-05-07T20:32:55.5347613Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5347720Z def test_silu_mul_quant( 2025-05-07T20:32:55.5347801Z self, 2025-05-07T20:32:55.5347883Z T: int, 2025-05-07T20:32:55.5348017Z D: int, 2025-05-07T20:32:55.5348120Z scale_ub: Optional[float], 2025-05-07T20:32:55.5348214Z contiguous: bool, 2025-05-07T20:32:55.5348312Z compiled: bool, 2025-05-07T20:32:55.5348396Z ) -> None: 2025-05-07T20:32:55.5348496Z torch.manual_seed(2025) 2025-05-07T20:32:55.5348579Z 2025-05-07T20:32:55.5348756Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5350682Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5350690Z 2025-05-07T20:32:55.5350815Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5350820Z 2025-05-07T20:32:55.5350937Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5351169Z self=, 2025-05-07T20:32:55.5351251Z T=4096, 2025-05-07T20:32:55.5351338Z D=7168, 2025-05-07T20:32:55.5351429Z scale_ub=None, 2025-05-07T20:32:55.5351520Z contiguous=True, 2025-05-07T20:32:55.5351617Z compiled=True, 2025-05-07T20:32:55.5351696Z ) 2025-05-07T20:32:55.5351922Z self = 2025-05-07T20:32:55.5352105Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.5352110Z 2025-05-07T20:32:55.5352192Z @given( 2025-05-07T20:32:55.5352317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5352433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5352559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5352689Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5352812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5352891Z ) 2025-05-07T20:32:55.5353154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5353254Z def test_silu_mul_quant( 2025-05-07T20:32:55.5353344Z self, 2025-05-07T20:32:55.5353432Z T: int, 2025-05-07T20:32:55.5353624Z D: int, 2025-05-07T20:32:55.5353730Z scale_ub: Optional[float], 2025-05-07T20:32:55.5353833Z contiguous: bool, 2025-05-07T20:32:55.5353925Z compiled: bool, 2025-05-07T20:32:55.5354016Z ) -> None: 2025-05-07T20:32:55.5354117Z torch.manual_seed(2025) 2025-05-07T20:32:55.5354199Z 2025-05-07T20:32:55.5354380Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5356303Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5356314Z 2025-05-07T20:32:55.5356448Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5356453Z 2025-05-07T20:32:55.5356561Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5356795Z self=, 2025-05-07T20:32:55.5356884Z T=2048, 2025-05-07T20:32:55.5356966Z D=5120, 2025-05-07T20:32:55.5357054Z scale_ub=1200.0, 2025-05-07T20:32:55.5357193Z contiguous=False, 2025-05-07T20:32:55.5357283Z compiled=False, 2025-05-07T20:32:55.5357363Z ) 2025-05-07T20:32:55.5357599Z self = 2025-05-07T20:32:55.5357782Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.5357787Z 2025-05-07T20:32:55.5357877Z @given( 2025-05-07T20:32:55.5358002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5358209Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5358338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5358462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5358582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5358668Z ) 2025-05-07T20:32:55.5358926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5359034Z def test_silu_mul_quant( 2025-05-07T20:32:55.5359120Z self, 2025-05-07T20:32:55.5359203Z T: int, 2025-05-07T20:32:55.5359291Z D: int, 2025-05-07T20:32:55.5359395Z scale_ub: Optional[float], 2025-05-07T20:32:55.5359494Z contiguous: bool, 2025-05-07T20:32:55.5359592Z compiled: bool, 2025-05-07T20:32:55.5359675Z ) -> None: 2025-05-07T20:32:55.5359779Z torch.manual_seed(2025) 2025-05-07T20:32:55.5359861Z 2025-05-07T20:32:55.5360043Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5361896Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5361904Z 2025-05-07T20:32:55.5362028Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5362035Z 2025-05-07T20:32:55.5362149Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5362381Z self=, 2025-05-07T20:32:55.5362465Z T=4096, 2025-05-07T20:32:55.5362555Z D=7168, 2025-05-07T20:32:55.5362645Z scale_ub=1200.0, 2025-05-07T20:32:55.5362735Z contiguous=True, 2025-05-07T20:32:55.5362831Z compiled=False, 2025-05-07T20:32:55.5362908Z ) 2025-05-07T20:32:55.5363138Z self = 2025-05-07T20:32:55.5363324Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.5363329Z 2025-05-07T20:32:55.5363410Z @given( 2025-05-07T20:32:55.5363540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5363647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5363767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5363944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5364065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5364143Z ) 2025-05-07T20:32:55.5364409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5364512Z def test_silu_mul_quant( 2025-05-07T20:32:55.5364596Z self, 2025-05-07T20:32:55.5364684Z T: int, 2025-05-07T20:32:55.5364765Z D: int, 2025-05-07T20:32:55.5364868Z scale_ub: Optional[float], 2025-05-07T20:32:55.5364976Z contiguous: bool, 2025-05-07T20:32:55.5365070Z compiled: bool, 2025-05-07T20:32:55.5365159Z ) -> None: 2025-05-07T20:32:55.5365259Z torch.manual_seed(2025) 2025-05-07T20:32:55.5365338Z 2025-05-07T20:32:55.5365517Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5367449Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5367493Z 2025-05-07T20:32:55.5367623Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5367627Z 2025-05-07T20:32:55.5367735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5367967Z self=, 2025-05-07T20:32:55.5368056Z T=16384, 2025-05-07T20:32:55.5368140Z D=7168, 2025-05-07T20:32:55.5368229Z scale_ub=None, 2025-05-07T20:32:55.5368327Z contiguous=False, 2025-05-07T20:32:55.5368414Z compiled=True, 2025-05-07T20:32:55.5368499Z ) 2025-05-07T20:32:55.5368729Z self = 2025-05-07T20:32:55.5368913Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.5368917Z 2025-05-07T20:32:55.5369010Z @given( 2025-05-07T20:32:55.5369135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5369241Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5369370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5369494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5369617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5369702Z ) 2025-05-07T20:32:55.5369961Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5370071Z def test_silu_mul_quant( 2025-05-07T20:32:55.5370153Z self, 2025-05-07T20:32:55.5370235Z T: int, 2025-05-07T20:32:55.5370323Z D: int, 2025-05-07T20:32:55.5370429Z scale_ub: Optional[float], 2025-05-07T20:32:55.5370524Z contiguous: bool, 2025-05-07T20:32:55.5370623Z compiled: bool, 2025-05-07T20:32:55.5370706Z ) -> None: 2025-05-07T20:32:55.5370807Z torch.manual_seed(2025) 2025-05-07T20:32:55.5370899Z 2025-05-07T20:32:55.5371079Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5372931Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5372939Z 2025-05-07T20:32:55.5373109Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5373113Z 2025-05-07T20:32:55.5373230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5373462Z self=, 2025-05-07T20:32:55.5373547Z T=4096, 2025-05-07T20:32:55.5373640Z D=7168, 2025-05-07T20:32:55.5373730Z scale_ub=None, 2025-05-07T20:32:55.5373823Z contiguous=True, 2025-05-07T20:32:55.5373920Z compiled=False, 2025-05-07T20:32:55.5373999Z ) 2025-05-07T20:32:55.5374227Z self = 2025-05-07T20:32:55.5374414Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.5374418Z 2025-05-07T20:32:55.5374503Z @given( 2025-05-07T20:32:55.5374679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5374784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5374908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5375039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5375161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5375239Z ) 2025-05-07T20:32:55.5375587Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5375688Z def test_silu_mul_quant( 2025-05-07T20:32:55.5375770Z self, 2025-05-07T20:32:55.5375858Z T: int, 2025-05-07T20:32:55.5375943Z D: int, 2025-05-07T20:32:55.5376047Z scale_ub: Optional[float], 2025-05-07T20:32:55.5376148Z contiguous: bool, 2025-05-07T20:32:55.5376239Z compiled: bool, 2025-05-07T20:32:55.5376328Z ) -> None: 2025-05-07T20:32:55.5376427Z torch.manual_seed(2025) 2025-05-07T20:32:55.5376508Z 2025-05-07T20:32:55.5376691Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5378589Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5378598Z 2025-05-07T20:32:55.5378727Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5378731Z 2025-05-07T20:32:55.5378839Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5379070Z self=, 2025-05-07T20:32:55.5379160Z T=16384, 2025-05-07T20:32:55.5379243Z D=7168, 2025-05-07T20:32:55.5379331Z scale_ub=None, 2025-05-07T20:32:55.5379427Z contiguous=True, 2025-05-07T20:32:55.5379521Z compiled=False, 2025-05-07T20:32:55.5379605Z ) 2025-05-07T20:32:55.5379831Z self = 2025-05-07T20:32:55.5380019Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.5380026Z 2025-05-07T20:32:55.5380118Z @given( 2025-05-07T20:32:55.5380242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5380347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5380474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5380597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5380718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5380805Z ) 2025-05-07T20:32:55.5381067Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5381176Z def test_silu_mul_quant( 2025-05-07T20:32:55.5381257Z self, 2025-05-07T20:32:55.5381893Z T: int, 2025-05-07T20:32:55.5381986Z D: int, 2025-05-07T20:32:55.5382091Z scale_ub: Optional[float], 2025-05-07T20:32:55.5382186Z contiguous: bool, 2025-05-07T20:32:55.5382284Z compiled: bool, 2025-05-07T20:32:55.5382368Z ) -> None: 2025-05-07T20:32:55.5382475Z torch.manual_seed(2025) 2025-05-07T20:32:55.5382559Z 2025-05-07T20:32:55.5382734Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5384587Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5384639Z 2025-05-07T20:32:55.5384766Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5384771Z 2025-05-07T20:32:55.5384885Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5385224Z self=, 2025-05-07T20:32:55.5385307Z T=16384, 2025-05-07T20:32:55.5385395Z D=7168, 2025-05-07T20:32:55.5385487Z scale_ub=1200.0, 2025-05-07T20:32:55.5385577Z contiguous=True, 2025-05-07T20:32:55.5385672Z compiled=False, 2025-05-07T20:32:55.5385751Z ) 2025-05-07T20:32:55.5385976Z self = 2025-05-07T20:32:55.5386166Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.5386173Z 2025-05-07T20:32:55.5386255Z @given( 2025-05-07T20:32:55.5386383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5386491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5386612Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5386742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5386863Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5386945Z ) 2025-05-07T20:32:55.5387213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5387313Z def test_silu_mul_quant( 2025-05-07T20:32:55.5387410Z self, 2025-05-07T20:32:55.5387498Z T: int, 2025-05-07T20:32:55.5387581Z D: int, 2025-05-07T20:32:55.5387684Z scale_ub: Optional[float], 2025-05-07T20:32:55.5387782Z contiguous: bool, 2025-05-07T20:32:55.5393436Z compiled: bool, 2025-05-07T20:32:55.5393642Z ) -> None: 2025-05-07T20:32:55.5393753Z torch.manual_seed(2025) 2025-05-07T20:32:55.5393839Z 2025-05-07T20:32:55.5394023Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5395912Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5395921Z 2025-05-07T20:32:55.5396048Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5396052Z 2025-05-07T20:32:55.5396167Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5396405Z self=, 2025-05-07T20:32:55.5396488Z T=128, 2025-05-07T20:32:55.5396575Z D=5120, 2025-05-07T20:32:55.5396748Z scale_ub=1200.0, 2025-05-07T20:32:55.5396844Z contiguous=False, 2025-05-07T20:32:55.5396940Z compiled=False, 2025-05-07T20:32:55.5397020Z ) 2025-05-07T20:32:55.5397251Z self = 2025-05-07T20:32:55.5397443Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.5397448Z 2025-05-07T20:32:55.5397532Z @given( 2025-05-07T20:32:55.5397665Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5397772Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5397895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5398027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5398149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5398276Z ) 2025-05-07T20:32:55.5398546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5398649Z def test_silu_mul_quant( 2025-05-07T20:32:55.5398746Z self, 2025-05-07T20:32:55.5398831Z T: int, 2025-05-07T20:32:55.5398916Z D: int, 2025-05-07T20:32:55.5399031Z scale_ub: Optional[float], 2025-05-07T20:32:55.5399170Z contiguous: bool, 2025-05-07T20:32:55.5399262Z compiled: bool, 2025-05-07T20:32:55.5399400Z ) -> None: 2025-05-07T20:32:55.5399503Z torch.manual_seed(2025) 2025-05-07T20:32:55.5399581Z 2025-05-07T20:32:55.5399766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5399851Z 2025-05-07T20:32:55.5399949Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5400095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5400190Z x = x_sign * x_clamp 2025-05-07T20:32:55.5400280Z x0 = x[:, :D] 2025-05-07T20:32:55.5400377Z x1 = x[:, D:] 2025-05-07T20:32:55.5400456Z 2025-05-07T20:32:55.5400552Z if contiguous: 2025-05-07T20:32:55.5400654Z x0 = x0.contiguous() 2025-05-07T20:32:55.5400755Z x1 = x1.contiguous() 2025-05-07T20:32:55.5400839Z 2025-05-07T20:32:55.5400937Z if scale_ub is not None: 2025-05-07T20:32:55.5401051Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5401209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5401292Z ) 2025-05-07T20:32:55.5401374Z else: 2025-05-07T20:32:55.5401480Z scale_ub_tensor = None 2025-05-07T20:32:55.5401559Z 2025-05-07T20:32:55.5401697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5401799Z op = silu_mul_quant 2025-05-07T20:32:55.5401890Z if compiled: 2025-05-07T20:32:55.5402002Z op = torch.compile(op) 2025-05-07T20:32:55.5402118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5402196Z 2025-05-07T20:32:55.5402301Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5402305Z 2025-05-07T20:32:55.5402411Z moe/activation_test.py:117: 2025-05-07T20:32:55.5402550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5402667Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5402777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5403314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5403427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5403808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5404051Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5404411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5404514Z kernel = self.compile( 2025-05-07T20:32:55.5404978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5405168Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5405312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5405319Z 2025-05-07T20:32:55.5405540Z self = 2025-05-07T20:32:55.5406359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5406899Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6948940>} 2025-05-07T20:32:55.5407725Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5407936Z context = 2025-05-07T20:32:55.5407983Z 2025-05-07T20:32:55.5408227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5408507Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5408629Z module_map=module_map) 2025-05-07T20:32:55.5408803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5408913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5408996Z E ^ 2025-05-07T20:32:55.5409370Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5409378Z 2025-05-07T20:32:55.5409821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5409826Z 2025-05-07T20:32:55.5409938Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5410178Z self=, 2025-05-07T20:32:55.5410265Z T=2048, 2025-05-07T20:32:55.5410347Z D=7168, 2025-05-07T20:32:55.5410446Z scale_ub=None, 2025-05-07T20:32:55.5410538Z contiguous=False, 2025-05-07T20:32:55.5410628Z compiled=False, 2025-05-07T20:32:55.5410713Z ) 2025-05-07T20:32:55.5410944Z self = 2025-05-07T20:32:55.5411128Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.5411132Z 2025-05-07T20:32:55.5411219Z @given( 2025-05-07T20:32:55.5411346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5411461Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5411584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5411711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5411839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5411920Z ) 2025-05-07T20:32:55.5412180Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5412297Z def test_silu_mul_quant( 2025-05-07T20:32:55.5412379Z self, 2025-05-07T20:32:55.5412461Z T: int, 2025-05-07T20:32:55.5412549Z D: int, 2025-05-07T20:32:55.5412652Z scale_ub: Optional[float], 2025-05-07T20:32:55.5412747Z contiguous: bool, 2025-05-07T20:32:55.5412845Z compiled: bool, 2025-05-07T20:32:55.5412928Z ) -> None: 2025-05-07T20:32:55.5413034Z torch.manual_seed(2025) 2025-05-07T20:32:55.5413114Z 2025-05-07T20:32:55.5413296Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5415204Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5415212Z 2025-05-07T20:32:55.5415344Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5415348Z 2025-05-07T20:32:55.5415469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5415703Z self=, 2025-05-07T20:32:55.5415827Z T=128, 2025-05-07T20:32:55.5415919Z D=7168, 2025-05-07T20:32:55.5416009Z scale_ub=1200.0, 2025-05-07T20:32:55.5416099Z contiguous=True, 2025-05-07T20:32:55.5416193Z compiled=True, 2025-05-07T20:32:55.5416274Z ) 2025-05-07T20:32:55.5416507Z self = 2025-05-07T20:32:55.5416688Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.5416734Z 2025-05-07T20:32:55.5416820Z @given( 2025-05-07T20:32:55.5416990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5417099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5417223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5417354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5417476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5417558Z ) 2025-05-07T20:32:55.5417828Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5417933Z def test_silu_mul_quant( 2025-05-07T20:32:55.5418023Z self, 2025-05-07T20:32:55.5418107Z T: int, 2025-05-07T20:32:55.5418192Z D: int, 2025-05-07T20:32:55.5418304Z scale_ub: Optional[float], 2025-05-07T20:32:55.5418403Z contiguous: bool, 2025-05-07T20:32:55.5418496Z compiled: bool, 2025-05-07T20:32:55.5418588Z ) -> None: 2025-05-07T20:32:55.5418693Z torch.manual_seed(2025) 2025-05-07T20:32:55.5418773Z 2025-05-07T20:32:55.5418959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5419039Z 2025-05-07T20:32:55.5419136Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5419277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5419373Z x = x_sign * x_clamp 2025-05-07T20:32:55.5419464Z x0 = x[:, :D] 2025-05-07T20:32:55.5419550Z x1 = x[:, D:] 2025-05-07T20:32:55.5419627Z 2025-05-07T20:32:55.5419727Z if contiguous: 2025-05-07T20:32:55.5419825Z x0 = x0.contiguous() 2025-05-07T20:32:55.5419919Z x1 = x1.contiguous() 2025-05-07T20:32:55.5420004Z 2025-05-07T20:32:55.5420104Z if scale_ub is not None: 2025-05-07T20:32:55.5420222Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.5420374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.5420457Z ) 2025-05-07T20:32:55.5420539Z else: 2025-05-07T20:32:55.5420649Z scale_ub_tensor = None 2025-05-07T20:32:55.5420729Z 2025-05-07T20:32:55.5420867Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.5420975Z op = silu_mul_quant 2025-05-07T20:32:55.5421067Z if compiled: 2025-05-07T20:32:55.5421178Z op = torch.compile(op) 2025-05-07T20:32:55.5421293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5421371Z 2025-05-07T20:32:55.5421473Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.5421480Z 2025-05-07T20:32:55.5421584Z moe/activation_test.py:117: 2025-05-07T20:32:55.5421721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5421883Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.5421992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.5422388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.5422492Z return fn(*args, **kwargs) 2025-05-07T20:32:55.5423010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.5423120Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.5423498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.5423739Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.5425188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.5425325Z kernel = self.compile( 2025-05-07T20:32:55.5425871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.5426115Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.5426581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.5426587Z 2025-05-07T20:32:55.5426821Z self = 2025-05-07T20:32:55.5427635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.5428179Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdf6948dc0>} 2025-05-07T20:32:55.5428957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.5429160Z context = 2025-05-07T20:32:55.5429176Z 2025-05-07T20:32:55.5429354Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.5429630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.5429750Z module_map=module_map) 2025-05-07T20:32:55.5429923Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.5430028Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.5430116Z E ^ 2025-05-07T20:32:55.5430494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.5430499Z 2025-05-07T20:32:55.5430943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.5430947Z 2025-05-07T20:32:55.5431058Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5431295Z self=, 2025-05-07T20:32:55.5431385Z T=128, 2025-05-07T20:32:55.5431468Z D=7168, 2025-05-07T20:32:55.5431558Z scale_ub=1200.0, 2025-05-07T20:32:55.5431654Z contiguous=True, 2025-05-07T20:32:55.5431744Z compiled=False, 2025-05-07T20:32:55.5431827Z ) 2025-05-07T20:32:55.5432061Z self = 2025-05-07T20:32:55.5432239Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.5432243Z 2025-05-07T20:32:55.5432337Z @given( 2025-05-07T20:32:55.5432463Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5432571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5432784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5432914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5433036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5433126Z ) 2025-05-07T20:32:55.5433387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5433488Z def test_silu_mul_quant( 2025-05-07T20:32:55.5433734Z self, 2025-05-07T20:32:55.5433818Z T: int, 2025-05-07T20:32:55.5433906Z D: int, 2025-05-07T20:32:55.5434009Z scale_ub: Optional[float], 2025-05-07T20:32:55.5434104Z contiguous: bool, 2025-05-07T20:32:55.5434202Z compiled: bool, 2025-05-07T20:32:55.5434287Z ) -> None: 2025-05-07T20:32:55.5434387Z torch.manual_seed(2025) 2025-05-07T20:32:55.5434548Z 2025-05-07T20:32:55.5434727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5434804Z 2025-05-07T20:32:55.5434913Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5435044Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5436996Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5437044Z 2025-05-07T20:32:55.5437175Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:55.5437182Z 2025-05-07T20:32:55.5437296Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5437530Z self=, 2025-05-07T20:32:55.5437612Z T=128, 2025-05-07T20:32:55.5437703Z D=5120, 2025-05-07T20:32:55.5437791Z scale_ub=1200.0, 2025-05-07T20:32:55.5438115Z contiguous=True, 2025-05-07T20:32:55.5438212Z compiled=True, 2025-05-07T20:32:55.5438295Z ) 2025-05-07T20:32:55.5438540Z self = 2025-05-07T20:32:55.5438729Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.5438734Z 2025-05-07T20:32:55.5438817Z @given( 2025-05-07T20:32:55.5438943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5439088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5439211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5439341Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5439464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5439544Z ) 2025-05-07T20:32:55.5439811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5439912Z def test_silu_mul_quant( 2025-05-07T20:32:55.5439995Z self, 2025-05-07T20:32:55.5440082Z T: int, 2025-05-07T20:32:55.5440165Z D: int, 2025-05-07T20:32:55.5440279Z scale_ub: Optional[float], 2025-05-07T20:32:55.5440376Z contiguous: bool, 2025-05-07T20:32:55.5440470Z compiled: bool, 2025-05-07T20:32:55.5440562Z ) -> None: 2025-05-07T20:32:55.5440664Z torch.manual_seed(2025) 2025-05-07T20:32:55.5440758Z 2025-05-07T20:32:55.5440934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5441013Z 2025-05-07T20:32:55.5441116Z x_sign = torch.sign(x) 2025-05-07T20:32:55.5441247Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.5443152Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5443161Z 2025-05-07T20:32:55.5443287Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:55.5443291Z 2025-05-07T20:32:55.5443407Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.5443645Z self=, 2025-05-07T20:32:55.5443730Z T=128, 2025-05-07T20:32:55.5443818Z D=7168, 2025-05-07T20:32:55.5443954Z scale_ub=None, 2025-05-07T20:32:55.5444047Z contiguous=True, 2025-05-07T20:32:55.5444145Z compiled=True, 2025-05-07T20:32:55.5444225Z ) 2025-05-07T20:32:55.5444454Z self = 2025-05-07T20:32:55.5444636Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.5444640Z 2025-05-07T20:32:55.5444763Z @given( 2025-05-07T20:32:55.5444931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.5445039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.5445161Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.5445292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.5445413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.5445493Z ) 2025-05-07T20:32:55.5445758Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.5445863Z def test_silu_mul_quant( 2025-05-07T20:32:55.5445944Z self, 2025-05-07T20:32:55.5446033Z T: int, 2025-05-07T20:32:55.5446114Z D: int, 2025-05-07T20:32:55.5446226Z scale_ub: Optional[float], 2025-05-07T20:32:55.5446322Z contiguous: bool, 2025-05-07T20:32:55.5446415Z compiled: bool, 2025-05-07T20:32:55.5446504Z ) -> None: 2025-05-07T20:32:55.5446603Z torch.manual_seed(2025) 2025-05-07T20:32:55.5446688Z 2025-05-07T20:32:55.5446873Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.5448705Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:55.5448713Z 2025-05-07T20:32:55.5448849Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:55.5448991Z =============================== warnings summary =============================== 2025-05-07T20:32:55.5449313Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:55.5449641Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:55.5449952Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:55.5450869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:55.5451114Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:55.5451119Z 2025-05-07T20:32:55.5451391Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:55.5451571Z ================= 1 failed, 1 deselected, 3 warnings in 22.74s ================= 2025-05-07T20:32:57.2134191Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:57.2771613Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:57.2771869Z 2025-05-07T20:32:59.2790637Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:01.5904003Z ============================= test session starts ============================== 2025-05-07T20:33:01.5905158Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:01.5905746Z cachedir: .pytest_cache 2025-05-07T20:33:01.5906392Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:01.5907396Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:01.5907851Z plugins: hypothesis-6.131.14 2025-05-07T20:33:03.2442745Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:03.4268285Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:03.4268899Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:03.4269205Z 2025-05-07T20:33:06.0110178Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.0111054Z self=, 2025-05-07T20:33:06.0111536Z T=1, 2025-05-07T20:33:06.0111759Z D=5120, 2025-05-07T20:33:06.0111987Z scale_ub=None, 2025-05-07T20:33:06.0112242Z contiguous=True, 2025-05-07T20:33:06.0112507Z compiled=True, 2025-05-07T20:33:06.0112748Z ) 2025-05-07T20:33:06.0113123Z self = 2025-05-07T20:33:06.0113818Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:06.0114118Z 2025-05-07T20:33:06.0114219Z @given( 2025-05-07T20:33:06.0114490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.0114855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.0115211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.0115590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.0115979Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.0116316Z ) 2025-05-07T20:33:06.0116723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.0117286Z def test_silu_mul_quant( 2025-05-07T20:33:06.0117690Z self, 2025-05-07T20:33:06.0117939Z T: int, 2025-05-07T20:33:06.0118167Z D: int, 2025-05-07T20:33:06.0118424Z scale_ub: Optional[float], 2025-05-07T20:33:06.0118743Z contiguous: bool, 2025-05-07T20:33:06.0119020Z compiled: bool, 2025-05-07T20:33:06.0119287Z ) -> None: 2025-05-07T20:33:06.0119543Z torch.manual_seed(2025) 2025-05-07T20:33:06.0119825Z 2025-05-07T20:33:06.0120219Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.0120639Z 2025-05-07T20:33:06.0120864Z x_sign = torch.sign(x) 2025-05-07T20:33:06.0121203Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.0121566Z x = x_sign * x_clamp 2025-05-07T20:33:06.0121843Z x0 = x[:, :D] 2025-05-07T20:33:06.0122104Z x1 = x[:, D:] 2025-05-07T20:33:06.0122360Z 2025-05-07T20:33:06.0122576Z if contiguous: 2025-05-07T20:33:06.0123232Z x0 = x0.contiguous() 2025-05-07T20:33:06.0123545Z x1 = x1.contiguous() 2025-05-07T20:33:06.0124168Z 2025-05-07T20:33:06.0124406Z if scale_ub is not None: 2025-05-07T20:33:06.0124732Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.0125129Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.0125483Z ) 2025-05-07T20:33:06.0125712Z else: 2025-05-07T20:33:06.0125958Z scale_ub_tensor = None 2025-05-07T20:33:06.0126245Z 2025-05-07T20:33:06.0126517Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.0126887Z op = silu_mul_quant 2025-05-07T20:33:06.0127173Z if compiled: 2025-05-07T20:33:06.0127463Z op = torch.compile(op) 2025-05-07T20:33:06.0127807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.0128226Z 2025-05-07T20:33:06.0128455Z y_fp8, y_scale = fn() 2025-05-07T20:33:06.0128790Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:06.0129120Z 2025-05-07T20:33:06.0129410Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.0129837Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:06.0130275Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:06.0130710Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:06.0131132Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.0131492Z 2025-05-07T20:33:06.0131722Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:06.0131951Z 2025-05-07T20:33:06.0132069Z moe/activation_test.py:126: 2025-05-07T20:33:06.0132413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0132797Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:06.0133181Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.0134088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:06.0134949Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:06.0135571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.0136361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.0137147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:06.0137974Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:06.0138830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:06.0139694Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:06.0140535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:06.0141263Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:06.0141954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:06.0142556Z fn() 2025-05-07T20:33:06.0143148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:06.0143811Z self.fn.run( 2025-05-07T20:33:06.0144355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.0144964Z kernel = self.compile( 2025-05-07T20:33:06.0145579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.0146334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.0146883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0147147Z 2025-05-07T20:33:06.0147392Z self = 2025-05-07T20:33:06.0148620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.0150215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb7bb8af0>} 2025-05-07T20:33:06.0151749Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.0152972Z context = 2025-05-07T20:33:06.0153303Z 2025-05-07T20:33:06.0153615Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.0154212Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.0154821Z module_map=module_map) 2025-05-07T20:33:06.0155287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.0155693Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:06.0156002Z E ^ 2025-05-07T20:33:06.0156536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.0157048Z 2025-05-07T20:33:06.0157528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.0158111Z 2025-05-07T20:33:06.0158231Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.0158704Z self=, 2025-05-07T20:33:06.0159169Z T=2048, 2025-05-07T20:33:06.0159385Z D=5120, 2025-05-07T20:33:06.0159608Z scale_ub=1200.0, 2025-05-07T20:33:06.0159869Z contiguous=True, 2025-05-07T20:33:06.0160120Z compiled=False, 2025-05-07T20:33:06.0160369Z ) 2025-05-07T20:33:07.5325390Z self = 2025-05-07T20:33:07.5326539Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.5327098Z 2025-05-07T20:33:07.5327271Z @given( 2025-05-07T20:33:07.5327740Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5328385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5329009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5329725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5330357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5330774Z ) 2025-05-07T20:33:07.5331272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5331781Z def test_silu_mul_quant( 2025-05-07T20:33:07.5332069Z self, 2025-05-07T20:33:07.5332301Z T: int, 2025-05-07T20:33:07.5332533Z D: int, 2025-05-07T20:33:07.5332791Z scale_ub: Optional[float], 2025-05-07T20:33:07.5333111Z contiguous: bool, 2025-05-07T20:33:07.5333385Z compiled: bool, 2025-05-07T20:33:07.5333651Z ) -> None: 2025-05-07T20:33:07.5333909Z torch.manual_seed(2025) 2025-05-07T20:33:07.5334185Z 2025-05-07T20:33:07.5334501Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5334898Z 2025-05-07T20:33:07.5335120Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5335458Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5335820Z x = x_sign * x_clamp 2025-05-07T20:33:07.5336105Z x0 = x[:, :D] 2025-05-07T20:33:07.5336354Z x1 = x[:, D:] 2025-05-07T20:33:07.5336896Z 2025-05-07T20:33:07.5337122Z if contiguous: 2025-05-07T20:33:07.5337387Z x0 = x0.contiguous() 2025-05-07T20:33:07.5337685Z x1 = x1.contiguous() 2025-05-07T20:33:07.5337965Z 2025-05-07T20:33:07.5338191Z if scale_ub is not None: 2025-05-07T20:33:07.5338515Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5338904Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5339256Z ) 2025-05-07T20:33:07.5339483Z else: 2025-05-07T20:33:07.5339732Z scale_ub_tensor = None 2025-05-07T20:33:07.5340047Z 2025-05-07T20:33:07.5340357Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5340744Z op = silu_mul_quant 2025-05-07T20:33:07.5341028Z if compiled: 2025-05-07T20:33:07.5341425Z op = torch.compile(op) 2025-05-07T20:33:07.5341770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5342086Z 2025-05-07T20:33:07.5349638Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5349858Z 2025-05-07T20:33:07.5350006Z moe/activation_test.py:117: 2025-05-07T20:33:07.5350376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5350906Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5351321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5352122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5352921Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5353638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5354428Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5355190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5355812Z kernel = self.compile( 2025-05-07T20:33:07.5356442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5357201Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5357665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5357939Z 2025-05-07T20:33:07.5358179Z self = 2025-05-07T20:33:07.5359414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5361043Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb7a95990>} 2025-05-07T20:33:07.5362582Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5363755Z context = 2025-05-07T20:33:07.5364090Z 2025-05-07T20:33:07.5364292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5364892Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5365426Z module_map=module_map) 2025-05-07T20:33:07.5365851Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5366258Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5366558Z E ^ 2025-05-07T20:33:07.5367093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5367606Z 2025-05-07T20:33:07.5368144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5368730Z 2025-05-07T20:33:07.5368863Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5369341Z self=, 2025-05-07T20:33:07.5369805Z T=2048, 2025-05-07T20:33:07.5370025Z D=5120, 2025-05-07T20:33:07.5370246Z scale_ub=1200.0, 2025-05-07T20:33:07.5370509Z contiguous=True, 2025-05-07T20:33:07.5370770Z compiled=True, 2025-05-07T20:33:07.5371007Z ) 2025-05-07T20:33:07.5371377Z self = 2025-05-07T20:33:07.5371949Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.5372307Z 2025-05-07T20:33:07.5372408Z @given( 2025-05-07T20:33:07.5372671Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5373035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5373393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5373773Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5374158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5374542Z ) 2025-05-07T20:33:07.5374986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5375502Z def test_silu_mul_quant( 2025-05-07T20:33:07.5375792Z self, 2025-05-07T20:33:07.5376017Z T: int, 2025-05-07T20:33:07.5376254Z D: int, 2025-05-07T20:33:07.5376514Z scale_ub: Optional[float], 2025-05-07T20:33:07.5376825Z contiguous: bool, 2025-05-07T20:33:07.5377109Z compiled: bool, 2025-05-07T20:33:07.5377372Z ) -> None: 2025-05-07T20:33:07.5377622Z torch.manual_seed(2025) 2025-05-07T20:33:07.5377910Z 2025-05-07T20:33:07.5378230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5378630Z 2025-05-07T20:33:07.5378856Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5379197Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5379556Z x = x_sign * x_clamp 2025-05-07T20:33:07.5379839Z x0 = x[:, :D] 2025-05-07T20:33:07.5380120Z x1 = x[:, D:] 2025-05-07T20:33:07.5380387Z 2025-05-07T20:33:07.5380600Z if contiguous: 2025-05-07T20:33:07.5380874Z x0 = x0.contiguous() 2025-05-07T20:33:07.5381174Z x1 = x1.contiguous() 2025-05-07T20:33:07.5381449Z 2025-05-07T20:33:07.5381675Z if scale_ub is not None: 2025-05-07T20:33:07.5381993Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5382375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5382737Z ) 2025-05-07T20:33:07.5382966Z else: 2025-05-07T20:33:07.5383208Z scale_ub_tensor = None 2025-05-07T20:33:07.5383501Z 2025-05-07T20:33:07.5383774Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5384138Z op = silu_mul_quant 2025-05-07T20:33:07.5384424Z if compiled: 2025-05-07T20:33:07.5384711Z op = torch.compile(op) 2025-05-07T20:33:07.5385058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5385374Z 2025-05-07T20:33:07.5385598Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5385931Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5386263Z 2025-05-07T20:33:07.5386543Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5386932Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5387263Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5387633Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5388053Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5388416Z 2025-05-07T20:33:07.5388703Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5388936Z 2025-05-07T20:33:07.5389054Z moe/activation_test.py:126: 2025-05-07T20:33:07.5389403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5389792Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5390224Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5391131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5391997Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5392623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5393407Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5394322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5395152Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5396022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5396976Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5397816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5398546Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5399240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5399843Z fn() 2025-05-07T20:33:07.5400461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5401119Z self.fn.run( 2025-05-07T20:33:07.5401659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5402261Z kernel = self.compile( 2025-05-07T20:33:07.5402873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5403626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5404077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5404340Z 2025-05-07T20:33:07.5404583Z self = 2025-05-07T20:33:07.5405799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5407365Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb65256c0>} 2025-05-07T20:33:07.5408895Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5410073Z context = 2025-05-07T20:33:07.5410445Z 2025-05-07T20:33:07.5410635Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5411232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5411770Z module_map=module_map) 2025-05-07T20:33:07.5412187Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5412595Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5412905Z E ^ 2025-05-07T20:33:07.5413493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5414004Z 2025-05-07T20:33:07.5414480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5415064Z 2025-05-07T20:33:07.5415186Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5415659Z self=, 2025-05-07T20:33:07.5416119Z T=16384, 2025-05-07T20:33:07.5416339Z D=7168, 2025-05-07T20:33:07.5416566Z scale_ub=1200.0, 2025-05-07T20:33:07.5416829Z contiguous=False, 2025-05-07T20:33:07.5417082Z compiled=False, 2025-05-07T20:33:07.5417324Z ) 2025-05-07T20:33:08.8505334Z self = 2025-05-07T20:33:08.8506312Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.8506739Z 2025-05-07T20:33:08.8506866Z @given( 2025-05-07T20:33:08.8507207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.8507607Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.8507956Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.8508420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.8508852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.8509180Z ) 2025-05-07T20:33:08.8509574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.8510064Z def test_silu_mul_quant( 2025-05-07T20:33:08.8510358Z self, 2025-05-07T20:33:08.8510606Z T: int, 2025-05-07T20:33:08.8510820Z D: int, 2025-05-07T20:33:08.8511065Z scale_ub: Optional[float], 2025-05-07T20:33:08.8511370Z contiguous: bool, 2025-05-07T20:33:08.8511634Z compiled: bool, 2025-05-07T20:33:08.8511886Z ) -> None: 2025-05-07T20:33:08.8512128Z torch.manual_seed(2025) 2025-05-07T20:33:08.8512396Z 2025-05-07T20:33:08.8512705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.8513087Z 2025-05-07T20:33:08.8513296Z x_sign = torch.sign(x) 2025-05-07T20:33:08.8513695Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.8514043Z x = x_sign * x_clamp 2025-05-07T20:33:08.8514314Z x0 = x[:, :D] 2025-05-07T20:33:08.8514551Z x1 = x[:, D:] 2025-05-07T20:33:08.8514785Z 2025-05-07T20:33:08.8514997Z if contiguous: 2025-05-07T20:33:08.8515251Z x0 = x0.contiguous() 2025-05-07T20:33:08.8515538Z x1 = x1.contiguous() 2025-05-07T20:33:08.8515818Z 2025-05-07T20:33:08.8516027Z if scale_ub is not None: 2025-05-07T20:33:08.8516349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.8516728Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.8517068Z ) 2025-05-07T20:33:08.8517285Z else: 2025-05-07T20:33:08.8517528Z scale_ub_tensor = None 2025-05-07T20:33:08.8517818Z 2025-05-07T20:33:08.8518074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.8518426Z op = silu_mul_quant 2025-05-07T20:33:08.8518714Z if compiled: 2025-05-07T20:33:08.8518991Z op = torch.compile(op) 2025-05-07T20:33:08.8519324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8519633Z 2025-05-07T20:33:08.8519845Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.8520036Z 2025-05-07T20:33:08.8520149Z moe/activation_test.py:117: 2025-05-07T20:33:08.8520524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8520897Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.8521214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8521989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.8522848Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.8523441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.8524613Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.8525354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.8525948Z kernel = self.compile( 2025-05-07T20:33:08.8526543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.8527271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.8527713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8528057Z 2025-05-07T20:33:08.8528286Z self = 2025-05-07T20:33:08.8529491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.8531142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb65248b0>} 2025-05-07T20:33:08.8532632Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.8533763Z context = 2025-05-07T20:33:08.8534088Z 2025-05-07T20:33:08.8534272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.8534849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.8535368Z module_map=module_map) 2025-05-07T20:33:08.8535780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.8536163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.8536455Z E ^ 2025-05-07T20:33:08.8536974Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.8537467Z 2025-05-07T20:33:08.8537924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.8538492Z 2025-05-07T20:33:08.8538608Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8539071Z self=, 2025-05-07T20:33:08.8539516Z T=1, 2025-05-07T20:33:08.8539717Z D=7168, 2025-05-07T20:33:08.8539934Z scale_ub=None, 2025-05-07T20:33:08.8540176Z contiguous=True, 2025-05-07T20:33:08.8540421Z compiled=True, 2025-05-07T20:33:08.8540650Z ) 2025-05-07T20:33:08.8541003Z self = 2025-05-07T20:33:08.8541532Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.8541826Z 2025-05-07T20:33:08.8541918Z @given( 2025-05-07T20:33:08.8542180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.8542520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.8542863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.8543231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.8543598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.8543910Z ) 2025-05-07T20:33:08.8544305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.8544799Z def test_silu_mul_quant( 2025-05-07T20:33:08.8545063Z self, 2025-05-07T20:33:08.8545282Z T: int, 2025-05-07T20:33:08.8545579Z D: int, 2025-05-07T20:33:08.8545819Z scale_ub: Optional[float], 2025-05-07T20:33:08.8546126Z contiguous: bool, 2025-05-07T20:33:08.8546397Z compiled: bool, 2025-05-07T20:33:08.8546647Z ) -> None: 2025-05-07T20:33:08.8546888Z torch.manual_seed(2025) 2025-05-07T20:33:08.8547159Z 2025-05-07T20:33:08.8547459Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.8547842Z 2025-05-07T20:33:08.8548062Z x_sign = torch.sign(x) 2025-05-07T20:33:08.8548380Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.8548725Z x = x_sign * x_clamp 2025-05-07T20:33:08.8548992Z x0 = x[:, :D] 2025-05-07T20:33:08.8549235Z x1 = x[:, D:] 2025-05-07T20:33:08.8549512Z 2025-05-07T20:33:08.8549724Z if contiguous: 2025-05-07T20:33:08.8549982Z x0 = x0.contiguous() 2025-05-07T20:33:08.8550277Z x1 = x1.contiguous() 2025-05-07T20:33:08.8550569Z 2025-05-07T20:33:08.8550827Z if scale_ub is not None: 2025-05-07T20:33:08.8551136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.8551512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.8551908Z ) 2025-05-07T20:33:08.8552159Z else: 2025-05-07T20:33:08.8552404Z scale_ub_tensor = None 2025-05-07T20:33:08.8552689Z 2025-05-07T20:33:08.8552941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.8553293Z op = silu_mul_quant 2025-05-07T20:33:08.8553651Z if compiled: 2025-05-07T20:33:08.8553925Z op = torch.compile(op) 2025-05-07T20:33:08.8554255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8554565Z 2025-05-07T20:33:08.8554784Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.8555101Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.8555427Z 2025-05-07T20:33:08.8555696Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.8556062Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.8556388Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.8556739Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.8557137Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.8557487Z 2025-05-07T20:33:08.8557713Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.8557929Z 2025-05-07T20:33:08.8558043Z moe/activation_test.py:126: 2025-05-07T20:33:08.8558376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8558755Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.8559130Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.8560008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.8560853Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.8561466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.8562230Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.8562990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.8563791Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.8564627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.8565452Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.8566266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.8567028Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.8567694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.8568263Z fn() 2025-05-07T20:33:08.8568835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.8569474Z self.fn.run( 2025-05-07T20:33:08.8569987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.8570625Z kernel = self.compile( 2025-05-07T20:33:08.8571219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.8571939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.8572423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8572684Z 2025-05-07T20:33:08.8572918Z self = 2025-05-07T20:33:08.8574149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.8575737Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb62c4e50>} 2025-05-07T20:33:08.8577222Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.8578348Z context = 2025-05-07T20:33:08.8578671Z 2025-05-07T20:33:08.8578856Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.8579436Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.8579953Z module_map=module_map) 2025-05-07T20:33:08.8580363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.8580764Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.8581060Z E ^ 2025-05-07T20:33:08.8581569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.8582070Z 2025-05-07T20:33:08.8582528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.8583092Z 2025-05-07T20:33:08.8583218Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8583679Z self=, 2025-05-07T20:33:08.8584119Z T=4096, 2025-05-07T20:33:08.8584331Z D=5120, 2025-05-07T20:33:08.8584550Z scale_ub=None, 2025-05-07T20:33:08.8584788Z contiguous=False, 2025-05-07T20:33:08.8585042Z compiled=False, 2025-05-07T20:33:08.8585273Z ) 2025-05-07T20:33:10.4835628Z self = 2025-05-07T20:33:10.4836386Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.4836811Z 2025-05-07T20:33:10.4836941Z @given( 2025-05-07T20:33:10.4837307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.4837766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.4838208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.4838673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.4839030Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.4839360Z ) 2025-05-07T20:33:10.4839746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.4840363Z def test_silu_mul_quant( 2025-05-07T20:33:10.4840641Z self, 2025-05-07T20:33:10.4840868Z T: int, 2025-05-07T20:33:10.4841091Z D: int, 2025-05-07T20:33:10.4841328Z scale_ub: Optional[float], 2025-05-07T20:33:10.4841628Z contiguous: bool, 2025-05-07T20:33:10.4841898Z compiled: bool, 2025-05-07T20:33:10.4842142Z ) -> None: 2025-05-07T20:33:10.4842388Z torch.manual_seed(2025) 2025-05-07T20:33:10.4842667Z 2025-05-07T20:33:10.4842963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.4843335Z 2025-05-07T20:33:10.4843551Z x_sign = torch.sign(x) 2025-05-07T20:33:10.4843865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.4844201Z x = x_sign * x_clamp 2025-05-07T20:33:10.4844530Z x0 = x[:, :D] 2025-05-07T20:33:10.4844760Z x1 = x[:, D:] 2025-05-07T20:33:10.4844991Z 2025-05-07T20:33:10.4845194Z if contiguous: 2025-05-07T20:33:10.4845443Z x0 = x0.contiguous() 2025-05-07T20:33:10.4845726Z x1 = x1.contiguous() 2025-05-07T20:33:10.4845987Z 2025-05-07T20:33:10.4846191Z if scale_ub is not None: 2025-05-07T20:33:10.4846488Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.4846973Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.4847308Z ) 2025-05-07T20:33:10.4847513Z else: 2025-05-07T20:33:10.4847741Z scale_ub_tensor = None 2025-05-07T20:33:10.4848012Z 2025-05-07T20:33:10.4848260Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.4848595Z op = silu_mul_quant 2025-05-07T20:33:10.4848862Z if compiled: 2025-05-07T20:33:10.4849145Z op = torch.compile(op) 2025-05-07T20:33:10.4849473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.4849763Z 2025-05-07T20:33:10.4849973Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.4850150Z 2025-05-07T20:33:10.4850266Z moe/activation_test.py:117: 2025-05-07T20:33:10.4850585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.4850936Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.4857520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.4858268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.4858991Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.4859566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.4860288Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.4861036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.4861594Z kernel = self.compile( 2025-05-07T20:33:10.4862175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.4862866Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.4863284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.4863532Z 2025-05-07T20:33:10.4863754Z self = 2025-05-07T20:33:10.4864880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.4866322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb62c5630>} 2025-05-07T20:33:10.4867804Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.4868870Z context = 2025-05-07T20:33:10.4869180Z 2025-05-07T20:33:10.4869359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.4869906Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.4870399Z module_map=module_map) 2025-05-07T20:33:10.4870843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.4871227Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.4871502Z E ^ 2025-05-07T20:33:10.4872064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.4872677Z 2025-05-07T20:33:10.4873201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.4873870Z 2025-05-07T20:33:10.4873982Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.4874417Z self=, 2025-05-07T20:33:10.4874890Z T=4096, 2025-05-07T20:33:10.4875146Z D=7168, 2025-05-07T20:33:10.4875357Z scale_ub=None, 2025-05-07T20:33:10.4875588Z contiguous=False, 2025-05-07T20:33:10.4875826Z compiled=False, 2025-05-07T20:33:10.4876054Z ) 2025-05-07T20:33:10.4876394Z self = 2025-05-07T20:33:10.4876908Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.4877238Z 2025-05-07T20:33:10.4877359Z @given( 2025-05-07T20:33:10.4877685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.4878011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.4878341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.4878692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.4879042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.4879342Z ) 2025-05-07T20:33:10.4879718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.4880190Z def test_silu_mul_quant( 2025-05-07T20:33:10.4880444Z self, 2025-05-07T20:33:10.4880658Z T: int, 2025-05-07T20:33:10.4880898Z D: int, 2025-05-07T20:33:10.4881152Z scale_ub: Optional[float], 2025-05-07T20:33:10.4881440Z contiguous: bool, 2025-05-07T20:33:10.4881700Z compiled: bool, 2025-05-07T20:33:10.4881939Z ) -> None: 2025-05-07T20:33:10.4882174Z torch.manual_seed(2025) 2025-05-07T20:33:10.4882435Z 2025-05-07T20:33:10.4882724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.4883085Z 2025-05-07T20:33:10.4883295Z x_sign = torch.sign(x) 2025-05-07T20:33:10.4883614Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.4883939Z x = x_sign * x_clamp 2025-05-07T20:33:10.4884196Z x0 = x[:, :D] 2025-05-07T20:33:10.4884427Z x1 = x[:, D:] 2025-05-07T20:33:10.4884645Z 2025-05-07T20:33:10.4884845Z if contiguous: 2025-05-07T20:33:10.4885096Z x0 = x0.contiguous() 2025-05-07T20:33:10.4885368Z x1 = x1.contiguous() 2025-05-07T20:33:10.4885624Z 2025-05-07T20:33:10.4885835Z if scale_ub is not None: 2025-05-07T20:33:10.4886120Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.4886475Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.4886805Z ) 2025-05-07T20:33:10.4887009Z else: 2025-05-07T20:33:10.4887240Z scale_ub_tensor = None 2025-05-07T20:33:10.4887510Z 2025-05-07T20:33:10.4887752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.4888147Z op = silu_mul_quant 2025-05-07T20:33:10.4888417Z if compiled: 2025-05-07T20:33:10.4888676Z op = torch.compile(op) 2025-05-07T20:33:10.4888990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.4889284Z 2025-05-07T20:33:10.4889493Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.4889669Z 2025-05-07T20:33:10.4889776Z moe/activation_test.py:117: 2025-05-07T20:33:10.4890092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.4890444Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.4890738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.4891466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.4892236Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.4892800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.4893514Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.4894209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.4894815Z kernel = self.compile( 2025-05-07T20:33:10.4895422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.4896114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.4896533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.4896773Z 2025-05-07T20:33:10.4897001Z self = 2025-05-07T20:33:10.4898128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.4899569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb62c5f30>} 2025-05-07T20:33:10.4900984Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.4902060Z context = 2025-05-07T20:33:10.4902363Z 2025-05-07T20:33:10.4902547Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.4903091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.4903589Z module_map=module_map) 2025-05-07T20:33:10.4903974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.4904345Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.4904626Z E ^ 2025-05-07T20:33:10.4905121Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.4905595Z 2025-05-07T20:33:10.4906039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.4906575Z 2025-05-07T20:33:10.4906685Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.4907121Z self=, 2025-05-07T20:33:10.4907540Z T=128, 2025-05-07T20:33:10.4907736Z D=7168, 2025-05-07T20:33:10.4907945Z scale_ub=None, 2025-05-07T20:33:10.4908174Z contiguous=False, 2025-05-07T20:33:10.4908409Z compiled=True, 2025-05-07T20:33:10.4908632Z ) 2025-05-07T20:33:10.5535886Z self = 2025-05-07T20:33:10.5536792Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.5537194Z 2025-05-07T20:33:10.5537291Z @given( 2025-05-07T20:33:10.5537544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.5537890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.5538222Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.5538577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.5538931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.5539237Z ) 2025-05-07T20:33:10.5539605Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.5540079Z def test_silu_mul_quant( 2025-05-07T20:33:10.5540341Z self, 2025-05-07T20:33:10.5540555Z T: int, 2025-05-07T20:33:10.5540840Z D: int, 2025-05-07T20:33:10.5541081Z scale_ub: Optional[float], 2025-05-07T20:33:10.5541372Z contiguous: bool, 2025-05-07T20:33:10.5541627Z compiled: bool, 2025-05-07T20:33:10.5541873Z ) -> None: 2025-05-07T20:33:10.5542108Z torch.manual_seed(2025) 2025-05-07T20:33:10.5542363Z 2025-05-07T20:33:10.5542655Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.5543097Z 2025-05-07T20:33:10.5543357Z x_sign = torch.sign(x) 2025-05-07T20:33:10.5543675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.5544010Z x = x_sign * x_clamp 2025-05-07T20:33:10.5544267Z x0 = x[:, :D] 2025-05-07T20:33:10.5544501Z x1 = x[:, D:] 2025-05-07T20:33:10.5544731Z 2025-05-07T20:33:10.5544929Z if contiguous: 2025-05-07T20:33:10.5545186Z x0 = x0.contiguous() 2025-05-07T20:33:10.5545465Z x1 = x1.contiguous() 2025-05-07T20:33:10.5545727Z 2025-05-07T20:33:10.5545936Z if scale_ub is not None: 2025-05-07T20:33:10.5546235Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.5546594Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.5546919Z ) 2025-05-07T20:33:10.5547127Z else: 2025-05-07T20:33:10.5547353Z scale_ub_tensor = None 2025-05-07T20:33:10.5547618Z 2025-05-07T20:33:10.5547873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.5548210Z op = silu_mul_quant 2025-05-07T20:33:10.5548473Z if compiled: 2025-05-07T20:33:10.5548744Z op = torch.compile(op) 2025-05-07T20:33:10.5549058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.5549343Z 2025-05-07T20:33:10.5549556Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.5549862Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.5550163Z 2025-05-07T20:33:10.5550421Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.5550779Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.5551089Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.5551420Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.5551800Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.5552127Z 2025-05-07T20:33:10.5552343Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.5552556Z 2025-05-07T20:33:10.5552665Z moe/activation_test.py:126: 2025-05-07T20:33:10.5552980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.5553329Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.5553762Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.5554588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.5555380Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.5555953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.5556726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.5557450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.5558222Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.5559008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:10.5559797Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.5560560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.5561280Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.5561975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.5562522Z fn() 2025-05-07T20:33:10.5563060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.5563675Z self.fn.run( 2025-05-07T20:33:10.5564275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.5564832Z kernel = self.compile( 2025-05-07T20:33:10.5565403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.5566090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.5566507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.5566753Z 2025-05-07T20:33:10.5566971Z self = 2025-05-07T20:33:10.5568106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.5569550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb6526a70>} 2025-05-07T20:33:10.5571013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.5572079Z context = 2025-05-07T20:33:10.5572390Z 2025-05-07T20:33:10.5572567Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.5573117Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.5573612Z module_map=module_map) 2025-05-07T20:33:10.5573996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.5574373Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.5574660Z E ^ 2025-05-07T20:33:10.5575150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.5575631Z 2025-05-07T20:33:10.5576065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.5576602Z 2025-05-07T20:33:10.5576715Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.5577153Z self=, 2025-05-07T20:33:10.5577570Z T=128, 2025-05-07T20:33:10.5577773Z D=7168, 2025-05-07T20:33:10.5577982Z scale_ub=None, 2025-05-07T20:33:10.5578208Z contiguous=False, 2025-05-07T20:33:10.5578453Z compiled=False, 2025-05-07T20:33:10.5578675Z ) 2025-05-07T20:33:10.9194402Z self = 2025-05-07T20:33:10.9195599Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.9196186Z 2025-05-07T20:33:10.9196357Z @given( 2025-05-07T20:33:10.9196860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.9197522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.9198170Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.9198870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.9199554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.9200161Z ) 2025-05-07T20:33:10.9200814Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.9201277Z def test_silu_mul_quant( 2025-05-07T20:33:10.9201615Z self, 2025-05-07T20:33:10.9201832Z T: int, 2025-05-07T20:33:10.9202045Z D: int, 2025-05-07T20:33:10.9202284Z scale_ub: Optional[float], 2025-05-07T20:33:10.9202585Z contiguous: bool, 2025-05-07T20:33:10.9202844Z compiled: bool, 2025-05-07T20:33:10.9203091Z ) -> None: 2025-05-07T20:33:10.9203327Z torch.manual_seed(2025) 2025-05-07T20:33:10.9203659Z 2025-05-07T20:33:10.9204003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.9204373Z 2025-05-07T20:33:10.9204589Z x_sign = torch.sign(x) 2025-05-07T20:33:10.9204897Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.9205234Z x = x_sign * x_clamp 2025-05-07T20:33:10.9205501Z x0 = x[:, :D] 2025-05-07T20:33:10.9205730Z x1 = x[:, D:] 2025-05-07T20:33:10.9205958Z 2025-05-07T20:33:10.9206163Z if contiguous: 2025-05-07T20:33:10.9206406Z x0 = x0.contiguous() 2025-05-07T20:33:10.9206691Z x1 = x1.contiguous() 2025-05-07T20:33:10.9206955Z 2025-05-07T20:33:10.9207169Z if scale_ub is not None: 2025-05-07T20:33:10.9207471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.9207837Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.9208161Z ) 2025-05-07T20:33:10.9208379Z else: 2025-05-07T20:33:10.9208618Z scale_ub_tensor = None 2025-05-07T20:33:10.9208892Z 2025-05-07T20:33:10.9209140Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.9209482Z op = silu_mul_quant 2025-05-07T20:33:10.9209754Z if compiled: 2025-05-07T20:33:10.9210015Z op = torch.compile(op) 2025-05-07T20:33:10.9210328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.9210619Z 2025-05-07T20:33:10.9210821Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.9210999Z 2025-05-07T20:33:10.9211109Z moe/activation_test.py:117: 2025-05-07T20:33:10.9211426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.9211770Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.9212076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.9212811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.9213543Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.9214104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.9214818Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.9215516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.9216073Z kernel = self.compile( 2025-05-07T20:33:10.9216644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.9217339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.9217810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.9218050Z 2025-05-07T20:33:10.9218266Z self = 2025-05-07T20:33:10.9219399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.9220843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb6525ab0>} 2025-05-07T20:33:10.9222247Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.9223355Z context = 2025-05-07T20:33:10.9223660Z 2025-05-07T20:33:10.9224022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.9224576Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.9225196Z module_map=module_map) 2025-05-07T20:33:10.9225582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.9225958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.9226235Z E ^ 2025-05-07T20:33:10.9226725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.9227194Z 2025-05-07T20:33:10.9227629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.9228170Z 2025-05-07T20:33:10.9228284Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.9228723Z self=, 2025-05-07T20:33:10.9229148Z T=4096, 2025-05-07T20:33:10.9229348Z D=5120, 2025-05-07T20:33:10.9229554Z scale_ub=1200.0, 2025-05-07T20:33:10.9229790Z contiguous=True, 2025-05-07T20:33:10.9230024Z compiled=False, 2025-05-07T20:33:10.9230243Z ) 2025-05-07T20:33:10.9230584Z self = 2025-05-07T20:33:10.9231099Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.9231393Z 2025-05-07T20:33:10.9231478Z @given( 2025-05-07T20:33:10.9231728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.9232056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.9232386Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.9232738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.9233088Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.9233386Z ) 2025-05-07T20:33:10.9233821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.9234286Z def test_silu_mul_quant( 2025-05-07T20:33:10.9234540Z self, 2025-05-07T20:33:10.9234753Z T: int, 2025-05-07T20:33:10.9234966Z D: int, 2025-05-07T20:33:10.9235198Z scale_ub: Optional[float], 2025-05-07T20:33:10.9235491Z contiguous: bool, 2025-05-07T20:33:10.9235748Z compiled: bool, 2025-05-07T20:33:10.9235984Z ) -> None: 2025-05-07T20:33:10.9236216Z torch.manual_seed(2025) 2025-05-07T20:33:10.9236474Z 2025-05-07T20:33:10.9236761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.9237122Z 2025-05-07T20:33:10.9237343Z x_sign = torch.sign(x) 2025-05-07T20:33:10.9237649Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.9237978Z x = x_sign * x_clamp 2025-05-07T20:33:10.9238235Z x0 = x[:, :D] 2025-05-07T20:33:10.9238543Z x1 = x[:, D:] 2025-05-07T20:33:10.9238764Z 2025-05-07T20:33:10.9238965Z if contiguous: 2025-05-07T20:33:10.9239218Z x0 = x0.contiguous() 2025-05-07T20:33:10.9239496Z x1 = x1.contiguous() 2025-05-07T20:33:10.9239755Z 2025-05-07T20:33:10.9239963Z if scale_ub is not None: 2025-05-07T20:33:10.9240248Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.9240605Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.9240931Z ) 2025-05-07T20:33:10.9241129Z else: 2025-05-07T20:33:10.9241352Z scale_ub_tensor = None 2025-05-07T20:33:10.9241619Z 2025-05-07T20:33:10.9241860Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.9242187Z op = silu_mul_quant 2025-05-07T20:33:10.9242521Z if compiled: 2025-05-07T20:33:10.9242780Z op = torch.compile(op) 2025-05-07T20:33:10.9243093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.9243386Z 2025-05-07T20:33:10.9243590Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.9243770Z 2025-05-07T20:33:10.9243875Z moe/activation_test.py:117: 2025-05-07T20:33:10.9244189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.9244618Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.9244913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.9245641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.9246365Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.9246925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.9247643Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.9248343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.9248901Z kernel = self.compile( 2025-05-07T20:33:10.9249465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.9250155Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.9250577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.9250843Z 2025-05-07T20:33:10.9251093Z self = 2025-05-07T20:33:10.9252215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.9253656Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb633a560>} 2025-05-07T20:33:10.9255065Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.9256143Z context = 2025-05-07T20:33:10.9256447Z 2025-05-07T20:33:10.9256623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.9257173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.9257668Z module_map=module_map) 2025-05-07T20:33:10.9258051Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.9258417Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.9258695Z E ^ 2025-05-07T20:33:10.9259198Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.9259717Z 2025-05-07T20:33:10.9260160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.9260700Z 2025-05-07T20:33:10.9260813Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.9261300Z self=, 2025-05-07T20:33:10.9261721Z T=1, 2025-05-07T20:33:10.9261914Z D=5120, 2025-05-07T20:33:10.9262114Z scale_ub=None, 2025-05-07T20:33:10.9262341Z contiguous=True, 2025-05-07T20:33:10.9268585Z compiled=True, 2025-05-07T20:33:10.9268820Z ) 2025-05-07T20:33:11.5194959Z self = 2025-05-07T20:33:11.5195684Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.5196204Z 2025-05-07T20:33:11.5196322Z @given( 2025-05-07T20:33:11.5196649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.5197088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.5197457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.5197808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.5198247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.5198558Z ) 2025-05-07T20:33:11.5198990Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.5199467Z def test_silu_mul_quant( 2025-05-07T20:33:11.5199734Z self, 2025-05-07T20:33:11.5199943Z T: int, 2025-05-07T20:33:11.5200161Z D: int, 2025-05-07T20:33:11.5200398Z scale_ub: Optional[float], 2025-05-07T20:33:11.5200687Z contiguous: bool, 2025-05-07T20:33:11.5200950Z compiled: bool, 2025-05-07T20:33:11.5201204Z ) -> None: 2025-05-07T20:33:11.5201435Z torch.manual_seed(2025) 2025-05-07T20:33:11.5201700Z 2025-05-07T20:33:11.5201997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.5202363Z 2025-05-07T20:33:11.5202570Z x_sign = torch.sign(x) 2025-05-07T20:33:11.5202882Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.5203220Z x = x_sign * x_clamp 2025-05-07T20:33:11.5203480Z x0 = x[:, :D] 2025-05-07T20:33:11.5203721Z x1 = x[:, D:] 2025-05-07T20:33:11.5203947Z 2025-05-07T20:33:11.5204147Z if contiguous: 2025-05-07T20:33:11.5204400Z x0 = x0.contiguous() 2025-05-07T20:33:11.5204678Z x1 = x1.contiguous() 2025-05-07T20:33:11.5204935Z 2025-05-07T20:33:11.5205146Z if scale_ub is not None: 2025-05-07T20:33:11.5205444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.5205799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.5206135Z ) 2025-05-07T20:33:11.5206349Z else: 2025-05-07T20:33:11.5206572Z scale_ub_tensor = None 2025-05-07T20:33:11.5206845Z 2025-05-07T20:33:11.5207098Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.5207438Z op = silu_mul_quant 2025-05-07T20:33:11.5207705Z if compiled: 2025-05-07T20:33:11.5207973Z op = torch.compile(op) 2025-05-07T20:33:11.5208301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.5208593Z 2025-05-07T20:33:11.5208807Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.5209114Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.5209423Z 2025-05-07T20:33:11.5209682Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.5210038Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.5210347Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.5210683Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.5211100Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.5211459Z 2025-05-07T20:33:11.5211749Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.5211964Z 2025-05-07T20:33:11.5212073Z moe/activation_test.py:126: 2025-05-07T20:33:11.5212394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.5212751Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.5213107Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.5213949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.5214740Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.5215321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.5216041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.5216811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.5217571Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.5218367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:11.5219240Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.5220011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.5220683Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.5221321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.5221872Z fn() 2025-05-07T20:33:11.5222412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.5223029Z self.fn.run( 2025-05-07T20:33:11.5223529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.5224295Z kernel = self.compile( 2025-05-07T20:33:11.5224869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.5225565Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.5225988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.5226229Z 2025-05-07T20:33:11.5226455Z self = 2025-05-07T20:33:11.5227592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.5229055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb6339510>} 2025-05-07T20:33:11.5230470Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.5231547Z context = 2025-05-07T20:33:11.5231851Z 2025-05-07T20:33:11.5232028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.5232585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.5233089Z module_map=module_map) 2025-05-07T20:33:11.5233480Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.5233957Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.5234244Z E ^ 2025-05-07T20:33:11.5234820Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.5235294Z 2025-05-07T20:33:11.5235730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.5236294Z 2025-05-07T20:33:11.5236407Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.5236847Z self=, 2025-05-07T20:33:11.5237274Z T=2048, 2025-05-07T20:33:11.5237476Z D=5120, 2025-05-07T20:33:11.5237686Z scale_ub=None, 2025-05-07T20:33:11.5237919Z contiguous=True, 2025-05-07T20:33:11.5238154Z compiled=True, 2025-05-07T20:33:11.5238373Z ) 2025-05-07T20:33:12.0815133Z self = 2025-05-07T20:33:12.0816241Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.0816648Z 2025-05-07T20:33:12.0816775Z @given( 2025-05-07T20:33:12.0817124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.0817563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.0818018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.0818626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.0818984Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.0819300Z ) 2025-05-07T20:33:12.0819689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.0820175Z def test_silu_mul_quant( 2025-05-07T20:33:12.0820442Z self, 2025-05-07T20:33:12.0820663Z T: int, 2025-05-07T20:33:12.0820881Z D: int, 2025-05-07T20:33:12.0821119Z scale_ub: Optional[float], 2025-05-07T20:33:12.0821426Z contiguous: bool, 2025-05-07T20:33:12.0821696Z compiled: bool, 2025-05-07T20:33:12.0821940Z ) -> None: 2025-05-07T20:33:12.0822184Z torch.manual_seed(2025) 2025-05-07T20:33:12.0822463Z 2025-05-07T20:33:12.0822761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.0823140Z 2025-05-07T20:33:12.0823359Z x_sign = torch.sign(x) 2025-05-07T20:33:12.0823707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.0824366Z x = x_sign * x_clamp 2025-05-07T20:33:12.0824643Z x0 = x[:, :D] 2025-05-07T20:33:12.0824887Z x1 = x[:, D:] 2025-05-07T20:33:12.0825112Z 2025-05-07T20:33:12.0825323Z if contiguous: 2025-05-07T20:33:12.0825584Z x0 = x0.contiguous() 2025-05-07T20:33:12.0825863Z x1 = x1.contiguous() 2025-05-07T20:33:12.0826135Z 2025-05-07T20:33:12.0826354Z if scale_ub is not None: 2025-05-07T20:33:12.0826663Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.0827024Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.0827362Z ) 2025-05-07T20:33:12.0827580Z else: 2025-05-07T20:33:12.0827813Z scale_ub_tensor = None 2025-05-07T20:33:12.0828088Z 2025-05-07T20:33:12.0828344Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.0828684Z op = silu_mul_quant 2025-05-07T20:33:12.0828967Z if compiled: 2025-05-07T20:33:12.0829246Z op = torch.compile(op) 2025-05-07T20:33:12.0829564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.0829866Z 2025-05-07T20:33:12.0830090Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.0830400Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.0830738Z 2025-05-07T20:33:12.0831020Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.0831426Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.0831752Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.0832099Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.0832603Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.0832942Z 2025-05-07T20:33:12.0833166Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.0833378Z 2025-05-07T20:33:12.0833571Z moe/activation_test.py:126: 2025-05-07T20:33:12.0833898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.0834271Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.0834637Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.0835488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.0836294Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.0836886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.0837694Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.0838427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.0839203Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.0840136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:12.0840937Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.0841759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.0842472Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.0843106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.0843669Z fn() 2025-05-07T20:33:12.0844247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.0844870Z self.fn.run( 2025-05-07T20:33:12.0845364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.0845941Z kernel = self.compile( 2025-05-07T20:33:12.0846524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.0847211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.0847640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.0847891Z 2025-05-07T20:33:12.0848110Z self = 2025-05-07T20:33:12.0849255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.0850833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd91dd7880>} 2025-05-07T20:33:12.0852256Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.0853326Z context = 2025-05-07T20:33:12.0853641Z 2025-05-07T20:33:12.0853821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.0854378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.0854883Z module_map=module_map) 2025-05-07T20:33:12.0855266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.0855706Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.0856001Z E ^ 2025-05-07T20:33:12.0856501Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.0856989Z 2025-05-07T20:33:12.0857428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.0857971Z 2025-05-07T20:33:12.0858086Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.0858531Z self=, 2025-05-07T20:33:12.0858952Z T=128, 2025-05-07T20:33:12.0859162Z D=5120, 2025-05-07T20:33:12.0859377Z scale_ub=None, 2025-05-07T20:33:12.0859608Z contiguous=True, 2025-05-07T20:33:12.0859905Z compiled=True, 2025-05-07T20:33:12.0860142Z ) 2025-05-07T20:33:13.0349392Z self = 2025-05-07T20:33:13.0350420Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.0351087Z 2025-05-07T20:33:13.0351284Z @given( 2025-05-07T20:33:13.0351845Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0352643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0353233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0353693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0354084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0354416Z ) 2025-05-07T20:33:13.0354830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0355349Z def test_silu_mul_quant( 2025-05-07T20:33:13.0355633Z self, 2025-05-07T20:33:13.0355867Z T: int, 2025-05-07T20:33:13.0356111Z D: int, 2025-05-07T20:33:13.0356365Z scale_ub: Optional[float], 2025-05-07T20:33:13.0356688Z contiguous: bool, 2025-05-07T20:33:13.0356975Z compiled: bool, 2025-05-07T20:33:13.0357241Z ) -> None: 2025-05-07T20:33:13.0357501Z torch.manual_seed(2025) 2025-05-07T20:33:13.0357788Z 2025-05-07T20:33:13.0358108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0358514Z 2025-05-07T20:33:13.0358754Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0359090Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0359459Z x = x_sign * x_clamp 2025-05-07T20:33:13.0359746Z x0 = x[:, :D] 2025-05-07T20:33:13.0360002Z x1 = x[:, D:] 2025-05-07T20:33:13.0360248Z 2025-05-07T20:33:13.0360468Z if contiguous: 2025-05-07T20:33:13.0360741Z x0 = x0.contiguous() 2025-05-07T20:33:13.0361041Z x1 = x1.contiguous() 2025-05-07T20:33:13.0361327Z 2025-05-07T20:33:13.0361554Z if scale_ub is not None: 2025-05-07T20:33:13.0361870Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0362268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0362633Z ) 2025-05-07T20:33:13.0362864Z else: 2025-05-07T20:33:13.0363116Z scale_ub_tensor = None 2025-05-07T20:33:13.0363410Z 2025-05-07T20:33:13.0363684Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0364054Z op = silu_mul_quant 2025-05-07T20:33:13.0364356Z if compiled: 2025-05-07T20:33:13.0364642Z op = torch.compile(op) 2025-05-07T20:33:13.0364988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0365310Z 2025-05-07T20:33:13.0365549Z y_fp8, y_scale = fn() 2025-05-07T20:33:13.0365878Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:13.0366215Z 2025-05-07T20:33:13.0366494Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0366879Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:13.0367223Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:13.0367679Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:13.0368092Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.0368453Z 2025-05-07T20:33:13.0368694Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:13.0368922Z 2025-05-07T20:33:13.0369046Z moe/activation_test.py:126: 2025-05-07T20:33:13.0369403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0369793Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:13.0370174Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.0371084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:13.0372258Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:13.0373054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0373991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0374779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:13.0375702Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:13.0376572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:13.0377431Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:13.0378270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:13.0379011Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:13.0379710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:13.0380308Z fn() 2025-05-07T20:33:13.0380901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:13.0381692Z self.fn.run( 2025-05-07T20:33:13.0382379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0383141Z kernel = self.compile( 2025-05-07T20:33:13.0383797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0384550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0385005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0385275Z 2025-05-07T20:33:13.0385517Z self = 2025-05-07T20:33:13.0386764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0388350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd91070700>} 2025-05-07T20:33:13.0389892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0391054Z context = 2025-05-07T20:33:13.0391391Z 2025-05-07T20:33:13.0391586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0392189Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0392733Z module_map=module_map) 2025-05-07T20:33:13.0393205Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0393704Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:13.0394019Z E ^ 2025-05-07T20:33:13.0394553Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0395075Z 2025-05-07T20:33:13.0395551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0396140Z 2025-05-07T20:33:13.0396262Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0396741Z self=, 2025-05-07T20:33:13.0397198Z T=4096, 2025-05-07T20:33:13.0397421Z D=5120, 2025-05-07T20:33:13.0397704Z scale_ub=None, 2025-05-07T20:33:13.0397951Z contiguous=True, 2025-05-07T20:33:13.0398212Z compiled=True, 2025-05-07T20:33:13.0398454Z ) 2025-05-07T20:33:13.8590867Z self = 2025-05-07T20:33:13.8591491Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.8591817Z 2025-05-07T20:33:13.8592055Z @given( 2025-05-07T20:33:13.8592388Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.8592758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.8593120Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.8593586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.8593964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.8594304Z ) 2025-05-07T20:33:13.8594721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.8595228Z def test_silu_mul_quant( 2025-05-07T20:33:13.8595518Z self, 2025-05-07T20:33:13.8595753Z T: int, 2025-05-07T20:33:13.8595983Z D: int, 2025-05-07T20:33:13.8596245Z scale_ub: Optional[float], 2025-05-07T20:33:13.8596564Z contiguous: bool, 2025-05-07T20:33:13.8596841Z compiled: bool, 2025-05-07T20:33:13.8597107Z ) -> None: 2025-05-07T20:33:13.8597362Z torch.manual_seed(2025) 2025-05-07T20:33:13.8597644Z 2025-05-07T20:33:13.8597994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.8598392Z 2025-05-07T20:33:13.8598624Z x_sign = torch.sign(x) 2025-05-07T20:33:13.8598956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.8599316Z x = x_sign * x_clamp 2025-05-07T20:33:13.8599601Z x0 = x[:, :D] 2025-05-07T20:33:13.8599849Z x1 = x[:, D:] 2025-05-07T20:33:13.8600101Z 2025-05-07T20:33:13.8600321Z if contiguous: 2025-05-07T20:33:13.8600601Z x0 = x0.contiguous() 2025-05-07T20:33:13.8600896Z x1 = x1.contiguous() 2025-05-07T20:33:13.8601178Z 2025-05-07T20:33:13.8601406Z if scale_ub is not None: 2025-05-07T20:33:13.8601724Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.8602139Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.8602493Z ) 2025-05-07T20:33:13.8602713Z else: 2025-05-07T20:33:13.8602966Z scale_ub_tensor = None 2025-05-07T20:33:13.8603269Z 2025-05-07T20:33:13.8603540Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.8603906Z op = silu_mul_quant 2025-05-07T20:33:13.8604194Z if compiled: 2025-05-07T20:33:13.8604477Z op = torch.compile(op) 2025-05-07T20:33:13.8604818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.8605131Z 2025-05-07T20:33:13.8605348Z y_fp8, y_scale = fn() 2025-05-07T20:33:13.8605680Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:13.8606017Z 2025-05-07T20:33:13.8606293Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.8606755Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:13.8607090Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:13.8607459Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:13.8607867Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.8608225Z 2025-05-07T20:33:13.8608461Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:13.8608688Z 2025-05-07T20:33:13.8608805Z moe/activation_test.py:126: 2025-05-07T20:33:13.8609142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.8609518Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:13.8609893Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.8610784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:13.8611805Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:13.8619237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.8620040Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.8620952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:13.8621791Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:13.8622657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:13.8623513Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:13.8624610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:13.8625349Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:13.8626041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:13.8626640Z fn() 2025-05-07T20:33:13.8627229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:13.8627903Z self.fn.run( 2025-05-07T20:33:13.8628442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.8629044Z kernel = self.compile( 2025-05-07T20:33:13.8629663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.8630409Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.8630869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.8631134Z 2025-05-07T20:33:13.8631380Z self = 2025-05-07T20:33:13.8632886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.8634730Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90fa09d0>} 2025-05-07T20:33:13.8636258Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.8637425Z context = 2025-05-07T20:33:13.8637758Z 2025-05-07T20:33:13.8637952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.8638649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.8639195Z module_map=module_map) 2025-05-07T20:33:13.8639617Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.8640035Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:13.8640351Z E ^ 2025-05-07T20:33:13.8640891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.8641438Z 2025-05-07T20:33:13.8642031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.8642762Z 2025-05-07T20:33:13.8642915Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.8643474Z self=, 2025-05-07T20:33:13.8644000Z T=16384, 2025-05-07T20:33:13.8644232Z D=5120, 2025-05-07T20:33:13.8644462Z scale_ub=None, 2025-05-07T20:33:13.8644718Z contiguous=True, 2025-05-07T20:33:13.8644982Z compiled=True, 2025-05-07T20:33:13.8645226Z ) 2025-05-07T20:33:13.9066831Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:13.9068363Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:13.9069825Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:13.9070910Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:13.9072124Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:14.0188827Z self = 2025-05-07T20:33:14.0189881Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:14.0190429Z 2025-05-07T20:33:14.0190589Z @given( 2025-05-07T20:33:14.0191057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.0191680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.0192095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.0192476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.0192851Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.0193179Z ) 2025-05-07T20:33:14.0193632Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.0194140Z def test_silu_mul_quant( 2025-05-07T20:33:14.0194424Z self, 2025-05-07T20:33:14.0194651Z T: int, 2025-05-07T20:33:14.0194881Z D: int, 2025-05-07T20:33:14.0195133Z scale_ub: Optional[float], 2025-05-07T20:33:14.0195441Z contiguous: bool, 2025-05-07T20:33:14.0195720Z compiled: bool, 2025-05-07T20:33:14.0195983Z ) -> None: 2025-05-07T20:33:14.0196231Z torch.manual_seed(2025) 2025-05-07T20:33:14.0196511Z 2025-05-07T20:33:14.0196832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.0197220Z 2025-05-07T20:33:14.0197444Z x_sign = torch.sign(x) 2025-05-07T20:33:14.0197782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.0198133Z x = x_sign * x_clamp 2025-05-07T20:33:14.0198416Z x0 = x[:, :D] 2025-05-07T20:33:14.0198670Z x1 = x[:, D:] 2025-05-07T20:33:14.0198916Z 2025-05-07T20:33:14.0199126Z if contiguous: 2025-05-07T20:33:14.0199398Z x0 = x0.contiguous() 2025-05-07T20:33:14.0199704Z x1 = x1.contiguous() 2025-05-07T20:33:14.0200077Z 2025-05-07T20:33:14.0200310Z if scale_ub is not None: 2025-05-07T20:33:14.0200631Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.0201010Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.0201371Z ) 2025-05-07T20:33:14.0201610Z else: 2025-05-07T20:33:14.0201853Z scale_ub_tensor = None 2025-05-07T20:33:14.0202143Z 2025-05-07T20:33:14.0202414Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.0202771Z op = silu_mul_quant 2025-05-07T20:33:14.0203061Z if compiled: 2025-05-07T20:33:14.0203353Z op = torch.compile(op) 2025-05-07T20:33:14.0203689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.0204075Z 2025-05-07T20:33:14.0204299Z y_fp8, y_scale = fn() 2025-05-07T20:33:14.0204626Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:14.0204956Z 2025-05-07T20:33:14.0205232Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.0205614Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:14.0205945Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:14.0206376Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:14.0206866Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.0207220Z 2025-05-07T20:33:14.0207457Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:14.0207681Z 2025-05-07T20:33:14.0207803Z moe/activation_test.py:126: 2025-05-07T20:33:14.0208138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.0208522Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:14.0208898Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.0209800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:14.0210647Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:14.0211269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.0212048Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.0212830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:14.0213646Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:14.0214502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:14.0215350Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:14.0216190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:14.0216919Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:14.0217606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:14.0218198Z fn() 2025-05-07T20:33:14.0218772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:14.0219434Z self.fn.run( 2025-05-07T20:33:14.0219969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.0220573Z kernel = self.compile( 2025-05-07T20:33:14.0221180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.0221926Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.0222381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.0222642Z 2025-05-07T20:33:14.0222932Z self = 2025-05-07T20:33:14.0224330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.0225894Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90fa2170>} 2025-05-07T20:33:14.0227416Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.0228650Z context = 2025-05-07T20:33:14.0228978Z 2025-05-07T20:33:14.0229176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.0229776Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.0230316Z module_map=module_map) 2025-05-07T20:33:14.0230803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.0231271Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:14.0231585Z E ^ 2025-05-07T20:33:14.0232117Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.0232622Z 2025-05-07T20:33:14.0233090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.0233739Z 2025-05-07T20:33:14.0233864Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.0234334Z self=, 2025-05-07T20:33:14.0234792Z T=1, 2025-05-07T20:33:14.0235004Z D=5120, 2025-05-07T20:33:14.0235234Z scale_ub=1200.0, 2025-05-07T20:33:14.0235492Z contiguous=True, 2025-05-07T20:33:14.0235744Z compiled=True, 2025-05-07T20:33:14.0235980Z ) 2025-05-07T20:33:14.1823307Z self = 2025-05-07T20:33:14.1824619Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:14.1825145Z 2025-05-07T20:33:14.1825310Z @given( 2025-05-07T20:33:14.1825782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.1826412Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.1827019Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.1827684Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.1828346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.1828924Z ) 2025-05-07T20:33:14.1829620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.1830508Z def test_silu_mul_quant( 2025-05-07T20:33:14.1830998Z self, 2025-05-07T20:33:14.1831388Z T: int, 2025-05-07T20:33:14.1831791Z D: int, 2025-05-07T20:33:14.1832125Z scale_ub: Optional[float], 2025-05-07T20:33:14.1832449Z contiguous: bool, 2025-05-07T20:33:14.1832732Z compiled: bool, 2025-05-07T20:33:14.1832993Z ) -> None: 2025-05-07T20:33:14.1833242Z torch.manual_seed(2025) 2025-05-07T20:33:14.1833584Z 2025-05-07T20:33:14.1833903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.1834290Z 2025-05-07T20:33:14.1834521Z x_sign = torch.sign(x) 2025-05-07T20:33:14.1834862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.1835222Z x = x_sign * x_clamp 2025-05-07T20:33:14.1835501Z x0 = x[:, :D] 2025-05-07T20:33:14.1835759Z x1 = x[:, D:] 2025-05-07T20:33:14.1836004Z 2025-05-07T20:33:14.1836219Z if contiguous: 2025-05-07T20:33:14.1836658Z x0 = x0.contiguous() 2025-05-07T20:33:14.1836964Z x1 = x1.contiguous() 2025-05-07T20:33:14.1837238Z 2025-05-07T20:33:14.1837462Z if scale_ub is not None: 2025-05-07T20:33:14.1837781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.1838162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.1838515Z ) 2025-05-07T20:33:14.1838735Z else: 2025-05-07T20:33:14.1838988Z scale_ub_tensor = None 2025-05-07T20:33:14.1839281Z 2025-05-07T20:33:14.1839543Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.1839905Z op = silu_mul_quant 2025-05-07T20:33:14.1840198Z if compiled: 2025-05-07T20:33:14.1840487Z op = torch.compile(op) 2025-05-07T20:33:14.1840901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.1841217Z 2025-05-07T20:33:14.1841445Z > y_fp8, y_scale = fn() 2025-05-07T20:33:14.1841633Z 2025-05-07T20:33:14.1841754Z moe/activation_test.py:117: 2025-05-07T20:33:14.1842095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.1842474Z moe/activation_test.py:115: in fn 2025-05-07T20:33:14.1842868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.1843561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:14.1844197Z return fn(*args, **kwargs) 2025-05-07T20:33:14.1844949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:14.1845720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:14.1846327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.1847101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.1847843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.1848443Z kernel = self.compile( 2025-05-07T20:33:14.1849057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.1849803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.1850248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.1850510Z 2025-05-07T20:33:14.1850744Z self = 2025-05-07T20:33:14.1851984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.1853566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90f9a050>} 2025-05-07T20:33:14.1855069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.1856219Z context = 2025-05-07T20:33:14.1856550Z 2025-05-07T20:33:14.1856741Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.1857337Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.1857860Z module_map=module_map) 2025-05-07T20:33:14.1858275Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.1858683Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:14.1858978Z E ^ 2025-05-07T20:33:14.1859555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.1860066Z 2025-05-07T20:33:14.1860534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.1861108Z 2025-05-07T20:33:14.1861235Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.1861705Z self=, 2025-05-07T20:33:14.1862169Z T=1, 2025-05-07T20:33:14.1862415Z D=5120, 2025-05-07T20:33:14.1862637Z scale_ub=None, 2025-05-07T20:33:14.1862882Z contiguous=False, 2025-05-07T20:33:14.1863143Z compiled=True, 2025-05-07T20:33:14.1863376Z ) 2025-05-07T20:33:14.2603909Z self = 2025-05-07T20:33:14.2604605Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:14.2604906Z 2025-05-07T20:33:14.2604997Z @given( 2025-05-07T20:33:14.2605266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.2605613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.2605960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.2606411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.2606846Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.2607169Z ) 2025-05-07T20:33:14.2607568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.2608065Z def test_silu_mul_quant( 2025-05-07T20:33:14.2608335Z self, 2025-05-07T20:33:14.2608557Z T: int, 2025-05-07T20:33:14.2608785Z D: int, 2025-05-07T20:33:14.2609025Z scale_ub: Optional[float], 2025-05-07T20:33:14.2609333Z contiguous: bool, 2025-05-07T20:33:14.2609611Z compiled: bool, 2025-05-07T20:33:14.2609859Z ) -> None: 2025-05-07T20:33:14.2610106Z torch.manual_seed(2025) 2025-05-07T20:33:14.2610378Z 2025-05-07T20:33:14.2610686Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.2611066Z 2025-05-07T20:33:14.2611286Z x_sign = torch.sign(x) 2025-05-07T20:33:14.2611613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.2611969Z x = x_sign * x_clamp 2025-05-07T20:33:14.2612242Z x0 = x[:, :D] 2025-05-07T20:33:14.2612486Z x1 = x[:, D:] 2025-05-07T20:33:14.2612719Z 2025-05-07T20:33:14.2612932Z if contiguous: 2025-05-07T20:33:14.2613196Z x0 = x0.contiguous() 2025-05-07T20:33:14.2613485Z x1 = x1.contiguous() 2025-05-07T20:33:14.2613762Z 2025-05-07T20:33:14.2613979Z if scale_ub is not None: 2025-05-07T20:33:14.2614287Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.2614668Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.2615019Z ) 2025-05-07T20:33:14.2615237Z else: 2025-05-07T20:33:14.2615479Z scale_ub_tensor = None 2025-05-07T20:33:14.2615762Z 2025-05-07T20:33:14.2616019Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.2616369Z op = silu_mul_quant 2025-05-07T20:33:14.2616656Z if compiled: 2025-05-07T20:33:14.2616935Z op = torch.compile(op) 2025-05-07T20:33:14.2617271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.2617581Z 2025-05-07T20:33:14.2617795Z y_fp8, y_scale = fn() 2025-05-07T20:33:14.2618117Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:14.2618442Z 2025-05-07T20:33:14.2618711Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.2619083Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:14.2619418Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:14.2619770Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:14.2620244Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.2620600Z 2025-05-07T20:33:14.2620833Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:14.2621051Z 2025-05-07T20:33:14.2621163Z moe/activation_test.py:126: 2025-05-07T20:33:14.2621508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.2621886Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:14.2622297Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.2623183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:14.2624185Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:14.2624800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.2625633Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.2626406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:14.2627216Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:14.2628188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:14.2629024Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:14.2629838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:14.2630551Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:14.2631231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:14.2631819Z fn() 2025-05-07T20:33:14.2632433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:14.2633086Z self.fn.run( 2025-05-07T20:33:14.2633709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.2634311Z kernel = self.compile( 2025-05-07T20:33:14.2634917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.2635650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.2636089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.2636350Z 2025-05-07T20:33:14.2636582Z self = 2025-05-07T20:33:14.2637788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.2639333Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd91499900>} 2025-05-07T20:33:14.2640835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.2641984Z context = 2025-05-07T20:33:14.2642313Z 2025-05-07T20:33:14.2642499Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.2643085Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.2643609Z module_map=module_map) 2025-05-07T20:33:14.2644018Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.2644496Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:14.2644801Z E ^ 2025-05-07T20:33:14.2645319Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.2645830Z 2025-05-07T20:33:14.2646300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.2646876Z 2025-05-07T20:33:14.2647007Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.2647476Z self=, 2025-05-07T20:33:14.2647922Z T=1, 2025-05-07T20:33:14.2648131Z D=5120, 2025-05-07T20:33:14.2648351Z scale_ub=None, 2025-05-07T20:33:14.2648590Z contiguous=True, 2025-05-07T20:33:14.2648844Z compiled=False, 2025-05-07T20:33:14.2649132Z ) 2025-05-07T20:33:14.6098394Z self = 2025-05-07T20:33:14.6099463Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:14.6099977Z 2025-05-07T20:33:14.6100136Z @given( 2025-05-07T20:33:14.6100594Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.6101193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.6102145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.6102620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.6103000Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.6103327Z ) 2025-05-07T20:33:14.6103732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.6104239Z def test_silu_mul_quant( 2025-05-07T20:33:14.6104520Z self, 2025-05-07T20:33:14.6104752Z T: int, 2025-05-07T20:33:14.6104990Z D: int, 2025-05-07T20:33:14.6105241Z scale_ub: Optional[float], 2025-05-07T20:33:14.6105554Z contiguous: bool, 2025-05-07T20:33:14.6105834Z compiled: bool, 2025-05-07T20:33:14.6106095Z ) -> None: 2025-05-07T20:33:14.6106349Z torch.manual_seed(2025) 2025-05-07T20:33:14.6106631Z 2025-05-07T20:33:14.6106943Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.6107341Z 2025-05-07T20:33:14.6107569Z x_sign = torch.sign(x) 2025-05-07T20:33:14.6107908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.6108273Z x = x_sign * x_clamp 2025-05-07T20:33:14.6108552Z x0 = x[:, :D] 2025-05-07T20:33:14.6108808Z x1 = x[:, D:] 2025-05-07T20:33:14.6109045Z 2025-05-07T20:33:14.6109265Z if contiguous: 2025-05-07T20:33:14.6109535Z x0 = x0.contiguous() 2025-05-07T20:33:14.6109832Z x1 = x1.contiguous() 2025-05-07T20:33:14.6110114Z 2025-05-07T20:33:14.6110345Z if scale_ub is not None: 2025-05-07T20:33:14.6110667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.6111051Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.6111405Z ) 2025-05-07T20:33:14.6111633Z else: 2025-05-07T20:33:14.6111882Z scale_ub_tensor = None 2025-05-07T20:33:14.6112165Z 2025-05-07T20:33:14.6112440Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.6119492Z op = silu_mul_quant 2025-05-07T20:33:14.6119813Z if compiled: 2025-05-07T20:33:14.6120102Z op = torch.compile(op) 2025-05-07T20:33:14.6120450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.6120773Z 2025-05-07T20:33:14.6120995Z > y_fp8, y_scale = fn() 2025-05-07T20:33:14.6121193Z 2025-05-07T20:33:14.6121310Z moe/activation_test.py:117: 2025-05-07T20:33:14.6121653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.6122045Z moe/activation_test.py:115: in fn 2025-05-07T20:33:14.6122366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.6123279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:14.6124267Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:14.6124880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.6125668Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.6126430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.6127048Z kernel = self.compile( 2025-05-07T20:33:14.6127667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.6128420Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.6128967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.6129238Z 2025-05-07T20:33:14.6129479Z self = 2025-05-07T20:33:14.6130782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.6132428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd914988b0>} 2025-05-07T20:33:14.6133964Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.6135136Z context = 2025-05-07T20:33:14.6135465Z 2025-05-07T20:33:14.6135657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.6136260Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.6136798Z module_map=module_map) 2025-05-07T20:33:14.6137221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.6137628Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:14.6137928Z E ^ 2025-05-07T20:33:14.6138463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.6138976Z 2025-05-07T20:33:14.6139450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.6140038Z 2025-05-07T20:33:14.6140159Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.6140633Z self=, 2025-05-07T20:33:14.6141101Z T=128, 2025-05-07T20:33:14.6141319Z D=5120, 2025-05-07T20:33:14.6141547Z scale_ub=None, 2025-05-07T20:33:14.6141805Z contiguous=False, 2025-05-07T20:33:14.6142064Z compiled=True, 2025-05-07T20:33:14.6142302Z ) 2025-05-07T20:33:14.6142672Z self = 2025-05-07T20:33:14.6143242Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:14.6143557Z 2025-05-07T20:33:14.6143646Z @given( 2025-05-07T20:33:14.6143911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.6144269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.6144621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.6145002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.6145385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.6145718Z ) 2025-05-07T20:33:14.6146124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.6146706Z def test_silu_mul_quant( 2025-05-07T20:33:14.6146988Z self, 2025-05-07T20:33:14.6147214Z T: int, 2025-05-07T20:33:14.6147444Z D: int, 2025-05-07T20:33:14.6147690Z scale_ub: Optional[float], 2025-05-07T20:33:14.6148008Z contiguous: bool, 2025-05-07T20:33:14.6148292Z compiled: bool, 2025-05-07T20:33:14.6148554Z ) -> None: 2025-05-07T20:33:14.6148810Z torch.manual_seed(2025) 2025-05-07T20:33:14.6149093Z 2025-05-07T20:33:14.6149405Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.6149800Z 2025-05-07T20:33:14.6150028Z x_sign = torch.sign(x) 2025-05-07T20:33:14.6150371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.6150727Z x = x_sign * x_clamp 2025-05-07T20:33:14.6151062Z x0 = x[:, :D] 2025-05-07T20:33:14.6151317Z x1 = x[:, D:] 2025-05-07T20:33:14.6151556Z 2025-05-07T20:33:14.6151775Z if contiguous: 2025-05-07T20:33:14.6152049Z x0 = x0.contiguous() 2025-05-07T20:33:14.6152392Z x1 = x1.contiguous() 2025-05-07T20:33:14.6152670Z 2025-05-07T20:33:14.6152897Z if scale_ub is not None: 2025-05-07T20:33:14.6153212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.6153821Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.6154181Z ) 2025-05-07T20:33:14.6154400Z else: 2025-05-07T20:33:14.6154646Z scale_ub_tensor = None 2025-05-07T20:33:14.6154937Z 2025-05-07T20:33:14.6155203Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.6155570Z op = silu_mul_quant 2025-05-07T20:33:14.6155860Z if compiled: 2025-05-07T20:33:14.6156149Z op = torch.compile(op) 2025-05-07T20:33:14.6156494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.6156812Z 2025-05-07T20:33:14.6157039Z > y_fp8, y_scale = fn() 2025-05-07T20:33:14.6157231Z 2025-05-07T20:33:14.6157348Z moe/activation_test.py:117: 2025-05-07T20:33:14.6157686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.6158067Z moe/activation_test.py:115: in fn 2025-05-07T20:33:14.6158389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.6159029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:14.6159667Z return fn(*args, **kwargs) 2025-05-07T20:33:14.6160419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:14.6161216Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:14.6161835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.6162664Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.6163418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.6164022Z kernel = self.compile( 2025-05-07T20:33:14.6164640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.6165392Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.6165844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.6166110Z 2025-05-07T20:33:14.6166346Z self = 2025-05-07T20:33:14.6167567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.6169170Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9149b880>} 2025-05-07T20:33:14.6170686Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.6171848Z context = 2025-05-07T20:33:14.6172219Z 2025-05-07T20:33:14.6172444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.6173048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.6173594Z module_map=module_map) 2025-05-07T20:33:14.6174014Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.6174471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:14.6174781Z E ^ 2025-05-07T20:33:14.6175312Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.6175830Z 2025-05-07T20:33:14.6176303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.6176941Z 2025-05-07T20:33:14.6177107Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.6177589Z self=, 2025-05-07T20:33:14.6178045Z T=128, 2025-05-07T20:33:14.6178272Z D=7168, 2025-05-07T20:33:14.6178504Z scale_ub=1200.0, 2025-05-07T20:33:14.6178768Z contiguous=False, 2025-05-07T20:33:14.6179033Z compiled=False, 2025-05-07T20:33:14.6179278Z ) 2025-05-07T20:33:14.7560934Z self = 2025-05-07T20:33:14.7562003Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:14.7562322Z 2025-05-07T20:33:14.7562424Z @given( 2025-05-07T20:33:14.7562691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.7563046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.7563398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.7563775Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.7564152Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.7564478Z ) 2025-05-07T20:33:14.7564871Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.7565369Z def test_silu_mul_quant( 2025-05-07T20:33:14.7565647Z self, 2025-05-07T20:33:14.7565869Z T: int, 2025-05-07T20:33:14.7566096Z D: int, 2025-05-07T20:33:14.7566347Z scale_ub: Optional[float], 2025-05-07T20:33:14.7566654Z contiguous: bool, 2025-05-07T20:33:14.7566935Z compiled: bool, 2025-05-07T20:33:14.7567202Z ) -> None: 2025-05-07T20:33:14.7567454Z torch.manual_seed(2025) 2025-05-07T20:33:14.7567727Z 2025-05-07T20:33:14.7568041Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.7568426Z 2025-05-07T20:33:14.7568649Z x_sign = torch.sign(x) 2025-05-07T20:33:14.7568978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.7569341Z x = x_sign * x_clamp 2025-05-07T20:33:14.7569613Z x0 = x[:, :D] 2025-05-07T20:33:14.7569865Z x1 = x[:, D:] 2025-05-07T20:33:14.7570105Z 2025-05-07T20:33:14.7570316Z if contiguous: 2025-05-07T20:33:14.7570585Z x0 = x0.contiguous() 2025-05-07T20:33:14.7570884Z x1 = x1.contiguous() 2025-05-07T20:33:14.7571153Z 2025-05-07T20:33:14.7571379Z if scale_ub is not None: 2025-05-07T20:33:14.7571692Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.7572069Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.7572423Z ) 2025-05-07T20:33:14.7572650Z else: 2025-05-07T20:33:14.7573007Z scale_ub_tensor = None 2025-05-07T20:33:14.7573293Z 2025-05-07T20:33:14.7573559Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.7573917Z op = silu_mul_quant 2025-05-07T20:33:14.7574201Z if compiled: 2025-05-07T20:33:14.7574483Z op = torch.compile(op) 2025-05-07T20:33:14.7574821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.7575131Z 2025-05-07T20:33:14.7575351Z > y_fp8, y_scale = fn() 2025-05-07T20:33:14.7575539Z 2025-05-07T20:33:14.7575657Z moe/activation_test.py:117: 2025-05-07T20:33:14.7575986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.7576361Z moe/activation_test.py:115: in fn 2025-05-07T20:33:14.7576679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.7577532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:14.7578309Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:14.7578913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.7579679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.7580585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.7581183Z kernel = self.compile( 2025-05-07T20:33:14.7581793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.7582534Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.7582984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.7583246Z 2025-05-07T20:33:14.7583480Z self = 2025-05-07T20:33:14.7584689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.7586232Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd91499090>} 2025-05-07T20:33:14.7587732Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.7588878Z context = 2025-05-07T20:33:14.7589203Z 2025-05-07T20:33:14.7589397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.7589981Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.7590511Z module_map=module_map) 2025-05-07T20:33:14.7590919Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.7591316Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:14.7591613Z E ^ 2025-05-07T20:33:14.7592141Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.7592698Z 2025-05-07T20:33:14.7593163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.7593807Z 2025-05-07T20:33:14.7593927Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.7594399Z self=, 2025-05-07T20:33:14.7594849Z T=128, 2025-05-07T20:33:14.7595065Z D=5120, 2025-05-07T20:33:14.7595289Z scale_ub=None, 2025-05-07T20:33:14.7595536Z contiguous=False, 2025-05-07T20:33:14.7595792Z compiled=False, 2025-05-07T20:33:14.7596105Z ) 2025-05-07T20:33:14.7596469Z self = 2025-05-07T20:33:14.7597021Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:14.7597336Z 2025-05-07T20:33:14.7597428Z @given( 2025-05-07T20:33:14.7597696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.7598046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.7598399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.7598776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.7599148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.7599473Z ) 2025-05-07T20:33:14.7599872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.7600418Z def test_silu_mul_quant( 2025-05-07T20:33:14.7600692Z self, 2025-05-07T20:33:14.7600919Z T: int, 2025-05-07T20:33:14.7601153Z D: int, 2025-05-07T20:33:14.7601401Z scale_ub: Optional[float], 2025-05-07T20:33:14.7601713Z contiguous: bool, 2025-05-07T20:33:14.7602012Z compiled: bool, 2025-05-07T20:33:14.7602295Z ) -> None: 2025-05-07T20:33:14.7602593Z torch.manual_seed(2025) 2025-05-07T20:33:14.7602912Z 2025-05-07T20:33:14.7603231Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.7603631Z 2025-05-07T20:33:14.7603854Z x_sign = torch.sign(x) 2025-05-07T20:33:14.7604182Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.7604534Z x = x_sign * x_clamp 2025-05-07T20:33:14.7604811Z x0 = x[:, :D] 2025-05-07T20:33:14.7605061Z x1 = x[:, D:] 2025-05-07T20:33:14.7605301Z 2025-05-07T20:33:14.7605526Z if contiguous: 2025-05-07T20:33:14.7605786Z x0 = x0.contiguous() 2025-05-07T20:33:14.7606081Z x1 = x1.contiguous() 2025-05-07T20:33:14.7606357Z 2025-05-07T20:33:14.7606582Z if scale_ub is not None: 2025-05-07T20:33:14.7606891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.7607269Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.7607625Z ) 2025-05-07T20:33:14.7607843Z else: 2025-05-07T20:33:14.7608088Z scale_ub_tensor = None 2025-05-07T20:33:14.7608374Z 2025-05-07T20:33:14.7608636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.7608995Z op = silu_mul_quant 2025-05-07T20:33:14.7609278Z if compiled: 2025-05-07T20:33:14.7609555Z op = torch.compile(op) 2025-05-07T20:33:14.7609889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.7610203Z 2025-05-07T20:33:14.7610421Z > y_fp8, y_scale = fn() 2025-05-07T20:33:14.7610617Z 2025-05-07T20:33:14.7610730Z moe/activation_test.py:117: 2025-05-07T20:33:14.7611075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.7611454Z moe/activation_test.py:115: in fn 2025-05-07T20:33:14.7611780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.7612748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:14.7613729Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:14.7614459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.7615226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.7615972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.7616578Z kernel = self.compile( 2025-05-07T20:33:14.7617185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.7617978Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.7618425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.7618684Z 2025-05-07T20:33:14.7618924Z self = 2025-05-07T20:33:14.7620130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.7621693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a3eb0>} 2025-05-07T20:33:14.7623570Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.7624940Z context = 2025-05-07T20:33:14.7625264Z 2025-05-07T20:33:14.7625454Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.7626176Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.7626707Z module_map=module_map) 2025-05-07T20:33:14.7627123Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.7627520Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:14.7627817Z E ^ 2025-05-07T20:33:14.7628344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.7628849Z 2025-05-07T20:33:14.7629314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.7629893Z 2025-05-07T20:33:14.7630011Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.7630483Z self=, 2025-05-07T20:33:14.7630938Z T=128, 2025-05-07T20:33:14.7631150Z D=5120, 2025-05-07T20:33:14.7631380Z scale_ub=1200.0, 2025-05-07T20:33:14.7631639Z contiguous=True, 2025-05-07T20:33:14.7631892Z compiled=False, 2025-05-07T20:33:14.7632154Z ) 2025-05-07T20:33:14.9752883Z self = 2025-05-07T20:33:14.9753486Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:14.9753868Z 2025-05-07T20:33:14.9754005Z @given( 2025-05-07T20:33:14.9754383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.9754753Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.9755101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.9755468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.9755837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.9756156Z ) 2025-05-07T20:33:14.9756545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.9757038Z def test_silu_mul_quant( 2025-05-07T20:33:14.9757308Z self, 2025-05-07T20:33:14.9757526Z T: int, 2025-05-07T20:33:14.9757747Z D: int, 2025-05-07T20:33:14.9757990Z scale_ub: Optional[float], 2025-05-07T20:33:14.9758296Z contiguous: bool, 2025-05-07T20:33:14.9758564Z compiled: bool, 2025-05-07T20:33:14.9758816Z ) -> None: 2025-05-07T20:33:14.9759057Z torch.manual_seed(2025) 2025-05-07T20:33:14.9759329Z 2025-05-07T20:33:14.9759635Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.9760009Z 2025-05-07T20:33:14.9760229Z x_sign = torch.sign(x) 2025-05-07T20:33:14.9760559Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.9760900Z x = x_sign * x_clamp 2025-05-07T20:33:14.9761297Z x0 = x[:, :D] 2025-05-07T20:33:14.9761550Z x1 = x[:, D:] 2025-05-07T20:33:14.9761778Z 2025-05-07T20:33:14.9761992Z if contiguous: 2025-05-07T20:33:14.9762288Z x0 = x0.contiguous() 2025-05-07T20:33:14.9762590Z x1 = x1.contiguous() 2025-05-07T20:33:14.9762858Z 2025-05-07T20:33:14.9763076Z if scale_ub is not None: 2025-05-07T20:33:14.9763379Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.9763753Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.9764100Z ) 2025-05-07T20:33:14.9764318Z else: 2025-05-07T20:33:14.9764552Z scale_ub_tensor = None 2025-05-07T20:33:14.9764838Z 2025-05-07T20:33:14.9765094Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.9765531Z op = silu_mul_quant 2025-05-07T20:33:14.9765820Z if compiled: 2025-05-07T20:33:14.9766107Z op = torch.compile(op) 2025-05-07T20:33:14.9766469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.9766775Z 2025-05-07T20:33:14.9766984Z > y_fp8, y_scale = fn() 2025-05-07T20:33:14.9767173Z 2025-05-07T20:33:14.9767284Z moe/activation_test.py:117: 2025-05-07T20:33:14.9767746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.9768112Z moe/activation_test.py:115: in fn 2025-05-07T20:33:14.9768426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.9769192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:14.9769951Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:14.9770549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.9771310Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.9772046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.9772634Z kernel = self.compile( 2025-05-07T20:33:14.9773237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.9773969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.9774410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.9774663Z 2025-05-07T20:33:14.9774890Z self = 2025-05-07T20:33:14.9776083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.9777611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a2d40>} 2025-05-07T20:33:14.9779102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.9780238Z context = 2025-05-07T20:33:14.9780563Z 2025-05-07T20:33:14.9780748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.9781323Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.9781842Z module_map=module_map) 2025-05-07T20:33:14.9782245Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.9782695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:14.9782985Z E ^ 2025-05-07T20:33:14.9783546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.9784049Z 2025-05-07T20:33:14.9784509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.9785082Z 2025-05-07T20:33:14.9785203Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.9785665Z self=, 2025-05-07T20:33:14.9786112Z T=1, 2025-05-07T20:33:14.9786314Z D=7168, 2025-05-07T20:33:14.9786532Z scale_ub=1200.0, 2025-05-07T20:33:14.9786776Z contiguous=True, 2025-05-07T20:33:14.9787023Z compiled=True, 2025-05-07T20:33:14.9787244Z ) 2025-05-07T20:33:14.9787604Z self = 2025-05-07T20:33:14.9788224Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:14.9788513Z 2025-05-07T20:33:14.9788602Z @given( 2025-05-07T20:33:14.9788997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.9795687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.9796045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.9796509Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.9796921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.9797251Z ) 2025-05-07T20:33:14.9797648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.9798140Z def test_silu_mul_quant( 2025-05-07T20:33:14.9798413Z self, 2025-05-07T20:33:14.9798635Z T: int, 2025-05-07T20:33:14.9798856Z D: int, 2025-05-07T20:33:14.9799097Z scale_ub: Optional[float], 2025-05-07T20:33:14.9799406Z contiguous: bool, 2025-05-07T20:33:14.9799679Z compiled: bool, 2025-05-07T20:33:14.9799928Z ) -> None: 2025-05-07T20:33:14.9800171Z torch.manual_seed(2025) 2025-05-07T20:33:14.9800445Z 2025-05-07T20:33:14.9800754Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.9801136Z 2025-05-07T20:33:14.9801355Z x_sign = torch.sign(x) 2025-05-07T20:33:14.9801675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.9802030Z x = x_sign * x_clamp 2025-05-07T20:33:14.9802302Z x0 = x[:, :D] 2025-05-07T20:33:14.9802540Z x1 = x[:, D:] 2025-05-07T20:33:14.9802774Z 2025-05-07T20:33:14.9802983Z if contiguous: 2025-05-07T20:33:14.9803244Z x0 = x0.contiguous() 2025-05-07T20:33:14.9803531Z x1 = x1.contiguous() 2025-05-07T20:33:14.9803803Z 2025-05-07T20:33:14.9804017Z if scale_ub is not None: 2025-05-07T20:33:14.9804320Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.9804699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.9805051Z ) 2025-05-07T20:33:14.9805263Z else: 2025-05-07T20:33:14.9805500Z scale_ub_tensor = None 2025-05-07T20:33:14.9805781Z 2025-05-07T20:33:14.9806037Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.9806388Z op = silu_mul_quant 2025-05-07T20:33:14.9806673Z if compiled: 2025-05-07T20:33:14.9806952Z op = torch.compile(op) 2025-05-07T20:33:14.9807292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.9807599Z 2025-05-07T20:33:14.9807811Z > y_fp8, y_scale = fn() 2025-05-07T20:33:14.9808002Z 2025-05-07T20:33:14.9808113Z moe/activation_test.py:117: 2025-05-07T20:33:14.9808442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.9808809Z moe/activation_test.py:115: in fn 2025-05-07T20:33:14.9809121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.9809754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:14.9810381Z return fn(*args, **kwargs) 2025-05-07T20:33:14.9811165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:14.9811927Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:14.9812529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.9813288Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.9814020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.9814608Z kernel = self.compile( 2025-05-07T20:33:14.9815210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.9816005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.9816446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.9816712Z 2025-05-07T20:33:14.9816941Z self = 2025-05-07T20:33:14.9818178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.9819744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a1480>} 2025-05-07T20:33:14.9821234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.9822376Z context = 2025-05-07T20:33:14.9822701Z 2025-05-07T20:33:14.9822890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.9823470Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.9824305Z module_map=module_map) 2025-05-07T20:33:14.9824720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.9825112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:14.9825397Z E ^ 2025-05-07T20:33:14.9825912Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.9826412Z 2025-05-07T20:33:14.9826873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.9827441Z 2025-05-07T20:33:14.9827562Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.9828022Z self=, 2025-05-07T20:33:14.9828471Z T=1, 2025-05-07T20:33:14.9828679Z D=7168, 2025-05-07T20:33:14.9828892Z scale_ub=1200.0, 2025-05-07T20:33:14.9829145Z contiguous=False, 2025-05-07T20:33:14.9829397Z compiled=True, 2025-05-07T20:33:14.9829625Z ) 2025-05-07T20:33:15.1327213Z self = 2025-05-07T20:33:15.1327877Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:15.1328305Z 2025-05-07T20:33:15.1328398Z @given( 2025-05-07T20:33:15.1328661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.1329002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.1329349Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.1329718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.1330086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.1330403Z ) 2025-05-07T20:33:15.1330915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.1331412Z def test_silu_mul_quant( 2025-05-07T20:33:15.1331679Z self, 2025-05-07T20:33:15.1331901Z T: int, 2025-05-07T20:33:15.1332131Z D: int, 2025-05-07T20:33:15.1332413Z scale_ub: Optional[float], 2025-05-07T20:33:15.1332717Z contiguous: bool, 2025-05-07T20:33:15.1332988Z compiled: bool, 2025-05-07T20:33:15.1333234Z ) -> None: 2025-05-07T20:33:15.1333475Z torch.manual_seed(2025) 2025-05-07T20:33:15.1333761Z 2025-05-07T20:33:15.1334060Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.1334450Z 2025-05-07T20:33:15.1334669Z x_sign = torch.sign(x) 2025-05-07T20:33:15.1334988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.1335405Z x = x_sign * x_clamp 2025-05-07T20:33:15.1335672Z x0 = x[:, :D] 2025-05-07T20:33:15.1335914Z x1 = x[:, D:] 2025-05-07T20:33:15.1336151Z 2025-05-07T20:33:15.1336367Z if contiguous: 2025-05-07T20:33:15.1336624Z x0 = x0.contiguous() 2025-05-07T20:33:15.1336915Z x1 = x1.contiguous() 2025-05-07T20:33:15.1337184Z 2025-05-07T20:33:15.1337395Z if scale_ub is not None: 2025-05-07T20:33:15.1337828Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.1338206Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.1338550Z ) 2025-05-07T20:33:15.1338763Z else: 2025-05-07T20:33:15.1339002Z scale_ub_tensor = None 2025-05-07T20:33:15.1339285Z 2025-05-07T20:33:15.1339543Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.1339894Z op = silu_mul_quant 2025-05-07T20:33:15.1340175Z if compiled: 2025-05-07T20:33:15.1340451Z op = torch.compile(op) 2025-05-07T20:33:15.1340783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.1341089Z 2025-05-07T20:33:15.1341304Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.1341500Z 2025-05-07T20:33:15.1341613Z moe/activation_test.py:117: 2025-05-07T20:33:15.1341940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.1342344Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.1342700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.1343332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:15.1343957Z return fn(*args, **kwargs) 2025-05-07T20:33:15.1344685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.1345451Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.1346046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.1346811Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.1347541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.1348131Z kernel = self.compile( 2025-05-07T20:33:15.1348739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.1349463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.1349910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.1350168Z 2025-05-07T20:33:15.1350401Z self = 2025-05-07T20:33:15.1351600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.1353205Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a0940>} 2025-05-07T20:33:15.1354753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.1355887Z context = 2025-05-07T20:33:15.1356204Z 2025-05-07T20:33:15.1356393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.1356969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.1357480Z module_map=module_map) 2025-05-07T20:33:15.1357936Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.1358326Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.1358610Z E ^ 2025-05-07T20:33:15.1359125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.1359618Z 2025-05-07T20:33:15.1360077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.1360756Z 2025-05-07T20:33:15.1360880Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.1361334Z self=, 2025-05-07T20:33:15.1361775Z T=1, 2025-05-07T20:33:15.1361987Z D=7168, 2025-05-07T20:33:15.1362225Z scale_ub=None, 2025-05-07T20:33:15.1362494Z contiguous=False, 2025-05-07T20:33:15.1362786Z compiled=True, 2025-05-07T20:33:15.1363011Z ) 2025-05-07T20:33:15.4079488Z self = 2025-05-07T20:33:15.4080274Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:15.4080691Z 2025-05-07T20:33:15.4080827Z @given( 2025-05-07T20:33:15.4081170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.4081578Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.4081931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.4082379Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.4082813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.4083140Z ) 2025-05-07T20:33:15.4083551Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.4084121Z def test_silu_mul_quant( 2025-05-07T20:33:15.4084405Z self, 2025-05-07T20:33:15.4084694Z T: int, 2025-05-07T20:33:15.4084927Z D: int, 2025-05-07T20:33:15.4085177Z scale_ub: Optional[float], 2025-05-07T20:33:15.4085497Z contiguous: bool, 2025-05-07T20:33:15.4085781Z compiled: bool, 2025-05-07T20:33:15.4086044Z ) -> None: 2025-05-07T20:33:15.4086300Z torch.manual_seed(2025) 2025-05-07T20:33:15.4086588Z 2025-05-07T20:33:15.4086903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.4087294Z 2025-05-07T20:33:15.4087523Z x_sign = torch.sign(x) 2025-05-07T20:33:15.4087871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.4088224Z x = x_sign * x_clamp 2025-05-07T20:33:15.4088503Z x0 = x[:, :D] 2025-05-07T20:33:15.4088755Z x1 = x[:, D:] 2025-05-07T20:33:15.4088996Z 2025-05-07T20:33:15.4089216Z if contiguous: 2025-05-07T20:33:15.4089489Z x0 = x0.contiguous() 2025-05-07T20:33:15.4089784Z x1 = x1.contiguous() 2025-05-07T20:33:15.4090067Z 2025-05-07T20:33:15.4090293Z if scale_ub is not None: 2025-05-07T20:33:15.4090610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.4090996Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.4091355Z ) 2025-05-07T20:33:15.4091719Z else: 2025-05-07T20:33:15.4091969Z scale_ub_tensor = None 2025-05-07T20:33:15.4092262Z 2025-05-07T20:33:15.4092529Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.4092894Z op = silu_mul_quant 2025-05-07T20:33:15.4093183Z if compiled: 2025-05-07T20:33:15.4093473Z op = torch.compile(op) 2025-05-07T20:33:15.4093810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.4094130Z 2025-05-07T20:33:15.4094353Z y_fp8, y_scale = fn() 2025-05-07T20:33:15.4094679Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:15.4095012Z 2025-05-07T20:33:15.4095288Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.4095666Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:15.4096078Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:15.4096439Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:15.4096846Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.4097207Z 2025-05-07T20:33:15.4097443Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:15.4097666Z 2025-05-07T20:33:15.4097790Z moe/activation_test.py:126: 2025-05-07T20:33:15.4098258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.4098646Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:15.4099021Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.4099913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:15.4100771Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:15.4101391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.4102172Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.4102951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:15.4103772Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:15.4104632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:15.4105482Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:15.4106306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:15.4107033Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:15.4107720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:15.4108312Z fn() 2025-05-07T20:33:15.4108895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:15.4109557Z self.fn.run( 2025-05-07T20:33:15.4110094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.4110697Z kernel = self.compile( 2025-05-07T20:33:15.4111314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.4112118Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.4112675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.4113003Z 2025-05-07T20:33:15.4113298Z self = 2025-05-07T20:33:15.4114753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.4116332Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90b7eef0>} 2025-05-07T20:33:15.4117871Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.4119028Z context = 2025-05-07T20:33:15.4119358Z 2025-05-07T20:33:15.4119550Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.4120143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.4120733Z module_map=module_map) 2025-05-07T20:33:15.4121147Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.4121558Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:15.4121866Z E ^ 2025-05-07T20:33:15.4122393Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.4122960Z 2025-05-07T20:33:15.4123476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.4124325Z 2025-05-07T20:33:15.4124451Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.4124927Z self=, 2025-05-07T20:33:15.4125381Z T=1, 2025-05-07T20:33:15.4125599Z D=5120, 2025-05-07T20:33:15.4125827Z scale_ub=1200.0, 2025-05-07T20:33:15.4126086Z contiguous=False, 2025-05-07T20:33:15.4126352Z compiled=True, 2025-05-07T20:33:15.4126590Z ) 2025-05-07T20:33:15.5959219Z self = 2025-05-07T20:33:15.5960073Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:15.5960495Z 2025-05-07T20:33:15.5960622Z @given( 2025-05-07T20:33:15.5960990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.5961397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.5961747Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.5962129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.5962506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.5962826Z ) 2025-05-07T20:33:15.5963228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.5963728Z def test_silu_mul_quant( 2025-05-07T20:33:15.5964001Z self, 2025-05-07T20:33:15.5964230Z T: int, 2025-05-07T20:33:15.5964459Z D: int, 2025-05-07T20:33:15.5964702Z scale_ub: Optional[float], 2025-05-07T20:33:15.5965013Z contiguous: bool, 2025-05-07T20:33:15.5965289Z compiled: bool, 2025-05-07T20:33:15.5965543Z ) -> None: 2025-05-07T20:33:15.5965792Z torch.manual_seed(2025) 2025-05-07T20:33:15.5966070Z 2025-05-07T20:33:15.5966380Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.5966768Z 2025-05-07T20:33:15.5966994Z x_sign = torch.sign(x) 2025-05-07T20:33:15.5967327Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.5967676Z x = x_sign * x_clamp 2025-05-07T20:33:15.5967952Z x0 = x[:, :D] 2025-05-07T20:33:15.5968206Z x1 = x[:, D:] 2025-05-07T20:33:15.5968445Z 2025-05-07T20:33:15.5968658Z if contiguous: 2025-05-07T20:33:15.5968927Z x0 = x0.contiguous() 2025-05-07T20:33:15.5969218Z x1 = x1.contiguous() 2025-05-07T20:33:15.5969498Z 2025-05-07T20:33:15.5969720Z if scale_ub is not None: 2025-05-07T20:33:15.5970029Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.5970544Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.5970899Z ) 2025-05-07T20:33:15.5971117Z else: 2025-05-07T20:33:15.5971358Z scale_ub_tensor = None 2025-05-07T20:33:15.5971649Z 2025-05-07T20:33:15.5971911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.5972270Z op = silu_mul_quant 2025-05-07T20:33:15.5972553Z if compiled: 2025-05-07T20:33:15.5972839Z op = torch.compile(op) 2025-05-07T20:33:15.5973170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.5973483Z 2025-05-07T20:33:15.5973704Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.5973891Z 2025-05-07T20:33:15.5974008Z moe/activation_test.py:117: 2025-05-07T20:33:15.5974344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.5974847Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.5975162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.5975799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:15.5976428Z return fn(*args, **kwargs) 2025-05-07T20:33:15.5977232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.5978063Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.5978668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.5979435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.5980177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.5980775Z kernel = self.compile( 2025-05-07T20:33:15.5981386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.5982127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.5982570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.5982837Z 2025-05-07T20:33:15.5983074Z self = 2025-05-07T20:33:15.5984284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.5985825Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90b7feb0>} 2025-05-07T20:33:15.5987332Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.5988473Z context = 2025-05-07T20:33:15.5988801Z 2025-05-07T20:33:15.5988990Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.5989579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.5990110Z module_map=module_map) 2025-05-07T20:33:15.5990728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.5991129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.5991425Z E ^ 2025-05-07T20:33:15.5991943Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.5992453Z 2025-05-07T20:33:15.5992916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.5993488Z 2025-05-07T20:33:15.5993724Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.5994187Z self=, 2025-05-07T20:33:15.5994632Z T=1, 2025-05-07T20:33:15.5994845Z D=5120, 2025-05-07T20:33:15.5995074Z scale_ub=1200.0, 2025-05-07T20:33:15.5995330Z contiguous=False, 2025-05-07T20:33:15.5995587Z compiled=False, 2025-05-07T20:33:15.5995823Z ) 2025-05-07T20:33:15.5996176Z self = 2025-05-07T20:33:15.5996725Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:15.5997024Z 2025-05-07T20:33:15.5997118Z @given( 2025-05-07T20:33:15.5997379Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.5997735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.5998134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.5998506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.5998876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.5999202Z ) 2025-05-07T20:33:15.5999599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.6000166Z def test_silu_mul_quant( 2025-05-07T20:33:15.6000442Z self, 2025-05-07T20:33:15.6000708Z T: int, 2025-05-07T20:33:15.6000930Z D: int, 2025-05-07T20:33:15.6001179Z scale_ub: Optional[float], 2025-05-07T20:33:15.6001490Z contiguous: bool, 2025-05-07T20:33:15.6001766Z compiled: bool, 2025-05-07T20:33:15.6002031Z ) -> None: 2025-05-07T20:33:15.6002281Z torch.manual_seed(2025) 2025-05-07T20:33:15.6002553Z 2025-05-07T20:33:15.6002858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.6003244Z 2025-05-07T20:33:15.6003461Z x_sign = torch.sign(x) 2025-05-07T20:33:15.6003786Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.6004135Z x = x_sign * x_clamp 2025-05-07T20:33:15.6004405Z x0 = x[:, :D] 2025-05-07T20:33:15.6004649Z x1 = x[:, D:] 2025-05-07T20:33:15.6011360Z 2025-05-07T20:33:15.6011588Z if contiguous: 2025-05-07T20:33:15.6011884Z x0 = x0.contiguous() 2025-05-07T20:33:15.6012184Z x1 = x1.contiguous() 2025-05-07T20:33:15.6012457Z 2025-05-07T20:33:15.6012676Z if scale_ub is not None: 2025-05-07T20:33:15.6012981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.6013360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.6013712Z ) 2025-05-07T20:33:15.6013926Z else: 2025-05-07T20:33:15.6014164Z scale_ub_tensor = None 2025-05-07T20:33:15.6014445Z 2025-05-07T20:33:15.6014704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.6015059Z op = silu_mul_quant 2025-05-07T20:33:15.6015341Z if compiled: 2025-05-07T20:33:15.6015619Z op = torch.compile(op) 2025-05-07T20:33:15.6015954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.6016260Z 2025-05-07T20:33:15.6016470Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.6016662Z 2025-05-07T20:33:15.6016777Z moe/activation_test.py:117: 2025-05-07T20:33:15.6017117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.6017491Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.6017801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.6018576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.6019352Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.6019948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.6020712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.6021530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.6022128Z kernel = self.compile( 2025-05-07T20:33:15.6022732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.6023468Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.6024115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.6024369Z 2025-05-07T20:33:15.6024605Z self = 2025-05-07T20:33:15.6025803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.6027425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a3e20>} 2025-05-07T20:33:15.6028979Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.6030182Z context = 2025-05-07T20:33:15.6030505Z 2025-05-07T20:33:15.6030691Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.6031275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.6031805Z module_map=module_map) 2025-05-07T20:33:15.6032226Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.6032615Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.6032911Z E ^ 2025-05-07T20:33:15.6033432Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.6034026Z 2025-05-07T20:33:15.6034493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.6035072Z 2025-05-07T20:33:15.6035190Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.6035652Z self=, 2025-05-07T20:33:15.6036103Z T=16384, 2025-05-07T20:33:15.6036313Z D=5120, 2025-05-07T20:33:15.6036534Z scale_ub=1200.0, 2025-05-07T20:33:15.6036789Z contiguous=False, 2025-05-07T20:33:15.6037036Z compiled=True, 2025-05-07T20:33:15.6037266Z ) 2025-05-07T20:33:15.7121205Z self = 2025-05-07T20:33:15.7122592Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:15.7123075Z 2025-05-07T20:33:15.7123211Z @given( 2025-05-07T20:33:15.7123573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.7124231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.7124590Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.7124967Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.7125341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.7125666Z ) 2025-05-07T20:33:15.7126060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.7126557Z def test_silu_mul_quant( 2025-05-07T20:33:15.7126839Z self, 2025-05-07T20:33:15.7127060Z T: int, 2025-05-07T20:33:15.7127284Z D: int, 2025-05-07T20:33:15.7127532Z scale_ub: Optional[float], 2025-05-07T20:33:15.7127838Z contiguous: bool, 2025-05-07T20:33:15.7128113Z compiled: bool, 2025-05-07T20:33:15.7128370Z ) -> None: 2025-05-07T20:33:15.7128618Z torch.manual_seed(2025) 2025-05-07T20:33:15.7129017Z 2025-05-07T20:33:15.7129336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.7129724Z 2025-05-07T20:33:15.7129945Z x_sign = torch.sign(x) 2025-05-07T20:33:15.7130280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.7130642Z x = x_sign * x_clamp 2025-05-07T20:33:15.7130917Z x0 = x[:, :D] 2025-05-07T20:33:15.7131163Z x1 = x[:, D:] 2025-05-07T20:33:15.7131405Z 2025-05-07T20:33:15.7131613Z if contiguous: 2025-05-07T20:33:15.7131880Z x0 = x0.contiguous() 2025-05-07T20:33:15.7132179Z x1 = x1.contiguous() 2025-05-07T20:33:15.7132474Z 2025-05-07T20:33:15.7132717Z if scale_ub is not None: 2025-05-07T20:33:15.7133028Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.7133481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.7133837Z ) 2025-05-07T20:33:15.7134064Z else: 2025-05-07T20:33:15.7134310Z scale_ub_tensor = None 2025-05-07T20:33:15.7134592Z 2025-05-07T20:33:15.7134860Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.7135227Z op = silu_mul_quant 2025-05-07T20:33:15.7135580Z if compiled: 2025-05-07T20:33:15.7135926Z op = torch.compile(op) 2025-05-07T20:33:15.7136265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.7136574Z 2025-05-07T20:33:15.7136799Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.7136988Z 2025-05-07T20:33:15.7137111Z moe/activation_test.py:117: 2025-05-07T20:33:15.7137446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.7137821Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.7138151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.7138793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:15.7139427Z return fn(*args, **kwargs) 2025-05-07T20:33:15.7140174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.7140955Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.7141560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.7142333Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.7143080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.7143686Z kernel = self.compile( 2025-05-07T20:33:15.7144292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.7145036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.7145487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.7145746Z 2025-05-07T20:33:15.7145985Z self = 2025-05-07T20:33:15.7147197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.7148749Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd901288b0>} 2025-05-07T20:33:15.7150260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.7151412Z context = 2025-05-07T20:33:15.7151792Z 2025-05-07T20:33:15.7151982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.7152574Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.7153109Z module_map=module_map) 2025-05-07T20:33:15.7153590Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.7153988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.7154282Z E ^ 2025-05-07T20:33:15.7154806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.7155314Z 2025-05-07T20:33:15.7155781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.7156408Z 2025-05-07T20:33:15.7156527Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.7157000Z self=, 2025-05-07T20:33:15.7157453Z T=2048, 2025-05-07T20:33:15.7157663Z D=7168, 2025-05-07T20:33:15.7157883Z scale_ub=1200.0, 2025-05-07T20:33:15.7158143Z contiguous=False, 2025-05-07T20:33:15.7158443Z compiled=True, 2025-05-07T20:33:15.7158676Z ) 2025-05-07T20:33:15.7159076Z self = 2025-05-07T20:33:15.7159632Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:15.7159948Z 2025-05-07T20:33:15.7160038Z @given( 2025-05-07T20:33:15.7160302Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.7160658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.7161001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.7161382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.7161758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.7162076Z ) 2025-05-07T20:33:15.7162531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.7163030Z def test_silu_mul_quant( 2025-05-07T20:33:15.7163301Z self, 2025-05-07T20:33:15.7163522Z T: int, 2025-05-07T20:33:15.7163753Z D: int, 2025-05-07T20:33:15.7164001Z scale_ub: Optional[float], 2025-05-07T20:33:15.7164307Z contiguous: bool, 2025-05-07T20:33:15.7164579Z compiled: bool, 2025-05-07T20:33:15.7164826Z ) -> None: 2025-05-07T20:33:15.7165073Z torch.manual_seed(2025) 2025-05-07T20:33:15.7165348Z 2025-05-07T20:33:15.7165651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.7166042Z 2025-05-07T20:33:15.7166266Z x_sign = torch.sign(x) 2025-05-07T20:33:15.7166600Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.7166952Z x = x_sign * x_clamp 2025-05-07T20:33:15.7167219Z x0 = x[:, :D] 2025-05-07T20:33:15.7167463Z x1 = x[:, D:] 2025-05-07T20:33:15.7167697Z 2025-05-07T20:33:15.7167906Z if contiguous: 2025-05-07T20:33:15.7168168Z x0 = x0.contiguous() 2025-05-07T20:33:15.7168458Z x1 = x1.contiguous() 2025-05-07T20:33:15.7168752Z 2025-05-07T20:33:15.7168975Z if scale_ub is not None: 2025-05-07T20:33:15.7169284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.7169656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.7170010Z ) 2025-05-07T20:33:15.7170227Z else: 2025-05-07T20:33:15.7170458Z scale_ub_tensor = None 2025-05-07T20:33:15.7170742Z 2025-05-07T20:33:15.7171007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.7171355Z op = silu_mul_quant 2025-05-07T20:33:15.7171644Z if compiled: 2025-05-07T20:33:15.7171929Z op = torch.compile(op) 2025-05-07T20:33:15.7172266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.7172627Z 2025-05-07T20:33:15.7172845Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.7173036Z 2025-05-07T20:33:15.7173149Z moe/activation_test.py:117: 2025-05-07T20:33:15.7173482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.7173853Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.7174172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.7174804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:15.7175427Z return fn(*args, **kwargs) 2025-05-07T20:33:15.7176172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.7176952Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.7177604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.7178370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.7179118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.7179792Z kernel = self.compile( 2025-05-07T20:33:15.7180451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.7181185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.7181632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.7181890Z 2025-05-07T20:33:15.7182130Z self = 2025-05-07T20:33:15.7183389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.7184937Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90129090>} 2025-05-07T20:33:15.7186447Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.7187600Z context = 2025-05-07T20:33:15.7187925Z 2025-05-07T20:33:15.7188118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.7188699Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.7189232Z module_map=module_map) 2025-05-07T20:33:15.7189647Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.7190037Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.7190334Z E ^ 2025-05-07T20:33:15.7190859Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.7191361Z 2025-05-07T20:33:15.7191841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.7192443Z 2025-05-07T20:33:15.8595747Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.8596251Z self=, 2025-05-07T20:33:15.8596908Z T=1, 2025-05-07T20:33:15.8597234Z D=5120, 2025-05-07T20:33:15.8597533Z scale_ub=None, 2025-05-07T20:33:15.8597886Z contiguous=False, 2025-05-07T20:33:15.8598243Z compiled=False, 2025-05-07T20:33:15.8598572Z ) 2025-05-07T20:33:15.8599000Z self = 2025-05-07T20:33:15.8599675Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:15.8599974Z 2025-05-07T20:33:15.8600068Z @given( 2025-05-07T20:33:15.8600327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.8600678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.8601023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.8601395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.8601767Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.8602088Z ) 2025-05-07T20:33:15.8602476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.8602969Z def test_silu_mul_quant( 2025-05-07T20:33:15.8603247Z self, 2025-05-07T20:33:15.8603464Z T: int, 2025-05-07T20:33:15.8603690Z D: int, 2025-05-07T20:33:15.8604014Z scale_ub: Optional[float], 2025-05-07T20:33:15.8604319Z contiguous: bool, 2025-05-07T20:33:15.8604585Z compiled: bool, 2025-05-07T20:33:15.8604836Z ) -> None: 2025-05-07T20:33:15.8605089Z torch.manual_seed(2025) 2025-05-07T20:33:15.8605358Z 2025-05-07T20:33:15.8605664Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.8606122Z 2025-05-07T20:33:15.8606371Z x_sign = torch.sign(x) 2025-05-07T20:33:15.8606759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.8607108Z x = x_sign * x_clamp 2025-05-07T20:33:15.8607378Z x0 = x[:, :D] 2025-05-07T20:33:15.8607623Z x1 = x[:, D:] 2025-05-07T20:33:15.8607856Z 2025-05-07T20:33:15.8608067Z if contiguous: 2025-05-07T20:33:15.8608330Z x0 = x0.contiguous() 2025-05-07T20:33:15.8608617Z x1 = x1.contiguous() 2025-05-07T20:33:15.8608890Z 2025-05-07T20:33:15.8609110Z if scale_ub is not None: 2025-05-07T20:33:15.8609416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.8609790Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.8610140Z ) 2025-05-07T20:33:15.8610357Z else: 2025-05-07T20:33:15.8610596Z scale_ub_tensor = None 2025-05-07T20:33:15.8610882Z 2025-05-07T20:33:15.8611138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.8611497Z op = silu_mul_quant 2025-05-07T20:33:15.8611782Z if compiled: 2025-05-07T20:33:15.8612065Z op = torch.compile(op) 2025-05-07T20:33:15.8612396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.8612705Z 2025-05-07T20:33:15.8612928Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.8613115Z 2025-05-07T20:33:15.8613225Z moe/activation_test.py:117: 2025-05-07T20:33:15.8613555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8613930Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.8614241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.8615018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.8615789Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.8616388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.8617150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.8617891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.8618486Z kernel = self.compile( 2025-05-07T20:33:15.8619088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.8619820Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.8620265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8620520Z 2025-05-07T20:33:15.8620811Z self = 2025-05-07T20:33:15.8622013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.8623543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd901297e0>} 2025-05-07T20:33:15.8625320Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.8626463Z context = 2025-05-07T20:33:15.8626861Z 2025-05-07T20:33:15.8627055Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.8627634Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.8628162Z module_map=module_map) 2025-05-07T20:33:15.8628571Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.8629091Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.8629384Z E ^ 2025-05-07T20:33:15.8629904Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.8630405Z 2025-05-07T20:33:15.8630868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.8631437Z 2025-05-07T20:33:15.8631553Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.8632021Z self=, 2025-05-07T20:33:15.8632467Z T=4096, 2025-05-07T20:33:15.8632677Z D=7168, 2025-05-07T20:33:15.8632901Z scale_ub=1200.0, 2025-05-07T20:33:15.8633159Z contiguous=False, 2025-05-07T20:33:15.8633407Z compiled=False, 2025-05-07T20:33:15.8633696Z ) 2025-05-07T20:33:15.8634053Z self = 2025-05-07T20:33:15.8634617Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:15.8634926Z 2025-05-07T20:33:15.8635014Z @given( 2025-05-07T20:33:15.8635276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.8635628Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.8635973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.8636346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.8636720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.8637040Z ) 2025-05-07T20:33:15.8637436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.8637939Z def test_silu_mul_quant( 2025-05-07T20:33:15.8638213Z self, 2025-05-07T20:33:15.8638431Z T: int, 2025-05-07T20:33:15.8638656Z D: int, 2025-05-07T20:33:15.8638905Z scale_ub: Optional[float], 2025-05-07T20:33:15.8639212Z contiguous: bool, 2025-05-07T20:33:15.8639485Z compiled: bool, 2025-05-07T20:33:15.8639743Z ) -> None: 2025-05-07T20:33:15.8639985Z torch.manual_seed(2025) 2025-05-07T20:33:15.8640259Z 2025-05-07T20:33:15.8640568Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.8640947Z 2025-05-07T20:33:15.8641166Z x_sign = torch.sign(x) 2025-05-07T20:33:15.8641495Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.8641840Z x = x_sign * x_clamp 2025-05-07T20:33:15.8642115Z x0 = x[:, :D] 2025-05-07T20:33:15.8642357Z x1 = x[:, D:] 2025-05-07T20:33:15.8642606Z 2025-05-07T20:33:15.8642851Z if contiguous: 2025-05-07T20:33:15.8643195Z x0 = x0.contiguous() 2025-05-07T20:33:15.8643488Z x1 = x1.contiguous() 2025-05-07T20:33:15.8643760Z 2025-05-07T20:33:15.8643978Z if scale_ub is not None: 2025-05-07T20:33:15.8644285Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.8644659Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.8645008Z ) 2025-05-07T20:33:15.8645230Z else: 2025-05-07T20:33:15.8645466Z scale_ub_tensor = None 2025-05-07T20:33:15.8645746Z 2025-05-07T20:33:15.8646007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.8646354Z op = silu_mul_quant 2025-05-07T20:33:15.8646638Z if compiled: 2025-05-07T20:33:15.8646917Z op = torch.compile(op) 2025-05-07T20:33:15.8647297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.8647608Z 2025-05-07T20:33:15.8647832Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.8648018Z 2025-05-07T20:33:15.8648134Z moe/activation_test.py:117: 2025-05-07T20:33:15.8648468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8648845Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.8649162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.8650033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.8650804Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.8651405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.8652158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.8652899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.8653498Z kernel = self.compile( 2025-05-07T20:33:15.8654109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.8654845Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.8655290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8655551Z 2025-05-07T20:33:15.8655784Z self = 2025-05-07T20:33:15.8656979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.8658502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9012a200>} 2025-05-07T20:33:15.8660005Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.8661141Z context = 2025-05-07T20:33:15.8661465Z 2025-05-07T20:33:15.8661654Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.8662314Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.8662978Z module_map=module_map) 2025-05-07T20:33:15.8669963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.8670379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.8670680Z E ^ 2025-05-07T20:33:15.8671213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.8671736Z 2025-05-07T20:33:15.8672286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.8672864Z 2025-05-07T20:33:15.8672988Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.8673459Z self=, 2025-05-07T20:33:15.8673996Z T=16384, 2025-05-07T20:33:15.8674220Z D=7168, 2025-05-07T20:33:15.8674437Z scale_ub=None, 2025-05-07T20:33:15.8674688Z contiguous=True, 2025-05-07T20:33:15.8674945Z compiled=True, 2025-05-07T20:33:15.8675172Z ) 2025-05-07T20:33:16.0777834Z self = 2025-05-07T20:33:16.0779323Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:16.0780071Z 2025-05-07T20:33:16.0780293Z @given( 2025-05-07T20:33:16.0780794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.0781618Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.0782210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.0782648Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.0783035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.0783353Z ) 2025-05-07T20:33:16.0783749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.0784374Z def test_silu_mul_quant( 2025-05-07T20:33:16.0784645Z self, 2025-05-07T20:33:16.0784865Z T: int, 2025-05-07T20:33:16.0785088Z D: int, 2025-05-07T20:33:16.0785328Z scale_ub: Optional[float], 2025-05-07T20:33:16.0785638Z contiguous: bool, 2025-05-07T20:33:16.0785907Z compiled: bool, 2025-05-07T20:33:16.0786161Z ) -> None: 2025-05-07T20:33:16.0786398Z torch.manual_seed(2025) 2025-05-07T20:33:16.0786672Z 2025-05-07T20:33:16.0786987Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.0787363Z 2025-05-07T20:33:16.0787586Z x_sign = torch.sign(x) 2025-05-07T20:33:16.0787922Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.0788268Z x = x_sign * x_clamp 2025-05-07T20:33:16.0788542Z x0 = x[:, :D] 2025-05-07T20:33:16.0788794Z x1 = x[:, D:] 2025-05-07T20:33:16.0789028Z 2025-05-07T20:33:16.0789238Z if contiguous: 2025-05-07T20:33:16.0789505Z x0 = x0.contiguous() 2025-05-07T20:33:16.0789791Z x1 = x1.contiguous() 2025-05-07T20:33:16.0790062Z 2025-05-07T20:33:16.0790283Z if scale_ub is not None: 2025-05-07T20:33:16.0790588Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.0790964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.0791311Z ) 2025-05-07T20:33:16.0791530Z else: 2025-05-07T20:33:16.0791762Z scale_ub_tensor = None 2025-05-07T20:33:16.0792048Z 2025-05-07T20:33:16.0792317Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.0792703Z op = silu_mul_quant 2025-05-07T20:33:16.0792993Z if compiled: 2025-05-07T20:33:16.0793273Z op = torch.compile(op) 2025-05-07T20:33:16.0793674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0793984Z 2025-05-07T20:33:16.0794206Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.0794392Z 2025-05-07T20:33:16.0794506Z moe/activation_test.py:117: 2025-05-07T20:33:16.0794837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0795214Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.0795531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0796150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.0796775Z return fn(*args, **kwargs) 2025-05-07T20:33:16.0797515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.0798356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.0798965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.0799722Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.0800464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.0801056Z kernel = self.compile( 2025-05-07T20:33:16.0801659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.0802392Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.0802833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0803142Z 2025-05-07T20:33:16.0803373Z self = 2025-05-07T20:33:16.0804578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.0806202Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9012b760>} 2025-05-07T20:33:16.0807695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.0808824Z context = 2025-05-07T20:33:16.0809150Z 2025-05-07T20:33:16.0809337Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.0809929Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.0810459Z module_map=module_map) 2025-05-07T20:33:16.0810863Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.0811261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.0811560Z E ^ 2025-05-07T20:33:16.0812075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.0812581Z 2025-05-07T20:33:16.0813042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.0813610Z 2025-05-07T20:33:16.0813728Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.0814194Z self=, 2025-05-07T20:33:16.0814638Z T=4096, 2025-05-07T20:33:16.0814854Z D=5120, 2025-05-07T20:33:16.0815073Z scale_ub=None, 2025-05-07T20:33:16.0815311Z contiguous=False, 2025-05-07T20:33:16.0815565Z compiled=True, 2025-05-07T20:33:16.0815794Z ) 2025-05-07T20:33:16.0816144Z self = 2025-05-07T20:33:16.0816693Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:16.0816999Z 2025-05-07T20:33:16.0817090Z @given( 2025-05-07T20:33:16.0817354Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.0817700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.0818043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.0818410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.0818773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.0819092Z ) 2025-05-07T20:33:16.0819488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.0819983Z def test_silu_mul_quant( 2025-05-07T20:33:16.0820256Z self, 2025-05-07T20:33:16.0820477Z T: int, 2025-05-07T20:33:16.0820695Z D: int, 2025-05-07T20:33:16.0820993Z scale_ub: Optional[float], 2025-05-07T20:33:16.0821298Z contiguous: bool, 2025-05-07T20:33:16.0821563Z compiled: bool, 2025-05-07T20:33:16.0821814Z ) -> None: 2025-05-07T20:33:16.0822066Z torch.manual_seed(2025) 2025-05-07T20:33:16.0822340Z 2025-05-07T20:33:16.0822688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.0823069Z 2025-05-07T20:33:16.0823294Z x_sign = torch.sign(x) 2025-05-07T20:33:16.0823616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.0824257Z x = x_sign * x_clamp 2025-05-07T20:33:16.0824532Z x0 = x[:, :D] 2025-05-07T20:33:16.0824773Z x1 = x[:, D:] 2025-05-07T20:33:16.0825009Z 2025-05-07T20:33:16.0825302Z if contiguous: 2025-05-07T20:33:16.0825560Z x0 = x0.contiguous() 2025-05-07T20:33:16.0825854Z x1 = x1.contiguous() 2025-05-07T20:33:16.0826124Z 2025-05-07T20:33:16.0826341Z if scale_ub is not None: 2025-05-07T20:33:16.0826652Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.0827027Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.0827442Z ) 2025-05-07T20:33:16.0827660Z else: 2025-05-07T20:33:16.0827958Z scale_ub_tensor = None 2025-05-07T20:33:16.0828237Z 2025-05-07T20:33:16.0828499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.0828852Z op = silu_mul_quant 2025-05-07T20:33:16.0829134Z if compiled: 2025-05-07T20:33:16.0829408Z op = torch.compile(op) 2025-05-07T20:33:16.0829739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0830049Z 2025-05-07T20:33:16.0830260Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.0830452Z 2025-05-07T20:33:16.0830563Z moe/activation_test.py:117: 2025-05-07T20:33:16.0830895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0831266Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.0831591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0832215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.0832851Z return fn(*args, **kwargs) 2025-05-07T20:33:16.0833646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.0834420Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.0835022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.0835777Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.0836521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.0837111Z kernel = self.compile( 2025-05-07T20:33:16.0837716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.0838441Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.0838891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0839147Z 2025-05-07T20:33:16.0839382Z self = 2025-05-07T20:33:16.0840584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.0842111Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074c280>} 2025-05-07T20:33:16.0843676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.0844809Z context = 2025-05-07T20:33:16.0845131Z 2025-05-07T20:33:16.0845325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.0845899Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.0846419Z module_map=module_map) 2025-05-07T20:33:16.0846828Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.0847220Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.0847510Z E ^ 2025-05-07T20:33:16.0848074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.0848574Z 2025-05-07T20:33:16.0849044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.0849611Z 2025-05-07T20:33:16.4230557Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.4231394Z self=, 2025-05-07T20:33:16.4232754Z T=4096, 2025-05-07T20:33:16.4232983Z D=5120, 2025-05-07T20:33:16.4233201Z scale_ub=1200.0, 2025-05-07T20:33:16.4233448Z contiguous=False, 2025-05-07T20:33:16.4233768Z compiled=False, 2025-05-07T20:33:16.4233998Z ) 2025-05-07T20:33:16.4234349Z self = 2025-05-07T20:33:16.4234907Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:16.4235216Z 2025-05-07T20:33:16.4235317Z @given( 2025-05-07T20:33:16.4235571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.4235917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.4236263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.4236637Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.4236999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.4237325Z ) 2025-05-07T20:33:16.4237719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.4238204Z def test_silu_mul_quant( 2025-05-07T20:33:16.4238476Z self, 2025-05-07T20:33:16.4238695Z T: int, 2025-05-07T20:33:16.4238914Z D: int, 2025-05-07T20:33:16.4239157Z scale_ub: Optional[float], 2025-05-07T20:33:16.4239457Z contiguous: bool, 2025-05-07T20:33:16.4239720Z compiled: bool, 2025-05-07T20:33:16.4239970Z ) -> None: 2025-05-07T20:33:16.4240213Z torch.manual_seed(2025) 2025-05-07T20:33:16.4240477Z 2025-05-07T20:33:16.4240782Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.4241163Z 2025-05-07T20:33:16.4241376Z x_sign = torch.sign(x) 2025-05-07T20:33:16.4241697Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.4242043Z x = x_sign * x_clamp 2025-05-07T20:33:16.4242309Z x0 = x[:, :D] 2025-05-07T20:33:16.4242556Z x1 = x[:, D:] 2025-05-07T20:33:16.4242790Z 2025-05-07T20:33:16.4243002Z if contiguous: 2025-05-07T20:33:16.4243256Z x0 = x0.contiguous() 2025-05-07T20:33:16.4243548Z x1 = x1.contiguous() 2025-05-07T20:33:16.4243828Z 2025-05-07T20:33:16.4244043Z if scale_ub is not None: 2025-05-07T20:33:16.4244350Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.4244724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.4245063Z ) 2025-05-07T20:33:16.4245282Z else: 2025-05-07T20:33:16.4245519Z scale_ub_tensor = None 2025-05-07T20:33:16.4245795Z 2025-05-07T20:33:16.4246054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.4246482Z op = silu_mul_quant 2025-05-07T20:33:16.4246760Z if compiled: 2025-05-07T20:33:16.4247038Z op = torch.compile(op) 2025-05-07T20:33:16.4247369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.4247676Z 2025-05-07T20:33:16.4247889Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.4248075Z 2025-05-07T20:33:16.4248186Z moe/activation_test.py:117: 2025-05-07T20:33:16.4248516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4248883Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.4249197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.4249961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.4250793Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.4251393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.4252147Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.4252877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.4253565Z kernel = self.compile( 2025-05-07T20:33:16.4254165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.4254891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.4255330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4255583Z 2025-05-07T20:33:16.4255811Z self = 2025-05-07T20:33:16.4257002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.4258513Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074d000>} 2025-05-07T20:33:16.4260001Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.4261118Z context = 2025-05-07T20:33:16.4261442Z 2025-05-07T20:33:16.4261626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.4262201Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.4262724Z module_map=module_map) 2025-05-07T20:33:16.4263124Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.4263515Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.4263799Z E ^ 2025-05-07T20:33:16.4264309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.4264814Z 2025-05-07T20:33:16.4265274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.4265839Z 2025-05-07T20:33:16.4265954Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.4266412Z self=, 2025-05-07T20:33:16.4266849Z T=4096, 2025-05-07T20:33:16.4267059Z D=5120, 2025-05-07T20:33:16.4267276Z scale_ub=1200.0, 2025-05-07T20:33:16.4267526Z contiguous=False, 2025-05-07T20:33:16.4267777Z compiled=True, 2025-05-07T20:33:16.4268002Z ) 2025-05-07T20:33:16.4268352Z self = 2025-05-07T20:33:16.4268944Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:16.4269251Z 2025-05-07T20:33:16.4269339Z @given( 2025-05-07T20:33:16.4269595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.4269943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.4270281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.4270648Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.4271007Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.4271324Z ) 2025-05-07T20:33:16.4271712Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.4272193Z def test_silu_mul_quant( 2025-05-07T20:33:16.4272464Z self, 2025-05-07T20:33:16.4272738Z T: int, 2025-05-07T20:33:16.4272974Z D: int, 2025-05-07T20:33:16.4273216Z scale_ub: Optional[float], 2025-05-07T20:33:16.4273581Z contiguous: bool, 2025-05-07T20:33:16.4273854Z compiled: bool, 2025-05-07T20:33:16.4274097Z ) -> None: 2025-05-07T20:33:16.4274336Z torch.manual_seed(2025) 2025-05-07T20:33:16.4274601Z 2025-05-07T20:33:16.4274896Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.4275364Z 2025-05-07T20:33:16.4275580Z x_sign = torch.sign(x) 2025-05-07T20:33:16.4275895Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.4276238Z x = x_sign * x_clamp 2025-05-07T20:33:16.4276507Z x0 = x[:, :D] 2025-05-07T20:33:16.4276742Z x1 = x[:, D:] 2025-05-07T20:33:16.4276975Z 2025-05-07T20:33:16.4277183Z if contiguous: 2025-05-07T20:33:16.4277435Z x0 = x0.contiguous() 2025-05-07T20:33:16.4277720Z x1 = x1.contiguous() 2025-05-07T20:33:16.4277989Z 2025-05-07T20:33:16.4278197Z if scale_ub is not None: 2025-05-07T20:33:16.4278500Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.4278868Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.4279206Z ) 2025-05-07T20:33:16.4279414Z else: 2025-05-07T20:33:16.4279647Z scale_ub_tensor = None 2025-05-07T20:33:16.4279926Z 2025-05-07T20:33:16.4280178Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.4280523Z op = silu_mul_quant 2025-05-07T20:33:16.4280800Z if compiled: 2025-05-07T20:33:16.4281070Z op = torch.compile(op) 2025-05-07T20:33:16.4281399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.4281702Z 2025-05-07T20:33:16.4281909Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.4282097Z 2025-05-07T20:33:16.4282206Z moe/activation_test.py:117: 2025-05-07T20:33:16.4282563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4282952Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.4283291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.4283910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.4284525Z return fn(*args, **kwargs) 2025-05-07T20:33:16.4285253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.4286011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.4286607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.4287359Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.4288083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.4288673Z kernel = self.compile( 2025-05-07T20:33:16.4289273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.4290042Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.4290484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4290744Z 2025-05-07T20:33:16.4290975Z self = 2025-05-07T20:33:16.4292165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.4293670Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074c700>} 2025-05-07T20:33:16.4295220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.4296351Z context = 2025-05-07T20:33:16.4296668Z 2025-05-07T20:33:16.4296857Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.4297570Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.4298087Z module_map=module_map) 2025-05-07T20:33:16.4298491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.4298882Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.4299165Z E ^ 2025-05-07T20:33:16.4299678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.4300179Z 2025-05-07T20:33:16.4300639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.4301201Z 2025-05-07T20:33:16.5677149Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.5677658Z self=, 2025-05-07T20:33:16.5678118Z T=2048, 2025-05-07T20:33:16.5678337Z D=7168, 2025-05-07T20:33:16.5678553Z scale_ub=1200.0, 2025-05-07T20:33:16.5678803Z contiguous=False, 2025-05-07T20:33:16.5679061Z compiled=False, 2025-05-07T20:33:16.5679290Z ) 2025-05-07T20:33:16.5679645Z self = 2025-05-07T20:33:16.5680200Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:16.5680508Z 2025-05-07T20:33:16.5680600Z @given( 2025-05-07T20:33:16.5680855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.5681207Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.5681555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.5681922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.5682292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.5682614Z ) 2025-05-07T20:33:16.5683006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.5683497Z def test_silu_mul_quant( 2025-05-07T20:33:16.5683771Z self, 2025-05-07T20:33:16.5683993Z T: int, 2025-05-07T20:33:16.5684224Z D: int, 2025-05-07T20:33:16.5684468Z scale_ub: Optional[float], 2025-05-07T20:33:16.5684779Z contiguous: bool, 2025-05-07T20:33:16.5685055Z compiled: bool, 2025-05-07T20:33:16.5685301Z ) -> None: 2025-05-07T20:33:16.5685539Z torch.manual_seed(2025) 2025-05-07T20:33:16.5685813Z 2025-05-07T20:33:16.5686113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.5686494Z 2025-05-07T20:33:16.5686716Z x_sign = torch.sign(x) 2025-05-07T20:33:16.5687034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.5687586Z x = x_sign * x_clamp 2025-05-07T20:33:16.5694508Z x0 = x[:, :D] 2025-05-07T20:33:16.5694794Z x1 = x[:, D:] 2025-05-07T20:33:16.5695038Z 2025-05-07T20:33:16.5695252Z if contiguous: 2025-05-07T20:33:16.5695523Z x0 = x0.contiguous() 2025-05-07T20:33:16.5695825Z x1 = x1.contiguous() 2025-05-07T20:33:16.5696103Z 2025-05-07T20:33:16.5696321Z if scale_ub is not None: 2025-05-07T20:33:16.5696637Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.5697017Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.5697362Z ) 2025-05-07T20:33:16.5697586Z else: 2025-05-07T20:33:16.5697830Z scale_ub_tensor = None 2025-05-07T20:33:16.5698117Z 2025-05-07T20:33:16.5698489Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.5698845Z op = silu_mul_quant 2025-05-07T20:33:16.5699133Z if compiled: 2025-05-07T20:33:16.5699451Z op = torch.compile(op) 2025-05-07T20:33:16.5699785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.5700096Z 2025-05-07T20:33:16.5700316Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.5700580Z 2025-05-07T20:33:16.5700701Z moe/activation_test.py:117: 2025-05-07T20:33:16.5701096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.5701475Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.5701789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.5702563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.5703334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.5703937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.5704849Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.5705598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.5706197Z kernel = self.compile( 2025-05-07T20:33:16.5706808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.5707538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.5707979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.5708233Z 2025-05-07T20:33:16.5708469Z self = 2025-05-07T20:33:16.5709661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.5711196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074d240>} 2025-05-07T20:33:16.5712695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.5713909Z context = 2025-05-07T20:33:16.5714231Z 2025-05-07T20:33:16.5714422Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.5714999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.5715523Z module_map=module_map) 2025-05-07T20:33:16.5715933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.5716319Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.5716616Z E ^ 2025-05-07T20:33:16.5717201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.5717703Z 2025-05-07T20:33:16.5718174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.5718743Z 2025-05-07T20:33:16.5718858Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.5719318Z self=, 2025-05-07T20:33:16.5719766Z T=1, 2025-05-07T20:33:16.5719972Z D=7168, 2025-05-07T20:33:16.5720189Z scale_ub=None, 2025-05-07T20:33:16.5720428Z contiguous=True, 2025-05-07T20:33:16.5720675Z compiled=False, 2025-05-07T20:33:16.5720905Z ) 2025-05-07T20:33:16.5721312Z self = 2025-05-07T20:33:16.5721853Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:16.5722143Z 2025-05-07T20:33:16.5722234Z @given( 2025-05-07T20:33:16.5722496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.5722896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.5723235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.5723697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.5724402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.5724723Z ) 2025-05-07T20:33:16.5725115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.5725606Z def test_silu_mul_quant( 2025-05-07T20:33:16.5725879Z self, 2025-05-07T20:33:16.5726093Z T: int, 2025-05-07T20:33:16.5726316Z D: int, 2025-05-07T20:33:16.5726563Z scale_ub: Optional[float], 2025-05-07T20:33:16.5726865Z contiguous: bool, 2025-05-07T20:33:16.5727133Z compiled: bool, 2025-05-07T20:33:16.5727384Z ) -> None: 2025-05-07T20:33:16.5727625Z torch.manual_seed(2025) 2025-05-07T20:33:16.5727895Z 2025-05-07T20:33:16.5728200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.5728577Z 2025-05-07T20:33:16.5728804Z x_sign = torch.sign(x) 2025-05-07T20:33:16.5729139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.5729485Z x = x_sign * x_clamp 2025-05-07T20:33:16.5729756Z x0 = x[:, :D] 2025-05-07T20:33:16.5730002Z x1 = x[:, D:] 2025-05-07T20:33:16.5730233Z 2025-05-07T20:33:16.5730444Z if contiguous: 2025-05-07T20:33:16.5730707Z x0 = x0.contiguous() 2025-05-07T20:33:16.5730992Z x1 = x1.contiguous() 2025-05-07T20:33:16.5731265Z 2025-05-07T20:33:16.5731480Z if scale_ub is not None: 2025-05-07T20:33:16.5731791Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.5732158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.5732507Z ) 2025-05-07T20:33:16.5732730Z else: 2025-05-07T20:33:16.5732965Z scale_ub_tensor = None 2025-05-07T20:33:16.5733247Z 2025-05-07T20:33:16.5733509Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.5733858Z op = silu_mul_quant 2025-05-07T20:33:16.5734145Z if compiled: 2025-05-07T20:33:16.5734424Z op = torch.compile(op) 2025-05-07T20:33:16.5734753Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.5735060Z 2025-05-07T20:33:16.5735281Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.5735465Z 2025-05-07T20:33:16.5735577Z moe/activation_test.py:117: 2025-05-07T20:33:16.5735908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.5736278Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.5736596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.5737446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.5738220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.5738825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.5739587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.5740327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.5740925Z kernel = self.compile( 2025-05-07T20:33:16.5741529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.5742252Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.5742805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.5743065Z 2025-05-07T20:33:16.5743299Z self = 2025-05-07T20:33:16.5744505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.5746153Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074e050>} 2025-05-07T20:33:16.5747644Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.5748774Z context = 2025-05-07T20:33:16.5749097Z 2025-05-07T20:33:16.5749289Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.5749864Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.5750386Z module_map=module_map) 2025-05-07T20:33:16.5750794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.5751195Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.5751486Z E ^ 2025-05-07T20:33:16.5752005Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.5752507Z 2025-05-07T20:33:16.5752975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.5753616Z 2025-05-07T20:33:16.5753742Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.5754197Z self=, 2025-05-07T20:33:16.5754649Z T=16384, 2025-05-07T20:33:16.5754869Z D=7168, 2025-05-07T20:33:16.5755089Z scale_ub=1200.0, 2025-05-07T20:33:16.5755343Z contiguous=False, 2025-05-07T20:33:16.5755596Z compiled=True, 2025-05-07T20:33:16.8496288Z ) 2025-05-07T20:33:16.8496786Z self = 2025-05-07T20:33:16.8497358Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:16.8497686Z 2025-05-07T20:33:16.8497802Z @given( 2025-05-07T20:33:16.8498160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.8498630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.8499042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.8499399Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.8499743Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.8500049Z ) 2025-05-07T20:33:16.8500459Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.8500931Z def test_silu_mul_quant( 2025-05-07T20:33:16.8501193Z self, 2025-05-07T20:33:16.8501545Z T: int, 2025-05-07T20:33:16.8501768Z D: int, 2025-05-07T20:33:16.8502005Z scale_ub: Optional[float], 2025-05-07T20:33:16.8502289Z contiguous: bool, 2025-05-07T20:33:16.8502549Z compiled: bool, 2025-05-07T20:33:16.8502794Z ) -> None: 2025-05-07T20:33:16.8503022Z torch.manual_seed(2025) 2025-05-07T20:33:16.8503285Z 2025-05-07T20:33:16.8503579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.8503947Z 2025-05-07T20:33:16.8504151Z x_sign = torch.sign(x) 2025-05-07T20:33:16.8504463Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.8504794Z x = x_sign * x_clamp 2025-05-07T20:33:16.8505046Z x0 = x[:, :D] 2025-05-07T20:33:16.8505357Z x1 = x[:, D:] 2025-05-07T20:33:16.8505584Z 2025-05-07T20:33:16.8505778Z if contiguous: 2025-05-07T20:33:16.8506027Z x0 = x0.contiguous() 2025-05-07T20:33:16.8506309Z x1 = x1.contiguous() 2025-05-07T20:33:16.8506561Z 2025-05-07T20:33:16.8506782Z if scale_ub is not None: 2025-05-07T20:33:16.8507071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.8507514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.8507949Z ) 2025-05-07T20:33:16.8508156Z else: 2025-05-07T20:33:16.8508389Z scale_ub_tensor = None 2025-05-07T20:33:16.8508661Z 2025-05-07T20:33:16.8508904Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.8509239Z op = silu_mul_quant 2025-05-07T20:33:16.8509507Z if compiled: 2025-05-07T20:33:16.8509768Z op = torch.compile(op) 2025-05-07T20:33:16.8510088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.8510391Z 2025-05-07T20:33:16.8510597Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.8510779Z 2025-05-07T20:33:16.8510884Z moe/activation_test.py:117: 2025-05-07T20:33:16.8511213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8511564Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.8511863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.8512468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.8513113Z return fn(*args, **kwargs) 2025-05-07T20:33:16.8513942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.8514678Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.8515248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.8515984Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.8516683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.8517255Z kernel = self.compile( 2025-05-07T20:33:16.8517833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.8518530Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.8518956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8519203Z 2025-05-07T20:33:16.8519427Z self = 2025-05-07T20:33:16.8520570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.8522094Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074f490>} 2025-05-07T20:33:16.8523512Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.8524932Z context = 2025-05-07T20:33:16.8525246Z 2025-05-07T20:33:16.8525422Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.8525975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.8526467Z module_map=module_map) 2025-05-07T20:33:16.8526857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.8527232Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.8527591Z E ^ 2025-05-07T20:33:16.8528090Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.8528570Z 2025-05-07T20:33:16.8529007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.8529545Z 2025-05-07T20:33:16.8529728Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.8530213Z self=, 2025-05-07T20:33:16.8530644Z T=1, 2025-05-07T20:33:16.8530847Z D=7168, 2025-05-07T20:33:16.8531049Z scale_ub=None, 2025-05-07T20:33:16.8531283Z contiguous=False, 2025-05-07T20:33:16.8531530Z compiled=False, 2025-05-07T20:33:16.8531747Z ) 2025-05-07T20:33:16.8532088Z self = 2025-05-07T20:33:16.8532612Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:16.8532892Z 2025-05-07T20:33:16.8532984Z @given( 2025-05-07T20:33:16.8533227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.8533565Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.8533894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.8534242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.8534601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.8534911Z ) 2025-05-07T20:33:16.8535281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.8535752Z def test_silu_mul_quant( 2025-05-07T20:33:16.8536014Z self, 2025-05-07T20:33:16.8536233Z T: int, 2025-05-07T20:33:16.8536443Z D: int, 2025-05-07T20:33:16.8536678Z scale_ub: Optional[float], 2025-05-07T20:33:16.8536966Z contiguous: bool, 2025-05-07T20:33:16.8537218Z compiled: bool, 2025-05-07T20:33:16.8537460Z ) -> None: 2025-05-07T20:33:16.8537688Z torch.manual_seed(2025) 2025-05-07T20:33:16.8537941Z 2025-05-07T20:33:16.8538238Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.8538601Z 2025-05-07T20:33:16.8538805Z x_sign = torch.sign(x) 2025-05-07T20:33:16.8539115Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.8539450Z x = x_sign * x_clamp 2025-05-07T20:33:16.8539702Z x0 = x[:, :D] 2025-05-07T20:33:16.8539943Z x1 = x[:, D:] 2025-05-07T20:33:16.8540169Z 2025-05-07T20:33:16.8540363Z if contiguous: 2025-05-07T20:33:16.8540615Z x0 = x0.contiguous() 2025-05-07T20:33:16.8540890Z x1 = x1.contiguous() 2025-05-07T20:33:16.8541144Z 2025-05-07T20:33:16.8541352Z if scale_ub is not None: 2025-05-07T20:33:16.8541643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.8542000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.8542331Z ) 2025-05-07T20:33:16.8542540Z else: 2025-05-07T20:33:16.8542763Z scale_ub_tensor = None 2025-05-07T20:33:16.8543026Z 2025-05-07T20:33:16.8543345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.8543683Z op = silu_mul_quant 2025-05-07T20:33:16.8543946Z if compiled: 2025-05-07T20:33:16.8544213Z op = torch.compile(op) 2025-05-07T20:33:16.8544535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.8544822Z 2025-05-07T20:33:16.8545031Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.8545206Z 2025-05-07T20:33:16.8545320Z moe/activation_test.py:117: 2025-05-07T20:33:16.8545633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8545987Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.8546287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.8547025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.8547945Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.8548722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.8549447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.8550263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.8550828Z kernel = self.compile( 2025-05-07T20:33:16.8551411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.8552109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.8552528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8552787Z 2025-05-07T20:33:16.8553010Z self = 2025-05-07T20:33:16.8554227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.8555695Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074f7f0>} 2025-05-07T20:33:16.8557138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.8558212Z context = 2025-05-07T20:33:16.8558639Z 2025-05-07T20:33:16.8558898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.8559562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.8560067Z module_map=module_map) 2025-05-07T20:33:16.8560449Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.8560820Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.8561098Z E ^ 2025-05-07T20:33:16.8561586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.8562068Z 2025-05-07T20:33:16.8562504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.8563047Z 2025-05-07T20:33:16.8563157Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.8563598Z self=, 2025-05-07T20:33:16.8564018Z T=2048, 2025-05-07T20:33:16.8564222Z D=7168, 2025-05-07T20:33:16.8564432Z scale_ub=None, 2025-05-07T20:33:16.8564658Z contiguous=False, 2025-05-07T20:33:16.8564898Z compiled=True, 2025-05-07T20:33:16.8565117Z ) 2025-05-07T20:33:16.9567399Z self = 2025-05-07T20:33:16.9568037Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:16.9568328Z 2025-05-07T20:33:16.9568421Z @given( 2025-05-07T20:33:16.9568675Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.9569011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.9569337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.9569694Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.9570049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.9570355Z ) 2025-05-07T20:33:16.9570725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.9571273Z def test_silu_mul_quant( 2025-05-07T20:33:16.9571536Z self, 2025-05-07T20:33:16.9571743Z T: int, 2025-05-07T20:33:16.9571959Z D: int, 2025-05-07T20:33:16.9572202Z scale_ub: Optional[float], 2025-05-07T20:33:16.9572488Z contiguous: bool, 2025-05-07T20:33:16.9572750Z compiled: bool, 2025-05-07T20:33:16.9572997Z ) -> None: 2025-05-07T20:33:16.9573226Z torch.manual_seed(2025) 2025-05-07T20:33:16.9573561Z 2025-05-07T20:33:16.9573923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.9574287Z 2025-05-07T20:33:16.9574503Z x_sign = torch.sign(x) 2025-05-07T20:33:16.9574822Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.9575149Z x = x_sign * x_clamp 2025-05-07T20:33:16.9575412Z x0 = x[:, :D] 2025-05-07T20:33:16.9575660Z x1 = x[:, D:] 2025-05-07T20:33:16.9575885Z 2025-05-07T20:33:16.9576086Z if contiguous: 2025-05-07T20:33:16.9576346Z x0 = x0.contiguous() 2025-05-07T20:33:16.9576623Z x1 = x1.contiguous() 2025-05-07T20:33:16.9576878Z 2025-05-07T20:33:16.9577091Z if scale_ub is not None: 2025-05-07T20:33:16.9577388Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.9577900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.9578388Z ) 2025-05-07T20:33:16.9578681Z else: 2025-05-07T20:33:16.9578995Z scale_ub_tensor = None 2025-05-07T20:33:16.9579353Z 2025-05-07T20:33:16.9579606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.9579933Z op = silu_mul_quant 2025-05-07T20:33:16.9580199Z if compiled: 2025-05-07T20:33:16.9580466Z op = torch.compile(op) 2025-05-07T20:33:16.9580779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.9581074Z 2025-05-07T20:33:16.9581284Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.9581460Z 2025-05-07T20:33:16.9581575Z moe/activation_test.py:117: 2025-05-07T20:33:16.9581884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.9582237Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.9582543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.9583183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.9583781Z return fn(*args, **kwargs) 2025-05-07T20:33:16.9584481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.9585209Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.9585770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.9586491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.9587192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.9587751Z kernel = self.compile( 2025-05-07T20:33:16.9588401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.9589103Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.9589531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.9589776Z 2025-05-07T20:33:16.9590000Z self = 2025-05-07T20:33:16.9591141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.9592598Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1caf0>} 2025-05-07T20:33:16.9594167Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.9595233Z context = 2025-05-07T20:33:16.9595616Z 2025-05-07T20:33:16.9595833Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.9596390Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.9596886Z module_map=module_map) 2025-05-07T20:33:16.9604321Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.9604715Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.9604998Z E ^ 2025-05-07T20:33:16.9605490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.9605974Z 2025-05-07T20:33:16.9606415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.9606958Z 2025-05-07T20:33:16.9607069Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.9607508Z self=, 2025-05-07T20:33:16.9607925Z T=4096, 2025-05-07T20:33:16.9608129Z D=7168, 2025-05-07T20:33:16.9608339Z scale_ub=None, 2025-05-07T20:33:16.9608564Z contiguous=False, 2025-05-07T20:33:16.9608812Z compiled=True, 2025-05-07T20:33:16.9609029Z ) 2025-05-07T20:33:16.9609361Z self = 2025-05-07T20:33:16.9609880Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:16.9610169Z 2025-05-07T20:33:16.9610252Z @given( 2025-05-07T20:33:16.9610502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.9610830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.9611159Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.9611510Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.9611855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.9612161Z ) 2025-05-07T20:33:16.9612536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.9613000Z def test_silu_mul_quant( 2025-05-07T20:33:16.9613259Z self, 2025-05-07T20:33:16.9613469Z T: int, 2025-05-07T20:33:16.9613675Z D: int, 2025-05-07T20:33:16.9613908Z scale_ub: Optional[float], 2025-05-07T20:33:16.9614199Z contiguous: bool, 2025-05-07T20:33:16.9614457Z compiled: bool, 2025-05-07T20:33:16.9614692Z ) -> None: 2025-05-07T20:33:16.9614923Z torch.manual_seed(2025) 2025-05-07T20:33:16.9615180Z 2025-05-07T20:33:16.9615467Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.9615831Z 2025-05-07T20:33:16.9616043Z x_sign = torch.sign(x) 2025-05-07T20:33:16.9616428Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.9616760Z x = x_sign * x_clamp 2025-05-07T20:33:16.9617022Z x0 = x[:, :D] 2025-05-07T20:33:16.9617251Z x1 = x[:, D:] 2025-05-07T20:33:16.9617481Z 2025-05-07T20:33:16.9617685Z if contiguous: 2025-05-07T20:33:16.9617932Z x0 = x0.contiguous() 2025-05-07T20:33:16.9618210Z x1 = x1.contiguous() 2025-05-07T20:33:16.9618468Z 2025-05-07T20:33:16.9618668Z if scale_ub is not None: 2025-05-07T20:33:16.9618959Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.9619316Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.9619638Z ) 2025-05-07T20:33:16.9619846Z else: 2025-05-07T20:33:16.9620120Z scale_ub_tensor = None 2025-05-07T20:33:16.9620388Z 2025-05-07T20:33:16.9620630Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.9620963Z op = silu_mul_quant 2025-05-07T20:33:16.9621231Z if compiled: 2025-05-07T20:33:16.9621493Z op = torch.compile(op) 2025-05-07T20:33:16.9621811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.9622147Z 2025-05-07T20:33:16.9622349Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.9622569Z 2025-05-07T20:33:16.9622675Z moe/activation_test.py:117: 2025-05-07T20:33:16.9622992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.9623335Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.9623634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.9624595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:16.9625183Z return fn(*args, **kwargs) 2025-05-07T20:33:16.9625873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.9626602Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.9627169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.9627876Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.9628574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.9629131Z kernel = self.compile( 2025-05-07T20:33:16.9629699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.9630380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.9630798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.9631040Z 2025-05-07T20:33:16.9631266Z self = 2025-05-07T20:33:16.9632397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.9633958Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1c280>} 2025-05-07T20:33:16.9635367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.9636443Z context = 2025-05-07T20:33:16.9636748Z 2025-05-07T20:33:16.9636930Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.9637473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.9638072Z module_map=module_map) 2025-05-07T20:33:16.9638462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.9638837Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.9639111Z E ^ 2025-05-07T20:33:16.9639605Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.9640073Z 2025-05-07T20:33:16.9640516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.9641051Z 2025-05-07T20:33:17.3089239Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.3089790Z self=, 2025-05-07T20:33:17.3090561Z T=16384, 2025-05-07T20:33:17.3090849Z D=5120, 2025-05-07T20:33:17.3091145Z scale_ub=1200.0, 2025-05-07T20:33:17.3091439Z contiguous=False, 2025-05-07T20:33:17.3091683Z compiled=False, 2025-05-07T20:33:17.3091910Z ) 2025-05-07T20:33:17.3092253Z self = 2025-05-07T20:33:17.3092787Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:17.3093199Z 2025-05-07T20:33:17.3093318Z @given( 2025-05-07T20:33:17.3093621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.3093958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.3094291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.3094638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.3094992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.3095302Z ) 2025-05-07T20:33:17.3095671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.3096148Z def test_silu_mul_quant( 2025-05-07T20:33:17.3096408Z self, 2025-05-07T20:33:17.3096619Z T: int, 2025-05-07T20:33:17.3096831Z D: int, 2025-05-07T20:33:17.3097069Z scale_ub: Optional[float], 2025-05-07T20:33:17.3097363Z contiguous: bool, 2025-05-07T20:33:17.3097617Z compiled: bool, 2025-05-07T20:33:17.3097867Z ) -> None: 2025-05-07T20:33:17.3098103Z torch.manual_seed(2025) 2025-05-07T20:33:17.3098365Z 2025-05-07T20:33:17.3098659Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.3099026Z 2025-05-07T20:33:17.3099232Z x_sign = torch.sign(x) 2025-05-07T20:33:17.3099552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.3099885Z x = x_sign * x_clamp 2025-05-07T20:33:17.3100141Z x0 = x[:, :D] 2025-05-07T20:33:17.3100377Z x1 = x[:, D:] 2025-05-07T20:33:17.3100603Z 2025-05-07T20:33:17.3100802Z if contiguous: 2025-05-07T20:33:17.3101054Z x0 = x0.contiguous() 2025-05-07T20:33:17.3101324Z x1 = x1.contiguous() 2025-05-07T20:33:17.3101576Z 2025-05-07T20:33:17.3101785Z if scale_ub is not None: 2025-05-07T20:33:17.3102076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.3102426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.3102776Z ) 2025-05-07T20:33:17.3103014Z else: 2025-05-07T20:33:17.3103268Z scale_ub_tensor = None 2025-05-07T20:33:17.3103535Z 2025-05-07T20:33:17.3103784Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.3104119Z op = silu_mul_quant 2025-05-07T20:33:17.3104384Z if compiled: 2025-05-07T20:33:17.3104657Z op = torch.compile(op) 2025-05-07T20:33:17.3104976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3105264Z 2025-05-07T20:33:17.3105474Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.3105651Z 2025-05-07T20:33:17.3105764Z moe/activation_test.py:117: 2025-05-07T20:33:17.3106210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3106566Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.3106876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3107613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.3108342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.3108913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.3109639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.3110338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.3110907Z kernel = self.compile( 2025-05-07T20:33:17.3111530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.3112225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.3112642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3112889Z 2025-05-07T20:33:17.3113108Z self = 2025-05-07T20:33:17.3114427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.3115897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1ed40>} 2025-05-07T20:33:17.3117323Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.3118410Z context = 2025-05-07T20:33:17.3118722Z 2025-05-07T20:33:17.3118899Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.3119456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.3119948Z module_map=module_map) 2025-05-07T20:33:17.3120339Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.3120717Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.3120993Z E ^ 2025-05-07T20:33:17.3121483Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.3121961Z 2025-05-07T20:33:17.3122403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.3122946Z 2025-05-07T20:33:17.3123071Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.3123511Z self=, 2025-05-07T20:33:17.3124179Z T=16384, 2025-05-07T20:33:17.3124391Z D=5120, 2025-05-07T20:33:17.3124610Z scale_ub=1200.0, 2025-05-07T20:33:17.3124844Z contiguous=True, 2025-05-07T20:33:17.3125089Z compiled=True, 2025-05-07T20:33:17.3125312Z ) 2025-05-07T20:33:17.3125647Z self = 2025-05-07T20:33:17.3126172Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:17.3126469Z 2025-05-07T20:33:17.3126554Z @given( 2025-05-07T20:33:17.3126801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.3127132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.3127462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.3127816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.3128239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.3128545Z ) 2025-05-07T20:33:17.3128919Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.3129384Z def test_silu_mul_quant( 2025-05-07T20:33:17.3129647Z self, 2025-05-07T20:33:17.3129859Z T: int, 2025-05-07T20:33:17.3130071Z D: int, 2025-05-07T20:33:17.3130298Z scale_ub: Optional[float], 2025-05-07T20:33:17.3130586Z contiguous: bool, 2025-05-07T20:33:17.3130840Z compiled: bool, 2025-05-07T20:33:17.3131073Z ) -> None: 2025-05-07T20:33:17.3131304Z torch.manual_seed(2025) 2025-05-07T20:33:17.3131561Z 2025-05-07T20:33:17.3131846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.3132274Z 2025-05-07T20:33:17.3132482Z x_sign = torch.sign(x) 2025-05-07T20:33:17.3132785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.3133113Z x = x_sign * x_clamp 2025-05-07T20:33:17.3133374Z x0 = x[:, :D] 2025-05-07T20:33:17.3133604Z x1 = x[:, D:] 2025-05-07T20:33:17.3133826Z 2025-05-07T20:33:17.3134027Z if contiguous: 2025-05-07T20:33:17.3134268Z x0 = x0.contiguous() 2025-05-07T20:33:17.3134613Z x1 = x1.contiguous() 2025-05-07T20:33:17.3134927Z 2025-05-07T20:33:17.3135133Z if scale_ub is not None: 2025-05-07T20:33:17.3135424Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.3135782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.3136116Z ) 2025-05-07T20:33:17.3136327Z else: 2025-05-07T20:33:17.3136554Z scale_ub_tensor = None 2025-05-07T20:33:17.3136823Z 2025-05-07T20:33:17.3137066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.3137403Z op = silu_mul_quant 2025-05-07T20:33:17.3137667Z if compiled: 2025-05-07T20:33:17.3137933Z op = torch.compile(op) 2025-05-07T20:33:17.3138251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3138545Z 2025-05-07T20:33:17.3138747Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.3138929Z 2025-05-07T20:33:17.3139037Z moe/activation_test.py:117: 2025-05-07T20:33:17.3139355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3139699Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.3139999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.3140592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.3141183Z return fn(*args, **kwargs) 2025-05-07T20:33:17.3141880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.3142612Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.3143186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.3143904Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.3144598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.3145160Z kernel = self.compile( 2025-05-07T20:33:17.3145729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.3146411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.3146827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.3147072Z 2025-05-07T20:33:17.3147289Z self = 2025-05-07T20:33:17.3148474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.3149906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1e830>} 2025-05-07T20:33:17.3151313Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.3152382Z context = 2025-05-07T20:33:17.3152706Z 2025-05-07T20:33:17.3152917Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.3153467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.3154074Z module_map=module_map) 2025-05-07T20:33:17.3154462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.3154834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.3155105Z E ^ 2025-05-07T20:33:17.3155597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.3156142Z 2025-05-07T20:33:17.3156619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.3157263Z 2025-05-07T20:33:17.5057240Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.5058481Z self=, 2025-05-07T20:33:17.5059566Z T=16384, 2025-05-07T20:33:17.5059959Z D=5120, 2025-05-07T20:33:17.5060337Z scale_ub=None, 2025-05-07T20:33:17.5060767Z contiguous=False, 2025-05-07T20:33:17.5061206Z compiled=True, 2025-05-07T20:33:17.5061607Z ) 2025-05-07T20:33:17.5062222Z self = 2025-05-07T20:33:17.5063174Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:17.5063523Z 2025-05-07T20:33:17.5063610Z @given( 2025-05-07T20:33:17.5063862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.5064201Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.5064531Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.5064887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.5065233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.5065542Z ) 2025-05-07T20:33:17.5065920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.5066386Z def test_silu_mul_quant( 2025-05-07T20:33:17.5066649Z self, 2025-05-07T20:33:17.5066863Z T: int, 2025-05-07T20:33:17.5067076Z D: int, 2025-05-07T20:33:17.5067315Z scale_ub: Optional[float], 2025-05-07T20:33:17.5067610Z contiguous: bool, 2025-05-07T20:33:17.5067866Z compiled: bool, 2025-05-07T20:33:17.5068107Z ) -> None: 2025-05-07T20:33:17.5068344Z torch.manual_seed(2025) 2025-05-07T20:33:17.5068611Z 2025-05-07T20:33:17.5068897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.5069266Z 2025-05-07T20:33:17.5069476Z x_sign = torch.sign(x) 2025-05-07T20:33:17.5069781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.5070110Z x = x_sign * x_clamp 2025-05-07T20:33:17.5070368Z x0 = x[:, :D] 2025-05-07T20:33:17.5070593Z x1 = x[:, D:] 2025-05-07T20:33:17.5070816Z 2025-05-07T20:33:17.5071016Z if contiguous: 2025-05-07T20:33:17.5071260Z x0 = x0.contiguous() 2025-05-07T20:33:17.5071539Z x1 = x1.contiguous() 2025-05-07T20:33:17.5071795Z 2025-05-07T20:33:17.5071998Z if scale_ub is not None: 2025-05-07T20:33:17.5072432Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.5072798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.5073120Z ) 2025-05-07T20:33:17.5073331Z else: 2025-05-07T20:33:17.5073642Z scale_ub_tensor = None 2025-05-07T20:33:17.5073917Z 2025-05-07T20:33:17.5074165Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.5074502Z op = silu_mul_quant 2025-05-07T20:33:17.5074770Z if compiled: 2025-05-07T20:33:17.5075029Z op = torch.compile(op) 2025-05-07T20:33:17.5075343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.5075637Z 2025-05-07T20:33:17.5075839Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.5076024Z 2025-05-07T20:33:17.5076131Z moe/activation_test.py:117: 2025-05-07T20:33:17.5076524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.5076870Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.5077178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.5077771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.5078369Z return fn(*args, **kwargs) 2025-05-07T20:33:17.5079207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.5079940Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.5080510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.5081233Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.5081936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.5082508Z kernel = self.compile( 2025-05-07T20:33:17.5083117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.5083800Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.5084220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.5084462Z 2025-05-07T20:33:17.5084690Z self = 2025-05-07T20:33:17.5085825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.5087276Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1f760>} 2025-05-07T20:33:17.5088692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.5089766Z context = 2025-05-07T20:33:17.5090070Z 2025-05-07T20:33:17.5090254Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.5090800Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.5091290Z module_map=module_map) 2025-05-07T20:33:17.5091676Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.5092046Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.5092315Z E ^ 2025-05-07T20:33:17.5092809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.5093281Z 2025-05-07T20:33:17.5093720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.5094302Z 2025-05-07T20:33:17.5094421Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.5094849Z self=, 2025-05-07T20:33:17.5095378Z T=2048, 2025-05-07T20:33:17.5095655Z D=5120, 2025-05-07T20:33:17.5095862Z scale_ub=None, 2025-05-07T20:33:17.5096093Z contiguous=False, 2025-05-07T20:33:17.5096334Z compiled=True, 2025-05-07T20:33:17.5096547Z ) 2025-05-07T20:33:17.6138206Z self = 2025-05-07T20:33:17.6139057Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:17.6139466Z 2025-05-07T20:33:17.6139592Z @given( 2025-05-07T20:33:17.6139855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.6140311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.6140642Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.6141001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.6141353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.6141655Z ) 2025-05-07T20:33:17.6142033Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.6142578Z def test_silu_mul_quant( 2025-05-07T20:33:17.6142893Z self, 2025-05-07T20:33:17.6143107Z T: int, 2025-05-07T20:33:17.6143322Z D: int, 2025-05-07T20:33:17.6143564Z scale_ub: Optional[float], 2025-05-07T20:33:17.6143858Z contiguous: bool, 2025-05-07T20:33:17.6144120Z compiled: bool, 2025-05-07T20:33:17.6144363Z ) -> None: 2025-05-07T20:33:17.6144597Z torch.manual_seed(2025) 2025-05-07T20:33:17.6144857Z 2025-05-07T20:33:17.6145142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.6145512Z 2025-05-07T20:33:17.6145723Z x_sign = torch.sign(x) 2025-05-07T20:33:17.6146031Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.6146364Z x = x_sign * x_clamp 2025-05-07T20:33:17.6146623Z x0 = x[:, :D] 2025-05-07T20:33:17.6154167Z x1 = x[:, D:] 2025-05-07T20:33:17.6154411Z 2025-05-07T20:33:17.6154619Z if contiguous: 2025-05-07T20:33:17.6154872Z x0 = x0.contiguous() 2025-05-07T20:33:17.6155152Z x1 = x1.contiguous() 2025-05-07T20:33:17.6155405Z 2025-05-07T20:33:17.6155614Z if scale_ub is not None: 2025-05-07T20:33:17.6155910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.6156267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.6156598Z ) 2025-05-07T20:33:17.6156807Z else: 2025-05-07T20:33:17.6157031Z scale_ub_tensor = None 2025-05-07T20:33:17.6157302Z 2025-05-07T20:33:17.6157564Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.6157912Z op = silu_mul_quant 2025-05-07T20:33:17.6158178Z if compiled: 2025-05-07T20:33:17.6158444Z op = torch.compile(op) 2025-05-07T20:33:17.6158760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.6159047Z 2025-05-07T20:33:17.6159254Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.6159433Z 2025-05-07T20:33:17.6159547Z moe/activation_test.py:117: 2025-05-07T20:33:17.6159856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.6160210Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.6160511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.6161099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.6161691Z return fn(*args, **kwargs) 2025-05-07T20:33:17.6162389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.6163111Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.6163775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.6164499Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.6165203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.6165761Z kernel = self.compile( 2025-05-07T20:33:17.6166330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.6167022Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.6167441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.6167683Z 2025-05-07T20:33:17.6167947Z self = 2025-05-07T20:33:17.6169086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.6170566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce43a0>} 2025-05-07T20:33:17.6172013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.6173140Z context = 2025-05-07T20:33:17.6173448Z 2025-05-07T20:33:17.6173625Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.6174176Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.6174673Z module_map=module_map) 2025-05-07T20:33:17.6175061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.6175428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.6175705Z E ^ 2025-05-07T20:33:17.6176201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.6176672Z 2025-05-07T20:33:17.6177107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.6177650Z 2025-05-07T20:33:17.6177762Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.6178197Z self=, 2025-05-07T20:33:17.6178621Z T=2048, 2025-05-07T20:33:17.6178825Z D=5120, 2025-05-07T20:33:17.6179035Z scale_ub=1200.0, 2025-05-07T20:33:17.6179279Z contiguous=False, 2025-05-07T20:33:17.6179515Z compiled=True, 2025-05-07T20:33:17.6179734Z ) 2025-05-07T20:33:17.6180075Z self = 2025-05-07T20:33:17.6180594Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:17.6180886Z 2025-05-07T20:33:17.6180972Z @given( 2025-05-07T20:33:17.6181223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.6181550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.6181883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.6182234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.6182584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.6182886Z ) 2025-05-07T20:33:17.6183310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.6183777Z def test_silu_mul_quant( 2025-05-07T20:33:17.6184032Z self, 2025-05-07T20:33:17.6184242Z T: int, 2025-05-07T20:33:17.6184454Z D: int, 2025-05-07T20:33:17.6184733Z scale_ub: Optional[float], 2025-05-07T20:33:17.6185025Z contiguous: bool, 2025-05-07T20:33:17.6185284Z compiled: bool, 2025-05-07T20:33:17.6185518Z ) -> None: 2025-05-07T20:33:17.6185749Z torch.manual_seed(2025) 2025-05-07T20:33:17.6186011Z 2025-05-07T20:33:17.6186305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.6186672Z 2025-05-07T20:33:17.6186885Z x_sign = torch.sign(x) 2025-05-07T20:33:17.6187196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.6187526Z x = x_sign * x_clamp 2025-05-07T20:33:17.6187783Z x0 = x[:, :D] 2025-05-07T20:33:17.6188016Z x1 = x[:, D:] 2025-05-07T20:33:17.6188237Z 2025-05-07T20:33:17.6188438Z if contiguous: 2025-05-07T20:33:17.6188735Z x0 = x0.contiguous() 2025-05-07T20:33:17.6189005Z x1 = x1.contiguous() 2025-05-07T20:33:17.6189253Z 2025-05-07T20:33:17.6189459Z if scale_ub is not None: 2025-05-07T20:33:17.6189752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.6190105Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.6190433Z ) 2025-05-07T20:33:17.6190683Z else: 2025-05-07T20:33:17.6190909Z scale_ub_tensor = None 2025-05-07T20:33:17.6191222Z 2025-05-07T20:33:17.6191467Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.6191801Z op = silu_mul_quant 2025-05-07T20:33:17.6192068Z if compiled: 2025-05-07T20:33:17.6192325Z op = torch.compile(op) 2025-05-07T20:33:17.6192644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.6192979Z 2025-05-07T20:33:17.6193191Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.6193371Z 2025-05-07T20:33:17.6193475Z moe/activation_test.py:117: 2025-05-07T20:33:17.6193843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.6194197Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.6194493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.6195085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.6195679Z return fn(*args, **kwargs) 2025-05-07T20:33:17.6196370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.6197095Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.6197663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.6198381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.6199079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.6199639Z kernel = self.compile( 2025-05-07T20:33:17.6200213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.6200901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.6201316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.6201567Z 2025-05-07T20:33:17.6201787Z self = 2025-05-07T20:33:17.6202919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.6204409Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce4820>} 2025-05-07T20:33:17.6205992Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.6207101Z context = 2025-05-07T20:33:17.6207430Z 2025-05-07T20:33:17.6207610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.6208169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.6208664Z module_map=module_map) 2025-05-07T20:33:17.6209055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.6209430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.6209707Z E ^ 2025-05-07T20:33:17.6210205Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.6210740Z 2025-05-07T20:33:17.6211179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.6211716Z 2025-05-07T20:33:17.9870410Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.9871119Z self=, 2025-05-07T20:33:17.9871866Z T=4096, 2025-05-07T20:33:17.9872179Z D=5120, 2025-05-07T20:33:17.9872397Z scale_ub=1200.0, 2025-05-07T20:33:17.9872635Z contiguous=True, 2025-05-07T20:33:17.9872882Z compiled=True, 2025-05-07T20:33:17.9873103Z ) 2025-05-07T20:33:17.9873449Z self = 2025-05-07T20:33:17.9874078Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:17.9874384Z 2025-05-07T20:33:17.9874471Z @given( 2025-05-07T20:33:17.9874721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.9875054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.9875383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.9875736Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.9876102Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.9876509Z ) 2025-05-07T20:33:17.9876894Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.9877367Z def test_silu_mul_quant( 2025-05-07T20:33:17.9877621Z self, 2025-05-07T20:33:17.9877830Z T: int, 2025-05-07T20:33:17.9878042Z D: int, 2025-05-07T20:33:17.9878279Z scale_ub: Optional[float], 2025-05-07T20:33:17.9878569Z contiguous: bool, 2025-05-07T20:33:17.9878827Z compiled: bool, 2025-05-07T20:33:17.9879064Z ) -> None: 2025-05-07T20:33:17.9879296Z torch.manual_seed(2025) 2025-05-07T20:33:17.9879555Z 2025-05-07T20:33:17.9879870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.9880229Z 2025-05-07T20:33:17.9880441Z x_sign = torch.sign(x) 2025-05-07T20:33:17.9880763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.9881086Z x = x_sign * x_clamp 2025-05-07T20:33:17.9881344Z x0 = x[:, :D] 2025-05-07T20:33:17.9881583Z x1 = x[:, D:] 2025-05-07T20:33:17.9881805Z 2025-05-07T20:33:17.9882028Z if contiguous: 2025-05-07T20:33:17.9882281Z x0 = x0.contiguous() 2025-05-07T20:33:17.9882553Z x1 = x1.contiguous() 2025-05-07T20:33:17.9882817Z 2025-05-07T20:33:17.9883030Z if scale_ub is not None: 2025-05-07T20:33:17.9883321Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.9883681Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.9884012Z ) 2025-05-07T20:33:17.9884227Z else: 2025-05-07T20:33:17.9884451Z scale_ub_tensor = None 2025-05-07T20:33:17.9884729Z 2025-05-07T20:33:17.9884981Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.9885403Z op = silu_mul_quant 2025-05-07T20:33:17.9885682Z if compiled: 2025-05-07T20:33:17.9885948Z op = torch.compile(op) 2025-05-07T20:33:17.9886260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.9886557Z 2025-05-07T20:33:17.9886764Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.9887002Z 2025-05-07T20:33:17.9887157Z moe/activation_test.py:117: 2025-05-07T20:33:17.9887622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.9888020Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.9888320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.9888909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.9889502Z return fn(*args, **kwargs) 2025-05-07T20:33:17.9890306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.9891032Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.9891601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.9892323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.9893351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.9893915Z kernel = self.compile( 2025-05-07T20:33:17.9894492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.9895190Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.9895604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.9895855Z 2025-05-07T20:33:17.9896075Z self = 2025-05-07T20:33:17.9897223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.9898689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce5360>} 2025-05-07T20:33:17.9900113Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.9901192Z context = 2025-05-07T20:33:17.9901501Z 2025-05-07T20:33:17.9901682Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.9902235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.9902736Z module_map=module_map) 2025-05-07T20:33:17.9903117Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.9903496Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.9903781Z E ^ 2025-05-07T20:33:17.9904270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.9904750Z 2025-05-07T20:33:17.9905189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.9905734Z 2025-05-07T20:33:17.9905845Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.9906290Z self=, 2025-05-07T20:33:17.9906714Z T=128, 2025-05-07T20:33:17.9906924Z D=5120, 2025-05-07T20:33:17.9907138Z scale_ub=1200.0, 2025-05-07T20:33:17.9907378Z contiguous=False, 2025-05-07T20:33:17.9907668Z compiled=True, 2025-05-07T20:33:17.9907888Z ) 2025-05-07T20:33:18.1063590Z self = 2025-05-07T20:33:18.1064433Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:18.1064845Z 2025-05-07T20:33:18.1064980Z @given( 2025-05-07T20:33:18.1065336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.1065792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.1066249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.1066602Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.1066943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.1067247Z ) 2025-05-07T20:33:18.1067617Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.1068204Z def test_silu_mul_quant( 2025-05-07T20:33:18.1068460Z self, 2025-05-07T20:33:18.1068669Z T: int, 2025-05-07T20:33:18.1068877Z D: int, 2025-05-07T20:33:18.1069110Z scale_ub: Optional[float], 2025-05-07T20:33:18.1069403Z contiguous: bool, 2025-05-07T20:33:18.1069656Z compiled: bool, 2025-05-07T20:33:18.1069975Z ) -> None: 2025-05-07T20:33:18.1070204Z torch.manual_seed(2025) 2025-05-07T20:33:18.1070517Z 2025-05-07T20:33:18.1070802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.1071163Z 2025-05-07T20:33:18.1071369Z x_sign = torch.sign(x) 2025-05-07T20:33:18.1071670Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.1072008Z x = x_sign * x_clamp 2025-05-07T20:33:18.1072261Z x0 = x[:, :D] 2025-05-07T20:33:18.1072487Z x1 = x[:, D:] 2025-05-07T20:33:18.1072710Z 2025-05-07T20:33:18.1072907Z if contiguous: 2025-05-07T20:33:18.1073146Z x0 = x0.contiguous() 2025-05-07T20:33:18.1073422Z x1 = x1.contiguous() 2025-05-07T20:33:18.1073775Z 2025-05-07T20:33:18.1073976Z if scale_ub is not None: 2025-05-07T20:33:18.1074266Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.1074620Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.1074942Z ) 2025-05-07T20:33:18.1075149Z else: 2025-05-07T20:33:18.1075374Z scale_ub_tensor = None 2025-05-07T20:33:18.1075634Z 2025-05-07T20:33:18.1075878Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.1076211Z op = silu_mul_quant 2025-05-07T20:33:18.1076471Z if compiled: 2025-05-07T20:33:18.1076728Z op = torch.compile(op) 2025-05-07T20:33:18.1077043Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.1077330Z 2025-05-07T20:33:18.1077531Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.1077707Z 2025-05-07T20:33:18.1077811Z moe/activation_test.py:117: 2025-05-07T20:33:18.1078123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1078473Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.1078770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.1079358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.1079950Z return fn(*args, **kwargs) 2025-05-07T20:33:18.1080633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.1081352Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.1081913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.1082619Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.1083320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.1083950Z kernel = self.compile( 2025-05-07T20:33:18.1084521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.1085205Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.1085623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1085864Z 2025-05-07T20:33:18.1086090Z self = 2025-05-07T20:33:18.1087217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.1088650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce6290>} 2025-05-07T20:33:18.1090104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.1091216Z context = 2025-05-07T20:33:18.1091558Z 2025-05-07T20:33:18.1091740Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.1092280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.1092773Z module_map=module_map) 2025-05-07T20:33:18.1093204Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.1093572Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.1093843Z E ^ 2025-05-07T20:33:18.1094328Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.1094796Z 2025-05-07T20:33:18.1095235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.1095766Z 2025-05-07T20:33:18.1095882Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.1096315Z self=, 2025-05-07T20:33:18.1096735Z T=16384, 2025-05-07T20:33:18.1096942Z D=7168, 2025-05-07T20:33:18.1097172Z scale_ub=1200.0, 2025-05-07T20:33:18.1097402Z contiguous=True, 2025-05-07T20:33:18.1097638Z compiled=True, 2025-05-07T20:33:18.1097858Z ) 2025-05-07T20:33:18.1098188Z self = 2025-05-07T20:33:18.1098707Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:18.1099001Z 2025-05-07T20:33:18.1099089Z @given( 2025-05-07T20:33:18.1099329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.1099664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.1099989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.1100363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.1100817Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.1101128Z ) 2025-05-07T20:33:18.1101501Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.1101968Z def test_silu_mul_quant( 2025-05-07T20:33:18.1102228Z self, 2025-05-07T20:33:18.1102445Z T: int, 2025-05-07T20:33:18.1102651Z D: int, 2025-05-07T20:33:18.1102884Z scale_ub: Optional[float], 2025-05-07T20:33:18.1103173Z contiguous: bool, 2025-05-07T20:33:18.1103424Z compiled: bool, 2025-05-07T20:33:18.1103662Z ) -> None: 2025-05-07T20:33:18.1103891Z torch.manual_seed(2025) 2025-05-07T20:33:18.1104139Z 2025-05-07T20:33:18.1104431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.1104850Z 2025-05-07T20:33:18.1105052Z x_sign = torch.sign(x) 2025-05-07T20:33:18.1105362Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.1105692Z x = x_sign * x_clamp 2025-05-07T20:33:18.1105942Z x0 = x[:, :D] 2025-05-07T20:33:18.1106165Z x1 = x[:, D:] 2025-05-07T20:33:18.1106395Z 2025-05-07T20:33:18.1106593Z if contiguous: 2025-05-07T20:33:18.1106833Z x0 = x0.contiguous() 2025-05-07T20:33:18.1107106Z x1 = x1.contiguous() 2025-05-07T20:33:18.1107358Z 2025-05-07T20:33:18.1107556Z if scale_ub is not None: 2025-05-07T20:33:18.1108059Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.1108416Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.1108789Z ) 2025-05-07T20:33:18.1108992Z else: 2025-05-07T20:33:18.1109223Z scale_ub_tensor = None 2025-05-07T20:33:18.1109487Z 2025-05-07T20:33:18.1109735Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.1110065Z op = silu_mul_quant 2025-05-07T20:33:18.1110329Z if compiled: 2025-05-07T20:33:18.1110590Z op = torch.compile(op) 2025-05-07T20:33:18.1110953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.1111365Z 2025-05-07T20:33:18.1111670Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.1111934Z 2025-05-07T20:33:18.1112082Z moe/activation_test.py:117: 2025-05-07T20:33:18.1112400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1112748Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.1113056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.1113745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.1114341Z return fn(*args, **kwargs) 2025-05-07T20:33:18.1115048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.1115762Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.1116322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.1117042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.1117733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.1118287Z kernel = self.compile( 2025-05-07T20:33:18.1118849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.1119540Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.1120019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1120452Z 2025-05-07T20:33:18.1127190Z self = 2025-05-07T20:33:18.1128356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.1129805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce6d40>} 2025-05-07T20:33:18.1131206Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.1132275Z context = 2025-05-07T20:33:18.1132583Z 2025-05-07T20:33:18.1132765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.1133474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.1133972Z module_map=module_map) 2025-05-07T20:33:18.1134354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.1134720Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.1134998Z E ^ 2025-05-07T20:33:18.1135489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.1135958Z 2025-05-07T20:33:18.1136392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.1136922Z 2025-05-07T20:33:18.2499733Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.2501074Z self=, 2025-05-07T20:33:18.2502574Z T=16384, 2025-05-07T20:33:18.2503148Z D=5120, 2025-05-07T20:33:18.2503498Z scale_ub=1200.0, 2025-05-07T20:33:18.2503736Z contiguous=True, 2025-05-07T20:33:18.2503968Z compiled=False, 2025-05-07T20:33:18.2504185Z ) 2025-05-07T20:33:18.2504518Z self = 2025-05-07T20:33:18.2505198Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:18.2505500Z 2025-05-07T20:33:18.2505582Z @given( 2025-05-07T20:33:18.2505823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.2506144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.2506468Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.2506823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.2507166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.2507473Z ) 2025-05-07T20:33:18.2507839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.2508301Z def test_silu_mul_quant( 2025-05-07T20:33:18.2508561Z self, 2025-05-07T20:33:18.2508767Z T: int, 2025-05-07T20:33:18.2508968Z D: int, 2025-05-07T20:33:18.2509256Z scale_ub: Optional[float], 2025-05-07T20:33:18.2509541Z contiguous: bool, 2025-05-07T20:33:18.2509811Z compiled: bool, 2025-05-07T20:33:18.2510052Z ) -> None: 2025-05-07T20:33:18.2510275Z torch.manual_seed(2025) 2025-05-07T20:33:18.2510529Z 2025-05-07T20:33:18.2510812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.2511164Z 2025-05-07T20:33:18.2511366Z x_sign = torch.sign(x) 2025-05-07T20:33:18.2511673Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.2512000Z x = x_sign * x_clamp 2025-05-07T20:33:18.2512248Z x0 = x[:, :D] 2025-05-07T20:33:18.2512480Z x1 = x[:, D:] 2025-05-07T20:33:18.2512702Z 2025-05-07T20:33:18.2512897Z if contiguous: 2025-05-07T20:33:18.2513139Z x0 = x0.contiguous() 2025-05-07T20:33:18.2513418Z x1 = x1.contiguous() 2025-05-07T20:33:18.2513775Z 2025-05-07T20:33:18.2513978Z if scale_ub is not None: 2025-05-07T20:33:18.2514271Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.2514622Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.2514949Z ) 2025-05-07T20:33:18.2515153Z else: 2025-05-07T20:33:18.2515367Z scale_ub_tensor = None 2025-05-07T20:33:18.2515630Z 2025-05-07T20:33:18.2515875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.2516200Z op = silu_mul_quant 2025-05-07T20:33:18.2516461Z if compiled: 2025-05-07T20:33:18.2516725Z op = torch.compile(op) 2025-05-07T20:33:18.2517029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.2517320Z 2025-05-07T20:33:18.2517527Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.2517700Z 2025-05-07T20:33:18.2517809Z moe/activation_test.py:117: 2025-05-07T20:33:18.2518219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.2518574Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.2518870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.2519603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.2520329Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.2520894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.2521620Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.2522313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.2522936Z kernel = self.compile( 2025-05-07T20:33:18.2523591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.2524642Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.2525063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.2525404Z 2025-05-07T20:33:18.2525679Z self = 2025-05-07T20:33:18.2526832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.2528311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce7ac0>} 2025-05-07T20:33:18.2529905Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.2530992Z context = 2025-05-07T20:33:18.2531299Z 2025-05-07T20:33:18.2531483Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.2532033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.2532524Z module_map=module_map) 2025-05-07T20:33:18.2532913Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.2533290Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.2533562Z E ^ 2025-05-07T20:33:18.2534059Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.2534536Z 2025-05-07T20:33:18.2534985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.2535528Z 2025-05-07T20:33:18.2535641Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.2536078Z self=, 2025-05-07T20:33:18.2536505Z T=1, 2025-05-07T20:33:18.2536702Z D=7168, 2025-05-07T20:33:18.2536906Z scale_ub=1200.0, 2025-05-07T20:33:18.2537145Z contiguous=False, 2025-05-07T20:33:18.2537384Z compiled=False, 2025-05-07T20:33:18.2537596Z ) 2025-05-07T20:33:18.2537933Z self = 2025-05-07T20:33:18.2538453Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:18.2538736Z 2025-05-07T20:33:18.2538823Z @given( 2025-05-07T20:33:18.2539063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.2539398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.2539723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.2540204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.2540558Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.2540867Z ) 2025-05-07T20:33:18.2541232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.2541701Z def test_silu_mul_quant( 2025-05-07T20:33:18.2541959Z self, 2025-05-07T20:33:18.2542162Z T: int, 2025-05-07T20:33:18.2542372Z D: int, 2025-05-07T20:33:18.2542602Z scale_ub: Optional[float], 2025-05-07T20:33:18.2542897Z contiguous: bool, 2025-05-07T20:33:18.2543151Z compiled: bool, 2025-05-07T20:33:18.2543387Z ) -> None: 2025-05-07T20:33:18.2543613Z torch.manual_seed(2025) 2025-05-07T20:33:18.2543864Z 2025-05-07T20:33:18.2544151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.2544579Z 2025-05-07T20:33:18.2544780Z x_sign = torch.sign(x) 2025-05-07T20:33:18.2545093Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.2545418Z x = x_sign * x_clamp 2025-05-07T20:33:18.2545666Z x0 = x[:, :D] 2025-05-07T20:33:18.2545899Z x1 = x[:, D:] 2025-05-07T20:33:18.2546119Z 2025-05-07T20:33:18.2546357Z if contiguous: 2025-05-07T20:33:18.2546636Z x0 = x0.contiguous() 2025-05-07T20:33:18.2546913Z x1 = x1.contiguous() 2025-05-07T20:33:18.2547160Z 2025-05-07T20:33:18.2547364Z if scale_ub is not None: 2025-05-07T20:33:18.2547657Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.2548006Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.2548333Z ) 2025-05-07T20:33:18.2548535Z else: 2025-05-07T20:33:18.2548757Z scale_ub_tensor = None 2025-05-07T20:33:18.2549023Z 2025-05-07T20:33:18.2549265Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.2549600Z op = silu_mul_quant 2025-05-07T20:33:18.2549863Z if compiled: 2025-05-07T20:33:18.2550127Z op = torch.compile(op) 2025-05-07T20:33:18.2550436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.2550719Z 2025-05-07T20:33:18.2550920Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.2551098Z 2025-05-07T20:33:18.2551208Z moe/activation_test.py:117: 2025-05-07T20:33:18.2551519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.2551871Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.2552166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.2552895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.2553691Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.2554254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.2554968Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.2555655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.2556207Z kernel = self.compile( 2025-05-07T20:33:18.2556781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.2557467Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.2557875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.2558117Z 2025-05-07T20:33:18.2558334Z self = 2025-05-07T20:33:18.2559457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.2560949Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbbdc4c0>} 2025-05-07T20:33:18.2562350Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.2563419Z context = 2025-05-07T20:33:18.2563725Z 2025-05-07T20:33:18.2563900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.2564443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.2564925Z module_map=module_map) 2025-05-07T20:33:18.2565355Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.2565726Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.2565998Z E ^ 2025-05-07T20:33:18.2566476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.2566945Z 2025-05-07T20:33:18.2567458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.2567992Z 2025-05-07T20:33:18.4513006Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.4513586Z self=, 2025-05-07T20:33:18.4514039Z T=4096, 2025-05-07T20:33:18.4514262Z D=7168, 2025-05-07T20:33:18.4514537Z scale_ub=1200.0, 2025-05-07T20:33:18.4514878Z contiguous=False, 2025-05-07T20:33:18.4515201Z compiled=True, 2025-05-07T20:33:18.4515498Z ) 2025-05-07T20:33:18.4515940Z self = 2025-05-07T20:33:18.4516564Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:18.4516865Z 2025-05-07T20:33:18.4516950Z @given( 2025-05-07T20:33:18.4517202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.4517531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.4517864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.4518220Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.4518566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.4518875Z ) 2025-05-07T20:33:18.4519252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.4519725Z def test_silu_mul_quant( 2025-05-07T20:33:18.4519990Z self, 2025-05-07T20:33:18.4520200Z T: int, 2025-05-07T20:33:18.4520412Z D: int, 2025-05-07T20:33:18.4520644Z scale_ub: Optional[float], 2025-05-07T20:33:18.4520935Z contiguous: bool, 2025-05-07T20:33:18.4521193Z compiled: bool, 2025-05-07T20:33:18.4521434Z ) -> None: 2025-05-07T20:33:18.4521669Z torch.manual_seed(2025) 2025-05-07T20:33:18.4521932Z 2025-05-07T20:33:18.4522223Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.4522598Z 2025-05-07T20:33:18.4522813Z x_sign = torch.sign(x) 2025-05-07T20:33:18.4523136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.4523512Z x = x_sign * x_clamp 2025-05-07T20:33:18.4524051Z x0 = x[:, :D] 2025-05-07T20:33:18.4524390Z x1 = x[:, D:] 2025-05-07T20:33:18.4524632Z 2025-05-07T20:33:18.4524840Z if contiguous: 2025-05-07T20:33:18.4525087Z x0 = x0.contiguous() 2025-05-07T20:33:18.4525377Z x1 = x1.contiguous() 2025-05-07T20:33:18.4525656Z 2025-05-07T20:33:18.4525864Z if scale_ub is not None: 2025-05-07T20:33:18.4526161Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.4526522Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.4526982Z ) 2025-05-07T20:33:18.4527190Z else: 2025-05-07T20:33:18.4527418Z scale_ub_tensor = None 2025-05-07T20:33:18.4527691Z 2025-05-07T20:33:18.4527935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.4528276Z op = silu_mul_quant 2025-05-07T20:33:18.4528549Z if compiled: 2025-05-07T20:33:18.4528817Z op = torch.compile(op) 2025-05-07T20:33:18.4529133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4529429Z 2025-05-07T20:33:18.4529638Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.4529814Z 2025-05-07T20:33:18.4529921Z moe/activation_test.py:117: 2025-05-07T20:33:18.4530243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4530668Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.4530965Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4531567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.4532164Z return fn(*args, **kwargs) 2025-05-07T20:33:18.4532869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.4533742Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.4534315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.4535035Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.4535740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.4536305Z kernel = self.compile( 2025-05-07T20:33:18.4536884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.4537581Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.4537998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4538245Z 2025-05-07T20:33:18.4538465Z self = 2025-05-07T20:33:18.4539612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.4541075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbbdd1b0>} 2025-05-07T20:33:18.4542494Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.4543630Z context = 2025-05-07T20:33:18.4543936Z 2025-05-07T20:33:18.4544111Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.4544660Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.4545151Z module_map=module_map) 2025-05-07T20:33:18.4545538Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.4545906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.4546181Z E ^ 2025-05-07T20:33:18.4546666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.4547145Z 2025-05-07T20:33:18.4547580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.4548118Z 2025-05-07T20:33:18.4548232Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.4548717Z self=, 2025-05-07T20:33:18.4549134Z T=128, 2025-05-07T20:33:18.4549336Z D=7168, 2025-05-07T20:33:18.4549542Z scale_ub=1200.0, 2025-05-07T20:33:18.4549781Z contiguous=False, 2025-05-07T20:33:18.4550022Z compiled=True, 2025-05-07T20:33:18.4550242Z ) 2025-05-07T20:33:18.5600748Z self = 2025-05-07T20:33:18.5601357Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:18.5601794Z 2025-05-07T20:33:18.5601913Z @given( 2025-05-07T20:33:18.5602163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.5602493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.5602808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.5603270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.5603620Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.5603918Z ) 2025-05-07T20:33:18.5604291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.5604753Z def test_silu_mul_quant( 2025-05-07T20:33:18.5605110Z self, 2025-05-07T20:33:18.5605312Z T: int, 2025-05-07T20:33:18.5605586Z D: int, 2025-05-07T20:33:18.5605817Z scale_ub: Optional[float], 2025-05-07T20:33:18.5606101Z contiguous: bool, 2025-05-07T20:33:18.5606356Z compiled: bool, 2025-05-07T20:33:18.5606595Z ) -> None: 2025-05-07T20:33:18.5606819Z torch.manual_seed(2025) 2025-05-07T20:33:18.5607076Z 2025-05-07T20:33:18.5607361Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.5607716Z 2025-05-07T20:33:18.5607918Z x_sign = torch.sign(x) 2025-05-07T20:33:18.5608230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.5608547Z x = x_sign * x_clamp 2025-05-07T20:33:18.5608806Z x0 = x[:, :D] 2025-05-07T20:33:18.5609032Z x1 = x[:, D:] 2025-05-07T20:33:18.5609246Z 2025-05-07T20:33:18.5609443Z if contiguous: 2025-05-07T20:33:18.5609688Z x0 = x0.contiguous() 2025-05-07T20:33:18.5609966Z x1 = x1.contiguous() 2025-05-07T20:33:18.5610217Z 2025-05-07T20:33:18.5610424Z if scale_ub is not None: 2025-05-07T20:33:18.5610716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.5611060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.5611387Z ) 2025-05-07T20:33:18.5611591Z else: 2025-05-07T20:33:18.5611808Z scale_ub_tensor = None 2025-05-07T20:33:18.5612077Z 2025-05-07T20:33:18.5612322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.5612655Z op = silu_mul_quant 2025-05-07T20:33:18.5612920Z if compiled: 2025-05-07T20:33:18.5613181Z op = torch.compile(op) 2025-05-07T20:33:18.5613497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.5613789Z 2025-05-07T20:33:18.5613996Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.5614170Z 2025-05-07T20:33:18.5614274Z moe/activation_test.py:117: 2025-05-07T20:33:18.5614595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.5614942Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.5615239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.5615823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.5616407Z return fn(*args, **kwargs) 2025-05-07T20:33:18.5617097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.5617812Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.5618446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.5619162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.5619855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.5620412Z kernel = self.compile( 2025-05-07T20:33:18.5620980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.5621663Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.5622082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.5622321Z 2025-05-07T20:33:18.5622539Z self = 2025-05-07T20:33:18.5623718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.5625574Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbbdc0d0>} 2025-05-07T20:33:18.5627121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.5628189Z context = 2025-05-07T20:33:18.5628499Z 2025-05-07T20:33:18.5628673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.5629221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.5629720Z module_map=module_map) 2025-05-07T20:33:18.5630099Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.5630472Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.5630749Z E ^ 2025-05-07T20:33:18.5631230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.5631706Z 2025-05-07T20:33:18.5632139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.5632677Z 2025-05-07T20:33:18.5632786Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.5633219Z self=, 2025-05-07T20:33:18.5633696Z T=2048, 2025-05-07T20:33:18.5633899Z D=7168, 2025-05-07T20:33:18.5634100Z scale_ub=None, 2025-05-07T20:33:18.5634326Z contiguous=True, 2025-05-07T20:33:18.5634565Z compiled=True, 2025-05-07T20:33:18.5634784Z ) 2025-05-07T20:33:18.5635112Z self = 2025-05-07T20:33:18.5635630Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:18.5635914Z 2025-05-07T20:33:18.5635996Z @given( 2025-05-07T20:33:18.5636235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.5636564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.5636893Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.5637246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.5637590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.5637893Z ) 2025-05-07T20:33:18.5638260Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.5638722Z def test_silu_mul_quant( 2025-05-07T20:33:18.5638971Z self, 2025-05-07T20:33:18.5639183Z T: int, 2025-05-07T20:33:18.5639390Z D: int, 2025-05-07T20:33:18.5639616Z scale_ub: Optional[float], 2025-05-07T20:33:18.5639898Z contiguous: bool, 2025-05-07T20:33:18.5640223Z compiled: bool, 2025-05-07T20:33:18.5640453Z ) -> None: 2025-05-07T20:33:18.5646863Z torch.manual_seed(2025) 2025-05-07T20:33:18.5647161Z 2025-05-07T20:33:18.5647451Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.5647819Z 2025-05-07T20:33:18.5648035Z x_sign = torch.sign(x) 2025-05-07T20:33:18.5648340Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.5648668Z x = x_sign * x_clamp 2025-05-07T20:33:18.5648926Z x0 = x[:, :D] 2025-05-07T20:33:18.5649154Z x1 = x[:, D:] 2025-05-07T20:33:18.5649377Z 2025-05-07T20:33:18.5649582Z if contiguous: 2025-05-07T20:33:18.5649825Z x0 = x0.contiguous() 2025-05-07T20:33:18.5650101Z x1 = x1.contiguous() 2025-05-07T20:33:18.5650459Z 2025-05-07T20:33:18.5650669Z if scale_ub is not None: 2025-05-07T20:33:18.5650954Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.5651322Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.5651655Z ) 2025-05-07T20:33:18.5651861Z else: 2025-05-07T20:33:18.5652091Z scale_ub_tensor = None 2025-05-07T20:33:18.5652408Z 2025-05-07T20:33:18.5652658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.5653052Z op = silu_mul_quant 2025-05-07T20:33:18.5653323Z if compiled: 2025-05-07T20:33:18.5653585Z op = torch.compile(op) 2025-05-07T20:33:18.5653899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.5654190Z 2025-05-07T20:33:18.5654392Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.5654571Z 2025-05-07T20:33:18.5654679Z moe/activation_test.py:117: 2025-05-07T20:33:18.5654990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.5655345Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.5655637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.5656229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.5656820Z return fn(*args, **kwargs) 2025-05-07T20:33:18.5657507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.5658231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.5658797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.5659517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.5660209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.5660769Z kernel = self.compile( 2025-05-07T20:33:18.5661337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.5662023Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.5662438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.5662679Z 2025-05-07T20:33:18.5662900Z self = 2025-05-07T20:33:18.5664083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.5665518Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbbde560>} 2025-05-07T20:33:18.5666972Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.5668040Z context = 2025-05-07T20:33:18.5668347Z 2025-05-07T20:33:18.5668521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.5669073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.5669562Z module_map=module_map) 2025-05-07T20:33:18.5669946Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.5670316Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.5670588Z E ^ 2025-05-07T20:33:18.5671075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.5671593Z 2025-05-07T20:33:18.5672029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.5672566Z 2025-05-07T20:33:18.6466278Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6466724Z self=, 2025-05-07T20:33:18.6467154Z T=16384, 2025-05-07T20:33:18.6467368Z D=5120, 2025-05-07T20:33:18.6467805Z scale_ub=None, 2025-05-07T20:33:18.6468202Z contiguous=False, 2025-05-07T20:33:18.6468572Z compiled=False, 2025-05-07T20:33:18.6468875Z ) 2025-05-07T20:33:18.6469334Z self = 2025-05-07T20:33:18.6469941Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:18.6470233Z 2025-05-07T20:33:18.6470321Z @given( 2025-05-07T20:33:18.6470563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6470895Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6471225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6471575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6471922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6472226Z ) 2025-05-07T20:33:18.6472600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6473061Z def test_silu_mul_quant( 2025-05-07T20:33:18.6473320Z self, 2025-05-07T20:33:18.6473620Z T: int, 2025-05-07T20:33:18.6473861Z D: int, 2025-05-07T20:33:18.6474092Z scale_ub: Optional[float], 2025-05-07T20:33:18.6474375Z contiguous: bool, 2025-05-07T20:33:18.6474628Z compiled: bool, 2025-05-07T20:33:18.6474867Z ) -> None: 2025-05-07T20:33:18.6475089Z torch.manual_seed(2025) 2025-05-07T20:33:18.6475347Z 2025-05-07T20:33:18.6475631Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6475988Z 2025-05-07T20:33:18.6476193Z x_sign = torch.sign(x) 2025-05-07T20:33:18.6476498Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.6478613Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.6480568Z 2025-05-07T20:33:18.6480698Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:18.6480922Z 2025-05-07T20:33:18.6481038Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6481471Z self=, 2025-05-07T20:33:18.6481895Z T=4096, 2025-05-07T20:33:18.6482096Z D=7168, 2025-05-07T20:33:18.6482294Z scale_ub=1200.0, 2025-05-07T20:33:18.6482615Z contiguous=True, 2025-05-07T20:33:18.6482852Z compiled=True, 2025-05-07T20:33:18.6483065Z ) 2025-05-07T20:33:18.6483448Z self = 2025-05-07T20:33:18.6483965Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:18.6484245Z 2025-05-07T20:33:18.6484328Z @given( 2025-05-07T20:33:18.6484570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6484897Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6485220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6485561Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6485908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6486272Z ) 2025-05-07T20:33:18.6486639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6487101Z def test_silu_mul_quant( 2025-05-07T20:33:18.6487361Z self, 2025-05-07T20:33:18.6487562Z T: int, 2025-05-07T20:33:18.6487770Z D: int, 2025-05-07T20:33:18.6487998Z scale_ub: Optional[float], 2025-05-07T20:33:18.6488280Z contiguous: bool, 2025-05-07T20:33:18.6488635Z compiled: bool, 2025-05-07T20:33:18.6488909Z ) -> None: 2025-05-07T20:33:18.6489133Z torch.manual_seed(2025) 2025-05-07T20:33:18.6489389Z 2025-05-07T20:33:18.6489675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6490034Z 2025-05-07T20:33:18.6490238Z x_sign = torch.sign(x) 2025-05-07T20:33:18.6490541Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.6492636Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.6494583Z 2025-05-07T20:33:18.6494714Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:18.6494936Z 2025-05-07T20:33:18.6495045Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6495480Z self=, 2025-05-07T20:33:18.6495898Z T=16384, 2025-05-07T20:33:18.6496103Z D=7168, 2025-05-07T20:33:18.6496302Z scale_ub=None, 2025-05-07T20:33:18.6496530Z contiguous=False, 2025-05-07T20:33:18.6496766Z compiled=False, 2025-05-07T20:33:18.6496979Z ) 2025-05-07T20:33:18.6497310Z self = 2025-05-07T20:33:18.6497831Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:18.6498123Z 2025-05-07T20:33:18.6498205Z @given( 2025-05-07T20:33:18.6498446Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6498774Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6499095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6499440Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6499786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6500086Z ) 2025-05-07T20:33:18.6500448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6500908Z def test_silu_mul_quant( 2025-05-07T20:33:18.6501162Z self, 2025-05-07T20:33:18.6501363Z T: int, 2025-05-07T20:33:18.6501569Z D: int, 2025-05-07T20:33:18.6501800Z scale_ub: Optional[float], 2025-05-07T20:33:18.6502081Z contiguous: bool, 2025-05-07T20:33:18.6502336Z compiled: bool, 2025-05-07T20:33:18.6502619Z ) -> None: 2025-05-07T20:33:18.6502844Z torch.manual_seed(2025) 2025-05-07T20:33:18.6503099Z 2025-05-07T20:33:18.6503381Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6505538Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.6507525Z 2025-05-07T20:33:18.6507656Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:18.6507879Z 2025-05-07T20:33:18.6507990Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6508422Z self=, 2025-05-07T20:33:18.6508843Z T=2048, 2025-05-07T20:33:18.6509037Z D=7168, 2025-05-07T20:33:18.6509286Z scale_ub=1200.0, 2025-05-07T20:33:18.6509521Z contiguous=True, 2025-05-07T20:33:18.6509814Z compiled=True, 2025-05-07T20:33:18.6510028Z ) 2025-05-07T20:33:18.6510356Z self = 2025-05-07T20:33:18.6510865Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:18.6511154Z 2025-05-07T20:33:18.6511235Z @given( 2025-05-07T20:33:18.6511474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6511800Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6512120Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6512463Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6512813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6513110Z ) 2025-05-07T20:33:18.6513477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6513982Z def test_silu_mul_quant( 2025-05-07T20:33:18.6514231Z self, 2025-05-07T20:33:18.6514444Z T: int, 2025-05-07T20:33:18.6514653Z D: int, 2025-05-07T20:33:18.6514877Z scale_ub: Optional[float], 2025-05-07T20:33:18.6515160Z contiguous: bool, 2025-05-07T20:33:18.6515414Z compiled: bool, 2025-05-07T20:33:18.6515643Z ) -> None: 2025-05-07T20:33:18.6515872Z torch.manual_seed(2025) 2025-05-07T20:33:18.6516125Z 2025-05-07T20:33:18.6516404Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6516767Z 2025-05-07T20:33:18.6516977Z x_sign = torch.sign(x) 2025-05-07T20:33:18.6517277Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.6519361Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.6521298Z 2025-05-07T20:33:18.6521421Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:18.6521649Z 2025-05-07T20:33:18.6521757Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6522186Z self=, 2025-05-07T20:33:18.6522601Z T=2048, 2025-05-07T20:33:18.6522800Z D=7168, 2025-05-07T20:33:18.6523016Z scale_ub=None, 2025-05-07T20:33:18.6523273Z contiguous=True, 2025-05-07T20:33:18.6523556Z compiled=False, 2025-05-07T20:33:18.6523952Z ) 2025-05-07T20:33:18.9573447Z self = 2025-05-07T20:33:18.9574473Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.9574998Z 2025-05-07T20:33:18.9575159Z @given( 2025-05-07T20:33:18.9575435Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.9575776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.9576109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.9576469Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.9576833Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.9577143Z ) 2025-05-07T20:33:18.9577657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.9578133Z def test_silu_mul_quant( 2025-05-07T20:33:18.9578400Z self, 2025-05-07T20:33:18.9578628Z T: int, 2025-05-07T20:33:18.9578841Z D: int, 2025-05-07T20:33:18.9579081Z scale_ub: Optional[float], 2025-05-07T20:33:18.9579378Z contiguous: bool, 2025-05-07T20:33:18.9579635Z compiled: bool, 2025-05-07T20:33:18.9579955Z ) -> None: 2025-05-07T20:33:18.9580257Z torch.manual_seed(2025) 2025-05-07T20:33:18.9580517Z 2025-05-07T20:33:18.9580816Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.9581189Z 2025-05-07T20:33:18.9581398Z > x_sign = torch.sign(x) 2025-05-07T20:33:18.9583545Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:18.9585543Z 2025-05-07T20:33:18.9585673Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:18.9585912Z 2025-05-07T20:33:18.9586027Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.9586474Z self=, 2025-05-07T20:33:18.9586904Z T=1, 2025-05-07T20:33:18.9587110Z D=7168, 2025-05-07T20:33:18.9587320Z scale_ub=1200.0, 2025-05-07T20:33:18.9587560Z contiguous=True, 2025-05-07T20:33:18.9587802Z compiled=False, 2025-05-07T20:33:18.9588024Z ) 2025-05-07T20:33:18.9588369Z self = 2025-05-07T20:33:18.9588894Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:18.9589185Z 2025-05-07T20:33:18.9589270Z @given( 2025-05-07T20:33:18.9589521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.9589855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.9590185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.9590547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.9590902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.9591212Z ) 2025-05-07T20:33:18.9591592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.9592067Z def test_silu_mul_quant( 2025-05-07T20:33:18.9592332Z self, 2025-05-07T20:33:18.9592545Z T: int, 2025-05-07T20:33:18.9592754Z D: int, 2025-05-07T20:33:18.9592998Z scale_ub: Optional[float], 2025-05-07T20:33:18.9593296Z contiguous: bool, 2025-05-07T20:33:18.9593630Z compiled: bool, 2025-05-07T20:33:18.9593871Z ) -> None: 2025-05-07T20:33:18.9594108Z torch.manual_seed(2025) 2025-05-07T20:33:18.9594368Z 2025-05-07T20:33:18.9594732Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.9595103Z 2025-05-07T20:33:18.9595316Z x_sign = torch.sign(x) 2025-05-07T20:33:18.9595626Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.9595969Z x = x_sign * x_clamp 2025-05-07T20:33:18.9596232Z x0 = x[:, :D] 2025-05-07T20:33:18.9596464Z x1 = x[:, D:] 2025-05-07T20:33:18.9596694Z 2025-05-07T20:33:18.9596897Z if contiguous: 2025-05-07T20:33:18.9597147Z x0 = x0.contiguous() 2025-05-07T20:33:18.9597428Z x1 = x1.contiguous() 2025-05-07T20:33:18.9597691Z 2025-05-07T20:33:18.9597895Z if scale_ub is not None: 2025-05-07T20:33:18.9598196Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.9598609Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.9598947Z ) 2025-05-07T20:33:18.9599155Z else: 2025-05-07T20:33:18.9599391Z scale_ub_tensor = None 2025-05-07T20:33:18.9599665Z 2025-05-07T20:33:18.9599914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.9600249Z op = silu_mul_quant 2025-05-07T20:33:18.9600566Z if compiled: 2025-05-07T20:33:18.9600873Z op = torch.compile(op) 2025-05-07T20:33:18.9601197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.9601498Z 2025-05-07T20:33:18.9601704Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.9601888Z 2025-05-07T20:33:18.9601996Z moe/activation_test.py:117: 2025-05-07T20:33:18.9602318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.9602672Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.9602983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.9603790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.9604540Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.9605117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.9605855Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.9606580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.9607160Z kernel = self.compile( 2025-05-07T20:33:18.9607743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.9608452Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.9608880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.9609129Z 2025-05-07T20:33:18.9609354Z self = 2025-05-07T20:33:18.9610522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.9612010Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacb6d44c0>} 2025-05-07T20:33:18.9613457Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.9614559Z context = 2025-05-07T20:33:18.9614871Z 2025-05-07T20:33:18.9615055Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.9615624Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.9616178Z module_map=module_map) 2025-05-07T20:33:18.9616569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.9616952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.9617239Z E ^ 2025-05-07T20:33:18.9617744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.9618227Z 2025-05-07T20:33:18.9618673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.9619225Z 2025-05-07T20:33:18.9619339Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.9619788Z self=, 2025-05-07T20:33:18.9620271Z T=128, 2025-05-07T20:33:18.9620472Z D=5120, 2025-05-07T20:33:18.9620685Z scale_ub=None, 2025-05-07T20:33:18.9620917Z contiguous=True, 2025-05-07T20:33:18.9621160Z compiled=False, 2025-05-07T20:33:18.9621384Z ) 2025-05-07T20:33:19.0430078Z self = 2025-05-07T20:33:19.0430823Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.0431386Z 2025-05-07T20:33:19.0431580Z @given( 2025-05-07T20:33:19.0431972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.0432457Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.0432918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.0433360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.0433810Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.0434118Z ) 2025-05-07T20:33:19.0434495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.0434977Z def test_silu_mul_quant( 2025-05-07T20:33:19.0435242Z self, 2025-05-07T20:33:19.0435452Z T: int, 2025-05-07T20:33:19.0435668Z D: int, 2025-05-07T20:33:19.0435905Z scale_ub: Optional[float], 2025-05-07T20:33:19.0436198Z contiguous: bool, 2025-05-07T20:33:19.0436458Z compiled: bool, 2025-05-07T20:33:19.0436703Z ) -> None: 2025-05-07T20:33:19.0436938Z torch.manual_seed(2025) 2025-05-07T20:33:19.0437200Z 2025-05-07T20:33:19.0437494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.0437867Z 2025-05-07T20:33:19.0438076Z x_sign = torch.sign(x) 2025-05-07T20:33:19.0438394Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.0438733Z x = x_sign * x_clamp 2025-05-07T20:33:19.0438987Z x0 = x[:, :D] 2025-05-07T20:33:19.0439225Z x1 = x[:, D:] 2025-05-07T20:33:19.0439453Z 2025-05-07T20:33:19.0439655Z if contiguous: 2025-05-07T20:33:19.0439909Z x0 = x0.contiguous() 2025-05-07T20:33:19.0440189Z x1 = x1.contiguous() 2025-05-07T20:33:19.0440455Z 2025-05-07T20:33:19.0440666Z if scale_ub is not None: 2025-05-07T20:33:19.0440963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.0441328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.0441661Z ) 2025-05-07T20:33:19.0441872Z else: 2025-05-07T20:33:19.0442106Z scale_ub_tensor = None 2025-05-07T20:33:19.0442374Z 2025-05-07T20:33:19.0442625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.0442968Z op = silu_mul_quant 2025-05-07T20:33:19.0443236Z if compiled: 2025-05-07T20:33:19.0443504Z op = torch.compile(op) 2025-05-07T20:33:19.0443835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0444134Z 2025-05-07T20:33:19.0444345Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.0444529Z 2025-05-07T20:33:19.0444635Z moe/activation_test.py:117: 2025-05-07T20:33:19.0445037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0445393Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.0445696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0446446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.0447194Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.0447769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.0448505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.0455516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.0456151Z kernel = self.compile( 2025-05-07T20:33:19.0456851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.0457570Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.0458003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0458252Z 2025-05-07T20:33:19.0458483Z self = 2025-05-07T20:33:19.0459731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.0461217Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacb6d4940>} 2025-05-07T20:33:19.0462661Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.0463775Z context = 2025-05-07T20:33:19.0464088Z 2025-05-07T20:33:19.0464272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.0464835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.0465343Z module_map=module_map) 2025-05-07T20:33:19.0465743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.0466128Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.0466417Z E ^ 2025-05-07T20:33:19.0466923Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.0467410Z 2025-05-07T20:33:19.0467863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.0468413Z 2025-05-07T20:33:19.0468530Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.0468976Z self=, 2025-05-07T20:33:19.0469412Z T=128, 2025-05-07T20:33:19.0469613Z D=7168, 2025-05-07T20:33:19.0469827Z scale_ub=None, 2025-05-07T20:33:19.0470061Z contiguous=True, 2025-05-07T20:33:19.0470303Z compiled=False, 2025-05-07T20:33:19.0470529Z ) 2025-05-07T20:33:19.0470878Z self = 2025-05-07T20:33:19.0471408Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.0471698Z 2025-05-07T20:33:19.0471784Z @given( 2025-05-07T20:33:19.0472036Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.0472378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.0472710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.0473069Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.0473480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.0473929Z ) 2025-05-07T20:33:19.0474308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.0474784Z def test_silu_mul_quant( 2025-05-07T20:33:19.0475055Z self, 2025-05-07T20:33:19.0475269Z T: int, 2025-05-07T20:33:19.0475485Z D: int, 2025-05-07T20:33:19.0475728Z scale_ub: Optional[float], 2025-05-07T20:33:19.0476019Z contiguous: bool, 2025-05-07T20:33:19.0476280Z compiled: bool, 2025-05-07T20:33:19.0476523Z ) -> None: 2025-05-07T20:33:19.0476756Z torch.manual_seed(2025) 2025-05-07T20:33:19.0477016Z 2025-05-07T20:33:19.0477310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.0477730Z 2025-05-07T20:33:19.0477945Z x_sign = torch.sign(x) 2025-05-07T20:33:19.0478266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.0478600Z x = x_sign * x_clamp 2025-05-07T20:33:19.0478865Z x0 = x[:, :D] 2025-05-07T20:33:19.0479101Z x1 = x[:, D:] 2025-05-07T20:33:19.0479326Z 2025-05-07T20:33:19.0479528Z if contiguous: 2025-05-07T20:33:19.0479783Z x0 = x0.contiguous() 2025-05-07T20:33:19.0480113Z x1 = x1.contiguous() 2025-05-07T20:33:19.0480422Z 2025-05-07T20:33:19.0480638Z if scale_ub is not None: 2025-05-07T20:33:19.0480940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.0481298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.0481639Z ) 2025-05-07T20:33:19.0481851Z else: 2025-05-07T20:33:19.0482078Z scale_ub_tensor = None 2025-05-07T20:33:19.0482351Z 2025-05-07T20:33:19.0482607Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.0482949Z op = silu_mul_quant 2025-05-07T20:33:19.0483224Z if compiled: 2025-05-07T20:33:19.0483521Z op = torch.compile(op) 2025-05-07T20:33:19.0483865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0484167Z 2025-05-07T20:33:19.0484376Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.0484556Z 2025-05-07T20:33:19.0484667Z moe/activation_test.py:117: 2025-05-07T20:33:19.0484990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0485350Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.0485657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0486397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.0487147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.0487725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.0488464Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.0489199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.0489777Z kernel = self.compile( 2025-05-07T20:33:19.0490363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.0491069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.0491498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0491750Z 2025-05-07T20:33:19.0491976Z self = 2025-05-07T20:33:19.0493140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.0494679Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacb6d5240>} 2025-05-07T20:33:19.0496124Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.0497227Z context = 2025-05-07T20:33:19.0497540Z 2025-05-07T20:33:19.0497723Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.0498283Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.0498782Z module_map=module_map) 2025-05-07T20:33:19.0499175Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.0499602Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.0499879Z E ^ 2025-05-07T20:33:19.0500387Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.0500871Z 2025-05-07T20:33:19.0501324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.0502032Z 2025-05-07T20:33:19.0502192Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.0502639Z self=, 2025-05-07T20:33:19.0503073Z T=2048, 2025-05-07T20:33:19.0503282Z D=7168, 2025-05-07T20:33:19.0503523Z scale_ub=1200.0, 2025-05-07T20:33:19.0503778Z contiguous=True, 2025-05-07T20:33:19.0504021Z compiled=False, 2025-05-07T20:33:19.0504240Z ) 2025-05-07T20:33:19.1485373Z self = 2025-05-07T20:33:19.1486214Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.1486638Z 2025-05-07T20:33:19.1486736Z @given( 2025-05-07T20:33:19.1487001Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.1487336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.1487666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.1488027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.1488382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.1488696Z ) 2025-05-07T20:33:19.1489078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.1489551Z def test_silu_mul_quant( 2025-05-07T20:33:19.1489815Z self, 2025-05-07T20:33:19.1490029Z T: int, 2025-05-07T20:33:19.1490238Z D: int, 2025-05-07T20:33:19.1490478Z scale_ub: Optional[float], 2025-05-07T20:33:19.1490771Z contiguous: bool, 2025-05-07T20:33:19.1491034Z compiled: bool, 2025-05-07T20:33:19.1491272Z ) -> None: 2025-05-07T20:33:19.1491507Z torch.manual_seed(2025) 2025-05-07T20:33:19.1491775Z 2025-05-07T20:33:19.1492063Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.1494297Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.1496333Z 2025-05-07T20:33:19.1496462Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.1496697Z 2025-05-07T20:33:19.1496808Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.1497253Z self=, 2025-05-07T20:33:19.1497839Z T=1, 2025-05-07T20:33:19.1498044Z D=5120, 2025-05-07T20:33:19.1498258Z scale_ub=1200.0, 2025-05-07T20:33:19.1498496Z contiguous=True, 2025-05-07T20:33:19.1498738Z compiled=False, 2025-05-07T20:33:19.1498963Z ) 2025-05-07T20:33:19.1499307Z self = 2025-05-07T20:33:19.1499820Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.1500108Z 2025-05-07T20:33:19.1500195Z @given( 2025-05-07T20:33:19.1500442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.1500773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.1501098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.1501454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.1501869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.1502177Z ) 2025-05-07T20:33:19.1502555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.1503029Z def test_silu_mul_quant( 2025-05-07T20:33:19.1503290Z self, 2025-05-07T20:33:19.1503503Z T: int, 2025-05-07T20:33:19.1503789Z D: int, 2025-05-07T20:33:19.1504021Z scale_ub: Optional[float], 2025-05-07T20:33:19.1504369Z contiguous: bool, 2025-05-07T20:33:19.1504632Z compiled: bool, 2025-05-07T20:33:19.1504869Z ) -> None: 2025-05-07T20:33:19.1505103Z torch.manual_seed(2025) 2025-05-07T20:33:19.1505363Z 2025-05-07T20:33:19.1505651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.1506019Z 2025-05-07T20:33:19.1506228Z x_sign = torch.sign(x) 2025-05-07T20:33:19.1506534Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.1506868Z x = x_sign * x_clamp 2025-05-07T20:33:19.1507123Z x0 = x[:, :D] 2025-05-07T20:33:19.1507351Z x1 = x[:, D:] 2025-05-07T20:33:19.1507580Z 2025-05-07T20:33:19.1507786Z if contiguous: 2025-05-07T20:33:19.1508030Z x0 = x0.contiguous() 2025-05-07T20:33:19.1508306Z x1 = x1.contiguous() 2025-05-07T20:33:19.1508565Z 2025-05-07T20:33:19.1508778Z if scale_ub is not None: 2025-05-07T20:33:19.1509076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.1509432Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.1509759Z ) 2025-05-07T20:33:19.1509963Z else: 2025-05-07T20:33:19.1510183Z scale_ub_tensor = None 2025-05-07T20:33:19.1510451Z 2025-05-07T20:33:19.1510699Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.1511033Z op = silu_mul_quant 2025-05-07T20:33:19.1511305Z if compiled: 2025-05-07T20:33:19.1511586Z op = torch.compile(op) 2025-05-07T20:33:19.1511905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.1512200Z 2025-05-07T20:33:19.1512404Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.1512584Z 2025-05-07T20:33:19.1512690Z moe/activation_test.py:117: 2025-05-07T20:33:19.1513007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.1513365Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.1513813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.1514547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.1515280Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.1515844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.1516568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.1517274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.1517884Z kernel = self.compile( 2025-05-07T20:33:19.1518464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.1519161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.1519582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.1519824Z 2025-05-07T20:33:19.1520045Z self = 2025-05-07T20:33:19.1521188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.1522684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacb6d6200>} 2025-05-07T20:33:19.1524347Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.1525502Z context = 2025-05-07T20:33:19.1525904Z 2025-05-07T20:33:19.1526082Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.1526631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.1527129Z module_map=module_map) 2025-05-07T20:33:19.1527511Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.1527884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.1528160Z E ^ 2025-05-07T20:33:19.1528646Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.1529126Z 2025-05-07T20:33:19.1529566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.1530112Z 2025-05-07T20:33:19.1530220Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.1530664Z self=, 2025-05-07T20:33:19.1531082Z T=2048, 2025-05-07T20:33:19.1531281Z D=5120, 2025-05-07T20:33:19.1531486Z scale_ub=None, 2025-05-07T20:33:19.1531708Z contiguous=True, 2025-05-07T20:33:19.1531945Z compiled=False, 2025-05-07T20:33:19.1532163Z ) 2025-05-07T20:33:19.1532496Z self = 2025-05-07T20:33:19.1533018Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.1533309Z 2025-05-07T20:33:19.1533391Z @given( 2025-05-07T20:33:19.1533634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.1533963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.1534291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.1534642Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.1534985Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.1535297Z ) 2025-05-07T20:33:19.1535677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.1536144Z def test_silu_mul_quant( 2025-05-07T20:33:19.1536394Z self, 2025-05-07T20:33:19.1536603Z T: int, 2025-05-07T20:33:19.1536809Z D: int, 2025-05-07T20:33:19.1537035Z scale_ub: Optional[float], 2025-05-07T20:33:19.1537320Z contiguous: bool, 2025-05-07T20:33:19.1537576Z compiled: bool, 2025-05-07T20:33:19.1537807Z ) -> None: 2025-05-07T20:33:19.1538039Z torch.manual_seed(2025) 2025-05-07T20:33:19.1538294Z 2025-05-07T20:33:19.1538576Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.1539008Z 2025-05-07T20:33:19.1539214Z > x_sign = torch.sign(x) 2025-05-07T20:33:19.1541266Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.1543217Z 2025-05-07T20:33:19.1543342Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:19.1543570Z 2025-05-07T20:33:19.1543743Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.1544175Z self=, 2025-05-07T20:33:19.1544596Z T=16384, 2025-05-07T20:33:19.1544800Z D=5120, 2025-05-07T20:33:19.1545003Z scale_ub=None, 2025-05-07T20:33:19.1545228Z contiguous=True, 2025-05-07T20:33:19.1545457Z compiled=False, 2025-05-07T20:33:19.1545676Z ) 2025-05-07T20:33:19.2535362Z self = 2025-05-07T20:33:19.2536965Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2537762Z 2025-05-07T20:33:19.2537934Z @given( 2025-05-07T20:33:19.2538403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2539049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2539681Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2540363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2541051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2541639Z ) 2025-05-07T20:33:19.2542366Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2543264Z def test_silu_mul_quant( 2025-05-07T20:33:19.2543563Z self, 2025-05-07T20:33:19.2543791Z T: int, 2025-05-07T20:33:19.2543996Z D: int, 2025-05-07T20:33:19.2544229Z scale_ub: Optional[float], 2025-05-07T20:33:19.2544515Z contiguous: bool, 2025-05-07T20:33:19.2544762Z compiled: bool, 2025-05-07T20:33:19.2544999Z ) -> None: 2025-05-07T20:33:19.2545224Z torch.manual_seed(2025) 2025-05-07T20:33:19.2545471Z 2025-05-07T20:33:19.2545756Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2547877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2549821Z 2025-05-07T20:33:19.2549944Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.2550168Z 2025-05-07T20:33:19.2550284Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2550710Z self=, 2025-05-07T20:33:19.2551124Z T=4096, 2025-05-07T20:33:19.2551323Z D=5120, 2025-05-07T20:33:19.2551519Z scale_ub=None, 2025-05-07T20:33:19.2551738Z contiguous=True, 2025-05-07T20:33:19.2551972Z compiled=False, 2025-05-07T20:33:19.2552181Z ) 2025-05-07T20:33:19.2552518Z self = 2025-05-07T20:33:19.2553034Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.2553314Z 2025-05-07T20:33:19.2553628Z @given( 2025-05-07T20:33:19.2553869Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2554194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2554514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2554858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2555199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2555497Z ) 2025-05-07T20:33:19.2555860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2556316Z def test_silu_mul_quant( 2025-05-07T20:33:19.2556569Z self, 2025-05-07T20:33:19.2556769Z T: int, 2025-05-07T20:33:19.2556974Z D: int, 2025-05-07T20:33:19.2557198Z scale_ub: Optional[float], 2025-05-07T20:33:19.2557549Z contiguous: bool, 2025-05-07T20:33:19.2557795Z compiled: bool, 2025-05-07T20:33:19.2558027Z ) -> None: 2025-05-07T20:33:19.2558253Z torch.manual_seed(2025) 2025-05-07T20:33:19.2558499Z 2025-05-07T20:33:19.2558779Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2560947Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2562932Z 2025-05-07T20:33:19.2563061Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.2563291Z 2025-05-07T20:33:19.2563414Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2563873Z self=, 2025-05-07T20:33:19.2564292Z T=2048, 2025-05-07T20:33:19.2564486Z D=5120, 2025-05-07T20:33:19.2564683Z scale_ub=None, 2025-05-07T20:33:19.2564907Z contiguous=False, 2025-05-07T20:33:19.2565146Z compiled=False, 2025-05-07T20:33:19.2565354Z ) 2025-05-07T20:33:19.2565685Z self = 2025-05-07T20:33:19.2566201Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2566486Z 2025-05-07T20:33:19.2566566Z @given( 2025-05-07T20:33:19.2566804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2567130Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2567446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2567794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2568140Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2568440Z ) 2025-05-07T20:33:19.2568801Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2569265Z def test_silu_mul_quant( 2025-05-07T20:33:19.2569518Z self, 2025-05-07T20:33:19.2569716Z T: int, 2025-05-07T20:33:19.2569926Z D: int, 2025-05-07T20:33:19.2570156Z scale_ub: Optional[float], 2025-05-07T20:33:19.2570433Z contiguous: bool, 2025-05-07T20:33:19.2570681Z compiled: bool, 2025-05-07T20:33:19.2570911Z ) -> None: 2025-05-07T20:33:19.2571134Z torch.manual_seed(2025) 2025-05-07T20:33:19.2571388Z 2025-05-07T20:33:19.2571667Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2573931Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2575880Z 2025-05-07T20:33:19.2576010Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.2576232Z 2025-05-07T20:33:19.2576341Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2576773Z self=, 2025-05-07T20:33:19.2577190Z T=4096, 2025-05-07T20:33:19.2577382Z D=7168, 2025-05-07T20:33:19.2577580Z scale_ub=None, 2025-05-07T20:33:19.2577806Z contiguous=True, 2025-05-07T20:33:19.2578035Z compiled=True, 2025-05-07T20:33:19.2578295Z ) 2025-05-07T20:33:19.2578626Z self = 2025-05-07T20:33:19.2579134Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.2579418Z 2025-05-07T20:33:19.2579498Z @given( 2025-05-07T20:33:19.2579736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2580063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2580458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2580804Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2581151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2581446Z ) 2025-05-07T20:33:19.2581811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2582273Z def test_silu_mul_quant( 2025-05-07T20:33:19.2582522Z self, 2025-05-07T20:33:19.2582726Z T: int, 2025-05-07T20:33:19.2582935Z D: int, 2025-05-07T20:33:19.2583159Z scale_ub: Optional[float], 2025-05-07T20:33:19.2583447Z contiguous: bool, 2025-05-07T20:33:19.2583724Z compiled: bool, 2025-05-07T20:33:19.2583956Z ) -> None: 2025-05-07T20:33:19.2584178Z torch.manual_seed(2025) 2025-05-07T20:33:19.2584432Z 2025-05-07T20:33:19.2584715Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2586842Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2588777Z 2025-05-07T20:33:19.2588902Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.2589122Z 2025-05-07T20:33:19.2589228Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2589658Z self=, 2025-05-07T20:33:19.2590075Z T=2048, 2025-05-07T20:33:19.2590265Z D=5120, 2025-05-07T20:33:19.2590459Z scale_ub=1200.0, 2025-05-07T20:33:19.2590693Z contiguous=False, 2025-05-07T20:33:19.2590929Z compiled=False, 2025-05-07T20:33:19.2591138Z ) 2025-05-07T20:33:19.2598571Z self = 2025-05-07T20:33:19.2599109Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2599402Z 2025-05-07T20:33:19.2599484Z @given( 2025-05-07T20:33:19.2599729Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2600056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2600380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2600725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2601141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2601439Z ) 2025-05-07T20:33:19.2601806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2602270Z def test_silu_mul_quant( 2025-05-07T20:33:19.2602523Z self, 2025-05-07T20:33:19.2602726Z T: int, 2025-05-07T20:33:19.2602934Z D: int, 2025-05-07T20:33:19.2603165Z scale_ub: Optional[float], 2025-05-07T20:33:19.2603492Z contiguous: bool, 2025-05-07T20:33:19.2603740Z compiled: bool, 2025-05-07T20:33:19.2603974Z ) -> None: 2025-05-07T20:33:19.2604194Z torch.manual_seed(2025) 2025-05-07T20:33:19.2604446Z 2025-05-07T20:33:19.2604730Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2606933Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.2608959Z 2025-05-07T20:33:19.2609085Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.2609309Z 2025-05-07T20:33:19.2609416Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2609846Z self=, 2025-05-07T20:33:19.2610261Z T=4096, 2025-05-07T20:33:19.2610449Z D=7168, 2025-05-07T20:33:19.2610656Z scale_ub=1200.0, 2025-05-07T20:33:19.2610890Z contiguous=True, 2025-05-07T20:33:19.2611117Z compiled=False, 2025-05-07T20:33:19.2611328Z ) 2025-05-07T20:33:19.3889445Z self = 2025-05-07T20:33:19.3890248Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.3890647Z 2025-05-07T20:33:19.3890765Z @given( 2025-05-07T20:33:19.3891079Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3891416Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3891740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3892090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3892436Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3892741Z ) 2025-05-07T20:33:19.3893115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3893629Z def test_silu_mul_quant( 2025-05-07T20:33:19.3893889Z self, 2025-05-07T20:33:19.3894097Z T: int, 2025-05-07T20:33:19.3894304Z D: int, 2025-05-07T20:33:19.3894537Z scale_ub: Optional[float], 2025-05-07T20:33:19.3894829Z contiguous: bool, 2025-05-07T20:33:19.3895081Z compiled: bool, 2025-05-07T20:33:19.3895322Z ) -> None: 2025-05-07T20:33:19.3895552Z torch.manual_seed(2025) 2025-05-07T20:33:19.3895809Z 2025-05-07T20:33:19.3896095Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3898243Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3900186Z 2025-05-07T20:33:19.3900315Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3900652Z 2025-05-07T20:33:19.3900770Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3901199Z self=, 2025-05-07T20:33:19.3901622Z T=16384, 2025-05-07T20:33:19.3901832Z D=7168, 2025-05-07T20:33:19.3902048Z scale_ub=None, 2025-05-07T20:33:19.3902277Z contiguous=False, 2025-05-07T20:33:19.3902519Z compiled=True, 2025-05-07T20:33:19.3902736Z ) 2025-05-07T20:33:19.3903072Z self = 2025-05-07T20:33:19.3903588Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.3903879Z 2025-05-07T20:33:19.3903968Z @given( 2025-05-07T20:33:19.3904207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3904610Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3904934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3905282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3905631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3905936Z ) 2025-05-07T20:33:19.3906302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3906830Z def test_silu_mul_quant( 2025-05-07T20:33:19.3907172Z self, 2025-05-07T20:33:19.3907382Z T: int, 2025-05-07T20:33:19.3907592Z D: int, 2025-05-07T20:33:19.3907824Z scale_ub: Optional[float], 2025-05-07T20:33:19.3908112Z contiguous: bool, 2025-05-07T20:33:19.3908362Z compiled: bool, 2025-05-07T20:33:19.3908600Z ) -> None: 2025-05-07T20:33:19.3908830Z torch.manual_seed(2025) 2025-05-07T20:33:19.3909084Z 2025-05-07T20:33:19.3909371Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3911503Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3913437Z 2025-05-07T20:33:19.3913658Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3913882Z 2025-05-07T20:33:19.3913995Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3914423Z self=, 2025-05-07T20:33:19.3914842Z T=4096, 2025-05-07T20:33:19.3915042Z D=7168, 2025-05-07T20:33:19.3915240Z scale_ub=None, 2025-05-07T20:33:19.3915466Z contiguous=True, 2025-05-07T20:33:19.3915700Z compiled=False, 2025-05-07T20:33:19.3915910Z ) 2025-05-07T20:33:19.3916246Z self = 2025-05-07T20:33:19.3916761Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.3917043Z 2025-05-07T20:33:19.3917130Z @given( 2025-05-07T20:33:19.3917369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3917696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3918014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3918354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3918698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3918996Z ) 2025-05-07T20:33:19.3919358Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3919822Z def test_silu_mul_quant( 2025-05-07T20:33:19.3920075Z self, 2025-05-07T20:33:19.3920277Z T: int, 2025-05-07T20:33:19.3920485Z D: int, 2025-05-07T20:33:19.3920766Z scale_ub: Optional[float], 2025-05-07T20:33:19.3921050Z contiguous: bool, 2025-05-07T20:33:19.3921302Z compiled: bool, 2025-05-07T20:33:19.3921534Z ) -> None: 2025-05-07T20:33:19.3921759Z torch.manual_seed(2025) 2025-05-07T20:33:19.3922015Z 2025-05-07T20:33:19.3922299Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3924658Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3926655Z 2025-05-07T20:33:19.3926788Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3927011Z 2025-05-07T20:33:19.3927121Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3927554Z self=, 2025-05-07T20:33:19.3928035Z T=16384, 2025-05-07T20:33:19.3928293Z D=7168, 2025-05-07T20:33:19.3928494Z scale_ub=None, 2025-05-07T20:33:19.3928720Z contiguous=True, 2025-05-07T20:33:19.3928955Z compiled=False, 2025-05-07T20:33:19.3929166Z ) 2025-05-07T20:33:19.3929496Z self = 2025-05-07T20:33:19.3930012Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.3930300Z 2025-05-07T20:33:19.3930384Z @given( 2025-05-07T20:33:19.3930627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3930953Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3931271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3931616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3931962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3932264Z ) 2025-05-07T20:33:19.3932633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3933097Z def test_silu_mul_quant( 2025-05-07T20:33:19.3933349Z self, 2025-05-07T20:33:19.3933549Z T: int, 2025-05-07T20:33:19.3933757Z D: int, 2025-05-07T20:33:19.3933986Z scale_ub: Optional[float], 2025-05-07T20:33:19.3934267Z contiguous: bool, 2025-05-07T20:33:19.3934518Z compiled: bool, 2025-05-07T20:33:19.3934750Z ) -> None: 2025-05-07T20:33:19.3934972Z torch.manual_seed(2025) 2025-05-07T20:33:19.3935227Z 2025-05-07T20:33:19.3935513Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3937640Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3939573Z 2025-05-07T20:33:19.3939703Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3939927Z 2025-05-07T20:33:19.3940035Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3940466Z self=, 2025-05-07T20:33:19.3940887Z T=16384, 2025-05-07T20:33:19.3941087Z D=7168, 2025-05-07T20:33:19.3941289Z scale_ub=1200.0, 2025-05-07T20:33:19.3941524Z contiguous=True, 2025-05-07T20:33:19.3941821Z compiled=False, 2025-05-07T20:33:19.3942038Z ) 2025-05-07T20:33:19.3942368Z self = 2025-05-07T20:33:19.3942883Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.3943180Z 2025-05-07T20:33:19.3943269Z @given( 2025-05-07T20:33:19.3943546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3943885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3944202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3944545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3944886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3945185Z ) 2025-05-07T20:33:19.3945552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3946061Z def test_silu_mul_quant( 2025-05-07T20:33:19.3946310Z self, 2025-05-07T20:33:19.3946518Z T: int, 2025-05-07T20:33:19.3946726Z D: int, 2025-05-07T20:33:19.3946951Z scale_ub: Optional[float], 2025-05-07T20:33:19.3947235Z contiguous: bool, 2025-05-07T20:33:19.3947490Z compiled: bool, 2025-05-07T20:33:19.3947796Z ) -> None: 2025-05-07T20:33:19.3948058Z torch.manual_seed(2025) 2025-05-07T20:33:19.3948312Z 2025-05-07T20:33:19.3948596Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3950731Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.3952665Z 2025-05-07T20:33:19.3952790Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.3953017Z 2025-05-07T20:33:19.3953127Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.3953653Z self=, 2025-05-07T20:33:19.3954086Z T=128, 2025-05-07T20:33:19.3954294Z D=5120, 2025-05-07T20:33:19.3954495Z scale_ub=1200.0, 2025-05-07T20:33:19.3954725Z contiguous=False, 2025-05-07T20:33:19.3954961Z compiled=False, 2025-05-07T20:33:19.3955174Z ) 2025-05-07T20:33:19.5386691Z self = 2025-05-07T20:33:19.5388280Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.5389088Z 2025-05-07T20:33:19.5389328Z @given( 2025-05-07T20:33:19.5389855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.5390510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.5391134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.5391818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.5392498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.5393088Z ) 2025-05-07T20:33:19.5393711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.5394188Z def test_silu_mul_quant( 2025-05-07T20:33:19.5394437Z self, 2025-05-07T20:33:19.5394642Z T: int, 2025-05-07T20:33:19.5394850Z D: int, 2025-05-07T20:33:19.5395077Z scale_ub: Optional[float], 2025-05-07T20:33:19.5395366Z contiguous: bool, 2025-05-07T20:33:19.5395619Z compiled: bool, 2025-05-07T20:33:19.5395848Z ) -> None: 2025-05-07T20:33:19.5396080Z torch.manual_seed(2025) 2025-05-07T20:33:19.5396338Z 2025-05-07T20:33:19.5396622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.5397091Z 2025-05-07T20:33:19.5397299Z x_sign = torch.sign(x) 2025-05-07T20:33:19.5397606Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.5397926Z x = x_sign * x_clamp 2025-05-07T20:33:19.5398194Z x0 = x[:, :D] 2025-05-07T20:33:19.5398431Z x1 = x[:, D:] 2025-05-07T20:33:19.5398651Z 2025-05-07T20:33:19.5398851Z if contiguous: 2025-05-07T20:33:19.5399105Z x0 = x0.contiguous() 2025-05-07T20:33:19.5399395Z x1 = x1.contiguous() 2025-05-07T20:33:19.5399665Z 2025-05-07T20:33:19.5399871Z if scale_ub is not None: 2025-05-07T20:33:19.5400173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.5400552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.5400969Z ) 2025-05-07T20:33:19.5401174Z else: 2025-05-07T20:33:19.5401407Z scale_ub_tensor = None 2025-05-07T20:33:19.5401696Z 2025-05-07T20:33:19.5401952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.5402311Z op = silu_mul_quant 2025-05-07T20:33:19.5402589Z if compiled: 2025-05-07T20:33:19.5402862Z op = torch.compile(op) 2025-05-07T20:33:19.5403261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.5403629Z 2025-05-07T20:33:19.5403840Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.5404027Z 2025-05-07T20:33:19.5404134Z moe/activation_test.py:117: 2025-05-07T20:33:19.5404471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.5404856Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.5405167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.5406020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.5406881Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.5407530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.5408365Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.5409182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.5409833Z kernel = self.compile( 2025-05-07T20:33:19.5410490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.5411289Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.5411755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.5412029Z 2025-05-07T20:33:19.5412277Z self = 2025-05-07T20:33:19.5413643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.5415418Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbac5ea0>} 2025-05-07T20:33:19.5417142Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.5418436Z context = 2025-05-07T20:33:19.5418784Z 2025-05-07T20:33:19.5418974Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.5419597Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.5420164Z module_map=module_map) 2025-05-07T20:33:19.5420633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.5421038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.5421333Z E ^ 2025-05-07T20:33:19.5421892Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.5422458Z 2025-05-07T20:33:19.5422983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.5423623Z 2025-05-07T20:33:19.5423735Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.5424338Z self=, 2025-05-07T20:33:19.5424757Z T=2048, 2025-05-07T20:33:19.5424952Z D=7168, 2025-05-07T20:33:19.5425230Z scale_ub=None, 2025-05-07T20:33:19.5425463Z contiguous=False, 2025-05-07T20:33:19.5425704Z compiled=False, 2025-05-07T20:33:19.5425917Z ) 2025-05-07T20:33:19.5426254Z self = 2025-05-07T20:33:19.5426772Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.5427058Z 2025-05-07T20:33:19.5427140Z @given( 2025-05-07T20:33:19.5427455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.5427841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.5428166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.5428512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.5428861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.5429162Z ) 2025-05-07T20:33:19.5429527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.5429991Z def test_silu_mul_quant( 2025-05-07T20:33:19.5430250Z self, 2025-05-07T20:33:19.5430456Z T: int, 2025-05-07T20:33:19.5430662Z D: int, 2025-05-07T20:33:19.5430895Z scale_ub: Optional[float], 2025-05-07T20:33:19.5431179Z contiguous: bool, 2025-05-07T20:33:19.5431434Z compiled: bool, 2025-05-07T20:33:19.5431679Z ) -> None: 2025-05-07T20:33:19.5431909Z torch.manual_seed(2025) 2025-05-07T20:33:19.5432170Z 2025-05-07T20:33:19.5432462Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.5434695Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.5436620Z 2025-05-07T20:33:19.5436752Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.5436974Z 2025-05-07T20:33:19.5437083Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.5437517Z self=, 2025-05-07T20:33:19.5437941Z T=128, 2025-05-07T20:33:19.5438134Z D=7168, 2025-05-07T20:33:19.5438337Z scale_ub=1200.0, 2025-05-07T20:33:19.5438576Z contiguous=True, 2025-05-07T20:33:19.5438806Z compiled=True, 2025-05-07T20:33:19.5439017Z ) 2025-05-07T20:33:19.5861862Z self = 2025-05-07T20:33:19.5863309Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.5863942Z 2025-05-07T20:33:19.5864071Z @given( 2025-05-07T20:33:19.5864405Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.5864758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.5865082Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.5865536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.5865887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.5866190Z ) 2025-05-07T20:33:19.5866554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.5867022Z def test_silu_mul_quant( 2025-05-07T20:33:19.5867283Z self, 2025-05-07T20:33:19.5867486Z T: int, 2025-05-07T20:33:19.5867699Z D: int, 2025-05-07T20:33:19.5867935Z scale_ub: Optional[float], 2025-05-07T20:33:19.5868217Z contiguous: bool, 2025-05-07T20:33:19.5868479Z compiled: bool, 2025-05-07T20:33:19.5868723Z ) -> None: 2025-05-07T20:33:19.5868952Z torch.manual_seed(2025) 2025-05-07T20:33:19.5869204Z 2025-05-07T20:33:19.5869563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.5869919Z 2025-05-07T20:33:19.5870121Z x_sign = torch.sign(x) 2025-05-07T20:33:19.5870433Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.5870760Z x = x_sign * x_clamp 2025-05-07T20:33:19.5871013Z x0 = x[:, :D] 2025-05-07T20:33:19.5871247Z x1 = x[:, D:] 2025-05-07T20:33:19.5871535Z 2025-05-07T20:33:19.5871731Z if contiguous: 2025-05-07T20:33:19.5872038Z x0 = x0.contiguous() 2025-05-07T20:33:19.5872315Z x1 = x1.contiguous() 2025-05-07T20:33:19.5872565Z 2025-05-07T20:33:19.5872770Z if scale_ub is not None: 2025-05-07T20:33:19.5873060Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.5873408Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.5873837Z ) 2025-05-07T20:33:19.5874041Z else: 2025-05-07T20:33:19.5874264Z scale_ub_tensor = None 2025-05-07T20:33:19.5874533Z 2025-05-07T20:33:19.5874778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.5875112Z op = silu_mul_quant 2025-05-07T20:33:19.5875374Z if compiled: 2025-05-07T20:33:19.5875635Z op = torch.compile(op) 2025-05-07T20:33:19.5875951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.5876238Z 2025-05-07T20:33:19.5876445Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.5876620Z 2025-05-07T20:33:19.5876734Z moe/activation_test.py:117: 2025-05-07T20:33:19.5877041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.5877391Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.5877685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.5878267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.5878853Z return fn(*args, **kwargs) 2025-05-07T20:33:19.5879542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.5880265Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.5880821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.5881532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.5882228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.5882786Z kernel = self.compile( 2025-05-07T20:33:19.5883367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.5884119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.5884537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.5884783Z 2025-05-07T20:33:19.5885003Z self = 2025-05-07T20:33:19.5886191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.5887638Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbac77f0>} 2025-05-07T20:33:19.5895521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.5896851Z context = 2025-05-07T20:33:19.5897162Z 2025-05-07T20:33:19.5897343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.5897972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.5898469Z module_map=module_map) 2025-05-07T20:33:19.5898855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.5899224Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.5899549Z E ^ 2025-05-07T20:33:19.5900081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.5900557Z 2025-05-07T20:33:19.5900996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.5901538Z 2025-05-07T20:33:19.5901650Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.5902085Z self=, 2025-05-07T20:33:19.5902508Z T=128, 2025-05-07T20:33:19.5902710Z D=7168, 2025-05-07T20:33:19.5902920Z scale_ub=1200.0, 2025-05-07T20:33:19.5903159Z contiguous=True, 2025-05-07T20:33:19.5903394Z compiled=False, 2025-05-07T20:33:19.5903618Z ) 2025-05-07T20:33:19.5903960Z self = 2025-05-07T20:33:19.5904472Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.5904763Z 2025-05-07T20:33:19.5904849Z @given( 2025-05-07T20:33:19.5905098Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.5905424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.5905750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.5906103Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.5906452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.5906752Z ) 2025-05-07T20:33:19.5907122Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.5907590Z def test_silu_mul_quant( 2025-05-07T20:33:19.5907845Z self, 2025-05-07T20:33:19.5908055Z T: int, 2025-05-07T20:33:19.5908270Z D: int, 2025-05-07T20:33:19.5908502Z scale_ub: Optional[float], 2025-05-07T20:33:19.5908790Z contiguous: bool, 2025-05-07T20:33:19.5909047Z compiled: bool, 2025-05-07T20:33:19.5909282Z ) -> None: 2025-05-07T20:33:19.5909514Z torch.manual_seed(2025) 2025-05-07T20:33:19.5909772Z 2025-05-07T20:33:19.5910068Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.5910428Z 2025-05-07T20:33:19.5910634Z x_sign = torch.sign(x) 2025-05-07T20:33:19.5910938Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.5913106Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.5915171Z 2025-05-07T20:33:19.5915302Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.5915531Z 2025-05-07T20:33:19.5915644Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.5916083Z self=, 2025-05-07T20:33:19.5916504Z T=128, 2025-05-07T20:33:19.5916703Z D=5120, 2025-05-07T20:33:19.5916908Z scale_ub=1200.0, 2025-05-07T20:33:19.5917141Z contiguous=True, 2025-05-07T20:33:19.5917376Z compiled=True, 2025-05-07T20:33:19.5917590Z ) 2025-05-07T20:33:19.5917926Z self = 2025-05-07T20:33:19.5918491Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.5918775Z 2025-05-07T20:33:19.5918875Z @given( 2025-05-07T20:33:19.5919124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.5919454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.5919773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.5920168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.5920556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.5920856Z ) 2025-05-07T20:33:19.5921226Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.5921692Z def test_silu_mul_quant( 2025-05-07T20:33:19.5921944Z self, 2025-05-07T20:33:19.5922152Z T: int, 2025-05-07T20:33:19.5922359Z D: int, 2025-05-07T20:33:19.5922590Z scale_ub: Optional[float], 2025-05-07T20:33:19.5922881Z contiguous: bool, 2025-05-07T20:33:19.5923137Z compiled: bool, 2025-05-07T20:33:19.5923367Z ) -> None: 2025-05-07T20:33:19.5923610Z torch.manual_seed(2025) 2025-05-07T20:33:19.5924060Z 2025-05-07T20:33:19.5924342Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.5924705Z 2025-05-07T20:33:19.5924914Z x_sign = torch.sign(x) 2025-05-07T20:33:19.5925223Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.5927286Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.5929221Z 2025-05-07T20:33:19.5929348Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.5929577Z 2025-05-07T20:33:19.5929690Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.5930121Z self=, 2025-05-07T20:33:19.5930541Z T=128, 2025-05-07T20:33:19.5930737Z D=7168, 2025-05-07T20:33:19.5930940Z scale_ub=None, 2025-05-07T20:33:19.5931167Z contiguous=True, 2025-05-07T20:33:19.5931400Z compiled=True, 2025-05-07T20:33:19.5931617Z ) 2025-05-07T20:33:19.7996990Z self = 2025-05-07T20:33:19.7997681Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.7998007Z 2025-05-07T20:33:19.7998098Z @given( 2025-05-07T20:33:19.7998348Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.7998697Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.7999037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.7999514Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.7999879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8000192Z ) 2025-05-07T20:33:19.8000574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8001062Z def test_silu_mul_quant( 2025-05-07T20:33:19.8001334Z self, 2025-05-07T20:33:19.8001545Z T: int, 2025-05-07T20:33:19.8001764Z D: int, 2025-05-07T20:33:19.8002008Z scale_ub: Optional[float], 2025-05-07T20:33:19.8002303Z contiguous: bool, 2025-05-07T20:33:19.8002571Z compiled: bool, 2025-05-07T20:33:19.8002819Z ) -> None: 2025-05-07T20:33:19.8003059Z torch.manual_seed(2025) 2025-05-07T20:33:19.8003321Z 2025-05-07T20:33:19.8003620Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8005964Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.8008026Z 2025-05-07T20:33:19.8008161Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.8008393Z 2025-05-07T20:33:19.8018982Z FAILED 2025-05-07T20:33:19.8019351Z 2025-05-07T20:33:19.8019773Z =================================== FAILURES =================================== 2025-05-07T20:33:19.8020305Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:19.8020817Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:19.8021512Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:33:19.8022120Z | yield 2025-05-07T20:33:19.8022605Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:33:19.8023183Z | self._callTestMethod(testMethod) 2025-05-07T20:33:19.8023981Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:33:19.8024582Z | method() 2025-05-07T20:33:19.8025315Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:19.8026111Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8026822Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:19.8027551Z | raise the_error_hypothesis_found 2025-05-07T20:33:19.8028094Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:19.8028638Z +-+---------------- 1 ---------------- 2025-05-07T20:33:19.8028969Z | Traceback (most recent call last): 2025-05-07T20:33:19.8029749Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:19.8030600Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8033031Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.8035289Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:19.8035782Z | self=, 2025-05-07T20:33:19.8036237Z | T=2048, 2025-05-07T20:33:19.8036498Z | D=5120, # or any other generated value 2025-05-07T20:33:19.8036879Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:19.8037296Z | contiguous=True, # or any other generated value 2025-05-07T20:33:19.8037720Z | compiled=False, # or any other generated value 2025-05-07T20:33:19.8038069Z | ) 2025-05-07T20:33:19.8038278Z | 2025-05-07T20:33:19.8038863Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:19.8039621Z +---------------- 2 ---------------- 2025-05-07T20:33:19.8039953Z | Traceback (most recent call last): 2025-05-07T20:33:19.8040751Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:19.8041687Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8043996Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.8046133Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:19.8047259Z | self=, 2025-05-07T20:33:19.8047715Z | T=128, 2025-05-07T20:33:19.8047942Z | D=7168, 2025-05-07T20:33:19.8048186Z | scale_ub=None, 2025-05-07T20:33:19.8048464Z | contiguous=True, 2025-05-07T20:33:19.8048732Z | compiled=True, 2025-05-07T20:33:19.8048994Z | ) 2025-05-07T20:33:19.8049207Z | 2025-05-07T20:33:19.8049789Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:19.8050483Z +---------------- 3 ---------------- 2025-05-07T20:33:19.8050813Z | Traceback (most recent call last): 2025-05-07T20:33:19.8051593Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:19.8052436Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8055366Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.8057599Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:19.8058091Z | self=, 2025-05-07T20:33:19.8058549Z | T=128, 2025-05-07T20:33:19.8058774Z | D=5120, 2025-05-07T20:33:19.8059022Z | scale_ub=1200.0, 2025-05-07T20:33:19.8059299Z | contiguous=True, 2025-05-07T20:33:19.8059639Z | compiled=True, 2025-05-07T20:33:19.8059901Z | ) 2025-05-07T20:33:19.8060112Z | 2025-05-07T20:33:19.8060686Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:19.8061356Z +---------------- 4 ---------------- 2025-05-07T20:33:19.8061685Z | Traceback (most recent call last): 2025-05-07T20:33:19.8062475Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:19.8063266Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8063997Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:19.8064813Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8065730Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:19.8066610Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8067331Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:19.8068180Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8068998Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:19.8070043Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8071259Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:33:19.8072472Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8073812Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:19.8074888Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8225229Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:19.8226113Z | fn() 2025-05-07T20:33:19.8226996Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:19.8227982Z | self.fn.run( 2025-05-07T20:33:19.8228807Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:19.8229683Z | kernel = self.compile( 2025-05-07T20:33:19.8230592Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:19.8231646Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8232719Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:19.8234056Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8234837Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8235364Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8235759Z | ^ 2025-05-07T20:33:19.8236452Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8237302Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:19.8237903Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:19.8238958Z | self=, 2025-05-07T20:33:19.8239615Z | T=1, # or any other generated value 2025-05-07T20:33:19.8240087Z | D=5120, # or any other generated value 2025-05-07T20:33:19.8240603Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:19.8241142Z | contiguous=True, # or any other generated value 2025-05-07T20:33:19.8241656Z | compiled=True, # or any other generated value 2025-05-07T20:33:19.8242117Z | ) 2025-05-07T20:33:19.8242377Z | 2025-05-07T20:33:19.8243117Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:19.8243990Z +------------------------------------ 2025-05-07T20:33:19.8244611Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:19.8245147Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8245719Z self=, 2025-05-07T20:33:19.8246299Z T=1, 2025-05-07T20:33:19.8246603Z D=5120, 2025-05-07T20:33:19.8246916Z scale_ub=None, 2025-05-07T20:33:19.8247399Z contiguous=True, 2025-05-07T20:33:19.8247759Z compiled=True, 2025-05-07T20:33:19.8248179Z ) 2025-05-07T20:33:19.8248688Z self = 2025-05-07T20:33:19.8249433Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.8249854Z 2025-05-07T20:33:19.8249974Z @given( 2025-05-07T20:33:19.8250329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8250748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8251095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8251475Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8251843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8252175Z ) 2025-05-07T20:33:19.8252567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8253062Z def test_silu_mul_quant( 2025-05-07T20:33:19.8253336Z self, 2025-05-07T20:33:19.8253558Z T: int, 2025-05-07T20:33:19.8253787Z D: int, 2025-05-07T20:33:19.8254030Z scale_ub: Optional[float], 2025-05-07T20:33:19.8254335Z contiguous: bool, 2025-05-07T20:33:19.8254607Z compiled: bool, 2025-05-07T20:33:19.8254855Z ) -> None: 2025-05-07T20:33:19.8255099Z torch.manual_seed(2025) 2025-05-07T20:33:19.8255373Z 2025-05-07T20:33:19.8255676Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8256058Z 2025-05-07T20:33:19.8256281Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8256607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8256954Z x = x_sign * x_clamp 2025-05-07T20:33:19.8257229Z x0 = x[:, :D] 2025-05-07T20:33:19.8257468Z x1 = x[:, D:] 2025-05-07T20:33:19.8257703Z 2025-05-07T20:33:19.8257914Z if contiguous: 2025-05-07T20:33:19.8258172Z x0 = x0.contiguous() 2025-05-07T20:33:19.8258464Z x1 = x1.contiguous() 2025-05-07T20:33:19.8258737Z 2025-05-07T20:33:19.8258959Z if scale_ub is not None: 2025-05-07T20:33:19.8259264Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8259641Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8259992Z ) 2025-05-07T20:33:19.8260210Z else: 2025-05-07T20:33:19.8260452Z scale_ub_tensor = None 2025-05-07T20:33:19.8260738Z 2025-05-07T20:33:19.8260996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8261353Z op = silu_mul_quant 2025-05-07T20:33:19.8261638Z if compiled: 2025-05-07T20:33:19.8261912Z op = torch.compile(op) 2025-05-07T20:33:19.8262311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8262626Z 2025-05-07T20:33:19.8262842Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8263169Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8263506Z 2025-05-07T20:33:19.8263772Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8264150Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8264479Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8264832Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8265229Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8265581Z 2025-05-07T20:33:19.8265810Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8266026Z 2025-05-07T20:33:19.8266190Z moe/activation_test.py:126: 2025-05-07T20:33:19.8266522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8266903Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8267277Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8268151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8269078Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8269687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8270438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8271204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8272010Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8272846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8273781Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8274587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8275304Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8275975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8276544Z fn() 2025-05-07T20:33:19.8277106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8277750Z self.fn.run( 2025-05-07T20:33:19.8278265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8278858Z kernel = self.compile( 2025-05-07T20:33:19.8279462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8280185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8280623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8280893Z 2025-05-07T20:33:19.8281129Z self = 2025-05-07T20:33:19.8282352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8283893Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb7bb8af0>} 2025-05-07T20:33:19.8285431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8286570Z context = 2025-05-07T20:33:19.8286900Z 2025-05-07T20:33:19.8287088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8302717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8303256Z module_map=module_map) 2025-05-07T20:33:19.8303689Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8304114Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8304405Z E ^ 2025-05-07T20:33:19.8304930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8305517Z 2025-05-07T20:33:19.8305996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8306568Z 2025-05-07T20:33:19.8306684Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8307149Z self=, 2025-05-07T20:33:19.8307643Z T=2048, 2025-05-07T20:33:19.8307854Z D=5120, 2025-05-07T20:33:19.8308108Z scale_ub=1200.0, 2025-05-07T20:33:19.8308354Z contiguous=True, 2025-05-07T20:33:19.8308595Z compiled=False, 2025-05-07T20:33:19.8308820Z ) 2025-05-07T20:33:19.8309174Z self = 2025-05-07T20:33:19.8309725Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.8310030Z 2025-05-07T20:33:19.8310114Z @given( 2025-05-07T20:33:19.8310377Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8310732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8311072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8311439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8311800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8312112Z ) 2025-05-07T20:33:19.8312497Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8312992Z def test_silu_mul_quant( 2025-05-07T20:33:19.8313266Z self, 2025-05-07T20:33:19.8313483Z T: int, 2025-05-07T20:33:19.8313781Z D: int, 2025-05-07T20:33:19.8314024Z scale_ub: Optional[float], 2025-05-07T20:33:19.8314347Z contiguous: bool, 2025-05-07T20:33:19.8314647Z compiled: bool, 2025-05-07T20:33:19.8314923Z ) -> None: 2025-05-07T20:33:19.8315164Z torch.manual_seed(2025) 2025-05-07T20:33:19.8315437Z 2025-05-07T20:33:19.8315745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8316126Z 2025-05-07T20:33:19.8316341Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8316667Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8317018Z x = x_sign * x_clamp 2025-05-07T20:33:19.8317280Z x0 = x[:, :D] 2025-05-07T20:33:19.8317527Z x1 = x[:, D:] 2025-05-07T20:33:19.8317756Z 2025-05-07T20:33:19.8317961Z if contiguous: 2025-05-07T20:33:19.8318227Z x0 = x0.contiguous() 2025-05-07T20:33:19.8318519Z x1 = x1.contiguous() 2025-05-07T20:33:19.8318789Z 2025-05-07T20:33:19.8319010Z if scale_ub is not None: 2025-05-07T20:33:19.8319324Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8319699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8320055Z ) 2025-05-07T20:33:19.8320277Z else: 2025-05-07T20:33:19.8320514Z scale_ub_tensor = None 2025-05-07T20:33:19.8320805Z 2025-05-07T20:33:19.8321074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8321423Z op = silu_mul_quant 2025-05-07T20:33:19.8321771Z if compiled: 2025-05-07T20:33:19.8322056Z op = torch.compile(op) 2025-05-07T20:33:19.8322388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8322701Z 2025-05-07T20:33:19.8322925Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8323117Z 2025-05-07T20:33:19.8323238Z moe/activation_test.py:117: 2025-05-07T20:33:19.8323570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8324331Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8324673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8325441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8326212Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8326971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8327734Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8328468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8329063Z kernel = self.compile( 2025-05-07T20:33:19.8329812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8330544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8330993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8331256Z 2025-05-07T20:33:19.8331488Z self = 2025-05-07T20:33:19.8332696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8334238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb7a95990>} 2025-05-07T20:33:19.8335730Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8336870Z context = 2025-05-07T20:33:19.8337191Z 2025-05-07T20:33:19.8337385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8337969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8338489Z module_map=module_map) 2025-05-07T20:33:19.8338902Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8339298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8339588Z E ^ 2025-05-07T20:33:19.8340115Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8340615Z 2025-05-07T20:33:19.8341088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8341660Z 2025-05-07T20:33:19.8341788Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8342248Z self=, 2025-05-07T20:33:19.8342697Z T=2048, 2025-05-07T20:33:19.8342915Z D=5120, 2025-05-07T20:33:19.8343133Z scale_ub=1200.0, 2025-05-07T20:33:19.8343390Z contiguous=True, 2025-05-07T20:33:19.8343641Z compiled=True, 2025-05-07T20:33:19.8343869Z ) 2025-05-07T20:33:19.8344234Z self = 2025-05-07T20:33:19.8344788Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.8345172Z 2025-05-07T20:33:19.8345270Z @given( 2025-05-07T20:33:19.8345534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8345887Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8346239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8346610Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8346986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8347316Z ) 2025-05-07T20:33:19.8347712Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8348214Z def test_silu_mul_quant( 2025-05-07T20:33:19.8348496Z self, 2025-05-07T20:33:19.8348713Z T: int, 2025-05-07T20:33:19.8348945Z D: int, 2025-05-07T20:33:19.8349247Z scale_ub: Optional[float], 2025-05-07T20:33:19.8349554Z contiguous: bool, 2025-05-07T20:33:19.8349823Z compiled: bool, 2025-05-07T20:33:19.8350077Z ) -> None: 2025-05-07T20:33:19.8350329Z torch.manual_seed(2025) 2025-05-07T20:33:19.8350598Z 2025-05-07T20:33:19.8350906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8351294Z 2025-05-07T20:33:19.8351567Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8351948Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8352304Z x = x_sign * x_clamp 2025-05-07T20:33:19.8352575Z x0 = x[:, :D] 2025-05-07T20:33:19.8352826Z x1 = x[:, D:] 2025-05-07T20:33:19.8353065Z 2025-05-07T20:33:19.8353272Z if contiguous: 2025-05-07T20:33:19.8353617Z x0 = x0.contiguous() 2025-05-07T20:33:19.8353914Z x1 = x1.contiguous() 2025-05-07T20:33:19.8354181Z 2025-05-07T20:33:19.8354402Z if scale_ub is not None: 2025-05-07T20:33:19.8354716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8355090Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8355441Z ) 2025-05-07T20:33:19.8355664Z else: 2025-05-07T20:33:19.8355905Z scale_ub_tensor = None 2025-05-07T20:33:19.8356186Z 2025-05-07T20:33:19.8356448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8356807Z op = silu_mul_quant 2025-05-07T20:33:19.8357089Z if compiled: 2025-05-07T20:33:19.8357375Z op = torch.compile(op) 2025-05-07T20:33:19.8357711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8358017Z 2025-05-07T20:33:19.8358242Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8358568Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8358893Z 2025-05-07T20:33:19.8359167Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8359555Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8359884Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8360248Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8360655Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8361008Z 2025-05-07T20:33:19.8361239Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8361468Z 2025-05-07T20:33:19.8361588Z moe/activation_test.py:126: 2025-05-07T20:33:19.8361933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8362309Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8362685Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8363570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8364410Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8365017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8365834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8366606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8367410Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8368258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8369096Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8369909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8370619Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8371338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8371916Z fn() 2025-05-07T20:33:19.8372487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8373130Z self.fn.run( 2025-05-07T20:33:19.8373655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8374364Z kernel = self.compile( 2025-05-07T20:33:19.8375014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8375742Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8376188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8376445Z 2025-05-07T20:33:19.8376684Z self = 2025-05-07T20:33:19.8377888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8379414Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb65256c0>} 2025-05-07T20:33:19.8380912Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8382054Z context = 2025-05-07T20:33:19.8382375Z 2025-05-07T20:33:19.8382571Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8383148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8383677Z module_map=module_map) 2025-05-07T20:33:19.8384115Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8384535Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8384842Z E ^ 2025-05-07T20:33:19.8385363Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8385868Z 2025-05-07T20:33:19.8386340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8386909Z 2025-05-07T20:33:19.8387028Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8387494Z self=, 2025-05-07T20:33:19.8387952Z T=16384, 2025-05-07T20:33:19.8388171Z D=7168, 2025-05-07T20:33:19.8388394Z scale_ub=1200.0, 2025-05-07T20:33:19.8388649Z contiguous=False, 2025-05-07T20:33:19.8388899Z compiled=False, 2025-05-07T20:33:19.8389134Z ) 2025-05-07T20:33:19.8389553Z self = 2025-05-07T20:33:19.8390119Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.8390433Z 2025-05-07T20:33:19.8390523Z @given( 2025-05-07T20:33:19.8390789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8391153Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8391498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8391870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8392248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8392567Z ) 2025-05-07T20:33:19.8392968Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8393463Z def test_silu_mul_quant( 2025-05-07T20:33:19.8393845Z self, 2025-05-07T20:33:19.8394074Z T: int, 2025-05-07T20:33:19.8394301Z D: int, 2025-05-07T20:33:19.8394550Z scale_ub: Optional[float], 2025-05-07T20:33:19.8394857Z contiguous: bool, 2025-05-07T20:33:19.8395129Z compiled: bool, 2025-05-07T20:33:19.8395384Z ) -> None: 2025-05-07T20:33:19.8395624Z torch.manual_seed(2025) 2025-05-07T20:33:19.8395898Z 2025-05-07T20:33:19.8396258Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8396684Z 2025-05-07T20:33:19.8396915Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8397250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8397597Z x = x_sign * x_clamp 2025-05-07T20:33:19.8397873Z x0 = x[:, :D] 2025-05-07T20:33:19.8398121Z x1 = x[:, D:] 2025-05-07T20:33:19.8398354Z 2025-05-07T20:33:19.8398570Z if contiguous: 2025-05-07T20:33:19.8398838Z x0 = x0.contiguous() 2025-05-07T20:33:19.8399130Z x1 = x1.contiguous() 2025-05-07T20:33:19.8399402Z 2025-05-07T20:33:19.8399620Z if scale_ub is not None: 2025-05-07T20:33:19.8399932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8400313Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8400663Z ) 2025-05-07T20:33:19.8400889Z else: 2025-05-07T20:33:19.8401128Z scale_ub_tensor = None 2025-05-07T20:33:19.8401422Z 2025-05-07T20:33:19.8401692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8402042Z op = silu_mul_quant 2025-05-07T20:33:19.8402327Z if compiled: 2025-05-07T20:33:19.8402612Z op = torch.compile(op) 2025-05-07T20:33:19.8402944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8403256Z 2025-05-07T20:33:19.8403483Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8403671Z 2025-05-07T20:33:19.8403784Z moe/activation_test.py:117: 2025-05-07T20:33:19.8404125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8404504Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8404832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8405603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8406379Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8406988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8407749Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8408493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8409093Z kernel = self.compile( 2025-05-07T20:33:19.8409705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8410435Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8410940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8411199Z 2025-05-07T20:33:19.8411442Z self = 2025-05-07T20:33:19.8412648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8414229Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb65248b0>} 2025-05-07T20:33:19.8415729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8416938Z context = 2025-05-07T20:33:19.8417264Z 2025-05-07T20:33:19.8417461Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8418050Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8418624Z module_map=module_map) 2025-05-07T20:33:19.8419077Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8419478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8419772Z E ^ 2025-05-07T20:33:19.8420295Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8420793Z 2025-05-07T20:33:19.8421264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8421838Z 2025-05-07T20:33:19.8421964Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8422429Z self=, 2025-05-07T20:33:19.8422885Z T=1, 2025-05-07T20:33:19.8423098Z D=7168, 2025-05-07T20:33:19.8423314Z scale_ub=None, 2025-05-07T20:33:19.8423561Z contiguous=True, 2025-05-07T20:33:19.8424115Z compiled=True, 2025-05-07T20:33:19.8424429Z ) 2025-05-07T20:33:19.8424792Z self = 2025-05-07T20:33:19.8425329Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.8425615Z 2025-05-07T20:33:19.8425703Z @given( 2025-05-07T20:33:19.8425964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8426315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8426651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8427023Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8427396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8427717Z ) 2025-05-07T20:33:19.8428110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8428602Z def test_silu_mul_quant( 2025-05-07T20:33:19.8428877Z self, 2025-05-07T20:33:19.8429099Z T: int, 2025-05-07T20:33:19.8429327Z D: int, 2025-05-07T20:33:19.8429578Z scale_ub: Optional[float], 2025-05-07T20:33:19.8429882Z contiguous: bool, 2025-05-07T20:33:19.8430158Z compiled: bool, 2025-05-07T20:33:19.8430411Z ) -> None: 2025-05-07T20:33:19.8430650Z torch.manual_seed(2025) 2025-05-07T20:33:19.8430923Z 2025-05-07T20:33:19.8431230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8431608Z 2025-05-07T20:33:19.8431834Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8432164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8432517Z x = x_sign * x_clamp 2025-05-07T20:33:19.8432782Z x0 = x[:, :D] 2025-05-07T20:33:19.8433029Z x1 = x[:, D:] 2025-05-07T20:33:19.8433265Z 2025-05-07T20:33:19.8433652Z if contiguous: 2025-05-07T20:33:19.8433925Z x0 = x0.contiguous() 2025-05-07T20:33:19.8434217Z x1 = x1.contiguous() 2025-05-07T20:33:19.8434483Z 2025-05-07T20:33:19.8434706Z if scale_ub is not None: 2025-05-07T20:33:19.8435020Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8435391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8435739Z ) 2025-05-07T20:33:19.8435959Z else: 2025-05-07T20:33:19.8436191Z scale_ub_tensor = None 2025-05-07T20:33:19.8436474Z 2025-05-07T20:33:19.8436736Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8437081Z op = silu_mul_quant 2025-05-07T20:33:19.8437363Z if compiled: 2025-05-07T20:33:19.8437726Z op = torch.compile(op) 2025-05-07T20:33:19.8438059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8438364Z 2025-05-07T20:33:19.8438589Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8438911Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8439233Z 2025-05-07T20:33:19.8439501Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8439949Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8440332Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8440688Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8441090Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8441431Z 2025-05-07T20:33:19.8441664Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8441882Z 2025-05-07T20:33:19.8442000Z moe/activation_test.py:126: 2025-05-07T20:33:19.8442337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8442718Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8443095Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8443974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8444802Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8445421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8446182Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8446945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8447744Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8448585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8449427Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8450236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8450945Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8451622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8452200Z fn() 2025-05-07T20:33:19.8452762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8453409Z self.fn.run( 2025-05-07T20:33:19.8453934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8454573Z kernel = self.compile( 2025-05-07T20:33:19.8455169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8455948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8456391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8456646Z 2025-05-07T20:33:19.8456877Z self = 2025-05-07T20:33:19.8458074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8459590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb62c4e50>} 2025-05-07T20:33:19.8461076Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8462260Z context = 2025-05-07T20:33:19.8462580Z 2025-05-07T20:33:19.8462768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8463423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8463989Z module_map=module_map) 2025-05-07T20:33:19.8464433Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8464839Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8465139Z E ^ 2025-05-07T20:33:19.8465658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8466154Z 2025-05-07T20:33:19.8466613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8467188Z 2025-05-07T20:33:19.8467306Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8467772Z self=, 2025-05-07T20:33:19.8468217Z T=4096, 2025-05-07T20:33:19.8468425Z D=5120, 2025-05-07T20:33:19.8468645Z scale_ub=None, 2025-05-07T20:33:19.8468892Z contiguous=False, 2025-05-07T20:33:19.8469146Z compiled=False, 2025-05-07T20:33:19.8469381Z ) 2025-05-07T20:33:19.8469741Z self = 2025-05-07T20:33:19.8470285Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.8470595Z 2025-05-07T20:33:19.8470683Z @given( 2025-05-07T20:33:19.8470942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8477683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8478081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8478458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8478834Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8479150Z ) 2025-05-07T20:33:19.8479546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8480040Z def test_silu_mul_quant( 2025-05-07T20:33:19.8480311Z self, 2025-05-07T20:33:19.8480532Z T: int, 2025-05-07T20:33:19.8480759Z D: int, 2025-05-07T20:33:19.8481001Z scale_ub: Optional[float], 2025-05-07T20:33:19.8481311Z contiguous: bool, 2025-05-07T20:33:19.8481582Z compiled: bool, 2025-05-07T20:33:19.8481848Z ) -> None: 2025-05-07T20:33:19.8482092Z torch.manual_seed(2025) 2025-05-07T20:33:19.8482365Z 2025-05-07T20:33:19.8482665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8483057Z 2025-05-07T20:33:19.8483282Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8483609Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8483952Z x = x_sign * x_clamp 2025-05-07T20:33:19.8484310Z x0 = x[:, :D] 2025-05-07T20:33:19.8484559Z x1 = x[:, D:] 2025-05-07T20:33:19.8484788Z 2025-05-07T20:33:19.8485001Z if contiguous: 2025-05-07T20:33:19.8485264Z x0 = x0.contiguous() 2025-05-07T20:33:19.8485552Z x1 = x1.contiguous() 2025-05-07T20:33:19.8485826Z 2025-05-07T20:33:19.8486045Z if scale_ub is not None: 2025-05-07T20:33:19.8486354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8486732Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8487081Z ) 2025-05-07T20:33:19.8487295Z else: 2025-05-07T20:33:19.8487536Z scale_ub_tensor = None 2025-05-07T20:33:19.8487822Z 2025-05-07T20:33:19.8488081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8488490Z op = silu_mul_quant 2025-05-07T20:33:19.8488773Z if compiled: 2025-05-07T20:33:19.8489050Z op = torch.compile(op) 2025-05-07T20:33:19.8489390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8489702Z 2025-05-07T20:33:19.8489922Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8490106Z 2025-05-07T20:33:19.8490217Z moe/activation_test.py:117: 2025-05-07T20:33:19.8490642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8491015Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8491326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8492103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8492876Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8493480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8494240Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8494979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8495574Z kernel = self.compile( 2025-05-07T20:33:19.8496172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8496906Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8497348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8497601Z 2025-05-07T20:33:19.8497838Z self = 2025-05-07T20:33:19.8499028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8500554Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb62c5630>} 2025-05-07T20:33:19.8502037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8503174Z context = 2025-05-07T20:33:19.8503494Z 2025-05-07T20:33:19.8503686Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8504284Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8504802Z module_map=module_map) 2025-05-07T20:33:19.8505207Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8505596Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8505886Z E ^ 2025-05-07T20:33:19.8506449Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8506942Z 2025-05-07T20:33:19.8507405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8507964Z 2025-05-07T20:33:19.8508082Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8508541Z self=, 2025-05-07T20:33:19.8508985Z T=4096, 2025-05-07T20:33:19.8509193Z D=7168, 2025-05-07T20:33:19.8509412Z scale_ub=None, 2025-05-07T20:33:19.8509656Z contiguous=False, 2025-05-07T20:33:19.8509904Z compiled=False, 2025-05-07T20:33:19.8510133Z ) 2025-05-07T20:33:19.8510486Z self = 2025-05-07T20:33:19.8511081Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.8511381Z 2025-05-07T20:33:19.8511467Z @given( 2025-05-07T20:33:19.8511732Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8512082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8512419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8512833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8513249Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8513641Z ) 2025-05-07T20:33:19.8514034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8514524Z def test_silu_mul_quant( 2025-05-07T20:33:19.8514797Z self, 2025-05-07T20:33:19.8515012Z T: int, 2025-05-07T20:33:19.8515236Z D: int, 2025-05-07T20:33:19.8515483Z scale_ub: Optional[float], 2025-05-07T20:33:19.8515780Z contiguous: bool, 2025-05-07T20:33:19.8516053Z compiled: bool, 2025-05-07T20:33:19.8516305Z ) -> None: 2025-05-07T20:33:19.8516542Z torch.manual_seed(2025) 2025-05-07T20:33:19.8516817Z 2025-05-07T20:33:19.8517127Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8517501Z 2025-05-07T20:33:19.8517723Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8518047Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8518405Z x = x_sign * x_clamp 2025-05-07T20:33:19.8518676Z x0 = x[:, :D] 2025-05-07T20:33:19.8518920Z x1 = x[:, D:] 2025-05-07T20:33:19.8519148Z 2025-05-07T20:33:19.8519359Z if contiguous: 2025-05-07T20:33:19.8519618Z x0 = x0.contiguous() 2025-05-07T20:33:19.8519901Z x1 = x1.contiguous() 2025-05-07T20:33:19.8520174Z 2025-05-07T20:33:19.8520392Z if scale_ub is not None: 2025-05-07T20:33:19.8520693Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8521070Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8521415Z ) 2025-05-07T20:33:19.8521633Z else: 2025-05-07T20:33:19.8521866Z scale_ub_tensor = None 2025-05-07T20:33:19.8522149Z 2025-05-07T20:33:19.8522413Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8522755Z op = silu_mul_quant 2025-05-07T20:33:19.8523037Z if compiled: 2025-05-07T20:33:19.8523320Z op = torch.compile(op) 2025-05-07T20:33:19.8523648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8524421Z 2025-05-07T20:33:19.8524704Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8524888Z 2025-05-07T20:33:19.8524998Z moe/activation_test.py:117: 2025-05-07T20:33:19.8525329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8525696Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8526010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8526767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8527707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8528310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8529056Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8529796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8530388Z kernel = self.compile( 2025-05-07T20:33:19.8530992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8531709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8532149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8532475Z 2025-05-07T20:33:19.8532712Z self = 2025-05-07T20:33:19.8533902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8535539Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb62c5f30>} 2025-05-07T20:33:19.8537012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8538141Z context = 2025-05-07T20:33:19.8538461Z 2025-05-07T20:33:19.8538658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8539231Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8539755Z module_map=module_map) 2025-05-07T20:33:19.8540159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8540550Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8540837Z E ^ 2025-05-07T20:33:19.8541358Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8541853Z 2025-05-07T20:33:19.8542315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8542873Z 2025-05-07T20:33:19.8542999Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8543452Z self=, 2025-05-07T20:33:19.8543901Z T=128, 2025-05-07T20:33:19.8544113Z D=7168, 2025-05-07T20:33:19.8544329Z scale_ub=None, 2025-05-07T20:33:19.8544571Z contiguous=False, 2025-05-07T20:33:19.8544825Z compiled=True, 2025-05-07T20:33:19.8545048Z ) 2025-05-07T20:33:19.8545404Z self = 2025-05-07T20:33:19.8545945Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.8546244Z 2025-05-07T20:33:19.8546330Z @given( 2025-05-07T20:33:19.8546590Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8546940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8547283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8547646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8548015Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8548336Z ) 2025-05-07T20:33:19.8548720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8549214Z def test_silu_mul_quant( 2025-05-07T20:33:19.8549488Z self, 2025-05-07T20:33:19.8549705Z T: int, 2025-05-07T20:33:19.8549988Z D: int, 2025-05-07T20:33:19.8550238Z scale_ub: Optional[float], 2025-05-07T20:33:19.8550536Z contiguous: bool, 2025-05-07T20:33:19.8550806Z compiled: bool, 2025-05-07T20:33:19.8551058Z ) -> None: 2025-05-07T20:33:19.8551298Z torch.manual_seed(2025) 2025-05-07T20:33:19.8551573Z 2025-05-07T20:33:19.8551877Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8552251Z 2025-05-07T20:33:19.8552472Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8552799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8553143Z x = x_sign * x_clamp 2025-05-07T20:33:19.8553409Z x0 = x[:, :D] 2025-05-07T20:33:19.8553732Z x1 = x[:, D:] 2025-05-07T20:33:19.8553970Z 2025-05-07T20:33:19.8554261Z if contiguous: 2025-05-07T20:33:19.8554524Z x0 = x0.contiguous() 2025-05-07T20:33:19.8554815Z x1 = x1.contiguous() 2025-05-07T20:33:19.8555079Z 2025-05-07T20:33:19.8555314Z if scale_ub is not None: 2025-05-07T20:33:19.8555614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8555988Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8556396Z ) 2025-05-07T20:33:19.8556608Z else: 2025-05-07T20:33:19.8556897Z scale_ub_tensor = None 2025-05-07T20:33:19.8557184Z 2025-05-07T20:33:19.8557444Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8557794Z op = silu_mul_quant 2025-05-07T20:33:19.8558075Z if compiled: 2025-05-07T20:33:19.8558349Z op = torch.compile(op) 2025-05-07T20:33:19.8558686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8558996Z 2025-05-07T20:33:19.8559219Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8559538Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8559865Z 2025-05-07T20:33:19.8560137Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8560503Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8560830Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8561182Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8561580Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8561929Z 2025-05-07T20:33:19.8562160Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8562375Z 2025-05-07T20:33:19.8562486Z moe/activation_test.py:126: 2025-05-07T20:33:19.8562819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8563199Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8563565Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8564433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8565264Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8565869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8566623Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8567386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8568184Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8569014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8569835Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8570646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8571404Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8572070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8572638Z fn() 2025-05-07T20:33:19.8573206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8573848Z self.fn.run( 2025-05-07T20:33:19.8574365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8574948Z kernel = self.compile( 2025-05-07T20:33:19.8575548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8576273Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8576757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8577017Z 2025-05-07T20:33:19.8577249Z self = 2025-05-07T20:33:19.8578480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8580034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb6526a70>} 2025-05-07T20:33:19.8581508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8582628Z context = 2025-05-07T20:33:19.8582954Z 2025-05-07T20:33:19.8583142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8583723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8584243Z module_map=module_map) 2025-05-07T20:33:19.8584642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8585047Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8585347Z E ^ 2025-05-07T20:33:19.8585858Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8586358Z 2025-05-07T20:33:19.8586816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8587383Z 2025-05-07T20:33:19.8587500Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8587962Z self=, 2025-05-07T20:33:19.8588401Z T=128, 2025-05-07T20:33:19.8588619Z D=7168, 2025-05-07T20:33:19.8588839Z scale_ub=None, 2025-05-07T20:33:19.8589076Z contiguous=False, 2025-05-07T20:33:19.8589332Z compiled=False, 2025-05-07T20:33:19.8589564Z ) 2025-05-07T20:33:19.8589912Z self = 2025-05-07T20:33:19.8590466Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.8590761Z 2025-05-07T20:33:19.8590854Z @given( 2025-05-07T20:33:19.8591106Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8591455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8591798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8592164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8592525Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8592848Z ) 2025-05-07T20:33:19.8593235Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8593872Z def test_silu_mul_quant( 2025-05-07T20:33:19.8594201Z self, 2025-05-07T20:33:19.8594424Z T: int, 2025-05-07T20:33:19.8594640Z D: int, 2025-05-07T20:33:19.8594889Z scale_ub: Optional[float], 2025-05-07T20:33:19.8595195Z contiguous: bool, 2025-05-07T20:33:19.8595456Z compiled: bool, 2025-05-07T20:33:19.8595712Z ) -> None: 2025-05-07T20:33:19.8595955Z torch.manual_seed(2025) 2025-05-07T20:33:19.8596219Z 2025-05-07T20:33:19.8596523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8596901Z 2025-05-07T20:33:19.8597116Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8597442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8597786Z x = x_sign * x_clamp 2025-05-07T20:33:19.8598106Z x0 = x[:, :D] 2025-05-07T20:33:19.8598342Z x1 = x[:, D:] 2025-05-07T20:33:19.8598581Z 2025-05-07T20:33:19.8598793Z if contiguous: 2025-05-07T20:33:19.8599052Z x0 = x0.contiguous() 2025-05-07T20:33:19.8599342Z x1 = x1.contiguous() 2025-05-07T20:33:19.8599615Z 2025-05-07T20:33:19.8599825Z if scale_ub is not None: 2025-05-07T20:33:19.8600136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8600606Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8600947Z ) 2025-05-07T20:33:19.8601165Z else: 2025-05-07T20:33:19.8601409Z scale_ub_tensor = None 2025-05-07T20:33:19.8601689Z 2025-05-07T20:33:19.8601954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8602302Z op = silu_mul_quant 2025-05-07T20:33:19.8602578Z if compiled: 2025-05-07T20:33:19.8602859Z op = torch.compile(op) 2025-05-07T20:33:19.8603192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8603499Z 2025-05-07T20:33:19.8603715Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8603906Z 2025-05-07T20:33:19.8604022Z moe/activation_test.py:117: 2025-05-07T20:33:19.8604352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8604720Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8605032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8605801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8606559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8607146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8607897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8608625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8609210Z kernel = self.compile( 2025-05-07T20:33:19.8609810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8610535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8610975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8611229Z 2025-05-07T20:33:19.8611460Z self = 2025-05-07T20:33:19.8612644Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8614157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb6525ab0>} 2025-05-07T20:33:19.8615693Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8616812Z context = 2025-05-07T20:33:19.8617137Z 2025-05-07T20:33:19.8617324Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8617903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8618421Z module_map=module_map) 2025-05-07T20:33:19.8618821Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8619216Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8619506Z E ^ 2025-05-07T20:33:19.8620014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8620563Z 2025-05-07T20:33:19.8621024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8621596Z 2025-05-07T20:33:19.8621714Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8622175Z self=, 2025-05-07T20:33:19.8622661Z T=4096, 2025-05-07T20:33:19.8622880Z D=5120, 2025-05-07T20:33:19.8623145Z scale_ub=1200.0, 2025-05-07T20:33:19.8623396Z contiguous=True, 2025-05-07T20:33:19.8623649Z compiled=False, 2025-05-07T20:33:19.8624193Z ) 2025-05-07T20:33:19.8624621Z self = 2025-05-07T20:33:19.8625168Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.8625479Z 2025-05-07T20:33:19.8625569Z @given( 2025-05-07T20:33:19.8625828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8626177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8626522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8626891Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8627249Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8627569Z ) 2025-05-07T20:33:19.8627960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8628456Z def test_silu_mul_quant( 2025-05-07T20:33:19.8628721Z self, 2025-05-07T20:33:19.8628942Z T: int, 2025-05-07T20:33:19.8629164Z D: int, 2025-05-07T20:33:19.8629403Z scale_ub: Optional[float], 2025-05-07T20:33:19.8629705Z contiguous: bool, 2025-05-07T20:33:19.8629978Z compiled: bool, 2025-05-07T20:33:19.8630225Z ) -> None: 2025-05-07T20:33:19.8630466Z torch.manual_seed(2025) 2025-05-07T20:33:19.8630736Z 2025-05-07T20:33:19.8631035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8631413Z 2025-05-07T20:33:19.8631633Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8631956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8632300Z x = x_sign * x_clamp 2025-05-07T20:33:19.8632569Z x0 = x[:, :D] 2025-05-07T20:33:19.8632807Z x1 = x[:, D:] 2025-05-07T20:33:19.8633043Z 2025-05-07T20:33:19.8633254Z if contiguous: 2025-05-07T20:33:19.8633586Z x0 = x0.contiguous() 2025-05-07T20:33:19.8633878Z x1 = x1.contiguous() 2025-05-07T20:33:19.8634148Z 2025-05-07T20:33:19.8634387Z if scale_ub is not None: 2025-05-07T20:33:19.8634719Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8635118Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8635462Z ) 2025-05-07T20:33:19.8635677Z else: 2025-05-07T20:33:19.8635915Z scale_ub_tensor = None 2025-05-07T20:33:19.8636201Z 2025-05-07T20:33:19.8636455Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8636806Z op = silu_mul_quant 2025-05-07T20:33:19.8637345Z if compiled: 2025-05-07T20:33:19.8637642Z op = torch.compile(op) 2025-05-07T20:33:19.8651262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8651718Z 2025-05-07T20:33:19.8652050Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8652271Z 2025-05-07T20:33:19.8652451Z moe/activation_test.py:117: 2025-05-07T20:33:19.8652825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8653237Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8653562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8654335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8655120Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8655959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8656722Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8657500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8658203Z kernel = self.compile( 2025-05-07T20:33:19.8658921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8659657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8660111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8660367Z 2025-05-07T20:33:19.8660608Z self = 2025-05-07T20:33:19.8661804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8663345Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb633a560>} 2025-05-07T20:33:19.8664843Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8665982Z context = 2025-05-07T20:33:19.8666304Z 2025-05-07T20:33:19.8666501Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8667082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8667622Z module_map=module_map) 2025-05-07T20:33:19.8668043Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8668487Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8668811Z E ^ 2025-05-07T20:33:19.8669342Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8669962Z 2025-05-07T20:33:19.8670596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8674946Z 2025-05-07T20:33:19.8675089Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8675553Z self=, 2025-05-07T20:33:19.8676042Z T=1, 2025-05-07T20:33:19.8676286Z D=5120, 2025-05-07T20:33:19.8676530Z scale_ub=None, 2025-05-07T20:33:19.8676769Z contiguous=True, 2025-05-07T20:33:19.8677015Z compiled=True, 2025-05-07T20:33:19.8677245Z ) 2025-05-07T20:33:19.8677593Z self = 2025-05-07T20:33:19.8678218Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.8678505Z 2025-05-07T20:33:19.8678590Z @given( 2025-05-07T20:33:19.8678843Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8679193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8679535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8679935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8680298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8680611Z ) 2025-05-07T20:33:19.8680996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8681481Z def test_silu_mul_quant( 2025-05-07T20:33:19.8681748Z self, 2025-05-07T20:33:19.8702831Z T: int, 2025-05-07T20:33:19.8703066Z D: int, 2025-05-07T20:33:19.8703431Z scale_ub: Optional[float], 2025-05-07T20:33:19.8703557Z contiguous: bool, 2025-05-07T20:33:19.8703669Z compiled: bool, 2025-05-07T20:33:19.8703793Z ) -> None: 2025-05-07T20:33:19.8703901Z torch.manual_seed(2025) 2025-05-07T20:33:19.8703988Z 2025-05-07T20:33:19.8704181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8704337Z 2025-05-07T20:33:19.8704442Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8704648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8704755Z x = x_sign * x_clamp 2025-05-07T20:33:19.8704852Z x0 = x[:, :D] 2025-05-07T20:33:19.8704941Z x1 = x[:, D:] 2025-05-07T20:33:19.8705027Z 2025-05-07T20:33:19.8705123Z if contiguous: 2025-05-07T20:33:19.8705232Z x0 = x0.contiguous() 2025-05-07T20:33:19.8705334Z x1 = x1.contiguous() 2025-05-07T20:33:19.8705424Z 2025-05-07T20:33:19.8705531Z if scale_ub is not None: 2025-05-07T20:33:19.8705657Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8705809Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8705902Z ) 2025-05-07T20:33:19.8705998Z else: 2025-05-07T20:33:19.8706106Z scale_ub_tensor = None 2025-05-07T20:33:19.8706190Z 2025-05-07T20:33:19.8706345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8706455Z op = silu_mul_quant 2025-05-07T20:33:19.8706551Z if compiled: 2025-05-07T20:33:19.8706670Z op = torch.compile(op) 2025-05-07T20:33:19.8706789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8706872Z 2025-05-07T20:33:19.8706983Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8707121Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8707204Z 2025-05-07T20:33:19.8707362Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8707480Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8707598Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8707736Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8707892Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8707980Z 2025-05-07T20:33:19.8708093Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8708101Z 2025-05-07T20:33:19.8708212Z moe/activation_test.py:126: 2025-05-07T20:33:19.8708369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8708490Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8708647Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8709268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8709383Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8709793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8710106Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8710513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8710809Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8711251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8711538Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8711956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8712146Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8712583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8712672Z fn() 2025-05-07T20:33:19.8713123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8713219Z self.fn.run( 2025-05-07T20:33:19.8713668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8713868Z kernel = self.compile( 2025-05-07T20:33:19.8714290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8714487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8714635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8714641Z 2025-05-07T20:33:19.8714874Z self = 2025-05-07T20:33:19.8715752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8716317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efdb6339510>} 2025-05-07T20:33:19.8717148Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8717367Z context = 2025-05-07T20:33:19.8717372Z 2025-05-07T20:33:19.8717560Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8717863Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8717988Z module_map=module_map) 2025-05-07T20:33:19.8718172Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8718296Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8718388Z E ^ 2025-05-07T20:33:19.8718790Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8718801Z 2025-05-07T20:33:19.8719259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8719264Z 2025-05-07T20:33:19.8719381Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8719635Z self=, 2025-05-07T20:33:19.8719722Z T=2048, 2025-05-07T20:33:19.8719815Z D=5120, 2025-05-07T20:33:19.8719908Z scale_ub=None, 2025-05-07T20:33:19.8720009Z contiguous=True, 2025-05-07T20:33:19.8720111Z compiled=True, 2025-05-07T20:33:19.8720197Z ) 2025-05-07T20:33:19.8720489Z self = 2025-05-07T20:33:19.8720690Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.8720695Z 2025-05-07T20:33:19.8720786Z @given( 2025-05-07T20:33:19.8720925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8721048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8721181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8721321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8721456Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8721542Z ) 2025-05-07T20:33:19.8721829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8721938Z def test_silu_mul_quant( 2025-05-07T20:33:19.8722070Z self, 2025-05-07T20:33:19.8722166Z T: int, 2025-05-07T20:33:19.8722254Z D: int, 2025-05-07T20:33:19.8722366Z scale_ub: Optional[float], 2025-05-07T20:33:19.8722477Z contiguous: bool, 2025-05-07T20:33:19.8722576Z compiled: bool, 2025-05-07T20:33:19.8722666Z ) -> None: 2025-05-07T20:33:19.8722780Z torch.manual_seed(2025) 2025-05-07T20:33:19.8722867Z 2025-05-07T20:33:19.8723186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8723272Z 2025-05-07T20:33:19.8723380Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8723532Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8723657Z x = x_sign * x_clamp 2025-05-07T20:33:19.8724130Z x0 = x[:, :D] 2025-05-07T20:33:19.8724296Z x1 = x[:, D:] 2025-05-07T20:33:19.8724419Z 2025-05-07T20:33:19.8724527Z if contiguous: 2025-05-07T20:33:19.8724643Z x0 = x0.contiguous() 2025-05-07T20:33:19.8724748Z x1 = x1.contiguous() 2025-05-07T20:33:19.8724835Z 2025-05-07T20:33:19.8724947Z if scale_ub is not None: 2025-05-07T20:33:19.8725071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8725225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8725321Z ) 2025-05-07T20:33:19.8725408Z else: 2025-05-07T20:33:19.8725526Z scale_ub_tensor = None 2025-05-07T20:33:19.8725608Z 2025-05-07T20:33:19.8725759Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8725867Z op = silu_mul_quant 2025-05-07T20:33:19.8725964Z if compiled: 2025-05-07T20:33:19.8726079Z op = torch.compile(op) 2025-05-07T20:33:19.8726207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8726289Z 2025-05-07T20:33:19.8726392Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8726535Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8726620Z 2025-05-07T20:33:19.8726773Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8726895Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8727012Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8727161Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8727317Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8727404Z 2025-05-07T20:33:19.8727526Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8727531Z 2025-05-07T20:33:19.8727642Z moe/activation_test.py:126: 2025-05-07T20:33:19.8727786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8727914Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8728066Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8728690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8728810Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8729396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8729658Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8730066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8730357Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8730807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8731092Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8731511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8731768Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8732151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8732246Z fn() 2025-05-07T20:33:19.8732689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8732865Z self.fn.run( 2025-05-07T20:33:19.8733298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8733408Z kernel = self.compile( 2025-05-07T20:33:19.8733835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8734032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8734175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8734193Z 2025-05-07T20:33:19.8734421Z self = 2025-05-07T20:33:19.8735277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8735853Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd91dd7880>} 2025-05-07T20:33:19.8736671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8736890Z context = 2025-05-07T20:33:19.8736895Z 2025-05-07T20:33:19.8737084Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8737378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8737510Z module_map=module_map) 2025-05-07T20:33:19.8737694Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8737811Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8737908Z E ^ 2025-05-07T20:33:19.8738305Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8738311Z 2025-05-07T20:33:19.8738771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8738776Z 2025-05-07T20:33:19.8738893Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8739141Z self=, 2025-05-07T20:33:19.8739238Z T=128, 2025-05-07T20:33:19.8739327Z D=5120, 2025-05-07T20:33:19.8739429Z scale_ub=None, 2025-05-07T20:33:19.8739526Z contiguous=True, 2025-05-07T20:33:19.8739668Z compiled=True, 2025-05-07T20:33:19.8739762Z ) 2025-05-07T20:33:19.8740004Z self = 2025-05-07T20:33:19.8740192Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.8740200Z 2025-05-07T20:33:19.8740293Z @given( 2025-05-07T20:33:19.8740430Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8740542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8740681Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8740814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8740951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8741035Z ) 2025-05-07T20:33:19.8741310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8741472Z def test_silu_mul_quant( 2025-05-07T20:33:19.8741559Z self, 2025-05-07T20:33:19.8741647Z T: int, 2025-05-07T20:33:19.8741747Z D: int, 2025-05-07T20:33:19.8741860Z scale_ub: Optional[float], 2025-05-07T20:33:19.8741962Z contiguous: bool, 2025-05-07T20:33:19.8742067Z compiled: bool, 2025-05-07T20:33:19.8742207Z ) -> None: 2025-05-07T20:33:19.8742316Z torch.manual_seed(2025) 2025-05-07T20:33:19.8742451Z 2025-05-07T20:33:19.8742643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8742734Z 2025-05-07T20:33:19.8742839Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8742979Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8743085Z x = x_sign * x_clamp 2025-05-07T20:33:19.8743177Z x0 = x[:, :D] 2025-05-07T20:33:19.8743268Z x1 = x[:, D:] 2025-05-07T20:33:19.8743360Z 2025-05-07T20:33:19.8743461Z if contiguous: 2025-05-07T20:33:19.8743564Z x0 = x0.contiguous() 2025-05-07T20:33:19.8743673Z x1 = x1.contiguous() 2025-05-07T20:33:19.8743757Z 2025-05-07T20:33:19.8743865Z if scale_ub is not None: 2025-05-07T20:33:19.8743990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8744142Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8744237Z ) 2025-05-07T20:33:19.8744348Z else: 2025-05-07T20:33:19.8744463Z scale_ub_tensor = None 2025-05-07T20:33:19.8744556Z 2025-05-07T20:33:19.8744724Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8744848Z op = silu_mul_quant 2025-05-07T20:33:19.8744954Z if compiled: 2025-05-07T20:33:19.8745066Z op = torch.compile(op) 2025-05-07T20:33:19.8745188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8745277Z 2025-05-07T20:33:19.8745383Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8745520Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8745610Z 2025-05-07T20:33:19.8745765Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8745882Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8746001Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8746138Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8746308Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8746394Z 2025-05-07T20:33:19.8746509Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8746514Z 2025-05-07T20:33:19.8746634Z moe/activation_test.py:126: 2025-05-07T20:33:19.8746780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8746900Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8747059Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8747681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8747861Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8748265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8748514Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8748933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8749220Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8749667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8749951Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8750419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8750619Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8751000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8751088Z fn() 2025-05-07T20:33:19.8751626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8751723Z self.fn.run( 2025-05-07T20:33:19.8752107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8752214Z kernel = self.compile( 2025-05-07T20:33:19.8752639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8752845Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8752992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8752997Z 2025-05-07T20:33:19.8753237Z self = 2025-05-07T20:33:19.8754281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8754852Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd91070700>} 2025-05-07T20:33:19.8755684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8755906Z context = 2025-05-07T20:33:19.8755912Z 2025-05-07T20:33:19.8756104Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8756402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8756525Z module_map=module_map) 2025-05-07T20:33:19.8756721Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8756845Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8756934Z E ^ 2025-05-07T20:33:19.8757337Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8757342Z 2025-05-07T20:33:19.8757802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8757807Z 2025-05-07T20:33:19.8757934Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8758187Z self=, 2025-05-07T20:33:19.8758276Z T=4096, 2025-05-07T20:33:19.8758370Z D=5120, 2025-05-07T20:33:19.8758515Z scale_ub=None, 2025-05-07T20:33:19.8758620Z contiguous=True, 2025-05-07T20:33:19.8758718Z compiled=True, 2025-05-07T20:33:19.8758803Z ) 2025-05-07T20:33:19.8759058Z self = 2025-05-07T20:33:19.8759256Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.8759261Z 2025-05-07T20:33:19.8759348Z @given( 2025-05-07T20:33:19.8759491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8759606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8759740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8759885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8760017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8760185Z ) 2025-05-07T20:33:19.8760463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8760575Z def test_silu_mul_quant( 2025-05-07T20:33:19.8760669Z self, 2025-05-07T20:33:19.8760762Z T: int, 2025-05-07T20:33:19.8760855Z D: int, 2025-05-07T20:33:19.8760972Z scale_ub: Optional[float], 2025-05-07T20:33:19.8761124Z contiguous: bool, 2025-05-07T20:33:19.8761224Z compiled: bool, 2025-05-07T20:33:19.8761363Z ) -> None: 2025-05-07T20:33:19.8761472Z torch.manual_seed(2025) 2025-05-07T20:33:19.8761559Z 2025-05-07T20:33:19.8761757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8761842Z 2025-05-07T20:33:19.8761949Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8762099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8762201Z x = x_sign * x_clamp 2025-05-07T20:33:19.8762304Z x0 = x[:, :D] 2025-05-07T20:33:19.8762395Z x1 = x[:, D:] 2025-05-07T20:33:19.8762479Z 2025-05-07T20:33:19.8762582Z if contiguous: 2025-05-07T20:33:19.8762689Z x0 = x0.contiguous() 2025-05-07T20:33:19.8762792Z x1 = x1.contiguous() 2025-05-07T20:33:19.8762882Z 2025-05-07T20:33:19.8762988Z if scale_ub is not None: 2025-05-07T20:33:19.8763109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8763275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8763361Z ) 2025-05-07T20:33:19.8763448Z else: 2025-05-07T20:33:19.8763564Z scale_ub_tensor = None 2025-05-07T20:33:19.8763647Z 2025-05-07T20:33:19.8763796Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8763917Z op = silu_mul_quant 2025-05-07T20:33:19.8764022Z if compiled: 2025-05-07T20:33:19.8764136Z op = torch.compile(op) 2025-05-07T20:33:19.8764260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8764350Z 2025-05-07T20:33:19.8764456Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8764595Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8764687Z 2025-05-07T20:33:19.8764841Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8764956Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8765081Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8765225Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8765390Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8765474Z 2025-05-07T20:33:19.8765589Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8765594Z 2025-05-07T20:33:19.8765713Z moe/activation_test.py:126: 2025-05-07T20:33:19.8765858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8765979Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8766142Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8766816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8766941Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8767345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8767602Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8768019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8768308Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8768752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8769091Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8769512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8769710Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8770094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8770269Z fn() 2025-05-07T20:33:19.8770726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8770821Z self.fn.run( 2025-05-07T20:33:19.8771208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8771317Z kernel = self.compile( 2025-05-07T20:33:19.8771742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8771950Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8772097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8772102Z 2025-05-07T20:33:19.8772333Z self = 2025-05-07T20:33:19.8773210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8773779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90fa09d0>} 2025-05-07T20:33:19.8774614Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8774835Z context = 2025-05-07T20:33:19.8774842Z 2025-05-07T20:33:19.8775035Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8775330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8775457Z module_map=module_map) 2025-05-07T20:33:19.8775651Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8775768Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8775857Z E ^ 2025-05-07T20:33:19.8776263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8776269Z 2025-05-07T20:33:19.8776729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8776736Z 2025-05-07T20:33:19.8776862Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8777157Z self=, 2025-05-07T20:33:19.8777249Z T=16384, 2025-05-07T20:33:19.8777342Z D=5120, 2025-05-07T20:33:19.8777437Z scale_ub=None, 2025-05-07T20:33:19.8777534Z contiguous=True, 2025-05-07T20:33:19.8777636Z compiled=True, 2025-05-07T20:33:19.8777724Z ) 2025-05-07T20:33:19.8777971Z self = 2025-05-07T20:33:19.8778176Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.8778181Z 2025-05-07T20:33:19.8778269Z @given( 2025-05-07T20:33:19.8778416Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8778529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8778660Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8778849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8778980Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8779066Z ) 2025-05-07T20:33:19.8779355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8779464Z def test_silu_mul_quant( 2025-05-07T20:33:19.8779558Z self, 2025-05-07T20:33:19.8779646Z T: int, 2025-05-07T20:33:19.8779780Z D: int, 2025-05-07T20:33:19.8779938Z scale_ub: Optional[float], 2025-05-07T20:33:19.8780042Z contiguous: bool, 2025-05-07T20:33:19.8780141Z compiled: bool, 2025-05-07T20:33:19.8780237Z ) -> None: 2025-05-07T20:33:19.8780344Z torch.manual_seed(2025) 2025-05-07T20:33:19.8780429Z 2025-05-07T20:33:19.8780626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8780711Z 2025-05-07T20:33:19.8780818Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8780966Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8781074Z x = x_sign * x_clamp 2025-05-07T20:33:19.8781173Z x0 = x[:, :D] 2025-05-07T20:33:19.8781268Z x1 = x[:, D:] 2025-05-07T20:33:19.8781354Z 2025-05-07T20:33:19.8781458Z if contiguous: 2025-05-07T20:33:19.8781563Z x0 = x0.contiguous() 2025-05-07T20:33:19.8781665Z x1 = x1.contiguous() 2025-05-07T20:33:19.8781759Z 2025-05-07T20:33:19.8781863Z if scale_ub is not None: 2025-05-07T20:33:19.8781988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8782152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8782240Z ) 2025-05-07T20:33:19.8782328Z else: 2025-05-07T20:33:19.8782444Z scale_ub_tensor = None 2025-05-07T20:33:19.8782529Z 2025-05-07T20:33:19.8782675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8782786Z op = silu_mul_quant 2025-05-07T20:33:19.8782886Z if compiled: 2025-05-07T20:33:19.8783009Z op = torch.compile(op) 2025-05-07T20:33:19.8783131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8783214Z 2025-05-07T20:33:19.8783329Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8783467Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8783552Z 2025-05-07T20:33:19.8783713Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8783834Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8783950Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8784102Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8784284Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8784396Z 2025-05-07T20:33:19.8784512Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8784517Z 2025-05-07T20:33:19.8784629Z moe/activation_test.py:126: 2025-05-07T20:33:19.8784782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8784905Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8785126Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8785757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8785877Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8786289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8786542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8786954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8787249Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8787740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8788028Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8788457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8788647Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8789148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8789238Z fn() 2025-05-07T20:33:19.8789685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8789790Z self.fn.run( 2025-05-07T20:33:19.8790170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8790283Z kernel = self.compile( 2025-05-07T20:33:19.8790715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8790917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8791068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8791073Z 2025-05-07T20:33:19.8791304Z self = 2025-05-07T20:33:19.8792177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8792751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90fa2170>} 2025-05-07T20:33:19.8793684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8793916Z context = 2025-05-07T20:33:19.8793921Z 2025-05-07T20:33:19.8794109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8794416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8794540Z module_map=module_map) 2025-05-07T20:33:19.8794723Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8794846Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8794934Z E ^ 2025-05-07T20:33:19.8795333Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8795339Z 2025-05-07T20:33:19.8795810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8795815Z 2025-05-07T20:33:19.8796023Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8796281Z self=, 2025-05-07T20:33:19.8796369Z T=1, 2025-05-07T20:33:19.8796459Z D=5120, 2025-05-07T20:33:19.8796564Z scale_ub=1200.0, 2025-05-07T20:33:19.8796661Z contiguous=True, 2025-05-07T20:33:19.8796760Z compiled=True, 2025-05-07T20:33:19.8796851Z ) 2025-05-07T20:33:19.8797096Z self = 2025-05-07T20:33:19.8797290Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.8797295Z 2025-05-07T20:33:19.8797383Z @given( 2025-05-07T20:33:19.8797520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8797638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8797818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8797951Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8798089Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8798175Z ) 2025-05-07T20:33:19.8798451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8798565Z def test_silu_mul_quant( 2025-05-07T20:33:19.8798700Z self, 2025-05-07T20:33:19.8798834Z T: int, 2025-05-07T20:33:19.8798923Z D: int, 2025-05-07T20:33:19.8799036Z scale_ub: Optional[float], 2025-05-07T20:33:19.8799143Z contiguous: bool, 2025-05-07T20:33:19.8799242Z compiled: bool, 2025-05-07T20:33:19.8799332Z ) -> None: 2025-05-07T20:33:19.8799446Z torch.manual_seed(2025) 2025-05-07T20:33:19.8799529Z 2025-05-07T20:33:19.8799720Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8799814Z 2025-05-07T20:33:19.8799920Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8800062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8800171Z x = x_sign * x_clamp 2025-05-07T20:33:19.8800266Z x0 = x[:, :D] 2025-05-07T20:33:19.8800358Z x1 = x[:, D:] 2025-05-07T20:33:19.8800448Z 2025-05-07T20:33:19.8800551Z if contiguous: 2025-05-07T20:33:19.8800662Z x0 = x0.contiguous() 2025-05-07T20:33:19.8800770Z x1 = x1.contiguous() 2025-05-07T20:33:19.8800861Z 2025-05-07T20:33:19.8800972Z if scale_ub is not None: 2025-05-07T20:33:19.8801094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8801247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8801343Z ) 2025-05-07T20:33:19.8801432Z else: 2025-05-07T20:33:19.8801541Z scale_ub_tensor = None 2025-05-07T20:33:19.8801633Z 2025-05-07T20:33:19.8801781Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8801889Z op = silu_mul_quant 2025-05-07T20:33:19.8801993Z if compiled: 2025-05-07T20:33:19.8802109Z op = torch.compile(op) 2025-05-07T20:33:19.8802239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8802326Z 2025-05-07T20:33:19.8802430Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8802434Z 2025-05-07T20:33:19.8802553Z moe/activation_test.py:117: 2025-05-07T20:33:19.8802702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8802822Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8802946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8803359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.8803467Z return fn(*args, **kwargs) 2025-05-07T20:33:19.8804029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8804144Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8804605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8804858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8805241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8805361Z kernel = self.compile( 2025-05-07T20:33:19.8805790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8805997Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8806142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8806147Z 2025-05-07T20:33:19.8806380Z self = 2025-05-07T20:33:19.8807299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8807867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90f9a050>} 2025-05-07T20:33:19.8808793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8809012Z context = 2025-05-07T20:33:19.8809018Z 2025-05-07T20:33:19.8809204Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8809508Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8809639Z module_map=module_map) 2025-05-07T20:33:19.8809835Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8809951Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8810044Z E ^ 2025-05-07T20:33:19.8810448Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8810459Z 2025-05-07T20:33:19.8810921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8810926Z 2025-05-07T20:33:19.8811051Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8811301Z self=, 2025-05-07T20:33:19.8811389Z T=1, 2025-05-07T20:33:19.8811486Z D=5120, 2025-05-07T20:33:19.8811582Z scale_ub=None, 2025-05-07T20:33:19.8811684Z contiguous=False, 2025-05-07T20:33:19.8811787Z compiled=True, 2025-05-07T20:33:19.8811873Z ) 2025-05-07T20:33:19.8812119Z self = 2025-05-07T20:33:19.8812312Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.8812317Z 2025-05-07T20:33:19.8812406Z @given( 2025-05-07T20:33:19.8812547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8812667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8812800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8812940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8813071Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8813155Z ) 2025-05-07T20:33:19.8813440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8813548Z def test_silu_mul_quant( 2025-05-07T20:33:19.8813636Z self, 2025-05-07T20:33:19.8813735Z T: int, 2025-05-07T20:33:19.8813831Z D: int, 2025-05-07T20:33:19.8813948Z scale_ub: Optional[float], 2025-05-07T20:33:19.8814112Z contiguous: bool, 2025-05-07T20:33:19.8814213Z compiled: bool, 2025-05-07T20:33:19.8814307Z ) -> None: 2025-05-07T20:33:19.8814417Z torch.manual_seed(2025) 2025-05-07T20:33:19.8814501Z 2025-05-07T20:33:19.8814705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8814794Z 2025-05-07T20:33:19.8814899Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8815048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8815151Z x = x_sign * x_clamp 2025-05-07T20:33:19.8815242Z x0 = x[:, :D] 2025-05-07T20:33:19.8815344Z x1 = x[:, D:] 2025-05-07T20:33:19.8815428Z 2025-05-07T20:33:19.8815524Z if contiguous: 2025-05-07T20:33:19.8815636Z x0 = x0.contiguous() 2025-05-07T20:33:19.8815786Z x1 = x1.contiguous() 2025-05-07T20:33:19.8815878Z 2025-05-07T20:33:19.8815982Z if scale_ub is not None: 2025-05-07T20:33:19.8816103Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8816268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8816356Z ) 2025-05-07T20:33:19.8816445Z else: 2025-05-07T20:33:19.8816559Z scale_ub_tensor = None 2025-05-07T20:33:19.8816687Z 2025-05-07T20:33:19.8816876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8816988Z op = silu_mul_quant 2025-05-07T20:33:19.8817086Z if compiled: 2025-05-07T20:33:19.8817200Z op = torch.compile(op) 2025-05-07T20:33:19.8817327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8817411Z 2025-05-07T20:33:19.8817523Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8817662Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8817751Z 2025-05-07T20:33:19.8817912Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8818028Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8818145Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8818291Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8818452Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8818541Z 2025-05-07T20:33:19.8818664Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8818672Z 2025-05-07T20:33:19.8818784Z moe/activation_test.py:126: 2025-05-07T20:33:19.8818937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8819059Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8819213Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8819844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8819963Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8820370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8820628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8821040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8821341Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8821788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8822076Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8822502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8822695Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8823143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8823234Z fn() 2025-05-07T20:33:19.8823685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8823797Z self.fn.run( 2025-05-07T20:33:19.8824536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8824649Z kernel = self.compile( 2025-05-07T20:33:19.8825080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8825283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8825433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8825606Z 2025-05-07T20:33:19.8825839Z self = 2025-05-07T20:33:19.8839926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8840829Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd91499900>} 2025-05-07T20:33:19.8841676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8841910Z context = 2025-05-07T20:33:19.8841916Z 2025-05-07T20:33:19.8842112Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8842421Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8842555Z module_map=module_map) 2025-05-07T20:33:19.8842744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8842863Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8842963Z E ^ 2025-05-07T20:33:19.8843368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8843373Z 2025-05-07T20:33:19.8843847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8843852Z 2025-05-07T20:33:19.8843973Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8844225Z self=, 2025-05-07T20:33:19.8844325Z T=1, 2025-05-07T20:33:19.8844414Z D=5120, 2025-05-07T20:33:19.8844510Z scale_ub=None, 2025-05-07T20:33:19.8844616Z contiguous=True, 2025-05-07T20:33:19.8844714Z compiled=False, 2025-05-07T20:33:19.8844810Z ) 2025-05-07T20:33:19.8845057Z self = 2025-05-07T20:33:19.8845243Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.8845251Z 2025-05-07T20:33:19.8845349Z @given( 2025-05-07T20:33:19.8845489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8845605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8845746Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8845881Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8846011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8846104Z ) 2025-05-07T20:33:19.8846383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8846500Z def test_silu_mul_quant( 2025-05-07T20:33:19.8846590Z self, 2025-05-07T20:33:19.8846683Z T: int, 2025-05-07T20:33:19.8846778Z D: int, 2025-05-07T20:33:19.8846978Z scale_ub: Optional[float], 2025-05-07T20:33:19.8847084Z contiguous: bool, 2025-05-07T20:33:19.8847190Z compiled: bool, 2025-05-07T20:33:19.8847281Z ) -> None: 2025-05-07T20:33:19.8847392Z torch.manual_seed(2025) 2025-05-07T20:33:19.8847483Z 2025-05-07T20:33:19.8847679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8847766Z 2025-05-07T20:33:19.8847878Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8848021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8848130Z x = x_sign * x_clamp 2025-05-07T20:33:19.8848224Z x0 = x[:, :D] 2025-05-07T20:33:19.8848318Z x1 = x[:, D:] 2025-05-07T20:33:19.8848407Z 2025-05-07T20:33:19.8848554Z if contiguous: 2025-05-07T20:33:19.8848661Z x0 = x0.contiguous() 2025-05-07T20:33:19.8848768Z x1 = x1.contiguous() 2025-05-07T20:33:19.8848853Z 2025-05-07T20:33:19.8848961Z if scale_ub is not None: 2025-05-07T20:33:19.8849085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8849237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8849367Z ) 2025-05-07T20:33:19.8849462Z else: 2025-05-07T20:33:19.8849609Z scale_ub_tensor = None 2025-05-07T20:33:19.8849693Z 2025-05-07T20:33:19.8849847Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8849950Z op = silu_mul_quant 2025-05-07T20:33:19.8850056Z if compiled: 2025-05-07T20:33:19.8850169Z op = torch.compile(op) 2025-05-07T20:33:19.8850288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8850377Z 2025-05-07T20:33:19.8850480Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8850488Z 2025-05-07T20:33:19.8850598Z moe/activation_test.py:117: 2025-05-07T20:33:19.8850749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8850868Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8850981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8851544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8851661Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8852070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8852323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8852700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8852813Z kernel = self.compile( 2025-05-07T20:33:19.8853243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8853448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8853617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8853622Z 2025-05-07T20:33:19.8853877Z self = 2025-05-07T20:33:19.8854754Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8855318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd914988b0>} 2025-05-07T20:33:19.8856149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8856417Z context = 2025-05-07T20:33:19.8856423Z 2025-05-07T20:33:19.8856610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8856914Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8857041Z module_map=module_map) 2025-05-07T20:33:19.8857232Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8857344Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8857431Z E ^ 2025-05-07T20:33:19.8857831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8857835Z 2025-05-07T20:33:19.8858294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8858345Z 2025-05-07T20:33:19.8858472Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8858724Z self=, 2025-05-07T20:33:19.8858813Z T=128, 2025-05-07T20:33:19.8858909Z D=5120, 2025-05-07T20:33:19.8859003Z scale_ub=None, 2025-05-07T20:33:19.8859146Z contiguous=False, 2025-05-07T20:33:19.8859292Z compiled=True, 2025-05-07T20:33:19.8859376Z ) 2025-05-07T20:33:19.8859620Z self = 2025-05-07T20:33:19.8859819Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.8859824Z 2025-05-07T20:33:19.8859912Z @given( 2025-05-07T20:33:19.8860054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8860169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8860301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8860443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8860572Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8860659Z ) 2025-05-07T20:33:19.8860939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8861047Z def test_silu_mul_quant( 2025-05-07T20:33:19.8861140Z self, 2025-05-07T20:33:19.8861239Z T: int, 2025-05-07T20:33:19.8861329Z D: int, 2025-05-07T20:33:19.8861441Z scale_ub: Optional[float], 2025-05-07T20:33:19.8861550Z contiguous: bool, 2025-05-07T20:33:19.8861648Z compiled: bool, 2025-05-07T20:33:19.8861745Z ) -> None: 2025-05-07T20:33:19.8861853Z torch.manual_seed(2025) 2025-05-07T20:33:19.8861937Z 2025-05-07T20:33:19.8862133Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8862218Z 2025-05-07T20:33:19.8862327Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8862475Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8862576Z x = x_sign * x_clamp 2025-05-07T20:33:19.8862669Z x0 = x[:, :D] 2025-05-07T20:33:19.8862766Z x1 = x[:, D:] 2025-05-07T20:33:19.8862848Z 2025-05-07T20:33:19.8862944Z if contiguous: 2025-05-07T20:33:19.8863056Z x0 = x0.contiguous() 2025-05-07T20:33:19.8863159Z x1 = x1.contiguous() 2025-05-07T20:33:19.8863247Z 2025-05-07T20:33:19.8863357Z if scale_ub is not None: 2025-05-07T20:33:19.8863475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8863634Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8863719Z ) 2025-05-07T20:33:19.8863806Z else: 2025-05-07T20:33:19.8863918Z scale_ub_tensor = None 2025-05-07T20:33:19.8863999Z 2025-05-07T20:33:19.8864161Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8864283Z op = silu_mul_quant 2025-05-07T20:33:19.8864401Z if compiled: 2025-05-07T20:33:19.8864514Z op = torch.compile(op) 2025-05-07T20:33:19.8864692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8864776Z 2025-05-07T20:33:19.8864885Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8864890Z 2025-05-07T20:33:19.8864999Z moe/activation_test.py:117: 2025-05-07T20:33:19.8865148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8865268Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8865380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8865791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.8865902Z return fn(*args, **kwargs) 2025-05-07T20:33:19.8866449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8866612Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8867013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8867263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8867648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8867870Z kernel = self.compile( 2025-05-07T20:33:19.8868298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8868503Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8868647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8868652Z 2025-05-07T20:33:19.8868886Z self = 2025-05-07T20:33:19.8869755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8870318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9149b880>} 2025-05-07T20:33:19.8871156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8871371Z context = 2025-05-07T20:33:19.8871376Z 2025-05-07T20:33:19.8871568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8871866Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8871997Z module_map=module_map) 2025-05-07T20:33:19.8872178Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8872296Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8872390Z E ^ 2025-05-07T20:33:19.8872787Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8872795Z 2025-05-07T20:33:19.8873255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8873261Z 2025-05-07T20:33:19.8873384Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8873731Z self=, 2025-05-07T20:33:19.8873825Z T=128, 2025-05-07T20:33:19.8873912Z D=7168, 2025-05-07T20:33:19.8874006Z scale_ub=1200.0, 2025-05-07T20:33:19.8874119Z contiguous=False, 2025-05-07T20:33:19.8874237Z compiled=False, 2025-05-07T20:33:19.8874331Z ) 2025-05-07T20:33:19.8874593Z self = 2025-05-07T20:33:19.8874844Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.8874850Z 2025-05-07T20:33:19.8874937Z @given( 2025-05-07T20:33:19.8875076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8875195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8875337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8875471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8875599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8875691Z ) 2025-05-07T20:33:19.8875966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8876073Z def test_silu_mul_quant( 2025-05-07T20:33:19.8876167Z self, 2025-05-07T20:33:19.8876304Z T: int, 2025-05-07T20:33:19.8876391Z D: int, 2025-05-07T20:33:19.8876508Z scale_ub: Optional[float], 2025-05-07T20:33:19.8876610Z contiguous: bool, 2025-05-07T20:33:19.8876711Z compiled: bool, 2025-05-07T20:33:19.8876808Z ) -> None: 2025-05-07T20:33:19.8876917Z torch.manual_seed(2025) 2025-05-07T20:33:19.8877006Z 2025-05-07T20:33:19.8877199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8877330Z 2025-05-07T20:33:19.8877479Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8877621Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8877724Z x = x_sign * x_clamp 2025-05-07T20:33:19.8877824Z x0 = x[:, :D] 2025-05-07T20:33:19.8877915Z x1 = x[:, D:] 2025-05-07T20:33:19.8877999Z 2025-05-07T20:33:19.8878100Z if contiguous: 2025-05-07T20:33:19.8878204Z x0 = x0.contiguous() 2025-05-07T20:33:19.8878305Z x1 = x1.contiguous() 2025-05-07T20:33:19.8878398Z 2025-05-07T20:33:19.8878501Z if scale_ub is not None: 2025-05-07T20:33:19.8878625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8878780Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8878866Z ) 2025-05-07T20:33:19.8878959Z else: 2025-05-07T20:33:19.8879067Z scale_ub_tensor = None 2025-05-07T20:33:19.8879155Z 2025-05-07T20:33:19.8879312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8879414Z op = silu_mul_quant 2025-05-07T20:33:19.8879510Z if compiled: 2025-05-07T20:33:19.8879629Z op = torch.compile(op) 2025-05-07T20:33:19.8879748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8879832Z 2025-05-07T20:33:19.8879940Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8879945Z 2025-05-07T20:33:19.8880056Z moe/activation_test.py:117: 2025-05-07T20:33:19.8880208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8880326Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8880439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8881004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8881113Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8881519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8881776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8882155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8882266Z kernel = self.compile( 2025-05-07T20:33:19.8882694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8882894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8883042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8883098Z 2025-05-07T20:33:19.8883330Z self = 2025-05-07T20:33:19.8884200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8884845Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd91499090>} 2025-05-07T20:33:19.8885677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8885942Z context = 2025-05-07T20:33:19.8885947Z 2025-05-07T20:33:19.8886136Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8886442Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8886564Z module_map=module_map) 2025-05-07T20:33:19.8886831Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8886952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8887040Z E ^ 2025-05-07T20:33:19.8887440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8887451Z 2025-05-07T20:33:19.8887917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8887925Z 2025-05-07T20:33:19.8888043Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8888301Z self=, 2025-05-07T20:33:19.8888392Z T=128, 2025-05-07T20:33:19.8888480Z D=5120, 2025-05-07T20:33:19.8888580Z scale_ub=None, 2025-05-07T20:33:19.8888680Z contiguous=False, 2025-05-07T20:33:19.8888781Z compiled=False, 2025-05-07T20:33:19.8888865Z ) 2025-05-07T20:33:19.8889114Z self = 2025-05-07T20:33:19.8889312Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.8889317Z 2025-05-07T20:33:19.8889408Z @given( 2025-05-07T20:33:19.8889544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8889665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8889798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8889931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8890070Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8890156Z ) 2025-05-07T20:33:19.8890440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8890548Z def test_silu_mul_quant( 2025-05-07T20:33:19.8890636Z self, 2025-05-07T20:33:19.8890731Z T: int, 2025-05-07T20:33:19.8890818Z D: int, 2025-05-07T20:33:19.8890935Z scale_ub: Optional[float], 2025-05-07T20:33:19.8891047Z contiguous: bool, 2025-05-07T20:33:19.8891146Z compiled: bool, 2025-05-07T20:33:19.8891237Z ) -> None: 2025-05-07T20:33:19.8891352Z torch.manual_seed(2025) 2025-05-07T20:33:19.8891437Z 2025-05-07T20:33:19.8891631Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8891724Z 2025-05-07T20:33:19.8891834Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8891982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8892090Z x = x_sign * x_clamp 2025-05-07T20:33:19.8892181Z x0 = x[:, :D] 2025-05-07T20:33:19.8892280Z x1 = x[:, D:] 2025-05-07T20:33:19.8892364Z 2025-05-07T20:33:19.8892513Z if contiguous: 2025-05-07T20:33:19.8892626Z x0 = x0.contiguous() 2025-05-07T20:33:19.8892729Z x1 = x1.contiguous() 2025-05-07T20:33:19.8892813Z 2025-05-07T20:33:19.8892924Z if scale_ub is not None: 2025-05-07T20:33:19.8893047Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8893204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8893298Z ) 2025-05-07T20:33:19.8893388Z else: 2025-05-07T20:33:19.8893499Z scale_ub_tensor = None 2025-05-07T20:33:19.8893613Z 2025-05-07T20:33:19.8893782Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8893899Z op = silu_mul_quant 2025-05-07T20:33:19.8893996Z if compiled: 2025-05-07T20:33:19.8894155Z op = torch.compile(op) 2025-05-07T20:33:19.8894282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8894365Z 2025-05-07T20:33:19.8894473Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8894477Z 2025-05-07T20:33:19.8894596Z moe/activation_test.py:117: 2025-05-07T20:33:19.8894740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8894901Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8895060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8895622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8895742Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8896148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8896400Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8896794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8896903Z kernel = self.compile( 2025-05-07T20:33:19.8897339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8897542Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8897690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8897696Z 2025-05-07T20:33:19.8897933Z self = 2025-05-07T20:33:19.8898803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8899377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a3eb0>} 2025-05-07T20:33:19.8900213Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8900431Z context = 2025-05-07T20:33:19.8900440Z 2025-05-07T20:33:19.8900637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8900935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8901064Z module_map=module_map) 2025-05-07T20:33:19.8901247Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8901358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8901453Z E ^ 2025-05-07T20:33:19.8901854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8901859Z 2025-05-07T20:33:19.8902374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8902386Z 2025-05-07T20:33:19.8902506Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8902758Z self=, 2025-05-07T20:33:19.8902855Z T=128, 2025-05-07T20:33:19.8902945Z D=5120, 2025-05-07T20:33:19.8903041Z scale_ub=1200.0, 2025-05-07T20:33:19.8903144Z contiguous=True, 2025-05-07T20:33:19.8903239Z compiled=False, 2025-05-07T20:33:19.8903323Z ) 2025-05-07T20:33:19.8903594Z self = 2025-05-07T20:33:19.8903822Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.8903829Z 2025-05-07T20:33:19.8903970Z @given( 2025-05-07T20:33:19.8904103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8904217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8904357Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8904493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8904623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8904795Z ) 2025-05-07T20:33:19.8905126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8905235Z def test_silu_mul_quant( 2025-05-07T20:33:19.8905326Z self, 2025-05-07T20:33:19.8905414Z T: int, 2025-05-07T20:33:19.8905501Z D: int, 2025-05-07T20:33:19.8905621Z scale_ub: Optional[float], 2025-05-07T20:33:19.8905723Z contiguous: bool, 2025-05-07T20:33:19.8905826Z compiled: bool, 2025-05-07T20:33:19.8905915Z ) -> None: 2025-05-07T20:33:19.8906022Z torch.manual_seed(2025) 2025-05-07T20:33:19.8906116Z 2025-05-07T20:33:19.8906308Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8906392Z 2025-05-07T20:33:19.8906505Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8906646Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8906747Z x = x_sign * x_clamp 2025-05-07T20:33:19.8906845Z x0 = x[:, :D] 2025-05-07T20:33:19.8906941Z x1 = x[:, D:] 2025-05-07T20:33:19.8907024Z 2025-05-07T20:33:19.8907127Z if contiguous: 2025-05-07T20:33:19.8907231Z x0 = x0.contiguous() 2025-05-07T20:33:19.8907352Z x1 = x1.contiguous() 2025-05-07T20:33:19.8907435Z 2025-05-07T20:33:19.8907537Z if scale_ub is not None: 2025-05-07T20:33:19.8907661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8907815Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8907904Z ) 2025-05-07T20:33:19.8908004Z else: 2025-05-07T20:33:19.8908113Z scale_ub_tensor = None 2025-05-07T20:33:19.8908196Z 2025-05-07T20:33:19.8908350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8908456Z op = silu_mul_quant 2025-05-07T20:33:19.8908554Z if compiled: 2025-05-07T20:33:19.8908675Z op = torch.compile(op) 2025-05-07T20:33:19.8908797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8908888Z 2025-05-07T20:33:19.8908995Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8909000Z 2025-05-07T20:33:19.8909112Z moe/activation_test.py:117: 2025-05-07T20:33:19.8909265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8909380Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8909494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8910066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8910180Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8910644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8910900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8911283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8911406Z kernel = self.compile( 2025-05-07T20:33:19.8911838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8912037Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8912187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8912192Z 2025-05-07T20:33:19.8912421Z self = 2025-05-07T20:33:19.8913350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8914077Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a2d40>} 2025-05-07T20:33:19.8915020Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8915239Z context = 2025-05-07T20:33:19.8915243Z 2025-05-07T20:33:19.8915431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8915740Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8915865Z module_map=module_map) 2025-05-07T20:33:19.8916050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8916170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8916258Z E ^ 2025-05-07T20:33:19.8916662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8916671Z 2025-05-07T20:33:19.8917137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8917142Z 2025-05-07T20:33:19.8917260Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8917516Z self=, 2025-05-07T20:33:19.8917608Z T=1, 2025-05-07T20:33:19.8917701Z D=7168, 2025-05-07T20:33:19.8917797Z scale_ub=1200.0, 2025-05-07T20:33:19.8917896Z contiguous=True, 2025-05-07T20:33:19.8917997Z compiled=True, 2025-05-07T20:33:19.8918081Z ) 2025-05-07T20:33:19.8918325Z self = 2025-05-07T20:33:19.8918518Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.8918523Z 2025-05-07T20:33:19.8918617Z @given( 2025-05-07T20:33:19.8918751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8918878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8919009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8919148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8919276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8919363Z ) 2025-05-07T20:33:19.8919645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8919753Z def test_silu_mul_quant( 2025-05-07T20:33:19.8919841Z self, 2025-05-07T20:33:19.8919939Z T: int, 2025-05-07T20:33:19.8920027Z D: int, 2025-05-07T20:33:19.8920140Z scale_ub: Optional[float], 2025-05-07T20:33:19.8920249Z contiguous: bool, 2025-05-07T20:33:19.8920396Z compiled: bool, 2025-05-07T20:33:19.8920489Z ) -> None: 2025-05-07T20:33:19.8920603Z torch.manual_seed(2025) 2025-05-07T20:33:19.8920687Z 2025-05-07T20:33:19.8920879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8920975Z 2025-05-07T20:33:19.8921080Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8921230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8921332Z x = x_sign * x_clamp 2025-05-07T20:33:19.8921423Z x0 = x[:, :D] 2025-05-07T20:33:19.8921519Z x1 = x[:, D:] 2025-05-07T20:33:19.8921602Z 2025-05-07T20:33:19.8921697Z if contiguous: 2025-05-07T20:33:19.8921809Z x0 = x0.contiguous() 2025-05-07T20:33:19.8921911Z x1 = x1.contiguous() 2025-05-07T20:33:19.8922041Z 2025-05-07T20:33:19.8922153Z if scale_ub is not None: 2025-05-07T20:33:19.8922274Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8922429Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8922521Z ) 2025-05-07T20:33:19.8922608Z else: 2025-05-07T20:33:19.8922720Z scale_ub_tensor = None 2025-05-07T20:33:19.8922851Z 2025-05-07T20:33:19.8923041Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8923155Z op = silu_mul_quant 2025-05-07T20:33:19.8923251Z if compiled: 2025-05-07T20:33:19.8923365Z op = torch.compile(op) 2025-05-07T20:33:19.8923492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8923576Z 2025-05-07T20:33:19.8923679Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8923684Z 2025-05-07T20:33:19.8924093Z moe/activation_test.py:117: 2025-05-07T20:33:19.8924344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8924517Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8924635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8925050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.8925163Z return fn(*args, **kwargs) 2025-05-07T20:33:19.8925723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8925834Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8926240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8926489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8926874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8926983Z kernel = self.compile( 2025-05-07T20:33:19.8927411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8927618Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8927760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8927767Z 2025-05-07T20:33:19.8927998Z self = 2025-05-07T20:33:19.8928864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8929423Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a1480>} 2025-05-07T20:33:19.8930460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8930681Z context = 2025-05-07T20:33:19.8930686Z 2025-05-07T20:33:19.8930881Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8931182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8931304Z module_map=module_map) 2025-05-07T20:33:19.8931495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8931609Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8931700Z E ^ 2025-05-07T20:33:19.8932103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8932181Z 2025-05-07T20:33:19.8932642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8932647Z 2025-05-07T20:33:19.8932775Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8933023Z self=, 2025-05-07T20:33:19.8933111Z T=1, 2025-05-07T20:33:19.8933278Z D=7168, 2025-05-07T20:33:19.8933374Z scale_ub=1200.0, 2025-05-07T20:33:19.8933534Z contiguous=False, 2025-05-07T20:33:19.8933637Z compiled=True, 2025-05-07T20:33:19.8933721Z ) 2025-05-07T20:33:19.8933971Z self = 2025-05-07T20:33:19.8934158Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.8934163Z 2025-05-07T20:33:19.8934263Z @given( 2025-05-07T20:33:19.8934419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8934539Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8934690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8934854Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8934987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8935079Z ) 2025-05-07T20:33:19.8935355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8935465Z def test_silu_mul_quant( 2025-05-07T20:33:19.8935562Z self, 2025-05-07T20:33:19.8935650Z T: int, 2025-05-07T20:33:19.8935737Z D: int, 2025-05-07T20:33:19.8935857Z scale_ub: Optional[float], 2025-05-07T20:33:19.8935958Z contiguous: bool, 2025-05-07T20:33:19.8936055Z compiled: bool, 2025-05-07T20:33:19.8936151Z ) -> None: 2025-05-07T20:33:19.8936259Z torch.manual_seed(2025) 2025-05-07T20:33:19.8936341Z 2025-05-07T20:33:19.8936543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8936629Z 2025-05-07T20:33:19.8936734Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8936881Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8936985Z x = x_sign * x_clamp 2025-05-07T20:33:19.8937083Z x0 = x[:, :D] 2025-05-07T20:33:19.8937174Z x1 = x[:, D:] 2025-05-07T20:33:19.8937256Z 2025-05-07T20:33:19.8937356Z if contiguous: 2025-05-07T20:33:19.8937463Z x0 = x0.contiguous() 2025-05-07T20:33:19.8937567Z x1 = x1.contiguous() 2025-05-07T20:33:19.8937658Z 2025-05-07T20:33:19.8937761Z if scale_ub is not None: 2025-05-07T20:33:19.8937881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8938040Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8938125Z ) 2025-05-07T20:33:19.8938213Z else: 2025-05-07T20:33:19.8938327Z scale_ub_tensor = None 2025-05-07T20:33:19.8938410Z 2025-05-07T20:33:19.8938566Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8938670Z op = silu_mul_quant 2025-05-07T20:33:19.8938767Z if compiled: 2025-05-07T20:33:19.8938941Z op = torch.compile(op) 2025-05-07T20:33:19.8939063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8939147Z 2025-05-07T20:33:19.8939257Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8939264Z 2025-05-07T20:33:19.8939374Z moe/activation_test.py:117: 2025-05-07T20:33:19.8939522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8939643Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8939759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8940174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.8940281Z return fn(*args, **kwargs) 2025-05-07T20:33:19.8940831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8941028Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8941430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8941686Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8942164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8942274Z kernel = self.compile( 2025-05-07T20:33:19.8942711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8942909Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8943056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8943061Z 2025-05-07T20:33:19.8943298Z self = 2025-05-07T20:33:19.8944172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8944750Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a0940>} 2025-05-07T20:33:19.8945584Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8945802Z context = 2025-05-07T20:33:19.8945814Z 2025-05-07T20:33:19.8946001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8946303Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8946439Z module_map=module_map) 2025-05-07T20:33:19.8946626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8946740Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8946836Z E ^ 2025-05-07T20:33:19.8947235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8947243Z 2025-05-07T20:33:19.8947715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8947720Z 2025-05-07T20:33:19.8947839Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8948090Z self=, 2025-05-07T20:33:19.8948184Z T=1, 2025-05-07T20:33:19.8948272Z D=7168, 2025-05-07T20:33:19.8948370Z scale_ub=None, 2025-05-07T20:33:19.8948477Z contiguous=False, 2025-05-07T20:33:19.8948572Z compiled=True, 2025-05-07T20:33:19.8948658Z ) 2025-05-07T20:33:19.8948960Z self = 2025-05-07T20:33:19.8949147Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.8949152Z 2025-05-07T20:33:19.8949247Z @given( 2025-05-07T20:33:19.8949385Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8949503Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8949641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8949776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8949906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8949998Z ) 2025-05-07T20:33:19.8950274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8950390Z def test_silu_mul_quant( 2025-05-07T20:33:19.8950521Z self, 2025-05-07T20:33:19.8950609Z T: int, 2025-05-07T20:33:19.8950702Z D: int, 2025-05-07T20:33:19.8950813Z scale_ub: Optional[float], 2025-05-07T20:33:19.8950920Z contiguous: bool, 2025-05-07T20:33:19.8951023Z compiled: bool, 2025-05-07T20:33:19.8951113Z ) -> None: 2025-05-07T20:33:19.8951220Z torch.manual_seed(2025) 2025-05-07T20:33:19.8951354Z 2025-05-07T20:33:19.8951586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8951672Z 2025-05-07T20:33:19.8951783Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8951925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8952027Z x = x_sign * x_clamp 2025-05-07T20:33:19.8952126Z x0 = x[:, :D] 2025-05-07T20:33:19.8952217Z x1 = x[:, D:] 2025-05-07T20:33:19.8952307Z 2025-05-07T20:33:19.8952402Z if contiguous: 2025-05-07T20:33:19.8952505Z x0 = x0.contiguous() 2025-05-07T20:33:19.8952619Z x1 = x1.contiguous() 2025-05-07T20:33:19.8952703Z 2025-05-07T20:33:19.8952806Z if scale_ub is not None: 2025-05-07T20:33:19.8952938Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8953094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8953181Z ) 2025-05-07T20:33:19.8953274Z else: 2025-05-07T20:33:19.8953385Z scale_ub_tensor = None 2025-05-07T20:33:19.8953468Z 2025-05-07T20:33:19.8953745Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8953848Z op = silu_mul_quant 2025-05-07T20:33:19.8953956Z if compiled: 2025-05-07T20:33:19.8954091Z op = torch.compile(op) 2025-05-07T20:33:19.8954225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8954329Z 2025-05-07T20:33:19.8954433Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.8954570Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.8954662Z 2025-05-07T20:33:19.8954817Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8954934Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.8955054Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.8955194Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.8955351Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8955443Z 2025-05-07T20:33:19.8955561Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.8955566Z 2025-05-07T20:33:19.8955683Z moe/activation_test.py:126: 2025-05-07T20:33:19.8955828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8955949Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.8956111Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.8956738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.8956857Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.8957324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8957577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8957995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.8958287Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8958736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.8959028Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.8959450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.8959691Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.8960081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.8960168Z fn() 2025-05-07T20:33:19.8960629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.8960770Z self.fn.run( 2025-05-07T20:33:19.8961188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8961307Z kernel = self.compile( 2025-05-07T20:33:19.8961734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8961940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8962085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8962093Z 2025-05-07T20:33:19.8962324Z self = 2025-05-07T20:33:19.8963200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8963778Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90b7eef0>} 2025-05-07T20:33:19.8964615Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8964835Z context = 2025-05-07T20:33:19.8964842Z 2025-05-07T20:33:19.8965036Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8965335Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8965459Z module_map=module_map) 2025-05-07T20:33:19.8965649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8965767Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.8965860Z E ^ 2025-05-07T20:33:19.8966269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8966274Z 2025-05-07T20:33:19.8966733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8966738Z 2025-05-07T20:33:19.8966865Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8967115Z self=, 2025-05-07T20:33:19.8967206Z T=1, 2025-05-07T20:33:19.8967304Z D=5120, 2025-05-07T20:33:19.8967400Z scale_ub=1200.0, 2025-05-07T20:33:19.8967499Z contiguous=False, 2025-05-07T20:33:19.8967653Z compiled=True, 2025-05-07T20:33:19.8967740Z ) 2025-05-07T20:33:19.8967984Z self = 2025-05-07T20:33:19.8968179Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.8968186Z 2025-05-07T20:33:19.8968277Z @given( 2025-05-07T20:33:19.8968419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8968534Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8968667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8968806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8968935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8969019Z ) 2025-05-07T20:33:19.8969306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8969462Z def test_silu_mul_quant( 2025-05-07T20:33:19.8969557Z self, 2025-05-07T20:33:19.8969644Z T: int, 2025-05-07T20:33:19.8969735Z D: int, 2025-05-07T20:33:19.8969852Z scale_ub: Optional[float], 2025-05-07T20:33:19.8969954Z contiguous: bool, 2025-05-07T20:33:19.8970051Z compiled: bool, 2025-05-07T20:33:19.8970191Z ) -> None: 2025-05-07T20:33:19.8970342Z torch.manual_seed(2025) 2025-05-07T20:33:19.8970428Z 2025-05-07T20:33:19.8970626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8970712Z 2025-05-07T20:33:19.8970818Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8970964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8971068Z x = x_sign * x_clamp 2025-05-07T20:33:19.8971160Z x0 = x[:, :D] 2025-05-07T20:33:19.8971257Z x1 = x[:, D:] 2025-05-07T20:33:19.8971347Z 2025-05-07T20:33:19.8971449Z if contiguous: 2025-05-07T20:33:19.8971553Z x0 = x0.contiguous() 2025-05-07T20:33:19.8971654Z x1 = x1.contiguous() 2025-05-07T20:33:19.8971747Z 2025-05-07T20:33:19.8971850Z if scale_ub is not None: 2025-05-07T20:33:19.8971970Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8972130Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8972220Z ) 2025-05-07T20:33:19.8972310Z else: 2025-05-07T20:33:19.8972424Z scale_ub_tensor = None 2025-05-07T20:33:19.8972507Z 2025-05-07T20:33:19.8972656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8972768Z op = silu_mul_quant 2025-05-07T20:33:19.8972864Z if compiled: 2025-05-07T20:33:19.8972982Z op = torch.compile(op) 2025-05-07T20:33:19.8973101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8973184Z 2025-05-07T20:33:19.8973297Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8973301Z 2025-05-07T20:33:19.8973412Z moe/activation_test.py:117: 2025-05-07T20:33:19.8973560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8973681Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8973795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8974206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.8974324Z return fn(*args, **kwargs) 2025-05-07T20:33:19.8974876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8974994Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8975396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8975649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8976043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8976233Z kernel = self.compile( 2025-05-07T20:33:19.8976670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8976871Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8982067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8982079Z 2025-05-07T20:33:19.8982337Z self = 2025-05-07T20:33:19.8983208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8983914Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90b7feb0>} 2025-05-07T20:33:19.8984753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8985022Z context = 2025-05-07T20:33:19.8985068Z 2025-05-07T20:33:19.8985264Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8985561Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8985692Z module_map=module_map) 2025-05-07T20:33:19.8985875Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8985990Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8986089Z E ^ 2025-05-07T20:33:19.8986490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8986494Z 2025-05-07T20:33:19.8986963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8986975Z 2025-05-07T20:33:19.8987094Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8987352Z self=, 2025-05-07T20:33:19.8987448Z T=1, 2025-05-07T20:33:19.8987537Z D=5120, 2025-05-07T20:33:19.8987632Z scale_ub=1200.0, 2025-05-07T20:33:19.8987737Z contiguous=False, 2025-05-07T20:33:19.8987833Z compiled=False, 2025-05-07T20:33:19.8987919Z ) 2025-05-07T20:33:19.8988173Z self = 2025-05-07T20:33:19.8988364Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.8988372Z 2025-05-07T20:33:19.8988467Z @given( 2025-05-07T20:33:19.8988602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8988719Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8988855Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8988989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8989120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8989215Z ) 2025-05-07T20:33:19.8989496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8989603Z def test_silu_mul_quant( 2025-05-07T20:33:19.8989698Z self, 2025-05-07T20:33:19.8989786Z T: int, 2025-05-07T20:33:19.8989876Z D: int, 2025-05-07T20:33:19.8989993Z scale_ub: Optional[float], 2025-05-07T20:33:19.8990097Z contiguous: bool, 2025-05-07T20:33:19.8990200Z compiled: bool, 2025-05-07T20:33:19.8990290Z ) -> None: 2025-05-07T20:33:19.8990402Z torch.manual_seed(2025) 2025-05-07T20:33:19.8990495Z 2025-05-07T20:33:19.8990688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8990846Z 2025-05-07T20:33:19.8990959Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8991102Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8991206Z x = x_sign * x_clamp 2025-05-07T20:33:19.8991306Z x0 = x[:, :D] 2025-05-07T20:33:19.8991399Z x1 = x[:, D:] 2025-05-07T20:33:19.8991483Z 2025-05-07T20:33:19.8991585Z if contiguous: 2025-05-07T20:33:19.8991689Z x0 = x0.contiguous() 2025-05-07T20:33:19.8991799Z x1 = x1.contiguous() 2025-05-07T20:33:19.8991882Z 2025-05-07T20:33:19.8991984Z if scale_ub is not None: 2025-05-07T20:33:19.8992112Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8992266Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8992403Z ) 2025-05-07T20:33:19.8992497Z else: 2025-05-07T20:33:19.8992605Z scale_ub_tensor = None 2025-05-07T20:33:19.8992688Z 2025-05-07T20:33:19.8992848Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8992952Z op = silu_mul_quant 2025-05-07T20:33:19.8993048Z if compiled: 2025-05-07T20:33:19.8993169Z op = torch.compile(op) 2025-05-07T20:33:19.8993336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8993467Z 2025-05-07T20:33:19.8993713Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8993718Z 2025-05-07T20:33:19.8993830Z moe/activation_test.py:117: 2025-05-07T20:33:19.8993984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8994100Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8994214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8994831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8994946Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8995354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8995613Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8995995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8996117Z kernel = self.compile( 2025-05-07T20:33:19.8996550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8996748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8996901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8996906Z 2025-05-07T20:33:19.8997138Z self = 2025-05-07T20:33:19.8998017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8998585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd913a3e20>} 2025-05-07T20:33:19.8999430Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8999648Z context = 2025-05-07T20:33:19.8999653Z 2025-05-07T20:33:19.8999839Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9000143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9000269Z module_map=module_map) 2025-05-07T20:33:19.9000506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9000630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9000719Z E ^ 2025-05-07T20:33:19.9001125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9001135Z 2025-05-07T20:33:19.9001596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9001601Z 2025-05-07T20:33:19.9001720Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9001979Z self=, 2025-05-07T20:33:19.9002069Z T=16384, 2025-05-07T20:33:19.9002163Z D=5120, 2025-05-07T20:33:19.9002260Z scale_ub=1200.0, 2025-05-07T20:33:19.9002408Z contiguous=False, 2025-05-07T20:33:19.9002511Z compiled=True, 2025-05-07T20:33:19.9002598Z ) 2025-05-07T20:33:19.9002846Z self = 2025-05-07T20:33:19.9003055Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.9003060Z 2025-05-07T20:33:19.9003152Z @given( 2025-05-07T20:33:19.9003333Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9003496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9003631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9003772Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9003902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9003988Z ) 2025-05-07T20:33:19.9004274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9004382Z def test_silu_mul_quant( 2025-05-07T20:33:19.9004473Z self, 2025-05-07T20:33:19.9004568Z T: int, 2025-05-07T20:33:19.9004657Z D: int, 2025-05-07T20:33:19.9004771Z scale_ub: Optional[float], 2025-05-07T20:33:19.9004883Z contiguous: bool, 2025-05-07T20:33:19.9004982Z compiled: bool, 2025-05-07T20:33:19.9005072Z ) -> None: 2025-05-07T20:33:19.9005187Z torch.manual_seed(2025) 2025-05-07T20:33:19.9005274Z 2025-05-07T20:33:19.9005472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9005563Z 2025-05-07T20:33:19.9005668Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9005821Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9005924Z x = x_sign * x_clamp 2025-05-07T20:33:19.9006015Z x0 = x[:, :D] 2025-05-07T20:33:19.9006117Z x1 = x[:, D:] 2025-05-07T20:33:19.9006200Z 2025-05-07T20:33:19.9006295Z if contiguous: 2025-05-07T20:33:19.9006405Z x0 = x0.contiguous() 2025-05-07T20:33:19.9006510Z x1 = x1.contiguous() 2025-05-07T20:33:19.9006593Z 2025-05-07T20:33:19.9006702Z if scale_ub is not None: 2025-05-07T20:33:19.9006827Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9006980Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9007072Z ) 2025-05-07T20:33:19.9007158Z else: 2025-05-07T20:33:19.9007275Z scale_ub_tensor = None 2025-05-07T20:33:19.9007358Z 2025-05-07T20:33:19.9007507Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9007616Z op = silu_mul_quant 2025-05-07T20:33:19.9007712Z if compiled: 2025-05-07T20:33:19.9007825Z op = torch.compile(op) 2025-05-07T20:33:19.9007952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9008035Z 2025-05-07T20:33:19.9008138Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9008142Z 2025-05-07T20:33:19.9008260Z moe/activation_test.py:117: 2025-05-07T20:33:19.9008408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9008530Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9008698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9009111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9009224Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9009784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9009895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9010304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9010557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9010945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9011096Z kernel = self.compile( 2025-05-07T20:33:19.9011528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9011734Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9011877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9011925Z 2025-05-07T20:33:19.9012229Z self = 2025-05-07T20:33:19.9013124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9013835Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd901288b0>} 2025-05-07T20:33:19.9014884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9015154Z context = 2025-05-07T20:33:19.9015160Z 2025-05-07T20:33:19.9015401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9015775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9015919Z module_map=module_map) 2025-05-07T20:33:19.9016108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9016224Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9016313Z E ^ 2025-05-07T20:33:19.9016718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9016725Z 2025-05-07T20:33:19.9017189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9017194Z 2025-05-07T20:33:19.9017319Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9017568Z self=, 2025-05-07T20:33:19.9017660Z T=2048, 2025-05-07T20:33:19.9017754Z D=7168, 2025-05-07T20:33:19.9017853Z scale_ub=1200.0, 2025-05-07T20:33:19.9017953Z contiguous=False, 2025-05-07T20:33:19.9018055Z compiled=True, 2025-05-07T20:33:19.9018139Z ) 2025-05-07T20:33:19.9018389Z self = 2025-05-07T20:33:19.9018588Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.9018592Z 2025-05-07T20:33:19.9018681Z @given( 2025-05-07T20:33:19.9018821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9018938Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9019069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9019259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9019392Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9019483Z ) 2025-05-07T20:33:19.9019763Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9019875Z def test_silu_mul_quant( 2025-05-07T20:33:19.9019968Z self, 2025-05-07T20:33:19.9020056Z T: int, 2025-05-07T20:33:19.9020144Z D: int, 2025-05-07T20:33:19.9020264Z scale_ub: Optional[float], 2025-05-07T20:33:19.9020367Z contiguous: bool, 2025-05-07T20:33:19.9020466Z compiled: bool, 2025-05-07T20:33:19.9020565Z ) -> None: 2025-05-07T20:33:19.9020674Z torch.manual_seed(2025) 2025-05-07T20:33:19.9020758Z 2025-05-07T20:33:19.9020957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9021092Z 2025-05-07T20:33:19.9021198Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9021350Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9021451Z x = x_sign * x_clamp 2025-05-07T20:33:19.9021553Z x0 = x[:, :D] 2025-05-07T20:33:19.9021645Z x1 = x[:, D:] 2025-05-07T20:33:19.9021775Z 2025-05-07T20:33:19.9021879Z if contiguous: 2025-05-07T20:33:19.9022025Z x0 = x0.contiguous() 2025-05-07T20:33:19.9022129Z x1 = x1.contiguous() 2025-05-07T20:33:19.9022222Z 2025-05-07T20:33:19.9022327Z if scale_ub is not None: 2025-05-07T20:33:19.9022447Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9022607Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9022695Z ) 2025-05-07T20:33:19.9022782Z else: 2025-05-07T20:33:19.9022898Z scale_ub_tensor = None 2025-05-07T20:33:19.9022984Z 2025-05-07T20:33:19.9023143Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9023248Z op = silu_mul_quant 2025-05-07T20:33:19.9023348Z if compiled: 2025-05-07T20:33:19.9023476Z op = torch.compile(op) 2025-05-07T20:33:19.9023626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9023727Z 2025-05-07T20:33:19.9024247Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9024258Z 2025-05-07T20:33:19.9024466Z moe/activation_test.py:117: 2025-05-07T20:33:19.9024723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9024934Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9025118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9025551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9025658Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9026217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9026334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9026737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9026989Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9027380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9027487Z kernel = self.compile( 2025-05-07T20:33:19.9027921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9028119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9028263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9028271Z 2025-05-07T20:33:19.9028508Z self = 2025-05-07T20:33:19.9029566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9030145Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd90129090>} 2025-05-07T20:33:19.9030977Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9031193Z context = 2025-05-07T20:33:19.9031206Z 2025-05-07T20:33:19.9031394Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9031766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9031903Z module_map=module_map) 2025-05-07T20:33:19.9032086Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9032198Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9032293Z E ^ 2025-05-07T20:33:19.9032820Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9032826Z 2025-05-07T20:33:19.9033293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9033298Z 2025-05-07T20:33:19.9033414Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9033747Z self=, 2025-05-07T20:33:19.9033845Z T=1, 2025-05-07T20:33:19.9033938Z D=5120, 2025-05-07T20:33:19.9034032Z scale_ub=None, 2025-05-07T20:33:19.9034138Z contiguous=False, 2025-05-07T20:33:19.9034234Z compiled=False, 2025-05-07T20:33:19.9034318Z ) 2025-05-07T20:33:19.9034570Z self = 2025-05-07T20:33:19.9034760Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.9034769Z 2025-05-07T20:33:19.9034862Z @given( 2025-05-07T20:33:19.9034998Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9035113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9035249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9035382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9035512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9035602Z ) 2025-05-07T20:33:19.9035879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9035996Z def test_silu_mul_quant( 2025-05-07T20:33:19.9036083Z self, 2025-05-07T20:33:19.9036171Z T: int, 2025-05-07T20:33:19.9036263Z D: int, 2025-05-07T20:33:19.9036377Z scale_ub: Optional[float], 2025-05-07T20:33:19.9036480Z contiguous: bool, 2025-05-07T20:33:19.9036585Z compiled: bool, 2025-05-07T20:33:19.9036675Z ) -> None: 2025-05-07T20:33:19.9036785Z torch.manual_seed(2025) 2025-05-07T20:33:19.9036877Z 2025-05-07T20:33:19.9037071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9037155Z 2025-05-07T20:33:19.9037267Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9037408Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9037511Z x = x_sign * x_clamp 2025-05-07T20:33:19.9037611Z x0 = x[:, :D] 2025-05-07T20:33:19.9037703Z x1 = x[:, D:] 2025-05-07T20:33:19.9037798Z 2025-05-07T20:33:19.9037894Z if contiguous: 2025-05-07T20:33:19.9038001Z x0 = x0.contiguous() 2025-05-07T20:33:19.9038113Z x1 = x1.contiguous() 2025-05-07T20:33:19.9038197Z 2025-05-07T20:33:19.9038356Z if scale_ub is not None: 2025-05-07T20:33:19.9038484Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9038636Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9038723Z ) 2025-05-07T20:33:19.9038819Z else: 2025-05-07T20:33:19.9038928Z scale_ub_tensor = None 2025-05-07T20:33:19.9039012Z 2025-05-07T20:33:19.9039165Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9039272Z op = silu_mul_quant 2025-05-07T20:33:19.9039374Z if compiled: 2025-05-07T20:33:19.9039492Z op = torch.compile(op) 2025-05-07T20:33:19.9039612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9039702Z 2025-05-07T20:33:19.9039807Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9039857Z 2025-05-07T20:33:19.9039967Z moe/activation_test.py:117: 2025-05-07T20:33:19.9040118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9040238Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9040351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9040917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9041115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9041526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9041776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9042158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9042270Z kernel = self.compile( 2025-05-07T20:33:19.9042704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9042912Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9043059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9043064Z 2025-05-07T20:33:19.9043304Z self = 2025-05-07T20:33:19.9044400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9045116Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd901297e0>} 2025-05-07T20:33:19.9046121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9046344Z context = 2025-05-07T20:33:19.9046349Z 2025-05-07T20:33:19.9046537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9046841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9046970Z module_map=module_map) 2025-05-07T20:33:19.9047162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9047277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9047366Z E ^ 2025-05-07T20:33:19.9047773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9047777Z 2025-05-07T20:33:19.9048246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9048253Z 2025-05-07T20:33:19.9048382Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9048684Z self=, 2025-05-07T20:33:19.9048774Z T=4096, 2025-05-07T20:33:19.9048867Z D=7168, 2025-05-07T20:33:19.9048965Z scale_ub=1200.0, 2025-05-07T20:33:19.9049069Z contiguous=False, 2025-05-07T20:33:19.9049173Z compiled=False, 2025-05-07T20:33:19.9049261Z ) 2025-05-07T20:33:19.9049508Z self = 2025-05-07T20:33:19.9049717Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.9049722Z 2025-05-07T20:33:19.9049811Z @given( 2025-05-07T20:33:19.9049950Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9050064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9050241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9050380Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9050510Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9050599Z ) 2025-05-07T20:33:19.9050884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9050992Z def test_silu_mul_quant( 2025-05-07T20:33:19.9051161Z self, 2025-05-07T20:33:19.9051257Z T: int, 2025-05-07T20:33:19.9051411Z D: int, 2025-05-07T20:33:19.9051525Z scale_ub: Optional[float], 2025-05-07T20:33:19.9051632Z contiguous: bool, 2025-05-07T20:33:19.9051731Z compiled: bool, 2025-05-07T20:33:19.9051831Z ) -> None: 2025-05-07T20:33:19.9051939Z torch.manual_seed(2025) 2025-05-07T20:33:19.9052023Z 2025-05-07T20:33:19.9052226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9052311Z 2025-05-07T20:33:19.9052421Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9052569Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9052670Z x = x_sign * x_clamp 2025-05-07T20:33:19.9052763Z x0 = x[:, :D] 2025-05-07T20:33:19.9052872Z x1 = x[:, D:] 2025-05-07T20:33:19.9052956Z 2025-05-07T20:33:19.9053053Z if contiguous: 2025-05-07T20:33:19.9053165Z x0 = x0.contiguous() 2025-05-07T20:33:19.9053272Z x1 = x1.contiguous() 2025-05-07T20:33:19.9053365Z 2025-05-07T20:33:19.9053472Z if scale_ub is not None: 2025-05-07T20:33:19.9053593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9053766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9053867Z ) 2025-05-07T20:33:19.9053968Z else: 2025-05-07T20:33:19.9054094Z scale_ub_tensor = None 2025-05-07T20:33:19.9054178Z 2025-05-07T20:33:19.9054326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9054441Z op = silu_mul_quant 2025-05-07T20:33:19.9054538Z if compiled: 2025-05-07T20:33:19.9054653Z op = torch.compile(op) 2025-05-07T20:33:19.9054781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9054866Z 2025-05-07T20:33:19.9054970Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9054981Z 2025-05-07T20:33:19.9055094Z moe/activation_test.py:117: 2025-05-07T20:33:19.9055246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9055377Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9055493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9056057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9056178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9056585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9056851Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9057292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9057403Z kernel = self.compile( 2025-05-07T20:33:19.9057847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9058052Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9058196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9058201Z 2025-05-07T20:33:19.9058442Z self = 2025-05-07T20:33:19.9059317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9059943Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9012a200>} 2025-05-07T20:33:19.9060782Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9061095Z context = 2025-05-07T20:33:19.9061100Z 2025-05-07T20:33:19.9061289Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9061588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9061716Z module_map=module_map) 2025-05-07T20:33:19.9061901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9062016Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9062111Z E ^ 2025-05-07T20:33:19.9062514Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9062519Z 2025-05-07T20:33:19.9062991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9062999Z 2025-05-07T20:33:19.9063123Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9063377Z self=, 2025-05-07T20:33:19.9063474Z T=16384, 2025-05-07T20:33:19.9063564Z D=7168, 2025-05-07T20:33:19.9063661Z scale_ub=None, 2025-05-07T20:33:19.9063766Z contiguous=True, 2025-05-07T20:33:19.9063863Z compiled=True, 2025-05-07T20:33:19.9063953Z ) 2025-05-07T20:33:19.9064199Z self = 2025-05-07T20:33:19.9064398Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.9064403Z 2025-05-07T20:33:19.9064498Z @given( 2025-05-07T20:33:19.9064635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9064750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9064888Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9065023Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9065164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9065250Z ) 2025-05-07T20:33:19.9065528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9065642Z def test_silu_mul_quant( 2025-05-07T20:33:19.9065732Z self, 2025-05-07T20:33:19.9065820Z T: int, 2025-05-07T20:33:19.9065915Z D: int, 2025-05-07T20:33:19.9066029Z scale_ub: Optional[float], 2025-05-07T20:33:19.9066132Z contiguous: bool, 2025-05-07T20:33:19.9066238Z compiled: bool, 2025-05-07T20:33:19.9066327Z ) -> None: 2025-05-07T20:33:19.9066435Z torch.manual_seed(2025) 2025-05-07T20:33:19.9066526Z 2025-05-07T20:33:19.9066770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9066855Z 2025-05-07T20:33:19.9066965Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9067107Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9067219Z x = x_sign * x_clamp 2025-05-07T20:33:19.9067314Z x0 = x[:, :D] 2025-05-07T20:33:19.9067405Z x1 = x[:, D:] 2025-05-07T20:33:19.9067495Z 2025-05-07T20:33:19.9067591Z if contiguous: 2025-05-07T20:33:19.9067696Z x0 = x0.contiguous() 2025-05-07T20:33:19.9067805Z x1 = x1.contiguous() 2025-05-07T20:33:19.9067888Z 2025-05-07T20:33:19.9067992Z if scale_ub is not None: 2025-05-07T20:33:19.9068120Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9068320Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9068409Z ) 2025-05-07T20:33:19.9068504Z else: 2025-05-07T20:33:19.9068611Z scale_ub_tensor = None 2025-05-07T20:33:19.9068704Z 2025-05-07T20:33:19.9068854Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9068957Z op = silu_mul_quant 2025-05-07T20:33:19.9069062Z if compiled: 2025-05-07T20:33:19.9069223Z op = torch.compile(op) 2025-05-07T20:33:19.9069387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9069479Z 2025-05-07T20:33:19.9069586Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9069591Z 2025-05-07T20:33:19.9069703Z moe/activation_test.py:117: 2025-05-07T20:33:19.9069859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9069977Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9070099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9070518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9070625Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9071193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9071306Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9071715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9071976Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9072362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9072478Z kernel = self.compile( 2025-05-07T20:33:19.9072910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9073112Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9073263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9073273Z 2025-05-07T20:33:19.9073616Z self = 2025-05-07T20:33:19.9074498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9075076Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9012b760>} 2025-05-07T20:33:19.9075915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9076147Z context = 2025-05-07T20:33:19.9076152Z 2025-05-07T20:33:19.9076394Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9076704Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9076828Z module_map=module_map) 2025-05-07T20:33:19.9077018Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9077142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9077233Z E ^ 2025-05-07T20:33:19.9077635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9077649Z 2025-05-07T20:33:19.9078113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9078118Z 2025-05-07T20:33:19.9078284Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9078541Z self=, 2025-05-07T20:33:19.9078631Z T=4096, 2025-05-07T20:33:19.9078722Z D=5120, 2025-05-07T20:33:19.9078823Z scale_ub=None, 2025-05-07T20:33:19.9078923Z contiguous=False, 2025-05-07T20:33:19.9079018Z compiled=True, 2025-05-07T20:33:19.9079108Z ) 2025-05-07T20:33:19.9079399Z self = 2025-05-07T20:33:19.9079642Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.9079647Z 2025-05-07T20:33:19.9079737Z @given( 2025-05-07T20:33:19.9079871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9079993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9080125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9080258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9080398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9080484Z ) 2025-05-07T20:33:19.9080774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9080883Z def test_silu_mul_quant( 2025-05-07T20:33:19.9080971Z self, 2025-05-07T20:33:19.9081066Z T: int, 2025-05-07T20:33:19.9081154Z D: int, 2025-05-07T20:33:19.9081272Z scale_ub: Optional[float], 2025-05-07T20:33:19.9081380Z contiguous: bool, 2025-05-07T20:33:19.9081481Z compiled: bool, 2025-05-07T20:33:19.9081570Z ) -> None: 2025-05-07T20:33:19.9081686Z torch.manual_seed(2025) 2025-05-07T20:33:19.9081770Z 2025-05-07T20:33:19.9081961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9082052Z 2025-05-07T20:33:19.9082161Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9082303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9082415Z x = x_sign * x_clamp 2025-05-07T20:33:19.9082507Z x0 = x[:, :D] 2025-05-07T20:33:19.9082605Z x1 = x[:, D:] 2025-05-07T20:33:19.9082688Z 2025-05-07T20:33:19.9082786Z if contiguous: 2025-05-07T20:33:19.9082896Z x0 = x0.contiguous() 2025-05-07T20:33:19.9082998Z x1 = x1.contiguous() 2025-05-07T20:33:19.9083081Z 2025-05-07T20:33:19.9083190Z if scale_ub is not None: 2025-05-07T20:33:19.9083316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9083475Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9083569Z ) 2025-05-07T20:33:19.9083657Z else: 2025-05-07T20:33:19.9083765Z scale_ub_tensor = None 2025-05-07T20:33:19.9083854Z 2025-05-07T20:33:19.9084001Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9084110Z op = silu_mul_quant 2025-05-07T20:33:19.9084208Z if compiled: 2025-05-07T20:33:19.9084325Z op = torch.compile(op) 2025-05-07T20:33:19.9084453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9084537Z 2025-05-07T20:33:19.9084642Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9084701Z 2025-05-07T20:33:19.9084819Z moe/activation_test.py:117: 2025-05-07T20:33:19.9084964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9085080Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9085205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9085621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9085736Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9086296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9086408Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9086821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9087152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9087545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9087660Z kernel = self.compile( 2025-05-07T20:33:19.9088134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9088384Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9088530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9088535Z 2025-05-07T20:33:19.9088768Z self = 2025-05-07T20:33:19.9089649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9090226Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074c280>} 2025-05-07T20:33:19.9091077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9091300Z context = 2025-05-07T20:33:19.9091305Z 2025-05-07T20:33:19.9091500Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9091801Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9091923Z module_map=module_map) 2025-05-07T20:33:19.9092119Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9092233Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9092323Z E ^ 2025-05-07T20:33:19.9092733Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9092737Z 2025-05-07T20:33:19.9093201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9093211Z 2025-05-07T20:33:19.9093336Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9093613Z self=, 2025-05-07T20:33:19.9093716Z T=4096, 2025-05-07T20:33:19.9093822Z D=5120, 2025-05-07T20:33:19.9093918Z scale_ub=1200.0, 2025-05-07T20:33:19.9094017Z contiguous=False, 2025-05-07T20:33:19.9094123Z compiled=False, 2025-05-07T20:33:19.9094206Z ) 2025-05-07T20:33:19.9094453Z self = 2025-05-07T20:33:19.9094666Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.9094671Z 2025-05-07T20:33:19.9094811Z @given( 2025-05-07T20:33:19.9094955Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9095069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9095200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9095347Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9095478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9095563Z ) 2025-05-07T20:33:19.9095847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9095956Z def test_silu_mul_quant( 2025-05-07T20:33:19.9096048Z self, 2025-05-07T20:33:19.9096137Z T: int, 2025-05-07T20:33:19.9096225Z D: int, 2025-05-07T20:33:19.9096343Z scale_ub: Optional[float], 2025-05-07T20:33:19.9096491Z contiguous: bool, 2025-05-07T20:33:19.9096589Z compiled: bool, 2025-05-07T20:33:19.9096686Z ) -> None: 2025-05-07T20:33:19.9096797Z torch.manual_seed(2025) 2025-05-07T20:33:19.9096881Z 2025-05-07T20:33:19.9097080Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9097170Z 2025-05-07T20:33:19.9097275Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9097472Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9097615Z x = x_sign * x_clamp 2025-05-07T20:33:19.9097709Z x0 = x[:, :D] 2025-05-07T20:33:19.9097807Z x1 = x[:, D:] 2025-05-07T20:33:19.9097890Z 2025-05-07T20:33:19.9097992Z if contiguous: 2025-05-07T20:33:19.9098101Z x0 = x0.contiguous() 2025-05-07T20:33:19.9098202Z x1 = x1.contiguous() 2025-05-07T20:33:19.9098293Z 2025-05-07T20:33:19.9098397Z if scale_ub is not None: 2025-05-07T20:33:19.9098521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9098680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9098767Z ) 2025-05-07T20:33:19.9098860Z else: 2025-05-07T20:33:19.9098975Z scale_ub_tensor = None 2025-05-07T20:33:19.9099058Z 2025-05-07T20:33:19.9099206Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9099315Z op = silu_mul_quant 2025-05-07T20:33:19.9099418Z if compiled: 2025-05-07T20:33:19.9099540Z op = torch.compile(op) 2025-05-07T20:33:19.9099660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9099742Z 2025-05-07T20:33:19.9099849Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9099854Z 2025-05-07T20:33:19.9099965Z moe/activation_test.py:117: 2025-05-07T20:33:19.9100112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9100238Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9100354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9100918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9101038Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9101443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9101709Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9102094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9102201Z kernel = self.compile( 2025-05-07T20:33:19.9102645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9102844Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9102995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9103003Z 2025-05-07T20:33:19.9103235Z self = 2025-05-07T20:33:19.9104158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9104743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074d000>} 2025-05-07T20:33:19.9105585Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9105809Z context = 2025-05-07T20:33:19.9105858Z 2025-05-07T20:33:19.9106047Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9106349Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9106479Z module_map=module_map) 2025-05-07T20:33:19.9106665Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9106833Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9106922Z E ^ 2025-05-07T20:33:19.9107368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9107373Z 2025-05-07T20:33:19.9107846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9107851Z 2025-05-07T20:33:19.9107970Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9108227Z self=, 2025-05-07T20:33:19.9108320Z T=4096, 2025-05-07T20:33:19.9108408Z D=5120, 2025-05-07T20:33:19.9108511Z scale_ub=1200.0, 2025-05-07T20:33:19.9108613Z contiguous=False, 2025-05-07T20:33:19.9108710Z compiled=True, 2025-05-07T20:33:19.9108806Z ) 2025-05-07T20:33:19.9109052Z self = 2025-05-07T20:33:19.9109249Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.9109261Z 2025-05-07T20:33:19.9109356Z @given( 2025-05-07T20:33:19.9109490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9109610Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9109742Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9109876Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9110017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9110102Z ) 2025-05-07T20:33:19.9110385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9110500Z def test_silu_mul_quant( 2025-05-07T20:33:19.9110588Z self, 2025-05-07T20:33:19.9110679Z T: int, 2025-05-07T20:33:19.9110777Z D: int, 2025-05-07T20:33:19.9110890Z scale_ub: Optional[float], 2025-05-07T20:33:19.9110996Z contiguous: bool, 2025-05-07T20:33:19.9111100Z compiled: bool, 2025-05-07T20:33:19.9111193Z ) -> None: 2025-05-07T20:33:19.9111309Z torch.manual_seed(2025) 2025-05-07T20:33:19.9111393Z 2025-05-07T20:33:19.9111585Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9111676Z 2025-05-07T20:33:19.9111781Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9111926Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9112032Z x = x_sign * x_clamp 2025-05-07T20:33:19.9112124Z x0 = x[:, :D] 2025-05-07T20:33:19.9112216Z x1 = x[:, D:] 2025-05-07T20:33:19.9112307Z 2025-05-07T20:33:19.9112404Z if contiguous: 2025-05-07T20:33:19.9112508Z x0 = x0.contiguous() 2025-05-07T20:33:19.9112673Z x1 = x1.contiguous() 2025-05-07T20:33:19.9112759Z 2025-05-07T20:33:19.9112863Z if scale_ub is not None: 2025-05-07T20:33:19.9112993Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9113147Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9113246Z ) 2025-05-07T20:33:19.9113335Z else: 2025-05-07T20:33:19.9113444Z scale_ub_tensor = None 2025-05-07T20:33:19.9113671Z 2025-05-07T20:33:19.9113819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9113923Z op = silu_mul_quant 2025-05-07T20:33:19.9114028Z if compiled: 2025-05-07T20:33:19.9114141Z op = torch.compile(op) 2025-05-07T20:33:19.9114288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9114449Z 2025-05-07T20:33:19.9114558Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9114562Z 2025-05-07T20:33:19.9114681Z moe/activation_test.py:117: 2025-05-07T20:33:19.9114831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9114946Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9115068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9115566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9115674Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9116238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9116350Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9116759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9117015Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9117404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9117518Z kernel = self.compile( 2025-05-07T20:33:19.9117949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9118155Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9118306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9118311Z 2025-05-07T20:33:19.9118544Z self = 2025-05-07T20:33:19.9119417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9119987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074c700>} 2025-05-07T20:33:19.9120834Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9121057Z context = 2025-05-07T20:33:19.9121062Z 2025-05-07T20:33:19.9121249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9121552Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9121674Z module_map=module_map) 2025-05-07T20:33:19.9121857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9121976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9122067Z E ^ 2025-05-07T20:33:19.9122473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9122556Z 2025-05-07T20:33:19.9123022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9123026Z 2025-05-07T20:33:19.9123148Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9128945Z self=, 2025-05-07T20:33:19.9129050Z T=2048, 2025-05-07T20:33:19.9129148Z D=7168, 2025-05-07T20:33:19.9129249Z scale_ub=1200.0, 2025-05-07T20:33:19.9129349Z contiguous=False, 2025-05-07T20:33:19.9129454Z compiled=False, 2025-05-07T20:33:19.9129539Z ) 2025-05-07T20:33:19.9129788Z self = 2025-05-07T20:33:19.9129998Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.9130200Z 2025-05-07T20:33:19.9130293Z @given( 2025-05-07T20:33:19.9130432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9130560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9130693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9130835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9131052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9131204Z ) 2025-05-07T20:33:19.9131500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9131609Z def test_silu_mul_quant( 2025-05-07T20:33:19.9131699Z self, 2025-05-07T20:33:19.9131794Z T: int, 2025-05-07T20:33:19.9131882Z D: int, 2025-05-07T20:33:19.9131996Z scale_ub: Optional[float], 2025-05-07T20:33:19.9132113Z contiguous: bool, 2025-05-07T20:33:19.9132213Z compiled: bool, 2025-05-07T20:33:19.9132309Z ) -> None: 2025-05-07T20:33:19.9132426Z torch.manual_seed(2025) 2025-05-07T20:33:19.9132511Z 2025-05-07T20:33:19.9132709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9132803Z 2025-05-07T20:33:19.9132912Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9133063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9133168Z x = x_sign * x_clamp 2025-05-07T20:33:19.9133264Z x0 = x[:, :D] 2025-05-07T20:33:19.9133364Z x1 = x[:, D:] 2025-05-07T20:33:19.9133449Z 2025-05-07T20:33:19.9133545Z if contiguous: 2025-05-07T20:33:19.9133655Z x0 = x0.contiguous() 2025-05-07T20:33:19.9133757Z x1 = x1.contiguous() 2025-05-07T20:33:19.9133841Z 2025-05-07T20:33:19.9133952Z if scale_ub is not None: 2025-05-07T20:33:19.9134073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9134230Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9134327Z ) 2025-05-07T20:33:19.9134416Z else: 2025-05-07T20:33:19.9134530Z scale_ub_tensor = None 2025-05-07T20:33:19.9134613Z 2025-05-07T20:33:19.9134765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9134875Z op = silu_mul_quant 2025-05-07T20:33:19.9134973Z if compiled: 2025-05-07T20:33:19.9135090Z op = torch.compile(op) 2025-05-07T20:33:19.9135224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9135309Z 2025-05-07T20:33:19.9135414Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9135419Z 2025-05-07T20:33:19.9135537Z moe/activation_test.py:117: 2025-05-07T20:33:19.9135687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9135809Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9135925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9136491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9136614Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9137098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9137355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9137756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9137866Z kernel = self.compile( 2025-05-07T20:33:19.9138310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9138513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9138657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9138662Z 2025-05-07T20:33:19.9138952Z self = 2025-05-07T20:33:19.9139831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9140455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074d240>} 2025-05-07T20:33:19.9141335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9141553Z context = 2025-05-07T20:33:19.9141564Z 2025-05-07T20:33:19.9141752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9142055Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9142185Z module_map=module_map) 2025-05-07T20:33:19.9142371Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9142486Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9142582Z E ^ 2025-05-07T20:33:19.9142985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9142994Z 2025-05-07T20:33:19.9143466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9143471Z 2025-05-07T20:33:19.9143592Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9143843Z self=, 2025-05-07T20:33:19.9143938Z T=1, 2025-05-07T20:33:19.9144026Z D=7168, 2025-05-07T20:33:19.9144124Z scale_ub=None, 2025-05-07T20:33:19.9144232Z contiguous=True, 2025-05-07T20:33:19.9144329Z compiled=False, 2025-05-07T20:33:19.9144415Z ) 2025-05-07T20:33:19.9144673Z self = 2025-05-07T20:33:19.9144863Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.9144868Z 2025-05-07T20:33:19.9144964Z @given( 2025-05-07T20:33:19.9145102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9145219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9145358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9145492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9145624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9145718Z ) 2025-05-07T20:33:19.9146001Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9146109Z def test_silu_mul_quant( 2025-05-07T20:33:19.9146212Z self, 2025-05-07T20:33:19.9146301Z T: int, 2025-05-07T20:33:19.9146396Z D: int, 2025-05-07T20:33:19.9146561Z scale_ub: Optional[float], 2025-05-07T20:33:19.9146666Z contiguous: bool, 2025-05-07T20:33:19.9146772Z compiled: bool, 2025-05-07T20:33:19.9146864Z ) -> None: 2025-05-07T20:33:19.9146973Z torch.manual_seed(2025) 2025-05-07T20:33:19.9147067Z 2025-05-07T20:33:19.9147263Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9147349Z 2025-05-07T20:33:19.9147462Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9147605Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9147709Z x = x_sign * x_clamp 2025-05-07T20:33:19.9147809Z x0 = x[:, :D] 2025-05-07T20:33:19.9147903Z x1 = x[:, D:] 2025-05-07T20:33:19.9147993Z 2025-05-07T20:33:19.9148091Z if contiguous: 2025-05-07T20:33:19.9148196Z x0 = x0.contiguous() 2025-05-07T20:33:19.9148353Z x1 = x1.contiguous() 2025-05-07T20:33:19.9148436Z 2025-05-07T20:33:19.9148539Z if scale_ub is not None: 2025-05-07T20:33:19.9148670Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9148825Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9148913Z ) 2025-05-07T20:33:19.9149008Z else: 2025-05-07T20:33:19.9149161Z scale_ub_tensor = None 2025-05-07T20:33:19.9149285Z 2025-05-07T20:33:19.9149448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9149551Z op = silu_mul_quant 2025-05-07T20:33:19.9149649Z if compiled: 2025-05-07T20:33:19.9149770Z op = torch.compile(op) 2025-05-07T20:33:19.9149891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9149981Z 2025-05-07T20:33:19.9150086Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9150091Z 2025-05-07T20:33:19.9150207Z moe/activation_test.py:117: 2025-05-07T20:33:19.9150358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9150476Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9150593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9151169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9151286Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9151702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9151955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9152339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9152452Z kernel = self.compile( 2025-05-07T20:33:19.9152887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9153090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9153242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9153247Z 2025-05-07T20:33:19.9153481Z self = 2025-05-07T20:33:19.9154450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9155022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074e050>} 2025-05-07T20:33:19.9155866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9156087Z context = 2025-05-07T20:33:19.9156144Z 2025-05-07T20:33:19.9156334Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9156642Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9156770Z module_map=module_map) 2025-05-07T20:33:19.9156960Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9157074Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9157165Z E ^ 2025-05-07T20:33:19.9157573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9157579Z 2025-05-07T20:33:19.9158043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9158091Z 2025-05-07T20:33:19.9158211Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9158472Z self=, 2025-05-07T20:33:19.9158561Z T=16384, 2025-05-07T20:33:19.9158656Z D=7168, 2025-05-07T20:33:19.9158752Z scale_ub=1200.0, 2025-05-07T20:33:19.9158851Z contiguous=False, 2025-05-07T20:33:19.9159014Z compiled=True, 2025-05-07T20:33:19.9159099Z ) 2025-05-07T20:33:19.9159417Z self = 2025-05-07T20:33:19.9159626Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.9159631Z 2025-05-07T20:33:19.9159721Z @given( 2025-05-07T20:33:19.9159855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9159978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9160111Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9160254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9160384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9160470Z ) 2025-05-07T20:33:19.9160761Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9160868Z def test_silu_mul_quant( 2025-05-07T20:33:19.9160955Z self, 2025-05-07T20:33:19.9161054Z T: int, 2025-05-07T20:33:19.9161142Z D: int, 2025-05-07T20:33:19.9161259Z scale_ub: Optional[float], 2025-05-07T20:33:19.9161369Z contiguous: bool, 2025-05-07T20:33:19.9161468Z compiled: bool, 2025-05-07T20:33:19.9161557Z ) -> None: 2025-05-07T20:33:19.9161673Z torch.manual_seed(2025) 2025-05-07T20:33:19.9161757Z 2025-05-07T20:33:19.9161956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9162045Z 2025-05-07T20:33:19.9162152Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9162304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9162409Z x = x_sign * x_clamp 2025-05-07T20:33:19.9162501Z x0 = x[:, :D] 2025-05-07T20:33:19.9162605Z x1 = x[:, D:] 2025-05-07T20:33:19.9162690Z 2025-05-07T20:33:19.9162788Z if contiguous: 2025-05-07T20:33:19.9162900Z x0 = x0.contiguous() 2025-05-07T20:33:19.9163004Z x1 = x1.contiguous() 2025-05-07T20:33:19.9163091Z 2025-05-07T20:33:19.9163206Z if scale_ub is not None: 2025-05-07T20:33:19.9163330Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9163494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9163582Z ) 2025-05-07T20:33:19.9163671Z else: 2025-05-07T20:33:19.9163789Z scale_ub_tensor = None 2025-05-07T20:33:19.9163873Z 2025-05-07T20:33:19.9164023Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9164133Z op = silu_mul_quant 2025-05-07T20:33:19.9164238Z if compiled: 2025-05-07T20:33:19.9164353Z op = torch.compile(op) 2025-05-07T20:33:19.9164482Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9164619Z 2025-05-07T20:33:19.9164726Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9164731Z 2025-05-07T20:33:19.9164851Z moe/activation_test.py:117: 2025-05-07T20:33:19.9164997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9165125Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9165240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9165655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9165770Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9166330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9166490Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9166904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9167162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9167553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9167707Z kernel = self.compile( 2025-05-07T20:33:19.9168180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9168390Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9168533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9168538Z 2025-05-07T20:33:19.9168776Z self = 2025-05-07T20:33:19.9169653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9170229Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074f490>} 2025-05-07T20:33:19.9171077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9171296Z context = 2025-05-07T20:33:19.9171301Z 2025-05-07T20:33:19.9171494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9171796Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9171922Z module_map=module_map) 2025-05-07T20:33:19.9172112Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9172228Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9172326Z E ^ 2025-05-07T20:33:19.9172726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9172733Z 2025-05-07T20:33:19.9173202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9173208Z 2025-05-07T20:33:19.9173332Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9173584Z self=, 2025-05-07T20:33:19.9173683Z T=1, 2025-05-07T20:33:19.9173771Z D=7168, 2025-05-07T20:33:19.9173867Z scale_ub=None, 2025-05-07T20:33:19.9173973Z contiguous=False, 2025-05-07T20:33:19.9174069Z compiled=False, 2025-05-07T20:33:19.9174156Z ) 2025-05-07T20:33:19.9174408Z self = 2025-05-07T20:33:19.9174645Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.9174651Z 2025-05-07T20:33:19.9174742Z @given( 2025-05-07T20:33:19.9174888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9175002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9175139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9175284Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9175414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9175507Z ) 2025-05-07T20:33:19.9175787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9175895Z def test_silu_mul_quant( 2025-05-07T20:33:19.9175989Z self, 2025-05-07T20:33:19.9176076Z T: int, 2025-05-07T20:33:19.9176210Z D: int, 2025-05-07T20:33:19.9176329Z scale_ub: Optional[float], 2025-05-07T20:33:19.9176434Z contiguous: bool, 2025-05-07T20:33:19.9176536Z compiled: bool, 2025-05-07T20:33:19.9176636Z ) -> None: 2025-05-07T20:33:19.9176746Z torch.manual_seed(2025) 2025-05-07T20:33:19.9176830Z 2025-05-07T20:33:19.9177029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9177164Z 2025-05-07T20:33:19.9177277Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9177460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9177565Z x = x_sign * x_clamp 2025-05-07T20:33:19.9177663Z x0 = x[:, :D] 2025-05-07T20:33:19.9177754Z x1 = x[:, D:] 2025-05-07T20:33:19.9177838Z 2025-05-07T20:33:19.9177940Z if contiguous: 2025-05-07T20:33:19.9178046Z x0 = x0.contiguous() 2025-05-07T20:33:19.9178150Z x1 = x1.contiguous() 2025-05-07T20:33:19.9178239Z 2025-05-07T20:33:19.9178347Z if scale_ub is not None: 2025-05-07T20:33:19.9178470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9178636Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9178723Z ) 2025-05-07T20:33:19.9178818Z else: 2025-05-07T20:33:19.9178926Z scale_ub_tensor = None 2025-05-07T20:33:19.9179010Z 2025-05-07T20:33:19.9179167Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9179277Z op = silu_mul_quant 2025-05-07T20:33:19.9179378Z if compiled: 2025-05-07T20:33:19.9179501Z op = torch.compile(op) 2025-05-07T20:33:19.9179623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9179707Z 2025-05-07T20:33:19.9179821Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9179826Z 2025-05-07T20:33:19.9179938Z moe/activation_test.py:117: 2025-05-07T20:33:19.9180085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9180210Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9180324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9180899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9181011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9181420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9181688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9182077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9182192Z kernel = self.compile( 2025-05-07T20:33:19.9182628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9182827Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9182980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9182985Z 2025-05-07T20:33:19.9183269Z self = 2025-05-07T20:33:19.9184154Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9184729Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd9074f7f0>} 2025-05-07T20:33:19.9185568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9185835Z context = 2025-05-07T20:33:19.9185840Z 2025-05-07T20:33:19.9186030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9186339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9186465Z module_map=module_map) 2025-05-07T20:33:19.9186694Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9186854Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9186944Z E ^ 2025-05-07T20:33:19.9187345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9187357Z 2025-05-07T20:33:19.9187822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9187827Z 2025-05-07T20:33:19.9187946Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9188208Z self=, 2025-05-07T20:33:19.9188298Z T=2048, 2025-05-07T20:33:19.9188391Z D=7168, 2025-05-07T20:33:19.9188497Z scale_ub=None, 2025-05-07T20:33:19.9188596Z contiguous=False, 2025-05-07T20:33:19.9188692Z compiled=True, 2025-05-07T20:33:19.9188783Z ) 2025-05-07T20:33:19.9189029Z self = 2025-05-07T20:33:19.9189239Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.9189244Z 2025-05-07T20:33:19.9189331Z @given( 2025-05-07T20:33:19.9189467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9189586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9189717Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9189851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9189987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9190075Z ) 2025-05-07T20:33:19.9190355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9190474Z def test_silu_mul_quant( 2025-05-07T20:33:19.9190561Z self, 2025-05-07T20:33:19.9190657Z T: int, 2025-05-07T20:33:19.9190745Z D: int, 2025-05-07T20:33:19.9190857Z scale_ub: Optional[float], 2025-05-07T20:33:19.9190970Z contiguous: bool, 2025-05-07T20:33:19.9191073Z compiled: bool, 2025-05-07T20:33:19.9191162Z ) -> None: 2025-05-07T20:33:19.9191279Z torch.manual_seed(2025) 2025-05-07T20:33:19.9191364Z 2025-05-07T20:33:19.9191556Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9191647Z 2025-05-07T20:33:19.9191753Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9191895Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9192001Z x = x_sign * x_clamp 2025-05-07T20:33:19.9192095Z x0 = x[:, :D] 2025-05-07T20:33:19.9192193Z x1 = x[:, D:] 2025-05-07T20:33:19.9192277Z 2025-05-07T20:33:19.9192373Z if contiguous: 2025-05-07T20:33:19.9192536Z x0 = x0.contiguous() 2025-05-07T20:33:19.9192641Z x1 = x1.contiguous() 2025-05-07T20:33:19.9192726Z 2025-05-07T20:33:19.9192837Z if scale_ub is not None: 2025-05-07T20:33:19.9192957Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9193121Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9193214Z ) 2025-05-07T20:33:19.9193301Z else: 2025-05-07T20:33:19.9193409Z scale_ub_tensor = None 2025-05-07T20:33:19.9193565Z 2025-05-07T20:33:19.9193726Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9193842Z op = silu_mul_quant 2025-05-07T20:33:19.9193967Z if compiled: 2025-05-07T20:33:19.9194081Z op = torch.compile(op) 2025-05-07T20:33:19.9194272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9194356Z 2025-05-07T20:33:19.9194460Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9194465Z 2025-05-07T20:33:19.9194587Z moe/activation_test.py:117: 2025-05-07T20:33:19.9194733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9194850Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9195043Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9195498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9195620Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9196176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9196288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9196701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9196960Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9197352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9197468Z kernel = self.compile( 2025-05-07T20:33:19.9197900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9198113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9198257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9198262Z 2025-05-07T20:33:19.9198496Z self = 2025-05-07T20:33:19.9199375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9199954Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1caf0>} 2025-05-07T20:33:19.9200796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9201027Z context = 2025-05-07T20:33:19.9201032Z 2025-05-07T20:33:19.9201220Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9201528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9201651Z module_map=module_map) 2025-05-07T20:33:19.9201845Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9201963Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9202052Z E ^ 2025-05-07T20:33:19.9202509Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9202515Z 2025-05-07T20:33:19.9202981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9202989Z 2025-05-07T20:33:19.9203117Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9203370Z self=, 2025-05-07T20:33:19.9203458Z T=4096, 2025-05-07T20:33:19.9203551Z D=7168, 2025-05-07T20:33:19.9203646Z scale_ub=None, 2025-05-07T20:33:19.9203745Z contiguous=False, 2025-05-07T20:33:19.9203857Z compiled=True, 2025-05-07T20:33:19.9203954Z ) 2025-05-07T20:33:19.9204225Z self = 2025-05-07T20:33:19.9204471Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.9204476Z 2025-05-07T20:33:19.9204566Z @given( 2025-05-07T20:33:19.9204708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9204826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9204957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9205141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9205311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9205399Z ) 2025-05-07T20:33:19.9205683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9205792Z def test_silu_mul_quant( 2025-05-07T20:33:19.9205879Z self, 2025-05-07T20:33:19.9205972Z T: int, 2025-05-07T20:33:19.9206060Z D: int, 2025-05-07T20:33:19.9206179Z scale_ub: Optional[float], 2025-05-07T20:33:19.9206280Z contiguous: bool, 2025-05-07T20:33:19.9206381Z compiled: bool, 2025-05-07T20:33:19.9206475Z ) -> None: 2025-05-07T20:33:19.9206584Z torch.manual_seed(2025) 2025-05-07T20:33:19.9206672Z 2025-05-07T20:33:19.9206872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9206957Z 2025-05-07T20:33:19.9207062Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9207213Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9207321Z x = x_sign * x_clamp 2025-05-07T20:33:19.9207412Z x0 = x[:, :D] 2025-05-07T20:33:19.9207510Z x1 = x[:, D:] 2025-05-07T20:33:19.9207592Z 2025-05-07T20:33:19.9207687Z if contiguous: 2025-05-07T20:33:19.9207799Z x0 = x0.contiguous() 2025-05-07T20:33:19.9207900Z x1 = x1.contiguous() 2025-05-07T20:33:19.9207987Z 2025-05-07T20:33:19.9208090Z if scale_ub is not None: 2025-05-07T20:33:19.9208209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9208374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9208460Z ) 2025-05-07T20:33:19.9208547Z else: 2025-05-07T20:33:19.9208661Z scale_ub_tensor = None 2025-05-07T20:33:19.9208744Z 2025-05-07T20:33:19.9208892Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9209002Z op = silu_mul_quant 2025-05-07T20:33:19.9209101Z if compiled: 2025-05-07T20:33:19.9209217Z op = torch.compile(op) 2025-05-07T20:33:19.9209345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9209428Z 2025-05-07T20:33:19.9209538Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9209543Z 2025-05-07T20:33:19.9209653Z moe/activation_test.py:117: 2025-05-07T20:33:19.9209798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9209923Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9210036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9210454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9210567Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9211178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9211302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9211713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9211968Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9212359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9212468Z kernel = self.compile( 2025-05-07T20:33:19.9212905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9213156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9213304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9213309Z 2025-05-07T20:33:19.9213548Z self = 2025-05-07T20:33:19.9214467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9215110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1c280>} 2025-05-07T20:33:19.9215946Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9216169Z context = 2025-05-07T20:33:19.9216174Z 2025-05-07T20:33:19.9216371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9216670Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9216800Z module_map=module_map) 2025-05-07T20:33:19.9216989Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9217102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9217198Z E ^ 2025-05-07T20:33:19.9217598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9217603Z 2025-05-07T20:33:19.9218065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9218079Z 2025-05-07T20:33:19.9218200Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9218451Z self=, 2025-05-07T20:33:19.9218553Z T=16384, 2025-05-07T20:33:19.9218644Z D=5120, 2025-05-07T20:33:19.9218739Z scale_ub=1200.0, 2025-05-07T20:33:19.9218846Z contiguous=False, 2025-05-07T20:33:19.9218945Z compiled=False, 2025-05-07T20:33:19.9219032Z ) 2025-05-07T20:33:19.9219288Z self = 2025-05-07T20:33:19.9219493Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.9219498Z 2025-05-07T20:33:19.9219586Z @given( 2025-05-07T20:33:19.9219726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9219839Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9219979Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9220116Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9220251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9220346Z ) 2025-05-07T20:33:19.9221359Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9221478Z def test_silu_mul_quant( 2025-05-07T20:33:19.9221571Z self, 2025-05-07T20:33:19.9221661Z T: int, 2025-05-07T20:33:19.9221747Z D: int, 2025-05-07T20:33:19.9221871Z scale_ub: Optional[float], 2025-05-07T20:33:19.9221975Z contiguous: bool, 2025-05-07T20:33:19.9222079Z compiled: bool, 2025-05-07T20:33:19.9222168Z ) -> None: 2025-05-07T20:33:19.9222276Z torch.manual_seed(2025) 2025-05-07T20:33:19.9222370Z 2025-05-07T20:33:19.9222562Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9222646Z 2025-05-07T20:33:19.9222757Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9222899Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9223101Z x = x_sign * x_clamp 2025-05-07T20:33:19.9223198Z x0 = x[:, :D] 2025-05-07T20:33:19.9223290Z x1 = x[:, D:] 2025-05-07T20:33:19.9223372Z 2025-05-07T20:33:19.9223478Z if contiguous: 2025-05-07T20:33:19.9223582Z x0 = x0.contiguous() 2025-05-07T20:33:19.9223684Z x1 = x1.contiguous() 2025-05-07T20:33:19.9223779Z 2025-05-07T20:33:19.9224305Z if scale_ub is not None: 2025-05-07T20:33:19.9224626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9224784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9224871Z ) 2025-05-07T20:33:19.9224965Z else: 2025-05-07T20:33:19.9225072Z scale_ub_tensor = None 2025-05-07T20:33:19.9225157Z 2025-05-07T20:33:19.9225311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9225414Z op = silu_mul_quant 2025-05-07T20:33:19.9225514Z if compiled: 2025-05-07T20:33:19.9225644Z op = torch.compile(op) 2025-05-07T20:33:19.9225764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9225847Z 2025-05-07T20:33:19.9225961Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9225966Z 2025-05-07T20:33:19.9226076Z moe/activation_test.py:117: 2025-05-07T20:33:19.9226228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9226352Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9226468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9227036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9227150Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9227556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9227816Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9228201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9228317Z kernel = self.compile( 2025-05-07T20:33:19.9228750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9228948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9229110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9229115Z 2025-05-07T20:33:19.9229347Z self = 2025-05-07T20:33:19.9230224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9230796Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1ed40>} 2025-05-07T20:33:19.9231715Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9231943Z context = 2025-05-07T20:33:19.9231951Z 2025-05-07T20:33:19.9232142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9232449Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9232572Z module_map=module_map) 2025-05-07T20:33:19.9232756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9232875Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9232965Z E ^ 2025-05-07T20:33:19.9233470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9233475Z 2025-05-07T20:33:19.9234012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9234017Z 2025-05-07T20:33:19.9234137Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9234513Z self=, 2025-05-07T20:33:19.9234605Z T=16384, 2025-05-07T20:33:19.9234695Z D=5120, 2025-05-07T20:33:19.9234798Z scale_ub=1200.0, 2025-05-07T20:33:19.9234899Z contiguous=True, 2025-05-07T20:33:19.9235003Z compiled=True, 2025-05-07T20:33:19.9235088Z ) 2025-05-07T20:33:19.9235335Z self = 2025-05-07T20:33:19.9235537Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.9235545Z 2025-05-07T20:33:19.9235636Z @given( 2025-05-07T20:33:19.9235771Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9235892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9236027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9236162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9236298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9236388Z ) 2025-05-07T20:33:19.9236681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9236790Z def test_silu_mul_quant( 2025-05-07T20:33:19.9236878Z self, 2025-05-07T20:33:19.9236977Z T: int, 2025-05-07T20:33:19.9237066Z D: int, 2025-05-07T20:33:19.9237178Z scale_ub: Optional[float], 2025-05-07T20:33:19.9237287Z contiguous: bool, 2025-05-07T20:33:19.9237386Z compiled: bool, 2025-05-07T20:33:19.9237478Z ) -> None: 2025-05-07T20:33:19.9237598Z torch.manual_seed(2025) 2025-05-07T20:33:19.9237683Z 2025-05-07T20:33:19.9237874Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9237965Z 2025-05-07T20:33:19.9238074Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9238224Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9238326Z x = x_sign * x_clamp 2025-05-07T20:33:19.9238420Z x0 = x[:, :D] 2025-05-07T20:33:19.9238518Z x1 = x[:, D:] 2025-05-07T20:33:19.9238605Z 2025-05-07T20:33:19.9238701Z if contiguous: 2025-05-07T20:33:19.9238814Z x0 = x0.contiguous() 2025-05-07T20:33:19.9238916Z x1 = x1.contiguous() 2025-05-07T20:33:19.9238999Z 2025-05-07T20:33:19.9239111Z if scale_ub is not None: 2025-05-07T20:33:19.9239233Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9239387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9239481Z ) 2025-05-07T20:33:19.9239572Z else: 2025-05-07T20:33:19.9239688Z scale_ub_tensor = None 2025-05-07T20:33:19.9239772Z 2025-05-07T20:33:19.9239973Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9240086Z op = silu_mul_quant 2025-05-07T20:33:19.9240184Z if compiled: 2025-05-07T20:33:19.9240299Z op = torch.compile(op) 2025-05-07T20:33:19.9240428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9240511Z 2025-05-07T20:33:19.9240620Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9240624Z 2025-05-07T20:33:19.9240741Z moe/activation_test.py:117: 2025-05-07T20:33:19.9240887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9241006Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9241128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9241545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9241713Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9242278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9242390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9242801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9243143Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9243534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9243642Z kernel = self.compile( 2025-05-07T20:33:19.9244073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9244306Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9244455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9244460Z 2025-05-07T20:33:19.9244743Z self = 2025-05-07T20:33:19.9245633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9246223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1e830>} 2025-05-07T20:33:19.9247073Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9247294Z context = 2025-05-07T20:33:19.9247301Z 2025-05-07T20:33:19.9247497Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9247802Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9247931Z module_map=module_map) 2025-05-07T20:33:19.9248123Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9248240Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9248332Z E ^ 2025-05-07T20:33:19.9248748Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9248753Z 2025-05-07T20:33:19.9249226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9249231Z 2025-05-07T20:33:19.9249361Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9249616Z self=, 2025-05-07T20:33:19.9249711Z T=16384, 2025-05-07T20:33:19.9249806Z D=5120, 2025-05-07T20:33:19.9249904Z scale_ub=None, 2025-05-07T20:33:19.9250055Z contiguous=False, 2025-05-07T20:33:19.9250163Z compiled=True, 2025-05-07T20:33:19.9250250Z ) 2025-05-07T20:33:19.9250505Z self = 2025-05-07T20:33:19.9250710Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.9250718Z 2025-05-07T20:33:19.9250808Z @given( 2025-05-07T20:33:19.9250952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9251068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9251200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9251342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9251475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9251561Z ) 2025-05-07T20:33:19.9251895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9252005Z def test_silu_mul_quant( 2025-05-07T20:33:19.9252104Z self, 2025-05-07T20:33:19.9252194Z T: int, 2025-05-07T20:33:19.9252282Z D: int, 2025-05-07T20:33:19.9252407Z scale_ub: Optional[float], 2025-05-07T20:33:19.9252511Z contiguous: bool, 2025-05-07T20:33:19.9252655Z compiled: bool, 2025-05-07T20:33:19.9252754Z ) -> None: 2025-05-07T20:33:19.9252904Z torch.manual_seed(2025) 2025-05-07T20:33:19.9252990Z 2025-05-07T20:33:19.9253192Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9253278Z 2025-05-07T20:33:19.9253385Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9253538Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9253642Z x = x_sign * x_clamp 2025-05-07T20:33:19.9253743Z x0 = x[:, :D] 2025-05-07T20:33:19.9253838Z x1 = x[:, D:] 2025-05-07T20:33:19.9253922Z 2025-05-07T20:33:19.9254028Z if contiguous: 2025-05-07T20:33:19.9254135Z x0 = x0.contiguous() 2025-05-07T20:33:19.9254241Z x1 = x1.contiguous() 2025-05-07T20:33:19.9254334Z 2025-05-07T20:33:19.9254440Z if scale_ub is not None: 2025-05-07T20:33:19.9254561Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9254725Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9254815Z ) 2025-05-07T20:33:19.9254904Z else: 2025-05-07T20:33:19.9255020Z scale_ub_tensor = None 2025-05-07T20:33:19.9255103Z 2025-05-07T20:33:19.9255251Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9255361Z op = silu_mul_quant 2025-05-07T20:33:19.9255458Z if compiled: 2025-05-07T20:33:19.9255579Z op = torch.compile(op) 2025-05-07T20:33:19.9255701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9255787Z 2025-05-07T20:33:19.9255898Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9255902Z 2025-05-07T20:33:19.9256014Z moe/activation_test.py:117: 2025-05-07T20:33:19.9256164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9256288Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9256404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9256834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9256943Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9257506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9257624Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9258033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9258291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9258743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9258854Z kernel = self.compile( 2025-05-07T20:33:19.9259297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9259500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9259649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9259654Z 2025-05-07T20:33:19.9259898Z self = 2025-05-07T20:33:19.9260777Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9261409Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbf1f760>} 2025-05-07T20:33:19.9262255Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9262566Z context = 2025-05-07T20:33:19.9262578Z 2025-05-07T20:33:19.9262768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9263074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9263203Z module_map=module_map) 2025-05-07T20:33:19.9263387Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9263502Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9263600Z E ^ 2025-05-07T20:33:19.9264006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9264011Z 2025-05-07T20:33:19.9264493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9264499Z 2025-05-07T20:33:19.9264640Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9264921Z self=, 2025-05-07T20:33:19.9265020Z T=2048, 2025-05-07T20:33:19.9265110Z D=5120, 2025-05-07T20:33:19.9265206Z scale_ub=None, 2025-05-07T20:33:19.9265313Z contiguous=False, 2025-05-07T20:33:19.9265410Z compiled=True, 2025-05-07T20:33:19.9265497Z ) 2025-05-07T20:33:19.9265752Z self = 2025-05-07T20:33:19.9265949Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.9265956Z 2025-05-07T20:33:19.9266053Z @given( 2025-05-07T20:33:19.9266188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9266306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9266445Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9266580Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9266715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9271695Z ) 2025-05-07T20:33:19.9272003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9272114Z def test_silu_mul_quant( 2025-05-07T20:33:19.9272211Z self, 2025-05-07T20:33:19.9272300Z T: int, 2025-05-07T20:33:19.9272399Z D: int, 2025-05-07T20:33:19.9272512Z scale_ub: Optional[float], 2025-05-07T20:33:19.9272616Z contiguous: bool, 2025-05-07T20:33:19.9272722Z compiled: bool, 2025-05-07T20:33:19.9272817Z ) -> None: 2025-05-07T20:33:19.9272930Z torch.manual_seed(2025) 2025-05-07T20:33:19.9273024Z 2025-05-07T20:33:19.9273335Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9273423Z 2025-05-07T20:33:19.9273731Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9273877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9273983Z x = x_sign * x_clamp 2025-05-07T20:33:19.9274084Z x0 = x[:, :D] 2025-05-07T20:33:19.9274179Z x1 = x[:, D:] 2025-05-07T20:33:19.9274271Z 2025-05-07T20:33:19.9274369Z if contiguous: 2025-05-07T20:33:19.9274474Z x0 = x0.contiguous() 2025-05-07T20:33:19.9274585Z x1 = x1.contiguous() 2025-05-07T20:33:19.9274668Z 2025-05-07T20:33:19.9274775Z if scale_ub is not None: 2025-05-07T20:33:19.9274905Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9275061Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9275208Z ) 2025-05-07T20:33:19.9275305Z else: 2025-05-07T20:33:19.9275415Z scale_ub_tensor = None 2025-05-07T20:33:19.9275501Z 2025-05-07T20:33:19.9275660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9275768Z op = silu_mul_quant 2025-05-07T20:33:19.9275866Z if compiled: 2025-05-07T20:33:19.9276039Z op = torch.compile(op) 2025-05-07T20:33:19.9276203Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9276293Z 2025-05-07T20:33:19.9276398Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9276404Z 2025-05-07T20:33:19.9276517Z moe/activation_test.py:117: 2025-05-07T20:33:19.9276674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9276794Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9276914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9277346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9277461Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9278038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9278152Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9278563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9278835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9279225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9279334Z kernel = self.compile( 2025-05-07T20:33:19.9279777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9279979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9280133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9280138Z 2025-05-07T20:33:19.9280377Z self = 2025-05-07T20:33:19.9281263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9281854Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce43a0>} 2025-05-07T20:33:19.9282698Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9282929Z context = 2025-05-07T20:33:19.9282934Z 2025-05-07T20:33:19.9283173Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9283482Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9283607Z module_map=module_map) 2025-05-07T20:33:19.9283796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9283917Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9284008Z E ^ 2025-05-07T20:33:19.9284412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9284417Z 2025-05-07T20:33:19.9284894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9284899Z 2025-05-07T20:33:19.9285020Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9285334Z self=, 2025-05-07T20:33:19.9285424Z T=2048, 2025-05-07T20:33:19.9285511Z D=5120, 2025-05-07T20:33:19.9285615Z scale_ub=1200.0, 2025-05-07T20:33:19.9285715Z contiguous=False, 2025-05-07T20:33:19.9285812Z compiled=True, 2025-05-07T20:33:19.9285903Z ) 2025-05-07T20:33:19.9286151Z self = 2025-05-07T20:33:19.9286443Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.9286456Z 2025-05-07T20:33:19.9286547Z @given( 2025-05-07T20:33:19.9286683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9286805Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9286939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9287074Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9287213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9287303Z ) 2025-05-07T20:33:19.9287584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9287702Z def test_silu_mul_quant( 2025-05-07T20:33:19.9287793Z self, 2025-05-07T20:33:19.9287882Z T: int, 2025-05-07T20:33:19.9287978Z D: int, 2025-05-07T20:33:19.9288093Z scale_ub: Optional[float], 2025-05-07T20:33:19.9288207Z contiguous: bool, 2025-05-07T20:33:19.9288312Z compiled: bool, 2025-05-07T20:33:19.9288407Z ) -> None: 2025-05-07T20:33:19.9288525Z torch.manual_seed(2025) 2025-05-07T20:33:19.9288610Z 2025-05-07T20:33:19.9288804Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9288897Z 2025-05-07T20:33:19.9289004Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9289150Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9289264Z x = x_sign * x_clamp 2025-05-07T20:33:19.9289361Z x0 = x[:, :D] 2025-05-07T20:33:19.9289453Z x1 = x[:, D:] 2025-05-07T20:33:19.9289545Z 2025-05-07T20:33:19.9289641Z if contiguous: 2025-05-07T20:33:19.9289750Z x0 = x0.contiguous() 2025-05-07T20:33:19.9289862Z x1 = x1.contiguous() 2025-05-07T20:33:19.9289946Z 2025-05-07T20:33:19.9290058Z if scale_ub is not None: 2025-05-07T20:33:19.9290184Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9290343Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9290441Z ) 2025-05-07T20:33:19.9290534Z else: 2025-05-07T20:33:19.9290645Z scale_ub_tensor = None 2025-05-07T20:33:19.9290737Z 2025-05-07T20:33:19.9290887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9290995Z op = silu_mul_quant 2025-05-07T20:33:19.9291101Z if compiled: 2025-05-07T20:33:19.9291217Z op = torch.compile(op) 2025-05-07T20:33:19.9291343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9291434Z 2025-05-07T20:33:19.9291540Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9291544Z 2025-05-07T20:33:19.9291718Z moe/activation_test.py:117: 2025-05-07T20:33:19.9291869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9291988Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9292113Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9292532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9292641Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9293212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9293326Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9293741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9294045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9294434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9294548Z kernel = self.compile( 2025-05-07T20:33:19.9294984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9295278Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9295424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9295430Z 2025-05-07T20:33:19.9295669Z self = 2025-05-07T20:33:19.9296559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9297145Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce4820>} 2025-05-07T20:33:19.9297997Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9298224Z context = 2025-05-07T20:33:19.9298229Z 2025-05-07T20:33:19.9298420Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9298730Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9298856Z module_map=module_map) 2025-05-07T20:33:19.9299048Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9299165Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9299256Z E ^ 2025-05-07T20:33:19.9299670Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9299675Z 2025-05-07T20:33:19.9300146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9300154Z 2025-05-07T20:33:19.9300284Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9300538Z self=, 2025-05-07T20:33:19.9300628Z T=4096, 2025-05-07T20:33:19.9300722Z D=5120, 2025-05-07T20:33:19.9300818Z scale_ub=1200.0, 2025-05-07T20:33:19.9300916Z contiguous=True, 2025-05-07T20:33:19.9301023Z compiled=True, 2025-05-07T20:33:19.9301108Z ) 2025-05-07T20:33:19.9301359Z self = 2025-05-07T20:33:19.9301567Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.9301572Z 2025-05-07T20:33:19.9301662Z @given( 2025-05-07T20:33:19.9301848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9301974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9302107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9302250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9302385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9302472Z ) 2025-05-07T20:33:19.9302760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9302868Z def test_silu_mul_quant( 2025-05-07T20:33:19.9302957Z self, 2025-05-07T20:33:19.9303052Z T: int, 2025-05-07T20:33:19.9303140Z D: int, 2025-05-07T20:33:19.9303254Z scale_ub: Optional[float], 2025-05-07T20:33:19.9303364Z contiguous: bool, 2025-05-07T20:33:19.9303511Z compiled: bool, 2025-05-07T20:33:19.9303610Z ) -> None: 2025-05-07T20:33:19.9303720Z torch.manual_seed(2025) 2025-05-07T20:33:19.9303804Z 2025-05-07T20:33:19.9304009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9304095Z 2025-05-07T20:33:19.9304203Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9304402Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9304575Z x = x_sign * x_clamp 2025-05-07T20:33:19.9304670Z x0 = x[:, :D] 2025-05-07T20:33:19.9304770Z x1 = x[:, D:] 2025-05-07T20:33:19.9304854Z 2025-05-07T20:33:19.9304952Z if contiguous: 2025-05-07T20:33:19.9305071Z x0 = x0.contiguous() 2025-05-07T20:33:19.9305175Z x1 = x1.contiguous() 2025-05-07T20:33:19.9305262Z 2025-05-07T20:33:19.9305374Z if scale_ub is not None: 2025-05-07T20:33:19.9305496Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9305661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9305750Z ) 2025-05-07T20:33:19.9305839Z else: 2025-05-07T20:33:19.9305958Z scale_ub_tensor = None 2025-05-07T20:33:19.9306044Z 2025-05-07T20:33:19.9306197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9306309Z op = silu_mul_quant 2025-05-07T20:33:19.9306412Z if compiled: 2025-05-07T20:33:19.9306531Z op = torch.compile(op) 2025-05-07T20:33:19.9306661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9306747Z 2025-05-07T20:33:19.9306853Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9306864Z 2025-05-07T20:33:19.9306977Z moe/activation_test.py:117: 2025-05-07T20:33:19.9307125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9307249Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9307366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9307792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9307909Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9308474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9308589Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9309011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9309268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9309663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9309772Z kernel = self.compile( 2025-05-07T20:33:19.9310212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9310428Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9310629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9310635Z 2025-05-07T20:33:19.9310881Z self = 2025-05-07T20:33:19.9311763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9312347Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce5360>} 2025-05-07T20:33:19.9313199Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9313468Z context = 2025-05-07T20:33:19.9313473Z 2025-05-07T20:33:19.9313773Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9314076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9314247Z module_map=module_map) 2025-05-07T20:33:19.9314487Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9314607Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9314706Z E ^ 2025-05-07T20:33:19.9315112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9315117Z 2025-05-07T20:33:19.9315588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9315596Z 2025-05-07T20:33:19.9315723Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9315977Z self=, 2025-05-07T20:33:19.9316078Z T=128, 2025-05-07T20:33:19.9316168Z D=5120, 2025-05-07T20:33:19.9316264Z scale_ub=1200.0, 2025-05-07T20:33:19.9316373Z contiguous=False, 2025-05-07T20:33:19.9316470Z compiled=True, 2025-05-07T20:33:19.9316559Z ) 2025-05-07T20:33:19.9316817Z self = 2025-05-07T20:33:19.9317015Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.9317019Z 2025-05-07T20:33:19.9317111Z @given( 2025-05-07T20:33:19.9317253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9317368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9317506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9317641Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9317776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9317868Z ) 2025-05-07T20:33:19.9318153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9318262Z def test_silu_mul_quant( 2025-05-07T20:33:19.9318361Z self, 2025-05-07T20:33:19.9318449Z T: int, 2025-05-07T20:33:19.9318537Z D: int, 2025-05-07T20:33:19.9318666Z scale_ub: Optional[float], 2025-05-07T20:33:19.9318775Z contiguous: bool, 2025-05-07T20:33:19.9318874Z compiled: bool, 2025-05-07T20:33:19.9318972Z ) -> None: 2025-05-07T20:33:19.9319080Z torch.manual_seed(2025) 2025-05-07T20:33:19.9319170Z 2025-05-07T20:33:19.9319366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9319452Z 2025-05-07T20:33:19.9319563Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9319707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9319815Z x = x_sign * x_clamp 2025-05-07T20:33:19.9319915Z x0 = x[:, :D] 2025-05-07T20:33:19.9320008Z x1 = x[:, D:] 2025-05-07T20:33:19.9320091Z 2025-05-07T20:33:19.9320251Z if contiguous: 2025-05-07T20:33:19.9320359Z x0 = x0.contiguous() 2025-05-07T20:33:19.9320462Z x1 = x1.contiguous() 2025-05-07T20:33:19.9320553Z 2025-05-07T20:33:19.9320663Z if scale_ub is not None: 2025-05-07T20:33:19.9320791Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9320952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9321041Z ) 2025-05-07T20:33:19.9321140Z else: 2025-05-07T20:33:19.9321249Z scale_ub_tensor = None 2025-05-07T20:33:19.9321335Z 2025-05-07T20:33:19.9321492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9321598Z op = silu_mul_quant 2025-05-07T20:33:19.9321696Z if compiled: 2025-05-07T20:33:19.9321868Z op = torch.compile(op) 2025-05-07T20:33:19.9321991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9322077Z 2025-05-07T20:33:19.9322193Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9322198Z 2025-05-07T20:33:19.9322310Z moe/activation_test.py:117: 2025-05-07T20:33:19.9322465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9322630Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9322789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9323216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9323324Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9324401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9324555Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9324968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9325234Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9325622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9325732Z kernel = self.compile( 2025-05-07T20:33:19.9326184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9326390Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9326536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9326548Z 2025-05-07T20:33:19.9326787Z self = 2025-05-07T20:33:19.9327671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9328266Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce6290>} 2025-05-07T20:33:19.9329115Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9329351Z context = 2025-05-07T20:33:19.9329356Z 2025-05-07T20:33:19.9329546Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9329851Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9329986Z module_map=module_map) 2025-05-07T20:33:19.9330175Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9330292Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9330389Z E ^ 2025-05-07T20:33:19.9330989Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9330996Z 2025-05-07T20:33:19.9331478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9331485Z 2025-05-07T20:33:19.9331605Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9331859Z self=, 2025-05-07T20:33:19.9331958Z T=16384, 2025-05-07T20:33:19.9332048Z D=7168, 2025-05-07T20:33:19.9332151Z scale_ub=1200.0, 2025-05-07T20:33:19.9332248Z contiguous=True, 2025-05-07T20:33:19.9332348Z compiled=True, 2025-05-07T20:33:19.9332439Z ) 2025-05-07T20:33:19.9332762Z self = 2025-05-07T20:33:19.9332964Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.9332973Z 2025-05-07T20:33:19.9333068Z @given( 2025-05-07T20:33:19.9333206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9333320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9333533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9333756Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9333920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9334007Z ) 2025-05-07T20:33:19.9334290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9334408Z def test_silu_mul_quant( 2025-05-07T20:33:19.9334497Z self, 2025-05-07T20:33:19.9334585Z T: int, 2025-05-07T20:33:19.9334680Z D: int, 2025-05-07T20:33:19.9334801Z scale_ub: Optional[float], 2025-05-07T20:33:19.9334904Z contiguous: bool, 2025-05-07T20:33:19.9335014Z compiled: bool, 2025-05-07T20:33:19.9335106Z ) -> None: 2025-05-07T20:33:19.9335218Z torch.manual_seed(2025) 2025-05-07T20:33:19.9335309Z 2025-05-07T20:33:19.9335503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9335596Z 2025-05-07T20:33:19.9335706Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9335855Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9335967Z x = x_sign * x_clamp 2025-05-07T20:33:19.9336060Z x0 = x[:, :D] 2025-05-07T20:33:19.9336152Z x1 = x[:, D:] 2025-05-07T20:33:19.9336243Z 2025-05-07T20:33:19.9336339Z if contiguous: 2025-05-07T20:33:19.9336446Z x0 = x0.contiguous() 2025-05-07T20:33:19.9336554Z x1 = x1.contiguous() 2025-05-07T20:33:19.9336640Z 2025-05-07T20:33:19.9336744Z if scale_ub is not None: 2025-05-07T20:33:19.9336875Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9337034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9337120Z ) 2025-05-07T20:33:19.9337219Z else: 2025-05-07T20:33:19.9337328Z scale_ub_tensor = None 2025-05-07T20:33:19.9337418Z 2025-05-07T20:33:19.9337567Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9337676Z op = silu_mul_quant 2025-05-07T20:33:19.9337784Z if compiled: 2025-05-07T20:33:19.9337900Z op = torch.compile(op) 2025-05-07T20:33:19.9338023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9338114Z 2025-05-07T20:33:19.9338219Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9338224Z 2025-05-07T20:33:19.9338335Z moe/activation_test.py:117: 2025-05-07T20:33:19.9338495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9338612Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9338735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9339210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9339319Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9339906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9340027Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9340436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9340698Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9341087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9341204Z kernel = self.compile( 2025-05-07T20:33:19.9341640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9341890Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9342049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9342054Z 2025-05-07T20:33:19.9342291Z self = 2025-05-07T20:33:19.9343300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9343885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce6d40>} 2025-05-07T20:33:19.9344730Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9344967Z context = 2025-05-07T20:33:19.9344972Z 2025-05-07T20:33:19.9345162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9345473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9345605Z module_map=module_map) 2025-05-07T20:33:19.9345793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9345915Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9346006Z E ^ 2025-05-07T20:33:19.9346420Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9346425Z 2025-05-07T20:33:19.9346896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9346903Z 2025-05-07T20:33:19.9347026Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9347291Z self=, 2025-05-07T20:33:19.9347382Z T=16384, 2025-05-07T20:33:19.9347470Z D=5120, 2025-05-07T20:33:19.9347575Z scale_ub=1200.0, 2025-05-07T20:33:19.9347676Z contiguous=True, 2025-05-07T20:33:19.9347785Z compiled=False, 2025-05-07T20:33:19.9347875Z ) 2025-05-07T20:33:19.9348124Z self = 2025-05-07T20:33:19.9348333Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.9348338Z 2025-05-07T20:33:19.9348427Z @given( 2025-05-07T20:33:19.9348564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9348690Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9348824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9348961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9349101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9349240Z ) 2025-05-07T20:33:19.9349530Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9349639Z def test_silu_mul_quant( 2025-05-07T20:33:19.9349727Z self, 2025-05-07T20:33:19.9349823Z T: int, 2025-05-07T20:33:19.9349912Z D: int, 2025-05-07T20:33:19.9350028Z scale_ub: Optional[float], 2025-05-07T20:33:19.9350138Z contiguous: bool, 2025-05-07T20:33:19.9350241Z compiled: bool, 2025-05-07T20:33:19.9350334Z ) -> None: 2025-05-07T20:33:19.9350449Z torch.manual_seed(2025) 2025-05-07T20:33:19.9350536Z 2025-05-07T20:33:19.9350729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9350822Z 2025-05-07T20:33:19.9350929Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9351130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9351234Z x = x_sign * x_clamp 2025-05-07T20:33:19.9351327Z x0 = x[:, :D] 2025-05-07T20:33:19.9351431Z x1 = x[:, D:] 2025-05-07T20:33:19.9351515Z 2025-05-07T20:33:19.9351616Z if contiguous: 2025-05-07T20:33:19.9351728Z x0 = x0.contiguous() 2025-05-07T20:33:19.9351832Z x1 = x1.contiguous() 2025-05-07T20:33:19.9351962Z 2025-05-07T20:33:19.9352115Z if scale_ub is not None: 2025-05-07T20:33:19.9352238Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9352394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9352488Z ) 2025-05-07T20:33:19.9352577Z else: 2025-05-07T20:33:19.9352687Z scale_ub_tensor = None 2025-05-07T20:33:19.9352779Z 2025-05-07T20:33:19.9352929Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9353038Z op = silu_mul_quant 2025-05-07T20:33:19.9353141Z if compiled: 2025-05-07T20:33:19.9353256Z op = torch.compile(op) 2025-05-07T20:33:19.9353387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9353471Z 2025-05-07T20:33:19.9353695Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9353700Z 2025-05-07T20:33:19.9353820Z moe/activation_test.py:117: 2025-05-07T20:33:19.9353975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9354121Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9354269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9354837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9354958Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9355370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9355630Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9356031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9356139Z kernel = self.compile( 2025-05-07T20:33:19.9356583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9356793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9356938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9356943Z 2025-05-07T20:33:19.9357186Z self = 2025-05-07T20:33:19.9358067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9358704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbce7ac0>} 2025-05-07T20:33:19.9359550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9359779Z context = 2025-05-07T20:33:19.9359784Z 2025-05-07T20:33:19.9359984Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9360287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9360419Z module_map=module_map) 2025-05-07T20:33:19.9360605Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9360766Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9360866Z E ^ 2025-05-07T20:33:19.9361275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9361279Z 2025-05-07T20:33:19.9361747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9361806Z 2025-05-07T20:33:19.9361928Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9362226Z self=, 2025-05-07T20:33:19.9362327Z T=1, 2025-05-07T20:33:19.9362416Z D=7168, 2025-05-07T20:33:19.9362512Z scale_ub=1200.0, 2025-05-07T20:33:19.9362624Z contiguous=False, 2025-05-07T20:33:19.9362722Z compiled=False, 2025-05-07T20:33:19.9362811Z ) 2025-05-07T20:33:19.9363069Z self = 2025-05-07T20:33:19.9363262Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.9363271Z 2025-05-07T20:33:19.9363370Z @given( 2025-05-07T20:33:19.9363513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9363652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9363810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9363953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9364089Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9364185Z ) 2025-05-07T20:33:19.9364467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9364577Z def test_silu_mul_quant( 2025-05-07T20:33:19.9364674Z self, 2025-05-07T20:33:19.9364764Z T: int, 2025-05-07T20:33:19.9364854Z D: int, 2025-05-07T20:33:19.9364976Z scale_ub: Optional[float], 2025-05-07T20:33:19.9365081Z contiguous: bool, 2025-05-07T20:33:19.9365188Z compiled: bool, 2025-05-07T20:33:19.9365283Z ) -> None: 2025-05-07T20:33:19.9365394Z torch.manual_seed(2025) 2025-05-07T20:33:19.9365485Z 2025-05-07T20:33:19.9365681Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9365767Z 2025-05-07T20:33:19.9365880Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9366023Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9366132Z x = x_sign * x_clamp 2025-05-07T20:33:19.9366237Z x0 = x[:, :D] 2025-05-07T20:33:19.9366329Z x1 = x[:, D:] 2025-05-07T20:33:19.9366413Z 2025-05-07T20:33:19.9366515Z if contiguous: 2025-05-07T20:33:19.9366621Z x0 = x0.contiguous() 2025-05-07T20:33:19.9366731Z x1 = x1.contiguous() 2025-05-07T20:33:19.9366815Z 2025-05-07T20:33:19.9366919Z if scale_ub is not None: 2025-05-07T20:33:19.9367047Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9367203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9367294Z ) 2025-05-07T20:33:19.9367390Z else: 2025-05-07T20:33:19.9367498Z scale_ub_tensor = None 2025-05-07T20:33:19.9367638Z 2025-05-07T20:33:19.9367798Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9367903Z op = silu_mul_quant 2025-05-07T20:33:19.9368001Z if compiled: 2025-05-07T20:33:19.9368129Z op = torch.compile(op) 2025-05-07T20:33:19.9368253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9368344Z 2025-05-07T20:33:19.9368449Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9368454Z 2025-05-07T20:33:19.9368566Z moe/activation_test.py:117: 2025-05-07T20:33:19.9368718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9368834Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9368950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9369526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9369687Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9370104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9370360Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9370846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9370962Z kernel = self.compile( 2025-05-07T20:33:19.9371398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9371602Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9371753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9371761Z 2025-05-07T20:33:19.9371997Z self = 2025-05-07T20:33:19.9372879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9373463Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbbdc4c0>} 2025-05-07T20:33:19.9374369Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9374593Z context = 2025-05-07T20:33:19.9374598Z 2025-05-07T20:33:19.9374787Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9375095Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9375220Z module_map=module_map) 2025-05-07T20:33:19.9375406Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9375527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9375616Z E ^ 2025-05-07T20:33:19.9376034Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9376039Z 2025-05-07T20:33:19.9376510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9376515Z 2025-05-07T20:33:19.9376635Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9376898Z self=, 2025-05-07T20:33:19.9376988Z T=4096, 2025-05-07T20:33:19.9377085Z D=7168, 2025-05-07T20:33:19.9377183Z scale_ub=1200.0, 2025-05-07T20:33:19.9377284Z contiguous=False, 2025-05-07T20:33:19.9377387Z compiled=True, 2025-05-07T20:33:19.9377472Z ) 2025-05-07T20:33:19.9377772Z self = 2025-05-07T20:33:19.9377983Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.9377991Z 2025-05-07T20:33:19.9378082Z @given( 2025-05-07T20:33:19.9378221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9378345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9378481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9378628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9378760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9378847Z ) 2025-05-07T20:33:19.9379137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9379331Z def test_silu_mul_quant( 2025-05-07T20:33:19.9379422Z self, 2025-05-07T20:33:19.9379520Z T: int, 2025-05-07T20:33:19.9379611Z D: int, 2025-05-07T20:33:19.9379727Z scale_ub: Optional[float], 2025-05-07T20:33:19.9379837Z contiguous: bool, 2025-05-07T20:33:19.9379939Z compiled: bool, 2025-05-07T20:33:19.9380029Z ) -> None: 2025-05-07T20:33:19.9380194Z torch.manual_seed(2025) 2025-05-07T20:33:19.9380280Z 2025-05-07T20:33:19.9380524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9380611Z 2025-05-07T20:33:19.9380719Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9380872Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9380975Z x = x_sign * x_clamp 2025-05-07T20:33:19.9381068Z x0 = x[:, :D] 2025-05-07T20:33:19.9381167Z x1 = x[:, D:] 2025-05-07T20:33:19.9381252Z 2025-05-07T20:33:19.9381350Z if contiguous: 2025-05-07T20:33:19.9381469Z x0 = x0.contiguous() 2025-05-07T20:33:19.9381572Z x1 = x1.contiguous() 2025-05-07T20:33:19.9381657Z 2025-05-07T20:33:19.9381772Z if scale_ub is not None: 2025-05-07T20:33:19.9381893Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9382053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9382147Z ) 2025-05-07T20:33:19.9382239Z else: 2025-05-07T20:33:19.9382357Z scale_ub_tensor = None 2025-05-07T20:33:19.9382443Z 2025-05-07T20:33:19.9382592Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9382702Z op = silu_mul_quant 2025-05-07T20:33:19.9382800Z if compiled: 2025-05-07T20:33:19.9382915Z op = torch.compile(op) 2025-05-07T20:33:19.9383044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9383128Z 2025-05-07T20:33:19.9383232Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9383240Z 2025-05-07T20:33:19.9383361Z moe/activation_test.py:117: 2025-05-07T20:33:19.9383508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9383636Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9383752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9384173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9384292Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9384857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9384970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9385383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9385639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9386038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9386147Z kernel = self.compile( 2025-05-07T20:33:19.9386637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9386847Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9386994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9387002Z 2025-05-07T20:33:19.9387244Z self = 2025-05-07T20:33:19.9388126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9388707Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbbdd1b0>} 2025-05-07T20:33:19.9389604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9389825Z context = 2025-05-07T20:33:19.9389872Z 2025-05-07T20:33:19.9390107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9390410Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9390534Z module_map=module_map) 2025-05-07T20:33:19.9390726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9390840Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9390929Z E ^ 2025-05-07T20:33:19.9391340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9391348Z 2025-05-07T20:33:19.9391821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9391826Z 2025-05-07T20:33:19.9391953Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9392207Z self=, 2025-05-07T20:33:19.9392300Z T=128, 2025-05-07T20:33:19.9392399Z D=7168, 2025-05-07T20:33:19.9392498Z scale_ub=1200.0, 2025-05-07T20:33:19.9392605Z contiguous=False, 2025-05-07T20:33:19.9392702Z compiled=True, 2025-05-07T20:33:19.9392787Z ) 2025-05-07T20:33:19.9393041Z self = 2025-05-07T20:33:19.9393240Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.9393245Z 2025-05-07T20:33:19.9393334Z @given( 2025-05-07T20:33:19.9393479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9393701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9393837Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9393981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9394119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9394213Z ) 2025-05-07T20:33:19.9394504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9394615Z def test_silu_mul_quant( 2025-05-07T20:33:19.9394714Z self, 2025-05-07T20:33:19.9394803Z T: int, 2025-05-07T20:33:19.9394892Z D: int, 2025-05-07T20:33:19.9395014Z scale_ub: Optional[float], 2025-05-07T20:33:19.9395120Z contiguous: bool, 2025-05-07T20:33:19.9395221Z compiled: bool, 2025-05-07T20:33:19.9395321Z ) -> None: 2025-05-07T20:33:19.9395431Z torch.manual_seed(2025) 2025-05-07T20:33:19.9395519Z 2025-05-07T20:33:19.9395721Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9395807Z 2025-05-07T20:33:19.9395968Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9396125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9396228Z x = x_sign * x_clamp 2025-05-07T20:33:19.9396328Z x0 = x[:, :D] 2025-05-07T20:33:19.9396424Z x1 = x[:, D:] 2025-05-07T20:33:19.9396509Z 2025-05-07T20:33:19.9396615Z if contiguous: 2025-05-07T20:33:19.9396720Z x0 = x0.contiguous() 2025-05-07T20:33:19.9396823Z x1 = x1.contiguous() 2025-05-07T20:33:19.9396915Z 2025-05-07T20:33:19.9397020Z if scale_ub is not None: 2025-05-07T20:33:19.9397142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9397306Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9397398Z ) 2025-05-07T20:33:19.9397486Z else: 2025-05-07T20:33:19.9397646Z scale_ub_tensor = None 2025-05-07T20:33:19.9397732Z 2025-05-07T20:33:19.9397893Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9398001Z op = silu_mul_quant 2025-05-07T20:33:19.9398103Z if compiled: 2025-05-07T20:33:19.9398226Z op = torch.compile(op) 2025-05-07T20:33:19.9398348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9398478Z 2025-05-07T20:33:19.9398630Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9398636Z 2025-05-07T20:33:19.9398752Z moe/activation_test.py:117: 2025-05-07T20:33:19.9398898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9399020Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9399135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9399558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9399671Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9400234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9400357Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9400766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9401024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9401421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9401529Z kernel = self.compile( 2025-05-07T20:33:19.9401971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9402173Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9402319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9402327Z 2025-05-07T20:33:19.9402569Z self = 2025-05-07T20:33:19.9403452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9404096Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbbdc0d0>} 2025-05-07T20:33:19.9404943Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9405170Z context = 2025-05-07T20:33:19.9405178Z 2025-05-07T20:33:19.9405367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9405719Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9405851Z module_map=module_map) 2025-05-07T20:33:19.9406038Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9406152Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9406251Z E ^ 2025-05-07T20:33:19.9406657Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9406662Z 2025-05-07T20:33:19.9407136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9407141Z 2025-05-07T20:33:19.9407261Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9407516Z self=, 2025-05-07T20:33:19.9407661Z T=2048, 2025-05-07T20:33:19.9407750Z D=7168, 2025-05-07T20:33:19.9407846Z scale_ub=None, 2025-05-07T20:33:19.9407953Z contiguous=True, 2025-05-07T20:33:19.9408052Z compiled=True, 2025-05-07T20:33:19.9408137Z ) 2025-05-07T20:33:19.9408392Z self = 2025-05-07T20:33:19.9408588Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.9408639Z 2025-05-07T20:33:19.9408777Z @given( 2025-05-07T20:33:19.9408915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9409030Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9409173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9409308Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9409440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9409534Z ) 2025-05-07T20:33:19.9409815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9409936Z def test_silu_mul_quant( 2025-05-07T20:33:19.9410025Z self, 2025-05-07T20:33:19.9410115Z T: int, 2025-05-07T20:33:19.9410213Z D: int, 2025-05-07T20:33:19.9410326Z scale_ub: Optional[float], 2025-05-07T20:33:19.9410430Z contiguous: bool, 2025-05-07T20:33:19.9410535Z compiled: bool, 2025-05-07T20:33:19.9410630Z ) -> None: 2025-05-07T20:33:19.9410742Z torch.manual_seed(2025) 2025-05-07T20:33:19.9416121Z 2025-05-07T20:33:19.9416343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9416433Z 2025-05-07T20:33:19.9416553Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9416700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9416807Z x = x_sign * x_clamp 2025-05-07T20:33:19.9416910Z x0 = x[:, :D] 2025-05-07T20:33:19.9417007Z x1 = x[:, D:] 2025-05-07T20:33:19.9417107Z 2025-05-07T20:33:19.9417206Z if contiguous: 2025-05-07T20:33:19.9417314Z x0 = x0.contiguous() 2025-05-07T20:33:19.9417426Z x1 = x1.contiguous() 2025-05-07T20:33:19.9417515Z 2025-05-07T20:33:19.9417622Z if scale_ub is not None: 2025-05-07T20:33:19.9417754Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9417911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9418002Z ) 2025-05-07T20:33:19.9418102Z else: 2025-05-07T20:33:19.9418211Z scale_ub_tensor = None 2025-05-07T20:33:19.9418299Z 2025-05-07T20:33:19.9418456Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9418564Z op = silu_mul_quant 2025-05-07T20:33:19.9418671Z if compiled: 2025-05-07T20:33:19.9418788Z op = torch.compile(op) 2025-05-07T20:33:19.9418910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9419003Z 2025-05-07T20:33:19.9419111Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9419117Z 2025-05-07T20:33:19.9419231Z moe/activation_test.py:117: 2025-05-07T20:33:19.9419472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9419593Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9419714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9420149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9420262Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9420842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9420955Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9421366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9421628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9422067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9422178Z kernel = self.compile( 2025-05-07T20:33:19.9422623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9422827Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9423068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9423074Z 2025-05-07T20:33:19.9423315Z self = 2025-05-07T20:33:19.9424969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9425567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbbde560>} 2025-05-07T20:33:19.9426417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9426649Z context = 2025-05-07T20:33:19.9426658Z 2025-05-07T20:33:19.9426846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9427156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9427280Z module_map=module_map) 2025-05-07T20:33:19.9427466Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9427586Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9427678Z E ^ 2025-05-07T20:33:19.9428081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9428087Z 2025-05-07T20:33:19.9428563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9428568Z 2025-05-07T20:33:19.9428687Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9428952Z self=, 2025-05-07T20:33:19.9429044Z T=16384, 2025-05-07T20:33:19.9429133Z D=5120, 2025-05-07T20:33:19.9429235Z scale_ub=None, 2025-05-07T20:33:19.9429335Z contiguous=False, 2025-05-07T20:33:19.9429432Z compiled=False, 2025-05-07T20:33:19.9429524Z ) 2025-05-07T20:33:19.9429771Z self = 2025-05-07T20:33:19.9429974Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.9429989Z 2025-05-07T20:33:19.9430079Z @given( 2025-05-07T20:33:19.9430215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9430522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9430659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9430795Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9430932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9431020Z ) 2025-05-07T20:33:19.9431305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9431420Z def test_silu_mul_quant( 2025-05-07T20:33:19.9431509Z self, 2025-05-07T20:33:19.9431603Z T: int, 2025-05-07T20:33:19.9431698Z D: int, 2025-05-07T20:33:19.9431812Z scale_ub: Optional[float], 2025-05-07T20:33:19.9431926Z contiguous: bool, 2025-05-07T20:33:19.9432025Z compiled: bool, 2025-05-07T20:33:19.9432117Z ) -> None: 2025-05-07T20:33:19.9432311Z torch.manual_seed(2025) 2025-05-07T20:33:19.9432396Z 2025-05-07T20:33:19.9432594Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9432690Z 2025-05-07T20:33:19.9432797Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9432941Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9435170Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9435243Z 2025-05-07T20:33:19.9435386Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.9435391Z 2025-05-07T20:33:19.9435518Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9435775Z self=, 2025-05-07T20:33:19.9435872Z T=4096, 2025-05-07T20:33:19.9435961Z D=7168, 2025-05-07T20:33:19.9436058Z scale_ub=1200.0, 2025-05-07T20:33:19.9436162Z contiguous=True, 2025-05-07T20:33:19.9436263Z compiled=True, 2025-05-07T20:33:19.9436351Z ) 2025-05-07T20:33:19.9436607Z self = 2025-05-07T20:33:19.9436803Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.9436808Z 2025-05-07T20:33:19.9436897Z @given( 2025-05-07T20:33:19.9437039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9437154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9437292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9437431Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9437564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9437659Z ) 2025-05-07T20:33:19.9437941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9438051Z def test_silu_mul_quant( 2025-05-07T20:33:19.9438147Z self, 2025-05-07T20:33:19.9438239Z T: int, 2025-05-07T20:33:19.9438329Z D: int, 2025-05-07T20:33:19.9438452Z scale_ub: Optional[float], 2025-05-07T20:33:19.9438556Z contiguous: bool, 2025-05-07T20:33:19.9438655Z compiled: bool, 2025-05-07T20:33:19.9438753Z ) -> None: 2025-05-07T20:33:19.9438863Z torch.manual_seed(2025) 2025-05-07T20:33:19.9438954Z 2025-05-07T20:33:19.9439147Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9439238Z 2025-05-07T20:33:19.9439355Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9439499Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9441562Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9441583Z 2025-05-07T20:33:19.9441723Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.9441727Z 2025-05-07T20:33:19.9441845Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9442105Z self=, 2025-05-07T20:33:19.9442239Z T=16384, 2025-05-07T20:33:19.9442329Z D=7168, 2025-05-07T20:33:19.9442433Z scale_ub=None, 2025-05-07T20:33:19.9442532Z contiguous=False, 2025-05-07T20:33:19.9442634Z compiled=False, 2025-05-07T20:33:19.9442721Z ) 2025-05-07T20:33:19.9442965Z self = 2025-05-07T20:33:19.9443171Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.9443221Z 2025-05-07T20:33:19.9443316Z @given( 2025-05-07T20:33:19.9443495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9443616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9443747Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9443881Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9444017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9444105Z ) 2025-05-07T20:33:19.9444390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9444502Z def test_silu_mul_quant( 2025-05-07T20:33:19.9444590Z self, 2025-05-07T20:33:19.9444684Z T: int, 2025-05-07T20:33:19.9444775Z D: int, 2025-05-07T20:33:19.9444888Z scale_ub: Optional[float], 2025-05-07T20:33:19.9444999Z contiguous: bool, 2025-05-07T20:33:19.9445097Z compiled: bool, 2025-05-07T20:33:19.9445187Z ) -> None: 2025-05-07T20:33:19.9445304Z torch.manual_seed(2025) 2025-05-07T20:33:19.9445392Z 2025-05-07T20:33:19.9445584Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9447612Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9447622Z 2025-05-07T20:33:19.9447762Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9447767Z 2025-05-07T20:33:19.9447885Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9448140Z self=, 2025-05-07T20:33:19.9448237Z T=2048, 2025-05-07T20:33:19.9448325Z D=7168, 2025-05-07T20:33:19.9448419Z scale_ub=1200.0, 2025-05-07T20:33:19.9448522Z contiguous=True, 2025-05-07T20:33:19.9448619Z compiled=True, 2025-05-07T20:33:19.9448702Z ) 2025-05-07T20:33:19.9448952Z self = 2025-05-07T20:33:19.9449145Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.9449342Z 2025-05-07T20:33:19.9449435Z @given( 2025-05-07T20:33:19.9449569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9449682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9449874Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9450009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9450139Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9450234Z ) 2025-05-07T20:33:19.9450516Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9450623Z def test_silu_mul_quant( 2025-05-07T20:33:19.9450717Z self, 2025-05-07T20:33:19.9450805Z T: int, 2025-05-07T20:33:19.9450900Z D: int, 2025-05-07T20:33:19.9451013Z scale_ub: Optional[float], 2025-05-07T20:33:19.9451115Z contiguous: bool, 2025-05-07T20:33:19.9451219Z compiled: bool, 2025-05-07T20:33:19.9451309Z ) -> None: 2025-05-07T20:33:19.9451467Z torch.manual_seed(2025) 2025-05-07T20:33:19.9451556Z 2025-05-07T20:33:19.9451751Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9451838Z 2025-05-07T20:33:19.9451955Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9452098Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9454441Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9454503Z 2025-05-07T20:33:19.9454672Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.9454682Z 2025-05-07T20:33:19.9454838Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9455155Z self=, 2025-05-07T20:33:19.9455266Z T=2048, 2025-05-07T20:33:19.9455384Z D=7168, 2025-05-07T20:33:19.9455504Z scale_ub=None, 2025-05-07T20:33:19.9455627Z contiguous=True, 2025-05-07T20:33:19.9455762Z compiled=False, 2025-05-07T20:33:19.9455865Z ) 2025-05-07T20:33:19.9456113Z self = 2025-05-07T20:33:19.9456315Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.9456320Z 2025-05-07T20:33:19.9456408Z @given( 2025-05-07T20:33:19.9456549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9456662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9456793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9456935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9457066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9457151Z ) 2025-05-07T20:33:19.9457442Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9457554Z def test_silu_mul_quant( 2025-05-07T20:33:19.9457646Z self, 2025-05-07T20:33:19.9457742Z T: int, 2025-05-07T20:33:19.9457834Z D: int, 2025-05-07T20:33:19.9457951Z scale_ub: Optional[float], 2025-05-07T20:33:19.9458062Z contiguous: bool, 2025-05-07T20:33:19.9458160Z compiled: bool, 2025-05-07T20:33:19.9458255Z ) -> None: 2025-05-07T20:33:19.9458365Z torch.manual_seed(2025) 2025-05-07T20:33:19.9458449Z 2025-05-07T20:33:19.9458648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9458736Z 2025-05-07T20:33:19.9458843Z > x_sign = torch.sign(x) 2025-05-07T20:33:19.9460899Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9460911Z 2025-05-07T20:33:19.9461048Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:19.9461053Z 2025-05-07T20:33:19.9461177Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9461427Z self=, 2025-05-07T20:33:19.9461515Z T=1, 2025-05-07T20:33:19.9461610Z D=7168, 2025-05-07T20:33:19.9461705Z scale_ub=1200.0, 2025-05-07T20:33:19.9461857Z contiguous=True, 2025-05-07T20:33:19.9461954Z compiled=False, 2025-05-07T20:33:19.9462038Z ) 2025-05-07T20:33:19.9462292Z self = 2025-05-07T20:33:19.9462480Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.9462485Z 2025-05-07T20:33:19.9462574Z @given( 2025-05-07T20:33:19.9462716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9462917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9463055Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9463227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9463388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9463499Z ) 2025-05-07T20:33:19.9463849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9463982Z def test_silu_mul_quant( 2025-05-07T20:33:19.9464098Z self, 2025-05-07T20:33:19.9464213Z T: int, 2025-05-07T20:33:19.9464322Z D: int, 2025-05-07T20:33:19.9464470Z scale_ub: Optional[float], 2025-05-07T20:33:19.9464601Z contiguous: bool, 2025-05-07T20:33:19.9464723Z compiled: bool, 2025-05-07T20:33:19.9464841Z ) -> None: 2025-05-07T20:33:19.9464976Z torch.manual_seed(2025) 2025-05-07T20:33:19.9465079Z 2025-05-07T20:33:19.9465299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9465388Z 2025-05-07T20:33:19.9465495Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9465644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9465746Z x = x_sign * x_clamp 2025-05-07T20:33:19.9465847Z x0 = x[:, :D] 2025-05-07T20:33:19.9465940Z x1 = x[:, D:] 2025-05-07T20:33:19.9466023Z 2025-05-07T20:33:19.9466126Z if contiguous: 2025-05-07T20:33:19.9466232Z x0 = x0.contiguous() 2025-05-07T20:33:19.9466342Z x1 = x1.contiguous() 2025-05-07T20:33:19.9466431Z 2025-05-07T20:33:19.9466536Z if scale_ub is not None: 2025-05-07T20:33:19.9466659Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9466826Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9466915Z ) 2025-05-07T20:33:19.9467006Z else: 2025-05-07T20:33:19.9467121Z scale_ub_tensor = None 2025-05-07T20:33:19.9467209Z 2025-05-07T20:33:19.9467366Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9467470Z op = silu_mul_quant 2025-05-07T20:33:19.9467570Z if compiled: 2025-05-07T20:33:19.9467692Z op = torch.compile(op) 2025-05-07T20:33:19.9467813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9467898Z 2025-05-07T20:33:19.9468008Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9468013Z 2025-05-07T20:33:19.9468125Z moe/activation_test.py:117: 2025-05-07T20:33:19.9468276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9468400Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9468574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9469152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9469265Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9469679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9469942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9470327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9470439Z kernel = self.compile( 2025-05-07T20:33:19.9470884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9471130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9471285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9471290Z 2025-05-07T20:33:19.9471526Z self = 2025-05-07T20:33:19.9472443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9473072Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacb6d44c0>} 2025-05-07T20:33:19.9474028Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9474267Z context = 2025-05-07T20:33:19.9474272Z 2025-05-07T20:33:19.9474464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9474770Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9474900Z module_map=module_map) 2025-05-07T20:33:19.9475089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9475209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9475299Z E ^ 2025-05-07T20:33:19.9475701Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9475707Z 2025-05-07T20:33:19.9476179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9476187Z 2025-05-07T20:33:19.9476306Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9476565Z self=, 2025-05-07T20:33:19.9476657Z T=128, 2025-05-07T20:33:19.9476750Z D=5120, 2025-05-07T20:33:19.9476854Z scale_ub=None, 2025-05-07T20:33:19.9476953Z contiguous=True, 2025-05-07T20:33:19.9477050Z compiled=False, 2025-05-07T20:33:19.9477149Z ) 2025-05-07T20:33:19.9477397Z self = 2025-05-07T20:33:19.9477591Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.9477606Z 2025-05-07T20:33:19.9477696Z @given( 2025-05-07T20:33:19.9477830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9477950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9478081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9478216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9478353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9478440Z ) 2025-05-07T20:33:19.9478771Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9478889Z def test_silu_mul_quant( 2025-05-07T20:33:19.9478978Z self, 2025-05-07T20:33:19.9479066Z T: int, 2025-05-07T20:33:19.9479161Z D: int, 2025-05-07T20:33:19.9479277Z scale_ub: Optional[float], 2025-05-07T20:33:19.9479388Z contiguous: bool, 2025-05-07T20:33:19.9479487Z compiled: bool, 2025-05-07T20:33:19.9479577Z ) -> None: 2025-05-07T20:33:19.9479692Z torch.manual_seed(2025) 2025-05-07T20:33:19.9479775Z 2025-05-07T20:33:19.9479967Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9480058Z 2025-05-07T20:33:19.9480167Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9480309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9480468Z x = x_sign * x_clamp 2025-05-07T20:33:19.9480560Z x0 = x[:, :D] 2025-05-07T20:33:19.9480652Z x1 = x[:, D:] 2025-05-07T20:33:19.9480741Z 2025-05-07T20:33:19.9480840Z if contiguous: 2025-05-07T20:33:19.9480951Z x0 = x0.contiguous() 2025-05-07T20:33:19.9481058Z x1 = x1.contiguous() 2025-05-07T20:33:19.9481142Z 2025-05-07T20:33:19.9481301Z if scale_ub is not None: 2025-05-07T20:33:19.9481464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9481621Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9481714Z ) 2025-05-07T20:33:19.9481803Z else: 2025-05-07T20:33:19.9481911Z scale_ub_tensor = None 2025-05-07T20:33:19.9482000Z 2025-05-07T20:33:19.9482147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9482251Z op = silu_mul_quant 2025-05-07T20:33:19.9482354Z if compiled: 2025-05-07T20:33:19.9482473Z op = torch.compile(op) 2025-05-07T20:33:19.9482593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9482682Z 2025-05-07T20:33:19.9482789Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9482794Z 2025-05-07T20:33:19.9482913Z moe/activation_test.py:117: 2025-05-07T20:33:19.9483060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9483179Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9483303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9483869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9483981Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9484397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9484651Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9485043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9485153Z kernel = self.compile( 2025-05-07T20:33:19.9485586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9485793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9485942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9485947Z 2025-05-07T20:33:19.9486187Z self = 2025-05-07T20:33:19.9487076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9487653Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacb6d4940>} 2025-05-07T20:33:19.9488554Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9488776Z context = 2025-05-07T20:33:19.9488784Z 2025-05-07T20:33:19.9488982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9489287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9489410Z module_map=module_map) 2025-05-07T20:33:19.9489603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9489719Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9489815Z E ^ 2025-05-07T20:33:19.9490299Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9490304Z 2025-05-07T20:33:19.9490773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9490778Z 2025-05-07T20:33:19.9490904Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9491242Z self=, 2025-05-07T20:33:19.9491340Z T=128, 2025-05-07T20:33:19.9491429Z D=7168, 2025-05-07T20:33:19.9491523Z scale_ub=None, 2025-05-07T20:33:19.9491628Z contiguous=True, 2025-05-07T20:33:19.9491725Z compiled=False, 2025-05-07T20:33:19.9491810Z ) 2025-05-07T20:33:19.9492061Z self = 2025-05-07T20:33:19.9492255Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.9492264Z 2025-05-07T20:33:19.9492352Z @given( 2025-05-07T20:33:19.9492494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9492611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9492745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9492888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9493019Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9493115Z ) 2025-05-07T20:33:19.9493397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9493505Z def test_silu_mul_quant( 2025-05-07T20:33:19.9493603Z self, 2025-05-07T20:33:19.9493691Z T: int, 2025-05-07T20:33:19.9493780Z D: int, 2025-05-07T20:33:19.9493902Z scale_ub: Optional[float], 2025-05-07T20:33:19.9494010Z contiguous: bool, 2025-05-07T20:33:19.9494108Z compiled: bool, 2025-05-07T20:33:19.9494205Z ) -> None: 2025-05-07T20:33:19.9494316Z torch.manual_seed(2025) 2025-05-07T20:33:19.9494399Z 2025-05-07T20:33:19.9494598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9494683Z 2025-05-07T20:33:19.9494796Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9494941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9495043Z x = x_sign * x_clamp 2025-05-07T20:33:19.9495141Z x0 = x[:, :D] 2025-05-07T20:33:19.9495236Z x1 = x[:, D:] 2025-05-07T20:33:19.9495322Z 2025-05-07T20:33:19.9495426Z if contiguous: 2025-05-07T20:33:19.9495535Z x0 = x0.contiguous() 2025-05-07T20:33:19.9495637Z x1 = x1.contiguous() 2025-05-07T20:33:19.9495728Z 2025-05-07T20:33:19.9495832Z if scale_ub is not None: 2025-05-07T20:33:19.9495952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9496116Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9496204Z ) 2025-05-07T20:33:19.9496300Z else: 2025-05-07T20:33:19.9496408Z scale_ub_tensor = None 2025-05-07T20:33:19.9496492Z 2025-05-07T20:33:19.9496697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9496804Z op = silu_mul_quant 2025-05-07T20:33:19.9496901Z if compiled: 2025-05-07T20:33:19.9497022Z op = torch.compile(op) 2025-05-07T20:33:19.9497143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9497229Z 2025-05-07T20:33:19.9497344Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9497349Z 2025-05-07T20:33:19.9497460Z moe/activation_test.py:117: 2025-05-07T20:33:19.9497611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9497728Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9497845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9498420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9498581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9498989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9499247Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9499636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9499836Z kernel = self.compile( 2025-05-07T20:33:19.9500270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9500469Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9500622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9500627Z 2025-05-07T20:33:19.9500865Z self = 2025-05-07T20:33:19.9501749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9502324Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacb6d5240>} 2025-05-07T20:33:19.9503169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9503397Z context = 2025-05-07T20:33:19.9503402Z 2025-05-07T20:33:19.9503591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9503897Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9504024Z module_map=module_map) 2025-05-07T20:33:19.9504211Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9504331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9504423Z E ^ 2025-05-07T20:33:19.9504824Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9504838Z 2025-05-07T20:33:19.9505306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9505311Z 2025-05-07T20:33:19.9505429Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9505690Z self=, 2025-05-07T20:33:19.9505779Z T=2048, 2025-05-07T20:33:19.9505868Z D=7168, 2025-05-07T20:33:19.9505971Z scale_ub=1200.0, 2025-05-07T20:33:19.9506073Z contiguous=True, 2025-05-07T20:33:19.9506170Z compiled=False, 2025-05-07T20:33:19.9506264Z ) 2025-05-07T20:33:19.9506559Z self = 2025-05-07T20:33:19.9506766Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.9506771Z 2025-05-07T20:33:19.9506859Z @given( 2025-05-07T20:33:19.9506993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9507123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9507255Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9507390Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9507530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9507619Z ) 2025-05-07T20:33:19.9507900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9508016Z def test_silu_mul_quant( 2025-05-07T20:33:19.9508151Z self, 2025-05-07T20:33:19.9508246Z T: int, 2025-05-07T20:33:19.9508336Z D: int, 2025-05-07T20:33:19.9508451Z scale_ub: Optional[float], 2025-05-07T20:33:19.9508564Z contiguous: bool, 2025-05-07T20:33:19.9508668Z compiled: bool, 2025-05-07T20:33:19.9508761Z ) -> None: 2025-05-07T20:33:19.9508876Z torch.manual_seed(2025) 2025-05-07T20:33:19.9508961Z 2025-05-07T20:33:19.9509197Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9511247Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9511257Z 2025-05-07T20:33:19.9511393Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9511401Z 2025-05-07T20:33:19.9511523Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9511774Z self=, 2025-05-07T20:33:19.9511875Z T=1, 2025-05-07T20:33:19.9511965Z D=5120, 2025-05-07T20:33:19.9512064Z scale_ub=1200.0, 2025-05-07T20:33:19.9512174Z contiguous=True, 2025-05-07T20:33:19.9512272Z compiled=False, 2025-05-07T20:33:19.9512358Z ) 2025-05-07T20:33:19.9512608Z self = 2025-05-07T20:33:19.9512796Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.9512801Z 2025-05-07T20:33:19.9512888Z @given( 2025-05-07T20:33:19.9513032Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9513180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9513348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9513638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9513803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9513920Z ) 2025-05-07T20:33:19.9514266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9514404Z def test_silu_mul_quant( 2025-05-07T20:33:19.9514522Z self, 2025-05-07T20:33:19.9514632Z T: int, 2025-05-07T20:33:19.9514743Z D: int, 2025-05-07T20:33:19.9514889Z scale_ub: Optional[float], 2025-05-07T20:33:19.9515017Z contiguous: bool, 2025-05-07T20:33:19.9515138Z compiled: bool, 2025-05-07T20:33:19.9515242Z ) -> None: 2025-05-07T20:33:19.9515350Z torch.manual_seed(2025) 2025-05-07T20:33:19.9515442Z 2025-05-07T20:33:19.9515634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9515723Z 2025-05-07T20:33:19.9515833Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9516030Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9516136Z x = x_sign * x_clamp 2025-05-07T20:33:19.9516235Z x0 = x[:, :D] 2025-05-07T20:33:19.9516328Z x1 = x[:, D:] 2025-05-07T20:33:19.9516410Z 2025-05-07T20:33:19.9516514Z if contiguous: 2025-05-07T20:33:19.9516619Z x0 = x0.contiguous() 2025-05-07T20:33:19.9516727Z x1 = x1.contiguous() 2025-05-07T20:33:19.9516817Z 2025-05-07T20:33:19.9516922Z if scale_ub is not None: 2025-05-07T20:33:19.9517042Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9517204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9517292Z ) 2025-05-07T20:33:19.9517385Z else: 2025-05-07T20:33:19.9517493Z scale_ub_tensor = None 2025-05-07T20:33:19.9517625Z 2025-05-07T20:33:19.9517780Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9517884Z op = silu_mul_quant 2025-05-07T20:33:19.9517981Z if compiled: 2025-05-07T20:33:19.9518107Z op = torch.compile(op) 2025-05-07T20:33:19.9518234Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9518317Z 2025-05-07T20:33:19.9518426Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9518478Z 2025-05-07T20:33:19.9518631Z moe/activation_test.py:117: 2025-05-07T20:33:19.9518787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9518903Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9519021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9519595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9519707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9520118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9520380Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9520767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9520884Z kernel = self.compile( 2025-05-07T20:33:19.9521325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9521526Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9521678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9521683Z 2025-05-07T20:33:19.9521916Z self = 2025-05-07T20:33:19.9522798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9523386Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacb6d6200>} 2025-05-07T20:33:19.9524832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9525081Z context = 2025-05-07T20:33:19.9525087Z 2025-05-07T20:33:19.9525280Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9525585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9525707Z module_map=module_map) 2025-05-07T20:33:19.9525897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9526021Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9526112Z E ^ 2025-05-07T20:33:19.9526728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9526742Z 2025-05-07T20:33:19.9527211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9527218Z 2025-05-07T20:33:19.9527338Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9527595Z self=, 2025-05-07T20:33:19.9527683Z T=2048, 2025-05-07T20:33:19.9527772Z D=5120, 2025-05-07T20:33:19.9527875Z scale_ub=None, 2025-05-07T20:33:19.9527973Z contiguous=True, 2025-05-07T20:33:19.9528073Z compiled=False, 2025-05-07T20:33:19.9528164Z ) 2025-05-07T20:33:19.9528479Z self = 2025-05-07T20:33:19.9528681Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.9528689Z 2025-05-07T20:33:19.9528775Z @given( 2025-05-07T20:33:19.9528909Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9529032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9529235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9529438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9529577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9529662Z ) 2025-05-07T20:33:19.9529944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9530052Z def test_silu_mul_quant( 2025-05-07T20:33:19.9530139Z self, 2025-05-07T20:33:19.9530232Z T: int, 2025-05-07T20:33:19.9530319Z D: int, 2025-05-07T20:33:19.9530435Z scale_ub: Optional[float], 2025-05-07T20:33:19.9530543Z contiguous: bool, 2025-05-07T20:33:19.9530640Z compiled: bool, 2025-05-07T20:33:19.9530728Z ) -> None: 2025-05-07T20:33:19.9530847Z torch.manual_seed(2025) 2025-05-07T20:33:19.9530930Z 2025-05-07T20:33:19.9531121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9531210Z 2025-05-07T20:33:19.9531320Z > x_sign = torch.sign(x) 2025-05-07T20:33:19.9533308Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9533317Z 2025-05-07T20:33:19.9533451Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:19.9533456Z 2025-05-07T20:33:19.9533583Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9533833Z self=, 2025-05-07T20:33:19.9533922Z T=16384, 2025-05-07T20:33:19.9534017Z D=5120, 2025-05-07T20:33:19.9534111Z scale_ub=None, 2025-05-07T20:33:19.9534211Z contiguous=True, 2025-05-07T20:33:19.9534312Z compiled=False, 2025-05-07T20:33:19.9534396Z ) 2025-05-07T20:33:19.9534640Z self = 2025-05-07T20:33:19.9534845Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.9534850Z 2025-05-07T20:33:19.9534937Z @given( 2025-05-07T20:33:19.9535077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9535192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9535321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9535463Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9535642Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9535728Z ) 2025-05-07T20:33:19.9536012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9536121Z def test_silu_mul_quant( 2025-05-07T20:33:19.9536208Z self, 2025-05-07T20:33:19.9536306Z T: int, 2025-05-07T20:33:19.9536393Z D: int, 2025-05-07T20:33:19.9536505Z scale_ub: Optional[float], 2025-05-07T20:33:19.9536613Z contiguous: bool, 2025-05-07T20:33:19.9536710Z compiled: bool, 2025-05-07T20:33:19.9536805Z ) -> None: 2025-05-07T20:33:19.9536912Z torch.manual_seed(2025) 2025-05-07T20:33:19.9536995Z 2025-05-07T20:33:19.9537191Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9539262Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9539308Z 2025-05-07T20:33:19.9539448Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9539453Z 2025-05-07T20:33:19.9539571Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9539821Z self=, 2025-05-07T20:33:19.9539918Z T=4096, 2025-05-07T20:33:19.9540007Z D=5120, 2025-05-07T20:33:19.9540103Z scale_ub=None, 2025-05-07T20:33:19.9540209Z contiguous=True, 2025-05-07T20:33:19.9540306Z compiled=False, 2025-05-07T20:33:19.9540396Z ) 2025-05-07T20:33:19.9540646Z self = 2025-05-07T20:33:19.9540840Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.9540844Z 2025-05-07T20:33:19.9540939Z @given( 2025-05-07T20:33:19.9541078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9541197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9541334Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9541466Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9541595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9541689Z ) 2025-05-07T20:33:19.9541966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9542081Z def test_silu_mul_quant( 2025-05-07T20:33:19.9542172Z self, 2025-05-07T20:33:19.9542260Z T: int, 2025-05-07T20:33:19.9542355Z D: int, 2025-05-07T20:33:19.9542468Z scale_ub: Optional[float], 2025-05-07T20:33:19.9542576Z contiguous: bool, 2025-05-07T20:33:19.9542681Z compiled: bool, 2025-05-07T20:33:19.9542771Z ) -> None: 2025-05-07T20:33:19.9542879Z torch.manual_seed(2025) 2025-05-07T20:33:19.9542973Z 2025-05-07T20:33:19.9543165Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9545162Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9545171Z 2025-05-07T20:33:19.9545352Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9545357Z 2025-05-07T20:33:19.9545481Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9545732Z self=, 2025-05-07T20:33:19.9545823Z T=2048, 2025-05-07T20:33:19.9545919Z D=5120, 2025-05-07T20:33:19.9546018Z scale_ub=None, 2025-05-07T20:33:19.9546120Z contiguous=False, 2025-05-07T20:33:19.9546224Z compiled=False, 2025-05-07T20:33:19.9546308Z ) 2025-05-07T20:33:19.9546551Z self = 2025-05-07T20:33:19.9546755Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.9546760Z 2025-05-07T20:33:19.9546848Z @given( 2025-05-07T20:33:19.9546988Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9547148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9547279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9547421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9547551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9547641Z ) 2025-05-07T20:33:19.9547929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9548123Z def test_silu_mul_quant( 2025-05-07T20:33:19.9548213Z self, 2025-05-07T20:33:19.9548307Z T: int, 2025-05-07T20:33:19.9548394Z D: int, 2025-05-07T20:33:19.9548508Z scale_ub: Optional[float], 2025-05-07T20:33:19.9548615Z contiguous: bool, 2025-05-07T20:33:19.9548712Z compiled: bool, 2025-05-07T20:33:19.9548806Z ) -> None: 2025-05-07T20:33:19.9548914Z torch.manual_seed(2025) 2025-05-07T20:33:19.9549000Z 2025-05-07T20:33:19.9549203Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9551191Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9551201Z 2025-05-07T20:33:19.9551341Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9551346Z 2025-05-07T20:33:19.9551463Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9551711Z self=, 2025-05-07T20:33:19.9551809Z T=4096, 2025-05-07T20:33:19.9551897Z D=7168, 2025-05-07T20:33:19.9551991Z scale_ub=None, 2025-05-07T20:33:19.9552099Z contiguous=True, 2025-05-07T20:33:19.9552194Z compiled=True, 2025-05-07T20:33:19.9552285Z ) 2025-05-07T20:33:19.9552528Z self = 2025-05-07T20:33:19.9552719Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.9552726Z 2025-05-07T20:33:19.9552819Z @given( 2025-05-07T20:33:19.9552956Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9553068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9553202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9553335Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9553466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9553645Z ) 2025-05-07T20:33:19.9553923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9554040Z def test_silu_mul_quant( 2025-05-07T20:33:19.9554130Z self, 2025-05-07T20:33:19.9554218Z T: int, 2025-05-07T20:33:19.9554391Z D: int, 2025-05-07T20:33:19.9554505Z scale_ub: Optional[float], 2025-05-07T20:33:19.9554620Z contiguous: bool, 2025-05-07T20:33:19.9554735Z compiled: bool, 2025-05-07T20:33:19.9554847Z ) -> None: 2025-05-07T20:33:19.9554962Z torch.manual_seed(2025) 2025-05-07T20:33:19.9555051Z 2025-05-07T20:33:19.9555245Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9557263Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9557319Z 2025-05-07T20:33:19.9557455Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9557459Z 2025-05-07T20:33:19.9557584Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9557886Z self=, 2025-05-07T20:33:19.9558041Z T=2048, 2025-05-07T20:33:19.9558136Z D=5120, 2025-05-07T20:33:19.9558232Z scale_ub=1200.0, 2025-05-07T20:33:19.9558332Z contiguous=False, 2025-05-07T20:33:19.9558436Z compiled=False, 2025-05-07T20:33:19.9558520Z ) 2025-05-07T20:33:19.9558763Z self = 2025-05-07T20:33:19.9558967Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.9558975Z 2025-05-07T20:33:19.9559061Z @given( 2025-05-07T20:33:19.9559201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9564667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9564833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9564982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9565115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9565208Z ) 2025-05-07T20:33:19.9565505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9565615Z def test_silu_mul_quant( 2025-05-07T20:33:19.9565706Z self, 2025-05-07T20:33:19.9565805Z T: int, 2025-05-07T20:33:19.9565893Z D: int, 2025-05-07T20:33:19.9566007Z scale_ub: Optional[float], 2025-05-07T20:33:19.9566116Z contiguous: bool, 2025-05-07T20:33:19.9566215Z compiled: bool, 2025-05-07T20:33:19.9566315Z ) -> None: 2025-05-07T20:33:19.9566425Z torch.manual_seed(2025) 2025-05-07T20:33:19.9566512Z 2025-05-07T20:33:19.9566716Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9568786Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9568796Z 2025-05-07T20:33:19.9568941Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9568946Z 2025-05-07T20:33:19.9569064Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9569319Z self=, 2025-05-07T20:33:19.9569420Z T=4096, 2025-05-07T20:33:19.9569510Z D=7168, 2025-05-07T20:33:19.9569607Z scale_ub=1200.0, 2025-05-07T20:33:19.9569792Z contiguous=True, 2025-05-07T20:33:19.9569896Z compiled=False, 2025-05-07T20:33:19.9569988Z ) 2025-05-07T20:33:19.9570238Z self = 2025-05-07T20:33:19.9570435Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.9570443Z 2025-05-07T20:33:19.9570536Z @given( 2025-05-07T20:33:19.9570672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9570787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9570924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9571056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9571186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9571276Z ) 2025-05-07T20:33:19.9571613Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9571728Z def test_silu_mul_quant( 2025-05-07T20:33:19.9571816Z self, 2025-05-07T20:33:19.9571908Z T: int, 2025-05-07T20:33:19.9572001Z D: int, 2025-05-07T20:33:19.9572114Z scale_ub: Optional[float], 2025-05-07T20:33:19.9572218Z contiguous: bool, 2025-05-07T20:33:19.9572371Z compiled: bool, 2025-05-07T20:33:19.9572464Z ) -> None: 2025-05-07T20:33:19.9572615Z torch.manual_seed(2025) 2025-05-07T20:33:19.9572706Z 2025-05-07T20:33:19.9572900Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9574906Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9574916Z 2025-05-07T20:33:19.9575050Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9575056Z 2025-05-07T20:33:19.9575184Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9575439Z self=, 2025-05-07T20:33:19.9575529Z T=16384, 2025-05-07T20:33:19.9575624Z D=7168, 2025-05-07T20:33:19.9575720Z scale_ub=None, 2025-05-07T20:33:19.9575820Z contiguous=False, 2025-05-07T20:33:19.9575921Z compiled=True, 2025-05-07T20:33:19.9576010Z ) 2025-05-07T20:33:19.9576255Z self = 2025-05-07T20:33:19.9576461Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.9576469Z 2025-05-07T20:33:19.9576558Z @given( 2025-05-07T20:33:19.9576697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9576813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9576943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9577085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9577218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9577307Z ) 2025-05-07T20:33:19.9577592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9577699Z def test_silu_mul_quant( 2025-05-07T20:33:19.9577787Z self, 2025-05-07T20:33:19.9577881Z T: int, 2025-05-07T20:33:19.9577972Z D: int, 2025-05-07T20:33:19.9578083Z scale_ub: Optional[float], 2025-05-07T20:33:19.9578191Z contiguous: bool, 2025-05-07T20:33:19.9578290Z compiled: bool, 2025-05-07T20:33:19.9578387Z ) -> None: 2025-05-07T20:33:19.9578496Z torch.manual_seed(2025) 2025-05-07T20:33:19.9578582Z 2025-05-07T20:33:19.9578834Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9580826Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9580835Z 2025-05-07T20:33:19.9580978Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9580983Z 2025-05-07T20:33:19.9581102Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9581397Z self=, 2025-05-07T20:33:19.9581497Z T=4096, 2025-05-07T20:33:19.9581586Z D=7168, 2025-05-07T20:33:19.9581683Z scale_ub=None, 2025-05-07T20:33:19.9581791Z contiguous=True, 2025-05-07T20:33:19.9581891Z compiled=False, 2025-05-07T20:33:19.9581985Z ) 2025-05-07T20:33:19.9582232Z self = 2025-05-07T20:33:19.9582514Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.9582520Z 2025-05-07T20:33:19.9582617Z @given( 2025-05-07T20:33:19.9582751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9582866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9583003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9583137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9583268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9583366Z ) 2025-05-07T20:33:19.9583646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9583764Z def test_silu_mul_quant( 2025-05-07T20:33:19.9583854Z self, 2025-05-07T20:33:19.9583943Z T: int, 2025-05-07T20:33:19.9584040Z D: int, 2025-05-07T20:33:19.9584155Z scale_ub: Optional[float], 2025-05-07T20:33:19.9584261Z contiguous: bool, 2025-05-07T20:33:19.9584373Z compiled: bool, 2025-05-07T20:33:19.9584478Z ) -> None: 2025-05-07T20:33:19.9584601Z torch.manual_seed(2025) 2025-05-07T20:33:19.9584712Z 2025-05-07T20:33:19.9584907Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9586920Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9586930Z 2025-05-07T20:33:19.9587064Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9587071Z 2025-05-07T20:33:19.9587199Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9587451Z self=, 2025-05-07T20:33:19.9587542Z T=16384, 2025-05-07T20:33:19.9587639Z D=7168, 2025-05-07T20:33:19.9587735Z scale_ub=None, 2025-05-07T20:33:19.9587835Z contiguous=True, 2025-05-07T20:33:19.9587946Z compiled=False, 2025-05-07T20:33:19.9588032Z ) 2025-05-07T20:33:19.9588276Z self = 2025-05-07T20:33:19.9588484Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.9588489Z 2025-05-07T20:33:19.9588579Z @given( 2025-05-07T20:33:19.9588772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9588887Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9589017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9589160Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9589293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9589379Z ) 2025-05-07T20:33:19.9589667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9589775Z def test_silu_mul_quant( 2025-05-07T20:33:19.9589863Z self, 2025-05-07T20:33:19.9589960Z T: int, 2025-05-07T20:33:19.9590048Z D: int, 2025-05-07T20:33:19.9590161Z scale_ub: Optional[float], 2025-05-07T20:33:19.9590318Z contiguous: bool, 2025-05-07T20:33:19.9590417Z compiled: bool, 2025-05-07T20:33:19.9590511Z ) -> None: 2025-05-07T20:33:19.9590620Z torch.manual_seed(2025) 2025-05-07T20:33:19.9590707Z 2025-05-07T20:33:19.9590903Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9592938Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9592985Z 2025-05-07T20:33:19.9593128Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9593136Z 2025-05-07T20:33:19.9593256Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9593616Z self=, 2025-05-07T20:33:19.9593718Z T=16384, 2025-05-07T20:33:19.9593808Z D=7168, 2025-05-07T20:33:19.9593906Z scale_ub=1200.0, 2025-05-07T20:33:19.9594009Z contiguous=True, 2025-05-07T20:33:19.9594106Z compiled=False, 2025-05-07T20:33:19.9594199Z ) 2025-05-07T20:33:19.9594489Z self = 2025-05-07T20:33:19.9594696Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.9594701Z 2025-05-07T20:33:19.9594797Z @given( 2025-05-07T20:33:19.9594931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9595045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9595186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9595319Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9595454Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9595545Z ) 2025-05-07T20:33:19.9595826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9595939Z def test_silu_mul_quant( 2025-05-07T20:33:19.9596028Z self, 2025-05-07T20:33:19.9596116Z T: int, 2025-05-07T20:33:19.9596211Z D: int, 2025-05-07T20:33:19.9596322Z scale_ub: Optional[float], 2025-05-07T20:33:19.9596428Z contiguous: bool, 2025-05-07T20:33:19.9596534Z compiled: bool, 2025-05-07T20:33:19.9596624Z ) -> None: 2025-05-07T20:33:19.9596735Z torch.manual_seed(2025) 2025-05-07T20:33:19.9596826Z 2025-05-07T20:33:19.9597018Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9599075Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9599088Z 2025-05-07T20:33:19.9599226Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9599231Z 2025-05-07T20:33:19.9599355Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9599607Z self=, 2025-05-07T20:33:19.9599697Z T=128, 2025-05-07T20:33:19.9599796Z D=5120, 2025-05-07T20:33:19.9599895Z scale_ub=1200.0, 2025-05-07T20:33:19.9599994Z contiguous=False, 2025-05-07T20:33:19.9600097Z compiled=False, 2025-05-07T20:33:19.9600231Z ) 2025-05-07T20:33:19.9600477Z self = 2025-05-07T20:33:19.9600680Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.9600688Z 2025-05-07T20:33:19.9600778Z @given( 2025-05-07T20:33:19.9600919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9601033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9601240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9601426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9601558Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9601649Z ) 2025-05-07T20:33:19.9601936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9602044Z def test_silu_mul_quant( 2025-05-07T20:33:19.9602133Z self, 2025-05-07T20:33:19.9602228Z T: int, 2025-05-07T20:33:19.9602316Z D: int, 2025-05-07T20:33:19.9602431Z scale_ub: Optional[float], 2025-05-07T20:33:19.9602541Z contiguous: bool, 2025-05-07T20:33:19.9602640Z compiled: bool, 2025-05-07T20:33:19.9602736Z ) -> None: 2025-05-07T20:33:19.9602848Z torch.manual_seed(2025) 2025-05-07T20:33:19.9602933Z 2025-05-07T20:33:19.9603131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9603220Z 2025-05-07T20:33:19.9603327Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9603482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9603586Z x = x_sign * x_clamp 2025-05-07T20:33:19.9603679Z x0 = x[:, :D] 2025-05-07T20:33:19.9603780Z x1 = x[:, D:] 2025-05-07T20:33:19.9603864Z 2025-05-07T20:33:19.9603961Z if contiguous: 2025-05-07T20:33:19.9604075Z x0 = x0.contiguous() 2025-05-07T20:33:19.9604180Z x1 = x1.contiguous() 2025-05-07T20:33:19.9604265Z 2025-05-07T20:33:19.9604377Z if scale_ub is not None: 2025-05-07T20:33:19.9604503Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9604666Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9604755Z ) 2025-05-07T20:33:19.9604843Z else: 2025-05-07T20:33:19.9604959Z scale_ub_tensor = None 2025-05-07T20:33:19.9605042Z 2025-05-07T20:33:19.9605192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9605306Z op = silu_mul_quant 2025-05-07T20:33:19.9605407Z if compiled: 2025-05-07T20:33:19.9605523Z op = torch.compile(op) 2025-05-07T20:33:19.9605652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9605739Z 2025-05-07T20:33:19.9605843Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9605854Z 2025-05-07T20:33:19.9605968Z moe/activation_test.py:117: 2025-05-07T20:33:19.9606114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9606243Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9606358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9606984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9607106Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9607516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9607783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9608171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9608280Z kernel = self.compile( 2025-05-07T20:33:19.9608723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9608926Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9609119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9609124Z 2025-05-07T20:33:19.9609368Z self = 2025-05-07T20:33:19.9610286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9610916Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbac5ea0>} 2025-05-07T20:33:19.9611759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9611987Z context = 2025-05-07T20:33:19.9611996Z 2025-05-07T20:33:19.9612183Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9612485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9612618Z module_map=module_map) 2025-05-07T20:33:19.9612805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9612924Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9613021Z E ^ 2025-05-07T20:33:19.9613423Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9613428Z 2025-05-07T20:33:19.9613901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9613906Z 2025-05-07T20:33:19.9614026Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9614279Z self=, 2025-05-07T20:33:19.9614375Z T=2048, 2025-05-07T20:33:19.9614465Z D=7168, 2025-05-07T20:33:19.9614563Z scale_ub=None, 2025-05-07T20:33:19.9614669Z contiguous=False, 2025-05-07T20:33:19.9614767Z compiled=False, 2025-05-07T20:33:19.9614857Z ) 2025-05-07T20:33:19.9615101Z self = 2025-05-07T20:33:19.9615309Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.9615314Z 2025-05-07T20:33:19.9615408Z @given( 2025-05-07T20:33:19.9615542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9615656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9615798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9615932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9616073Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9616161Z ) 2025-05-07T20:33:19.9616441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9616556Z def test_silu_mul_quant( 2025-05-07T20:33:19.9616730Z self, 2025-05-07T20:33:19.9616820Z T: int, 2025-05-07T20:33:19.9616916Z D: int, 2025-05-07T20:33:19.9617030Z scale_ub: Optional[float], 2025-05-07T20:33:19.9617135Z contiguous: bool, 2025-05-07T20:33:19.9617246Z compiled: bool, 2025-05-07T20:33:19.9617339Z ) -> None: 2025-05-07T20:33:19.9617447Z torch.manual_seed(2025) 2025-05-07T20:33:19.9617541Z 2025-05-07T20:33:19.9617735Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9619744Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9619796Z 2025-05-07T20:33:19.9619934Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9619986Z 2025-05-07T20:33:19.9620152Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9620408Z self=, 2025-05-07T20:33:19.9620500Z T=128, 2025-05-07T20:33:19.9620596Z D=7168, 2025-05-07T20:33:19.9620694Z scale_ub=1200.0, 2025-05-07T20:33:19.9620793Z contiguous=True, 2025-05-07T20:33:19.9620897Z compiled=True, 2025-05-07T20:33:19.9620983Z ) 2025-05-07T20:33:19.9621229Z self = 2025-05-07T20:33:19.9621435Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.9621440Z 2025-05-07T20:33:19.9621530Z @given( 2025-05-07T20:33:19.9621673Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9621790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9621922Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9622064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9622201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9622288Z ) 2025-05-07T20:33:19.9622574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9622682Z def test_silu_mul_quant( 2025-05-07T20:33:19.9622771Z self, 2025-05-07T20:33:19.9622865Z T: int, 2025-05-07T20:33:19.9622953Z D: int, 2025-05-07T20:33:19.9623068Z scale_ub: Optional[float], 2025-05-07T20:33:19.9623178Z contiguous: bool, 2025-05-07T20:33:19.9623278Z compiled: bool, 2025-05-07T20:33:19.9623373Z ) -> None: 2025-05-07T20:33:19.9623481Z torch.manual_seed(2025) 2025-05-07T20:33:19.9623564Z 2025-05-07T20:33:19.9624108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9624244Z 2025-05-07T20:33:19.9624394Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9624547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9624653Z x = x_sign * x_clamp 2025-05-07T20:33:19.9624749Z x0 = x[:, :D] 2025-05-07T20:33:19.9624846Z x1 = x[:, D:] 2025-05-07T20:33:19.9624930Z 2025-05-07T20:33:19.9625027Z if contiguous: 2025-05-07T20:33:19.9625142Z x0 = x0.contiguous() 2025-05-07T20:33:19.9625244Z x1 = x1.contiguous() 2025-05-07T20:33:19.9625334Z 2025-05-07T20:33:19.9625437Z if scale_ub is not None: 2025-05-07T20:33:19.9625559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.9625723Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.9625810Z ) 2025-05-07T20:33:19.9625900Z else: 2025-05-07T20:33:19.9626014Z scale_ub_tensor = None 2025-05-07T20:33:19.9626288Z 2025-05-07T20:33:19.9626440Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.9626553Z op = silu_mul_quant 2025-05-07T20:33:19.9626652Z if compiled: 2025-05-07T20:33:19.9626771Z op = torch.compile(op) 2025-05-07T20:33:19.9626899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9626985Z 2025-05-07T20:33:19.9627097Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.9627102Z 2025-05-07T20:33:19.9627213Z moe/activation_test.py:117: 2025-05-07T20:33:19.9627362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9627483Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.9627603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.9628095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.9628210Z return fn(*args, **kwargs) 2025-05-07T20:33:19.9628774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.9628892Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.9629429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.9629686Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.9630079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.9630187Z kernel = self.compile( 2025-05-07T20:33:19.9630625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.9630836Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.9630981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.9630989Z 2025-05-07T20:33:19.9631230Z self = 2025-05-07T20:33:19.9632114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.9632692Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efacbac77f0>} 2025-05-07T20:33:19.9633621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.9633847Z context = 2025-05-07T20:33:19.9633852Z 2025-05-07T20:33:19.9634050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.9634352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.9634483Z module_map=module_map) 2025-05-07T20:33:19.9634674Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.9634791Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.9634889Z E ^ 2025-05-07T20:33:19.9635292Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.9635297Z 2025-05-07T20:33:19.9635763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.9635768Z 2025-05-07T20:33:19.9635898Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9636149Z self=, 2025-05-07T20:33:19.9636244Z T=128, 2025-05-07T20:33:19.9636387Z D=7168, 2025-05-07T20:33:19.9636499Z scale_ub=1200.0, 2025-05-07T20:33:19.9636606Z contiguous=True, 2025-05-07T20:33:19.9636705Z compiled=False, 2025-05-07T20:33:19.9636789Z ) 2025-05-07T20:33:19.9637043Z self = 2025-05-07T20:33:19.9637244Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.9637249Z 2025-05-07T20:33:19.9637344Z @given( 2025-05-07T20:33:19.9637479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9637597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9637737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9637872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9638090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9638183Z ) 2025-05-07T20:33:19.9638467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9638579Z def test_silu_mul_quant( 2025-05-07T20:33:19.9638675Z self, 2025-05-07T20:33:19.9638763Z T: int, 2025-05-07T20:33:19.9638853Z D: int, 2025-05-07T20:33:19.9639023Z scale_ub: Optional[float], 2025-05-07T20:33:19.9639127Z contiguous: bool, 2025-05-07T20:33:19.9639277Z compiled: bool, 2025-05-07T20:33:19.9639371Z ) -> None: 2025-05-07T20:33:19.9639482Z torch.manual_seed(2025) 2025-05-07T20:33:19.9639576Z 2025-05-07T20:33:19.9639773Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9639861Z 2025-05-07T20:33:19.9639978Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9640123Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9642130Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9642139Z 2025-05-07T20:33:19.9642276Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.9642281Z 2025-05-07T20:33:19.9642399Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9642659Z self=, 2025-05-07T20:33:19.9642749Z T=128, 2025-05-07T20:33:19.9642845Z D=5120, 2025-05-07T20:33:19.9642943Z scale_ub=1200.0, 2025-05-07T20:33:19.9643045Z contiguous=True, 2025-05-07T20:33:19.9643147Z compiled=True, 2025-05-07T20:33:19.9643231Z ) 2025-05-07T20:33:19.9643477Z self = 2025-05-07T20:33:19.9643674Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.9643678Z 2025-05-07T20:33:19.9643766Z @given( 2025-05-07T20:33:19.9643900Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9644026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9644156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9644295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9644425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9644509Z ) 2025-05-07T20:33:19.9644792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9644899Z def test_silu_mul_quant( 2025-05-07T20:33:19.9644991Z self, 2025-05-07T20:33:19.9645085Z T: int, 2025-05-07T20:33:19.9645174Z D: int, 2025-05-07T20:33:19.9645287Z scale_ub: Optional[float], 2025-05-07T20:33:19.9645450Z contiguous: bool, 2025-05-07T20:33:19.9645551Z compiled: bool, 2025-05-07T20:33:19.9645642Z ) -> None: 2025-05-07T20:33:19.9645756Z torch.manual_seed(2025) 2025-05-07T20:33:19.9645840Z 2025-05-07T20:33:19.9646041Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9646133Z 2025-05-07T20:33:19.9646240Z x_sign = torch.sign(x) 2025-05-07T20:33:19.9646389Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.9648379Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9648431Z 2025-05-07T20:33:19.9648572Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:19.9648577Z 2025-05-07T20:33:19.9648740Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.9649031Z self=, 2025-05-07T20:33:19.9649128Z T=128, 2025-05-07T20:33:19.9649216Z D=7168, 2025-05-07T20:33:19.9649310Z scale_ub=None, 2025-05-07T20:33:19.9649415Z contiguous=True, 2025-05-07T20:33:19.9649511Z compiled=True, 2025-05-07T20:33:19.9649595Z ) 2025-05-07T20:33:19.9649844Z self = 2025-05-07T20:33:19.9650032Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.9650041Z 2025-05-07T20:33:19.9650135Z @given( 2025-05-07T20:33:19.9650268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.9650383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.9650520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.9650657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.9650791Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.9650881Z ) 2025-05-07T20:33:19.9651166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.9651280Z def test_silu_mul_quant( 2025-05-07T20:33:19.9651369Z self, 2025-05-07T20:33:19.9651456Z T: int, 2025-05-07T20:33:19.9651553Z D: int, 2025-05-07T20:33:19.9651667Z scale_ub: Optional[float], 2025-05-07T20:33:19.9651769Z contiguous: bool, 2025-05-07T20:33:19.9651872Z compiled: bool, 2025-05-07T20:33:19.9651966Z ) -> None: 2025-05-07T20:33:19.9652075Z torch.manual_seed(2025) 2025-05-07T20:33:19.9652165Z 2025-05-07T20:33:19.9652358Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.9654370Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:19.9654380Z 2025-05-07T20:33:19.9654514Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:19.9654669Z =============================== warnings summary =============================== 2025-05-07T20:33:19.9655028Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:19.9655423Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:19.9655775Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:19.9656767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:19.9657036Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:19.9657041Z 2025-05-07T20:33:19.9657280Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:19.9657473Z ================= 1 failed, 1 deselected, 3 warnings in 18.35s ================= 2025-05-07T20:33:21.5929004Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:21.6581081Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:33:21.6581518Z 2025-05-07T20:33:23.6601063Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:25.8305104Z ============================= test session starts ============================== 2025-05-07T20:33:25.8306258Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:25.8307188Z cachedir: .pytest_cache 2025-05-07T20:33:25.8308148Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:25.8309469Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:25.8310216Z plugins: hypothesis-6.131.14 2025-05-07T20:33:27.4795049Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:27.6735760Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:27.6736256Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:27.6736504Z 2025-05-07T20:33:30.3771701Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.3772701Z self=, 2025-05-07T20:33:30.3773171Z T=1, 2025-05-07T20:33:30.3773385Z D=5120, 2025-05-07T20:33:30.3773688Z scale_ub=None, 2025-05-07T20:33:30.3773944Z contiguous=True, 2025-05-07T20:33:30.3774198Z compiled=True, 2025-05-07T20:33:30.3774463Z ) 2025-05-07T20:33:30.3774829Z self = 2025-05-07T20:33:30.3775381Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:30.3775681Z 2025-05-07T20:33:30.3775773Z @given( 2025-05-07T20:33:30.3776039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.3776396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.3776750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.3777133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.3777510Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.3777830Z ) 2025-05-07T20:33:30.3778232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.3778733Z def test_silu_mul_quant( 2025-05-07T20:33:30.3779015Z self, 2025-05-07T20:33:30.3779237Z T: int, 2025-05-07T20:33:30.3779466Z D: int, 2025-05-07T20:33:30.3779724Z scale_ub: Optional[float], 2025-05-07T20:33:30.3780029Z contiguous: bool, 2025-05-07T20:33:30.3780305Z compiled: bool, 2025-05-07T20:33:30.3780565Z ) -> None: 2025-05-07T20:33:30.3781189Z torch.manual_seed(2025) 2025-05-07T20:33:30.3781471Z 2025-05-07T20:33:30.3781786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.3782166Z 2025-05-07T20:33:30.3782396Z x_sign = torch.sign(x) 2025-05-07T20:33:30.3782734Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.3783083Z x = x_sign * x_clamp 2025-05-07T20:33:30.3783359Z x0 = x[:, :D] 2025-05-07T20:33:30.3783612Z x1 = x[:, D:] 2025-05-07T20:33:30.3783843Z 2025-05-07T20:33:30.3784056Z if contiguous: 2025-05-07T20:33:30.3784322Z x0 = x0.contiguous() 2025-05-07T20:33:30.3784613Z x1 = x1.contiguous() 2025-05-07T20:33:30.3784892Z 2025-05-07T20:33:30.3785115Z if scale_ub is not None: 2025-05-07T20:33:30.3785543Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.3785920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.3786282Z ) 2025-05-07T20:33:30.3786510Z else: 2025-05-07T20:33:30.3786750Z scale_ub_tensor = None 2025-05-07T20:33:30.3787038Z 2025-05-07T20:33:30.3787305Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.3787753Z op = silu_mul_quant 2025-05-07T20:33:30.3788129Z if compiled: 2025-05-07T20:33:30.3788417Z op = torch.compile(op) 2025-05-07T20:33:30.3788746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.3789064Z 2025-05-07T20:33:30.3789286Z y_fp8, y_scale = fn() 2025-05-07T20:33:30.3789606Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:30.3789941Z 2025-05-07T20:33:30.3790213Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.3790597Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:30.3790923Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:30.3791285Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:30.3791688Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:30.3792039Z 2025-05-07T20:33:30.3792270Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:30.3792493Z 2025-05-07T20:33:30.3792616Z moe/activation_test.py:126: 2025-05-07T20:33:30.3792952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.3793331Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:30.3793865Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:30.3794759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:30.3795597Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:30.3796217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.3796992Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.3797761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:30.3798578Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:30.3799434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:30.3800278Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:30.3801090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:30.3801809Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:30.3802486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:30.3803070Z fn() 2025-05-07T20:33:30.3803705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:30.3804364Z self.fn.run( 2025-05-07T20:33:30.3804894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.3805490Z kernel = self.compile( 2025-05-07T20:33:30.3806100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.3806842Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.3807291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.3807549Z 2025-05-07T20:33:30.3807783Z self = 2025-05-07T20:33:30.3809061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.3810632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2b098af0>} 2025-05-07T20:33:30.3812268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.3813418Z context = 2025-05-07T20:33:30.3813742Z 2025-05-07T20:33:30.3813933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.3821767Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.3822473Z module_map=module_map) 2025-05-07T20:33:30.3822903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.3823308Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:30.3823617Z E ^ 2025-05-07T20:33:30.3824415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.3824926Z 2025-05-07T20:33:30.3825406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.3825988Z 2025-05-07T20:33:30.3826108Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.3826576Z self=, 2025-05-07T20:33:30.3827056Z T=2048, 2025-05-07T20:33:30.3827283Z D=5120, 2025-05-07T20:33:30.3827516Z scale_ub=1200.0, 2025-05-07T20:33:30.3827777Z contiguous=True, 2025-05-07T20:33:30.3828023Z compiled=False, 2025-05-07T20:33:30.3828262Z ) 2025-05-07T20:33:31.7998144Z self = 2025-05-07T20:33:31.7998838Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.7999130Z 2025-05-07T20:33:31.7999215Z @given( 2025-05-07T20:33:31.7999517Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.7999857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.8000180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.8000528Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.8000870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.8001172Z ) 2025-05-07T20:33:31.8001544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.8002003Z def test_silu_mul_quant( 2025-05-07T20:33:31.8002267Z self, 2025-05-07T20:33:31.8002477Z T: int, 2025-05-07T20:33:31.8002684Z D: int, 2025-05-07T20:33:31.8002916Z scale_ub: Optional[float], 2025-05-07T20:33:31.8003488Z contiguous: bool, 2025-05-07T20:33:31.8003745Z compiled: bool, 2025-05-07T20:33:31.8003988Z ) -> None: 2025-05-07T20:33:31.8004221Z torch.manual_seed(2025) 2025-05-07T20:33:31.8004474Z 2025-05-07T20:33:31.8004769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.8005135Z 2025-05-07T20:33:31.8005343Z x_sign = torch.sign(x) 2025-05-07T20:33:31.8005648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.8005979Z x = x_sign * x_clamp 2025-05-07T20:33:31.8006237Z x0 = x[:, :D] 2025-05-07T20:33:31.8006463Z x1 = x[:, D:] 2025-05-07T20:33:31.8006684Z 2025-05-07T20:33:31.8006883Z if contiguous: 2025-05-07T20:33:31.8007124Z x0 = x0.contiguous() 2025-05-07T20:33:31.8007542Z x1 = x1.contiguous() 2025-05-07T20:33:31.8007800Z 2025-05-07T20:33:31.8008000Z if scale_ub is not None: 2025-05-07T20:33:31.8008297Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.8008654Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.8008975Z ) 2025-05-07T20:33:31.8009183Z else: 2025-05-07T20:33:31.8009408Z scale_ub_tensor = None 2025-05-07T20:33:31.8009748Z 2025-05-07T20:33:31.8010103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.8010440Z op = silu_mul_quant 2025-05-07T20:33:31.8010709Z if compiled: 2025-05-07T20:33:31.8010970Z op = torch.compile(op) 2025-05-07T20:33:31.8011289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.8011587Z 2025-05-07T20:33:31.8011787Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.8011968Z 2025-05-07T20:33:31.8012074Z moe/activation_test.py:117: 2025-05-07T20:33:31.8012394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.8012740Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.8013042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.8013774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.8014506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.8015068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.8015784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.8016482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.8017039Z kernel = self.compile( 2025-05-07T20:33:31.8017610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.8018308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.8018726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.8018968Z 2025-05-07T20:33:31.8019187Z self = 2025-05-07T20:33:31.8020323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.8021789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2af71990>} 2025-05-07T20:33:31.8023198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.8024654Z context = 2025-05-07T20:33:31.8024958Z 2025-05-07T20:33:31.8025211Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.8025759Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.8026258Z module_map=module_map) 2025-05-07T20:33:31.8026640Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.8027031Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.8027339Z E ^ 2025-05-07T20:33:31.8027827Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.8028296Z 2025-05-07T20:33:31.8028729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.8029338Z 2025-05-07T20:33:31.8029447Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.8029886Z self=, 2025-05-07T20:33:31.8030305Z T=2048, 2025-05-07T20:33:31.8030507Z D=5120, 2025-05-07T20:33:31.8030713Z scale_ub=1200.0, 2025-05-07T20:33:31.8030950Z contiguous=True, 2025-05-07T20:33:31.8031177Z compiled=True, 2025-05-07T20:33:31.8031467Z ) 2025-05-07T20:33:31.8031911Z self = 2025-05-07T20:33:31.8032423Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.8032715Z 2025-05-07T20:33:31.8032797Z @given( 2025-05-07T20:33:31.8033043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.8033365Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.8033752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.8034105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.8034446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.8034748Z ) 2025-05-07T20:33:31.8035123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.8035588Z def test_silu_mul_quant( 2025-05-07T20:33:31.8035838Z self, 2025-05-07T20:33:31.8036046Z T: int, 2025-05-07T20:33:31.8036263Z D: int, 2025-05-07T20:33:31.8036490Z scale_ub: Optional[float], 2025-05-07T20:33:31.8036780Z contiguous: bool, 2025-05-07T20:33:31.8037034Z compiled: bool, 2025-05-07T20:33:31.8037268Z ) -> None: 2025-05-07T20:33:31.8037499Z torch.manual_seed(2025) 2025-05-07T20:33:31.8037754Z 2025-05-07T20:33:31.8038036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.8038392Z 2025-05-07T20:33:31.8038597Z x_sign = torch.sign(x) 2025-05-07T20:33:31.8038894Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.8039222Z x = x_sign * x_clamp 2025-05-07T20:33:31.8039476Z x0 = x[:, :D] 2025-05-07T20:33:31.8039700Z x1 = x[:, D:] 2025-05-07T20:33:31.8039923Z 2025-05-07T20:33:31.8040123Z if contiguous: 2025-05-07T20:33:31.8040362Z x0 = x0.contiguous() 2025-05-07T20:33:31.8040635Z x1 = x1.contiguous() 2025-05-07T20:33:31.8040888Z 2025-05-07T20:33:31.8041096Z if scale_ub is not None: 2025-05-07T20:33:31.8041383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.8041736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.8042059Z ) 2025-05-07T20:33:31.8042257Z else: 2025-05-07T20:33:31.8042480Z scale_ub_tensor = None 2025-05-07T20:33:31.8042747Z 2025-05-07T20:33:31.8042985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.8043314Z op = silu_mul_quant 2025-05-07T20:33:31.8043579Z if compiled: 2025-05-07T20:33:31.8043840Z op = torch.compile(op) 2025-05-07T20:33:31.8044152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.8044441Z 2025-05-07T20:33:31.8044692Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.8044997Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.8045305Z 2025-05-07T20:33:31.8045559Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.8045909Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.8046229Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.8046558Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.8046929Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.8047259Z 2025-05-07T20:33:31.8047476Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.8047680Z 2025-05-07T20:33:31.8047786Z moe/activation_test.py:126: 2025-05-07T20:33:31.8048105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.8048502Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.8048851Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.8049667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.8050449Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.8051101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.8051807Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.8052524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.8053280Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.8054064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:31.8054840Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.8055598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.8056261Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.8056892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.8057481Z fn() 2025-05-07T20:33:31.8058007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.8058613Z self.fn.run( 2025-05-07T20:33:31.8059096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.8059652Z kernel = self.compile( 2025-05-07T20:33:31.8060222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.8060907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.8061316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.8061564Z 2025-05-07T20:33:31.8061781Z self = 2025-05-07T20:33:31.8062905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.8064328Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae29a196c0>} 2025-05-07T20:33:31.8065716Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.8066830Z context = 2025-05-07T20:33:31.8067138Z 2025-05-07T20:33:31.8067312Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.8067870Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.8068427Z module_map=module_map) 2025-05-07T20:33:31.8068813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.8069191Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.8069492Z E ^ 2025-05-07T20:33:31.8069982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.8070456Z 2025-05-07T20:33:31.8070937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.8071477Z 2025-05-07T20:33:31.8071594Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.8072032Z self=, 2025-05-07T20:33:31.8072457Z T=16384, 2025-05-07T20:33:31.8072662Z D=7168, 2025-05-07T20:33:31.8072913Z scale_ub=1200.0, 2025-05-07T20:33:31.8073155Z contiguous=False, 2025-05-07T20:33:31.8073429Z compiled=False, 2025-05-07T20:33:31.8073721Z ) 2025-05-07T20:33:33.0106554Z self = 2025-05-07T20:33:33.0107111Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0107414Z 2025-05-07T20:33:33.0108088Z @given( 2025-05-07T20:33:33.0108481Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0108933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0109419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0109777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0110132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0110442Z ) 2025-05-07T20:33:33.0110819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0111304Z def test_silu_mul_quant( 2025-05-07T20:33:33.0111576Z self, 2025-05-07T20:33:33.0111799Z T: int, 2025-05-07T20:33:33.0112014Z D: int, 2025-05-07T20:33:33.0112244Z scale_ub: Optional[float], 2025-05-07T20:33:33.0112544Z contiguous: bool, 2025-05-07T20:33:33.0112805Z compiled: bool, 2025-05-07T20:33:33.0113044Z ) -> None: 2025-05-07T20:33:33.0113283Z torch.manual_seed(2025) 2025-05-07T20:33:33.0113664Z 2025-05-07T20:33:33.0113955Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0114341Z 2025-05-07T20:33:33.0114552Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0114859Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0115192Z x = x_sign * x_clamp 2025-05-07T20:33:33.0115456Z x0 = x[:, :D] 2025-05-07T20:33:33.0115691Z x1 = x[:, D:] 2025-05-07T20:33:33.0115920Z 2025-05-07T20:33:33.0116121Z if contiguous: 2025-05-07T20:33:33.0116362Z x0 = x0.contiguous() 2025-05-07T20:33:33.0116643Z x1 = x1.contiguous() 2025-05-07T20:33:33.0116915Z 2025-05-07T20:33:33.0117131Z if scale_ub is not None: 2025-05-07T20:33:33.0117424Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0117841Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0118177Z ) 2025-05-07T20:33:33.0118387Z else: 2025-05-07T20:33:33.0118618Z scale_ub_tensor = None 2025-05-07T20:33:33.0118890Z 2025-05-07T20:33:33.0119140Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0119478Z op = silu_mul_quant 2025-05-07T20:33:33.0119778Z if compiled: 2025-05-07T20:33:33.0120043Z op = torch.compile(op) 2025-05-07T20:33:33.0120697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0120988Z 2025-05-07T20:33:33.0121240Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0121484Z 2025-05-07T20:33:33.0121628Z moe/activation_test.py:117: 2025-05-07T20:33:33.0121947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0122300Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0122612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0123352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0124619Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0125206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0126096Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0126802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0127407Z kernel = self.compile( 2025-05-07T20:33:33.0128013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0128898Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0129320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0129570Z 2025-05-07T20:33:33.0129791Z self = 2025-05-07T20:33:33.0130936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0132611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae29a18940>} 2025-05-07T20:33:33.0134046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0135138Z context = 2025-05-07T20:33:33.0135448Z 2025-05-07T20:33:33.0135626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0136184Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0136677Z module_map=module_map) 2025-05-07T20:33:33.0137069Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0137476Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0137782Z E ^ 2025-05-07T20:33:33.0138273Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0138752Z 2025-05-07T20:33:33.0139190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0139733Z 2025-05-07T20:33:33.0139853Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0140289Z self=, 2025-05-07T20:33:33.0140713Z T=1, 2025-05-07T20:33:33.0140916Z D=7168, 2025-05-07T20:33:33.0141130Z scale_ub=None, 2025-05-07T20:33:33.0141357Z contiguous=True, 2025-05-07T20:33:33.0141600Z compiled=True, 2025-05-07T20:33:33.0141830Z ) 2025-05-07T20:33:33.0142167Z self = 2025-05-07T20:33:33.0142684Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0142956Z 2025-05-07T20:33:33.0143048Z @given( 2025-05-07T20:33:33.0143376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0143715Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0144043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0144395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0144751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0145057Z ) 2025-05-07T20:33:33.0145439Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0145904Z def test_silu_mul_quant( 2025-05-07T20:33:33.0146164Z self, 2025-05-07T20:33:33.0146378Z T: int, 2025-05-07T20:33:33.0146585Z D: int, 2025-05-07T20:33:33.0146824Z scale_ub: Optional[float], 2025-05-07T20:33:33.0147115Z contiguous: bool, 2025-05-07T20:33:33.0147418Z compiled: bool, 2025-05-07T20:33:33.0147665Z ) -> None: 2025-05-07T20:33:33.0147902Z torch.manual_seed(2025) 2025-05-07T20:33:33.0148158Z 2025-05-07T20:33:33.0148455Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0148822Z 2025-05-07T20:33:33.0149026Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0149338Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0149757Z x = x_sign * x_clamp 2025-05-07T20:33:33.0150012Z x0 = x[:, :D] 2025-05-07T20:33:33.0150247Z x1 = x[:, D:] 2025-05-07T20:33:33.0150475Z 2025-05-07T20:33:33.0150678Z if contiguous: 2025-05-07T20:33:33.0150925Z x0 = x0.contiguous() 2025-05-07T20:33:33.0151203Z x1 = x1.contiguous() 2025-05-07T20:33:33.0151461Z 2025-05-07T20:33:33.0151662Z if scale_ub is not None: 2025-05-07T20:33:33.0151960Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0152323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0152652Z ) 2025-05-07T20:33:33.0152868Z else: 2025-05-07T20:33:33.0153098Z scale_ub_tensor = None 2025-05-07T20:33:33.0153366Z 2025-05-07T20:33:33.0153700Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0154038Z op = silu_mul_quant 2025-05-07T20:33:33.0154306Z if compiled: 2025-05-07T20:33:33.0154578Z op = torch.compile(op) 2025-05-07T20:33:33.0154898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0155193Z 2025-05-07T20:33:33.0155405Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.0155712Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.0156023Z 2025-05-07T20:33:33.0156279Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0156636Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.0156952Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.0157286Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.0157670Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0158006Z 2025-05-07T20:33:33.0158220Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.0158432Z 2025-05-07T20:33:33.0158539Z moe/activation_test.py:126: 2025-05-07T20:33:33.0158859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0159224Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.0159567Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0160395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.0161188Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.0161759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0162482Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0163322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.0164087Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0164881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:33.0165670Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0166440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.0167113Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.0167791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.0168394Z fn() 2025-05-07T20:33:33.0168934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.0169541Z self.fn.run( 2025-05-07T20:33:33.0170037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0170647Z kernel = self.compile( 2025-05-07T20:33:33.0171263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0171949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0172373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0172616Z 2025-05-07T20:33:33.0172840Z self = 2025-05-07T20:33:33.0173972Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0175413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae297b0790>} 2025-05-07T20:33:33.0176825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0177951Z context = 2025-05-07T20:33:33.0178254Z 2025-05-07T20:33:33.0178439Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0178988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0179491Z module_map=module_map) 2025-05-07T20:33:33.0179882Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0180267Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.0180548Z E ^ 2025-05-07T20:33:33.0181046Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0181521Z 2025-05-07T20:33:33.0181976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0182529Z 2025-05-07T20:33:33.0182647Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0189559Z self=, 2025-05-07T20:33:33.0190001Z T=4096, 2025-05-07T20:33:33.0190211Z D=5120, 2025-05-07T20:33:33.0190419Z scale_ub=None, 2025-05-07T20:33:33.0190655Z contiguous=False, 2025-05-07T20:33:33.0190902Z compiled=False, 2025-05-07T20:33:33.0191130Z ) 2025-05-07T20:33:34.6087807Z self = 2025-05-07T20:33:34.6088850Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:34.6089294Z 2025-05-07T20:33:34.6089429Z @given( 2025-05-07T20:33:34.6089786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.6090261Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.6090630Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.6091007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.6091365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.6091683Z ) 2025-05-07T20:33:34.6092075Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.6092560Z def test_silu_mul_quant( 2025-05-07T20:33:34.6092829Z self, 2025-05-07T20:33:34.6093045Z T: int, 2025-05-07T20:33:34.6093347Z D: int, 2025-05-07T20:33:34.6093587Z scale_ub: Optional[float], 2025-05-07T20:33:34.6093888Z contiguous: bool, 2025-05-07T20:33:34.6094152Z compiled: bool, 2025-05-07T20:33:34.6094407Z ) -> None: 2025-05-07T20:33:34.6094648Z torch.manual_seed(2025) 2025-05-07T20:33:34.6094909Z 2025-05-07T20:33:34.6095249Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.6095741Z 2025-05-07T20:33:34.6095955Z x_sign = torch.sign(x) 2025-05-07T20:33:34.6096349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.6096692Z x = x_sign * x_clamp 2025-05-07T20:33:34.6096959Z x0 = x[:, :D] 2025-05-07T20:33:34.6097199Z x1 = x[:, D:] 2025-05-07T20:33:34.6097428Z 2025-05-07T20:33:34.6097636Z if contiguous: 2025-05-07T20:33:34.6097906Z x0 = x0.contiguous() 2025-05-07T20:33:34.6098230Z x1 = x1.contiguous() 2025-05-07T20:33:34.6098498Z 2025-05-07T20:33:34.6098715Z if scale_ub is not None: 2025-05-07T20:33:34.6099016Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.6099388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.6099731Z ) 2025-05-07T20:33:34.6099944Z else: 2025-05-07T20:33:34.6100177Z scale_ub_tensor = None 2025-05-07T20:33:34.6100456Z 2025-05-07T20:33:34.6100710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.6101064Z op = silu_mul_quant 2025-05-07T20:33:34.6101341Z if compiled: 2025-05-07T20:33:34.6101616Z op = torch.compile(op) 2025-05-07T20:33:34.6101937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.6102239Z 2025-05-07T20:33:34.6102454Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.6102637Z 2025-05-07T20:33:34.6102747Z moe/activation_test.py:117: 2025-05-07T20:33:34.6103074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.6103445Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.6103757Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.6104525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.6105292Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.6105886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.6106645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.6107378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.6107968Z kernel = self.compile( 2025-05-07T20:33:34.6108597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.6109346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.6109790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.6110042Z 2025-05-07T20:33:34.6110330Z self = 2025-05-07T20:33:34.6111528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.6113049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae297b1510>} 2025-05-07T20:33:34.6114614Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.6115795Z context = 2025-05-07T20:33:34.6116112Z 2025-05-07T20:33:34.6116303Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.6116875Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.6117395Z module_map=module_map) 2025-05-07T20:33:34.6117844Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.6118269Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.6118554Z E ^ 2025-05-07T20:33:34.6119120Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.6119617Z 2025-05-07T20:33:34.6120076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.6120642Z 2025-05-07T20:33:34.6120761Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.6121215Z self=, 2025-05-07T20:33:34.6121661Z T=4096, 2025-05-07T20:33:34.6121868Z D=7168, 2025-05-07T20:33:34.6122081Z scale_ub=None, 2025-05-07T20:33:34.6122322Z contiguous=False, 2025-05-07T20:33:34.6122574Z compiled=False, 2025-05-07T20:33:34.6122797Z ) 2025-05-07T20:33:34.6123150Z self = 2025-05-07T20:33:34.6123706Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:34.6124205Z 2025-05-07T20:33:34.6124291Z @given( 2025-05-07T20:33:34.6124546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.6124890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.6125230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.6125588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.6125954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.6126270Z ) 2025-05-07T20:33:34.6126654Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.6127143Z def test_silu_mul_quant( 2025-05-07T20:33:34.6127410Z self, 2025-05-07T20:33:34.6127624Z T: int, 2025-05-07T20:33:34.6127844Z D: int, 2025-05-07T20:33:34.6128088Z scale_ub: Optional[float], 2025-05-07T20:33:34.6128386Z contiguous: bool, 2025-05-07T20:33:34.6128653Z compiled: bool, 2025-05-07T20:33:34.6128900Z ) -> None: 2025-05-07T20:33:34.6129136Z torch.manual_seed(2025) 2025-05-07T20:33:34.6129403Z 2025-05-07T20:33:34.6129702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.6130080Z 2025-05-07T20:33:34.6130300Z x_sign = torch.sign(x) 2025-05-07T20:33:34.6130622Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.6130970Z x = x_sign * x_clamp 2025-05-07T20:33:34.6131237Z x0 = x[:, :D] 2025-05-07T20:33:34.6131478Z x1 = x[:, D:] 2025-05-07T20:33:34.6131707Z 2025-05-07T20:33:34.6131907Z if contiguous: 2025-05-07T20:33:34.6132508Z x0 = x0.contiguous() 2025-05-07T20:33:34.6132801Z x1 = x1.contiguous() 2025-05-07T20:33:34.6133061Z 2025-05-07T20:33:34.6133270Z if scale_ub is not None: 2025-05-07T20:33:34.6133572Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.6133940Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.6134279Z ) 2025-05-07T20:33:34.6134490Z else: 2025-05-07T20:33:34.6134720Z scale_ub_tensor = None 2025-05-07T20:33:34.6134999Z 2025-05-07T20:33:34.6135254Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.6135593Z op = silu_mul_quant 2025-05-07T20:33:34.6135869Z if compiled: 2025-05-07T20:33:34.6136141Z op = torch.compile(op) 2025-05-07T20:33:34.6136548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.6136847Z 2025-05-07T20:33:34.6137061Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.6137243Z 2025-05-07T20:33:34.6137359Z moe/activation_test.py:117: 2025-05-07T20:33:34.6137681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.6138053Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.6138434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.6139257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.6140021Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.6140617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.6141370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.6142101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.6142693Z kernel = self.compile( 2025-05-07T20:33:34.6143294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.6144023Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.6144458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.6144718Z 2025-05-07T20:33:34.6144951Z self = 2025-05-07T20:33:34.6146142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.6147659Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae297b1bd0>} 2025-05-07T20:33:34.6149195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.6150326Z context = 2025-05-07T20:33:34.6150652Z 2025-05-07T20:33:34.6150837Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.6151410Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.6151926Z module_map=module_map) 2025-05-07T20:33:34.6152329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.6152714Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.6152995Z E ^ 2025-05-07T20:33:34.6153585Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.6154092Z 2025-05-07T20:33:34.6154604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.6155175Z 2025-05-07T20:33:34.6155296Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.6155749Z self=, 2025-05-07T20:33:34.6156194Z T=128, 2025-05-07T20:33:34.6156403Z D=7168, 2025-05-07T20:33:34.6156608Z scale_ub=None, 2025-05-07T20:33:34.6156844Z contiguous=False, 2025-05-07T20:33:34.6157092Z compiled=True, 2025-05-07T20:33:34.6157309Z ) 2025-05-07T20:33:34.6829121Z self = 2025-05-07T20:33:34.6829978Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:34.6830403Z 2025-05-07T20:33:34.6830531Z @given( 2025-05-07T20:33:34.6830944Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.6831290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.6831637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.6832013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.6832379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.6832700Z ) 2025-05-07T20:33:34.6833091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.6833846Z def test_silu_mul_quant( 2025-05-07T20:33:34.6834114Z self, 2025-05-07T20:33:34.6834339Z T: int, 2025-05-07T20:33:34.6834561Z D: int, 2025-05-07T20:33:34.6834803Z scale_ub: Optional[float], 2025-05-07T20:33:34.6835101Z contiguous: bool, 2025-05-07T20:33:34.6835369Z compiled: bool, 2025-05-07T20:33:34.6835620Z ) -> None: 2025-05-07T20:33:34.6835860Z torch.manual_seed(2025) 2025-05-07T20:33:34.6836125Z 2025-05-07T20:33:34.6836432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.6836813Z 2025-05-07T20:33:34.6837026Z x_sign = torch.sign(x) 2025-05-07T20:33:34.6837358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.6837705Z x = x_sign * x_clamp 2025-05-07T20:33:34.6837971Z x0 = x[:, :D] 2025-05-07T20:33:34.6838212Z x1 = x[:, D:] 2025-05-07T20:33:34.6838451Z 2025-05-07T20:33:34.6838657Z if contiguous: 2025-05-07T20:33:34.6838920Z x0 = x0.contiguous() 2025-05-07T20:33:34.6839215Z x1 = x1.contiguous() 2025-05-07T20:33:34.6839486Z 2025-05-07T20:33:34.6839708Z if scale_ub is not None: 2025-05-07T20:33:34.6840024Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.6840406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.6840757Z ) 2025-05-07T20:33:34.6840978Z else: 2025-05-07T20:33:34.6841219Z scale_ub_tensor = None 2025-05-07T20:33:34.6841504Z 2025-05-07T20:33:34.6841769Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.6842125Z op = silu_mul_quant 2025-05-07T20:33:34.6842411Z if compiled: 2025-05-07T20:33:34.6842694Z op = torch.compile(op) 2025-05-07T20:33:34.6843031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.6843337Z 2025-05-07T20:33:34.6843561Z y_fp8, y_scale = fn() 2025-05-07T20:33:34.6843890Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:34.6844214Z 2025-05-07T20:33:34.6844489Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.6844871Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:34.6845197Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:34.6845553Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:34.6845962Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:34.6846317Z 2025-05-07T20:33:34.6846545Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:34.6846774Z 2025-05-07T20:33:34.6846887Z moe/activation_test.py:126: 2025-05-07T20:33:34.6847304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.6847680Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:34.6848070Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:34.6848996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:34.6849849Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:34.6850464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.6851239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.6852021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:34.6852915Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:34.6853776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:34.6854621Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:34.6855538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:34.6856258Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:34.6856939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:34.6857528Z fn() 2025-05-07T20:33:34.6858105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:34.6858779Z self.fn.run( 2025-05-07T20:33:34.6859306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.6859904Z kernel = self.compile( 2025-05-07T20:33:34.6860504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.6861237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.6861693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.6861950Z 2025-05-07T20:33:34.6862190Z self = 2025-05-07T20:33:34.6863391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.6864932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae29a1a9e0>} 2025-05-07T20:33:34.6866437Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.6867590Z context = 2025-05-07T20:33:34.6867911Z 2025-05-07T20:33:34.6868105Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.6868707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.6869295Z module_map=module_map) 2025-05-07T20:33:34.6869711Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.6870111Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:34.6870415Z E ^ 2025-05-07T20:33:34.6870940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.6871497Z 2025-05-07T20:33:34.6871976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.6872555Z 2025-05-07T20:33:34.6872672Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.6873144Z self=, 2025-05-07T20:33:34.6873644Z T=128, 2025-05-07T20:33:34.6873853Z D=7168, 2025-05-07T20:33:34.6874075Z scale_ub=None, 2025-05-07T20:33:34.6874322Z contiguous=False, 2025-05-07T20:33:34.6874576Z compiled=False, 2025-05-07T20:33:34.6874808Z ) 2025-05-07T20:33:35.0693821Z self = 2025-05-07T20:33:35.0694641Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:35.0695240Z 2025-05-07T20:33:35.0695371Z @given( 2025-05-07T20:33:35.0695743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.0696249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.0696627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.0697009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.0697387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.0697818Z ) 2025-05-07T20:33:35.0698290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.0698814Z def test_silu_mul_quant( 2025-05-07T20:33:35.0699106Z self, 2025-05-07T20:33:35.0699360Z T: int, 2025-05-07T20:33:35.0699613Z D: int, 2025-05-07T20:33:35.0699872Z scale_ub: Optional[float], 2025-05-07T20:33:35.0700184Z contiguous: bool, 2025-05-07T20:33:35.0700463Z compiled: bool, 2025-05-07T20:33:35.0700739Z ) -> None: 2025-05-07T20:33:35.0700983Z torch.manual_seed(2025) 2025-05-07T20:33:35.0701262Z 2025-05-07T20:33:35.0701581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.0701970Z 2025-05-07T20:33:35.0702198Z x_sign = torch.sign(x) 2025-05-07T20:33:35.0702531Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.0702912Z x = x_sign * x_clamp 2025-05-07T20:33:35.0703193Z x0 = x[:, :D] 2025-05-07T20:33:35.0703441Z x1 = x[:, D:] 2025-05-07T20:33:35.0703682Z 2025-05-07T20:33:35.0703901Z if contiguous: 2025-05-07T20:33:35.0704165Z x0 = x0.contiguous() 2025-05-07T20:33:35.0704462Z x1 = x1.contiguous() 2025-05-07T20:33:35.0704739Z 2025-05-07T20:33:35.0704955Z if scale_ub is not None: 2025-05-07T20:33:35.0705275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.0705657Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.0706016Z ) 2025-05-07T20:33:35.0706242Z else: 2025-05-07T20:33:35.0706486Z scale_ub_tensor = None 2025-05-07T20:33:35.0706773Z 2025-05-07T20:33:35.0707036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.0707397Z op = silu_mul_quant 2025-05-07T20:33:35.0707687Z if compiled: 2025-05-07T20:33:35.0707971Z op = torch.compile(op) 2025-05-07T20:33:35.0708318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.0708642Z 2025-05-07T20:33:35.0708859Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.0709055Z 2025-05-07T20:33:35.0709170Z moe/activation_test.py:117: 2025-05-07T20:33:35.0709508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.0709888Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.0710209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.0710995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.0711785Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.0712471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.0713251Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.0714100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.0714712Z kernel = self.compile( 2025-05-07T20:33:35.0715323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.0716072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.0716526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.0716785Z 2025-05-07T20:33:35.0717021Z self = 2025-05-07T20:33:35.0718584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.0720206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2982e560>} 2025-05-07T20:33:35.0721774Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.0722930Z context = 2025-05-07T20:33:35.0723259Z 2025-05-07T20:33:35.0723449Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.0724244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.0724784Z module_map=module_map) 2025-05-07T20:33:35.0725206Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.0725602Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.0725897Z E ^ 2025-05-07T20:33:35.0726427Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.0726939Z 2025-05-07T20:33:35.0727410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.0727995Z 2025-05-07T20:33:35.0728114Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.0728584Z self=, 2025-05-07T20:33:35.0729041Z T=4096, 2025-05-07T20:33:35.0729252Z D=5120, 2025-05-07T20:33:35.0729478Z scale_ub=1200.0, 2025-05-07T20:33:35.0729740Z contiguous=True, 2025-05-07T20:33:35.0729989Z compiled=False, 2025-05-07T20:33:35.0730225Z ) 2025-05-07T20:33:35.0730595Z self = 2025-05-07T20:33:35.0731150Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:35.0731467Z 2025-05-07T20:33:35.0731558Z @given( 2025-05-07T20:33:35.0731831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.0732184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.0732539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.0732917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.0733296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.0733620Z ) 2025-05-07T20:33:35.0734022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.0734527Z def test_silu_mul_quant( 2025-05-07T20:33:35.0734802Z self, 2025-05-07T20:33:35.0735030Z T: int, 2025-05-07T20:33:35.0735257Z D: int, 2025-05-07T20:33:35.0735518Z scale_ub: Optional[float], 2025-05-07T20:33:35.0735912Z contiguous: bool, 2025-05-07T20:33:35.0743003Z compiled: bool, 2025-05-07T20:33:35.0743278Z ) -> None: 2025-05-07T20:33:35.0743525Z torch.manual_seed(2025) 2025-05-07T20:33:35.0743812Z 2025-05-07T20:33:35.0744132Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.0744515Z 2025-05-07T20:33:35.0744742Z x_sign = torch.sign(x) 2025-05-07T20:33:35.0745075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.0745424Z x = x_sign * x_clamp 2025-05-07T20:33:35.0745703Z x0 = x[:, :D] 2025-05-07T20:33:35.0745957Z x1 = x[:, D:] 2025-05-07T20:33:35.0746196Z 2025-05-07T20:33:35.0746424Z if contiguous: 2025-05-07T20:33:35.0746695Z x0 = x0.contiguous() 2025-05-07T20:33:35.0747101Z x1 = x1.contiguous() 2025-05-07T20:33:35.0747380Z 2025-05-07T20:33:35.0747598Z if scale_ub is not None: 2025-05-07T20:33:35.0747911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.0748286Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.0748633Z ) 2025-05-07T20:33:35.0748853Z else: 2025-05-07T20:33:35.0749164Z scale_ub_tensor = None 2025-05-07T20:33:35.0749447Z 2025-05-07T20:33:35.0749774Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.0750133Z op = silu_mul_quant 2025-05-07T20:33:35.0750419Z if compiled: 2025-05-07T20:33:35.0750702Z op = torch.compile(op) 2025-05-07T20:33:35.0751034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.0751349Z 2025-05-07T20:33:35.0751573Z > y_fp8, y_scale = fn() 2025-05-07T20:33:35.0751761Z 2025-05-07T20:33:35.0751878Z moe/activation_test.py:117: 2025-05-07T20:33:35.0752212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.0752589Z moe/activation_test.py:115: in fn 2025-05-07T20:33:35.0752911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.0753767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:35.0754549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:35.0755162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.0755924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.0756670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.0757272Z kernel = self.compile( 2025-05-07T20:33:35.0757890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.0758680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.0759131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.0759392Z 2025-05-07T20:33:35.0759631Z self = 2025-05-07T20:33:35.0760846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.0762392Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2982e7a0>} 2025-05-07T20:33:35.0763902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.0765052Z context = 2025-05-07T20:33:35.0765433Z 2025-05-07T20:33:35.0765628Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.0766210Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.0766743Z module_map=module_map) 2025-05-07T20:33:35.0767158Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.0767557Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:35.0767845Z E ^ 2025-05-07T20:33:35.0768370Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.0768877Z 2025-05-07T20:33:35.0769353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.0769978Z 2025-05-07T20:33:35.0770099Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.0770562Z self=, 2025-05-07T20:33:35.0771013Z T=1, 2025-05-07T20:33:35.0771223Z D=5120, 2025-05-07T20:33:35.0771440Z scale_ub=None, 2025-05-07T20:33:35.0771685Z contiguous=True, 2025-05-07T20:33:35.0771986Z compiled=True, 2025-05-07T20:33:35.0772209Z ) 2025-05-07T20:33:35.7032583Z self = 2025-05-07T20:33:35.7033409Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:35.7033871Z 2025-05-07T20:33:35.7033999Z @given( 2025-05-07T20:33:35.7034335Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.7034768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.7035115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.7035489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.7035858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.7036184Z ) 2025-05-07T20:33:35.7036578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.7037081Z def test_silu_mul_quant( 2025-05-07T20:33:35.7037355Z self, 2025-05-07T20:33:35.7037573Z T: int, 2025-05-07T20:33:35.7037803Z D: int, 2025-05-07T20:33:35.7038052Z scale_ub: Optional[float], 2025-05-07T20:33:35.7038361Z contiguous: bool, 2025-05-07T20:33:35.7038630Z compiled: bool, 2025-05-07T20:33:35.7038884Z ) -> None: 2025-05-07T20:33:35.7039132Z torch.manual_seed(2025) 2025-05-07T20:33:35.7039405Z 2025-05-07T20:33:35.7039714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.7040102Z 2025-05-07T20:33:35.7040316Z x_sign = torch.sign(x) 2025-05-07T20:33:35.7040646Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.7040995Z x = x_sign * x_clamp 2025-05-07T20:33:35.7041262Z x0 = x[:, :D] 2025-05-07T20:33:35.7041505Z x1 = x[:, D:] 2025-05-07T20:33:35.7041745Z 2025-05-07T20:33:35.7041954Z if contiguous: 2025-05-07T20:33:35.7042219Z x0 = x0.contiguous() 2025-05-07T20:33:35.7042512Z x1 = x1.contiguous() 2025-05-07T20:33:35.7042778Z 2025-05-07T20:33:35.7042998Z if scale_ub is not None: 2025-05-07T20:33:35.7043314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.7043694Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.7044037Z ) 2025-05-07T20:33:35.7044256Z else: 2025-05-07T20:33:35.7044495Z scale_ub_tensor = None 2025-05-07T20:33:35.7044774Z 2025-05-07T20:33:35.7045037Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.7045394Z op = silu_mul_quant 2025-05-07T20:33:35.7045673Z if compiled: 2025-05-07T20:33:35.7045954Z op = torch.compile(op) 2025-05-07T20:33:35.7046288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.7046698Z 2025-05-07T20:33:35.7046924Z y_fp8, y_scale = fn() 2025-05-07T20:33:35.7047249Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:35.7047570Z 2025-05-07T20:33:35.7047839Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.7048218Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:35.7048595Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:35.7048948Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:35.7049351Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:35.7049698Z 2025-05-07T20:33:35.7049921Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:35.7050144Z 2025-05-07T20:33:35.7050256Z moe/activation_test.py:126: 2025-05-07T20:33:35.7050666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.7051035Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:35.7051408Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:35.7052299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:35.7053215Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:35.7053866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.7054632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.7055397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:35.7056204Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:35.7057056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:35.7057900Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:35.7058769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:35.7059482Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:35.7060165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:35.7060751Z fn() 2025-05-07T20:33:35.7061321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:35.7061968Z self.fn.run( 2025-05-07T20:33:35.7062493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.7063091Z kernel = self.compile( 2025-05-07T20:33:35.7063695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.7064432Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.7064880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.7065136Z 2025-05-07T20:33:35.7065376Z self = 2025-05-07T20:33:35.7066582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.7068118Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2982e050>} 2025-05-07T20:33:35.7069669Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.7070819Z context = 2025-05-07T20:33:35.7071142Z 2025-05-07T20:33:35.7071336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.7071921Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.7072450Z module_map=module_map) 2025-05-07T20:33:35.7072862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.7073257Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:35.7073627Z E ^ 2025-05-07T20:33:35.7074151Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.7074656Z 2025-05-07T20:33:35.7075191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.7075763Z 2025-05-07T20:33:35.7075884Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.7076350Z self=, 2025-05-07T20:33:35.7076797Z T=2048, 2025-05-07T20:33:35.7077005Z D=5120, 2025-05-07T20:33:35.7077268Z scale_ub=None, 2025-05-07T20:33:35.7077580Z contiguous=True, 2025-05-07T20:33:35.7077831Z compiled=True, 2025-05-07T20:33:35.7078058Z ) 2025-05-07T20:33:36.2898169Z self = 2025-05-07T20:33:36.2899446Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:36.2899993Z 2025-05-07T20:33:36.2900149Z @given( 2025-05-07T20:33:36.2900608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.2901213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.2901836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.2902490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.2903144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.2903711Z ) 2025-05-07T20:33:36.2904403Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.2905274Z def test_silu_mul_quant( 2025-05-07T20:33:36.2905751Z self, 2025-05-07T20:33:36.2906151Z T: int, 2025-05-07T20:33:36.2906540Z D: int, 2025-05-07T20:33:36.2906961Z scale_ub: Optional[float], 2025-05-07T20:33:36.2907497Z contiguous: bool, 2025-05-07T20:33:36.2907969Z compiled: bool, 2025-05-07T20:33:36.2908404Z ) -> None: 2025-05-07T20:33:36.2908693Z torch.manual_seed(2025) 2025-05-07T20:33:36.2908970Z 2025-05-07T20:33:36.2909274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.2909664Z 2025-05-07T20:33:36.2909882Z x_sign = torch.sign(x) 2025-05-07T20:33:36.2910208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.2910566Z x = x_sign * x_clamp 2025-05-07T20:33:36.2910842Z x0 = x[:, :D] 2025-05-07T20:33:36.2911084Z x1 = x[:, D:] 2025-05-07T20:33:36.2911323Z 2025-05-07T20:33:36.2911538Z if contiguous: 2025-05-07T20:33:36.2911801Z x0 = x0.contiguous() 2025-05-07T20:33:36.2912096Z x1 = x1.contiguous() 2025-05-07T20:33:36.2912373Z 2025-05-07T20:33:36.2912591Z if scale_ub is not None: 2025-05-07T20:33:36.2912911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.2913295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.2913734Z ) 2025-05-07T20:33:36.2913950Z else: 2025-05-07T20:33:36.2914195Z scale_ub_tensor = None 2025-05-07T20:33:36.2914484Z 2025-05-07T20:33:36.2914747Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.2915107Z op = silu_mul_quant 2025-05-07T20:33:36.2915391Z if compiled: 2025-05-07T20:33:36.2915801Z op = torch.compile(op) 2025-05-07T20:33:36.2916145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.2916457Z 2025-05-07T20:33:36.2916674Z y_fp8, y_scale = fn() 2025-05-07T20:33:36.2917003Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:36.2917336Z 2025-05-07T20:33:36.2917605Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.2917987Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:36.2918320Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:36.2918678Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:36.2919080Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:36.2919434Z 2025-05-07T20:33:36.2919669Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:36.2919963Z 2025-05-07T20:33:36.2920078Z moe/activation_test.py:126: 2025-05-07T20:33:36.2920417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.2920802Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:36.2921170Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:36.2922126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:36.2923034Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:36.2923657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.2924682Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.2925460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:36.2926284Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:36.2927140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:36.2927983Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:36.2928802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:36.2929534Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:36.2930214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:36.2930794Z fn() 2025-05-07T20:33:36.2931368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:36.2932023Z self.fn.run( 2025-05-07T20:33:36.2932548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.2933155Z kernel = self.compile( 2025-05-07T20:33:36.2933769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.2934511Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.2934959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.2935226Z 2025-05-07T20:33:36.2935459Z self = 2025-05-07T20:33:36.2936672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.2938219Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae292e77f0>} 2025-05-07T20:33:36.2939861Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.2941014Z context = 2025-05-07T20:33:36.2941349Z 2025-05-07T20:33:36.2941541Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.2942129Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.2942655Z module_map=module_map) 2025-05-07T20:33:36.2943067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.2943471Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:36.2943766Z E ^ 2025-05-07T20:33:36.2944289Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.2944872Z 2025-05-07T20:33:36.2945344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.2945920Z 2025-05-07T20:33:36.2946044Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.2946507Z self=, 2025-05-07T20:33:36.2947029Z T=128, 2025-05-07T20:33:36.2947304Z D=5120, 2025-05-07T20:33:36.2947522Z scale_ub=None, 2025-05-07T20:33:36.2947769Z contiguous=True, 2025-05-07T20:33:36.2948024Z compiled=True, 2025-05-07T20:33:36.2948252Z ) 2025-05-07T20:33:37.2624194Z self = 2025-05-07T20:33:37.2625036Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:37.2625446Z 2025-05-07T20:33:37.2625568Z @given( 2025-05-07T20:33:37.2625943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2626291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2626624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2626992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2627346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2627656Z ) 2025-05-07T20:33:37.2628042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2628519Z def test_silu_mul_quant( 2025-05-07T20:33:37.2628788Z self, 2025-05-07T20:33:37.2629002Z T: int, 2025-05-07T20:33:37.2629219Z D: int, 2025-05-07T20:33:37.2629454Z scale_ub: Optional[float], 2025-05-07T20:33:37.2629750Z contiguous: bool, 2025-05-07T20:33:37.2630012Z compiled: bool, 2025-05-07T20:33:37.2630254Z ) -> None: 2025-05-07T20:33:37.2630491Z torch.manual_seed(2025) 2025-05-07T20:33:37.2630757Z 2025-05-07T20:33:37.2631051Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2631423Z 2025-05-07T20:33:37.2631636Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2631950Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2632285Z x = x_sign * x_clamp 2025-05-07T20:33:37.2632549Z x0 = x[:, :D] 2025-05-07T20:33:37.2632781Z x1 = x[:, D:] 2025-05-07T20:33:37.2633014Z 2025-05-07T20:33:37.2633221Z if contiguous: 2025-05-07T20:33:37.2633472Z x0 = x0.contiguous() 2025-05-07T20:33:37.2633838Z x1 = x1.contiguous() 2025-05-07T20:33:37.2634100Z 2025-05-07T20:33:37.2634304Z if scale_ub is not None: 2025-05-07T20:33:37.2634604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.2634968Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.2635303Z ) 2025-05-07T20:33:37.2635510Z else: 2025-05-07T20:33:37.2635745Z scale_ub_tensor = None 2025-05-07T20:33:37.2636019Z 2025-05-07T20:33:37.2636267Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.2636609Z op = silu_mul_quant 2025-05-07T20:33:37.2637034Z if compiled: 2025-05-07T20:33:37.2637307Z op = torch.compile(op) 2025-05-07T20:33:37.2637632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2637935Z 2025-05-07T20:33:37.2638143Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.2638457Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.2638774Z 2025-05-07T20:33:37.2639026Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.2639390Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.2639710Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.2640060Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.2640445Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.2640861Z 2025-05-07T20:33:37.2641084Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.2641295Z 2025-05-07T20:33:37.2641404Z moe/activation_test.py:126: 2025-05-07T20:33:37.2641731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2642099Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.2642451Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.2643464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.2644281Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.2644876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.2645621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.2646371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.2647161Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.2647976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:37.2648782Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.2649606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.2650359Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.2651014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.2651576Z fn() 2025-05-07T20:33:37.2652133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.2652772Z self.fn.run( 2025-05-07T20:33:37.2653279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.2653864Z kernel = self.compile( 2025-05-07T20:33:37.2654458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.2655172Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.2655608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2655871Z 2025-05-07T20:33:37.2656097Z self = 2025-05-07T20:33:37.2657275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.2658788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28c24280>} 2025-05-07T20:33:37.2660325Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.2661440Z context = 2025-05-07T20:33:37.2661757Z 2025-05-07T20:33:37.2661939Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.2662506Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.2663009Z module_map=module_map) 2025-05-07T20:33:37.2663405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.2663792Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.2664124Z E ^ 2025-05-07T20:33:37.2664627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.2665125Z 2025-05-07T20:33:37.2665584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.2666138Z 2025-05-07T20:33:37.2666304Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2666788Z self=, 2025-05-07T20:33:37.2667225Z T=4096, 2025-05-07T20:33:37.2667431Z D=5120, 2025-05-07T20:33:37.2667638Z scale_ub=None, 2025-05-07T20:33:37.2667874Z contiguous=True, 2025-05-07T20:33:37.2668118Z compiled=True, 2025-05-07T20:33:37.2668346Z ) 2025-05-07T20:33:38.0379851Z self = 2025-05-07T20:33:38.0380685Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:38.0381096Z 2025-05-07T20:33:38.0381223Z @given( 2025-05-07T20:33:38.0381474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.0381816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.0382149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.0382503Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.0382870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.0383181Z ) 2025-05-07T20:33:38.0383561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.0384036Z def test_silu_mul_quant( 2025-05-07T20:33:38.0384298Z self, 2025-05-07T20:33:38.0384508Z T: int, 2025-05-07T20:33:38.0384717Z D: int, 2025-05-07T20:33:38.0384958Z scale_ub: Optional[float], 2025-05-07T20:33:38.0385254Z contiguous: bool, 2025-05-07T20:33:38.0385509Z compiled: bool, 2025-05-07T20:33:38.0385756Z ) -> None: 2025-05-07T20:33:38.0385996Z torch.manual_seed(2025) 2025-05-07T20:33:38.0386255Z 2025-05-07T20:33:38.0386553Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.0386920Z 2025-05-07T20:33:38.0387124Z x_sign = torch.sign(x) 2025-05-07T20:33:38.0387433Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.0387770Z x = x_sign * x_clamp 2025-05-07T20:33:38.0388035Z x0 = x[:, :D] 2025-05-07T20:33:38.0388268Z x1 = x[:, D:] 2025-05-07T20:33:38.0388493Z 2025-05-07T20:33:38.0388695Z if contiguous: 2025-05-07T20:33:38.0388942Z x0 = x0.contiguous() 2025-05-07T20:33:38.0389220Z x1 = x1.contiguous() 2025-05-07T20:33:38.0389476Z 2025-05-07T20:33:38.0389680Z if scale_ub is not None: 2025-05-07T20:33:38.0389971Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.0390332Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.0390663Z ) 2025-05-07T20:33:38.0390873Z else: 2025-05-07T20:33:38.0391102Z scale_ub_tensor = None 2025-05-07T20:33:38.0391369Z 2025-05-07T20:33:38.0391738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.0392083Z op = silu_mul_quant 2025-05-07T20:33:38.0392355Z if compiled: 2025-05-07T20:33:38.0392626Z op = torch.compile(op) 2025-05-07T20:33:38.0392955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.0393250Z 2025-05-07T20:33:38.0393459Z y_fp8, y_scale = fn() 2025-05-07T20:33:38.0393834Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:38.0394151Z 2025-05-07T20:33:38.0394403Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.0394765Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:38.0395083Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:38.0395418Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:38.0395878Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:38.0396219Z 2025-05-07T20:33:38.0396440Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:38.0396655Z 2025-05-07T20:33:38.0396763Z moe/activation_test.py:126: 2025-05-07T20:33:38.0397089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.0397517Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:38.0397919Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:38.0398774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:38.0399582Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:38.0400164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.0400898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.0401640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:38.0402419Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:38.0403229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:38.0404032Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:38.0404816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:38.0405504Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:38.0406152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:38.0406719Z fn() 2025-05-07T20:33:38.0407273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:38.0407895Z self.fn.run( 2025-05-07T20:33:38.0408401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.0408972Z kernel = self.compile( 2025-05-07T20:33:38.0409549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.0410248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.0410675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.0410921Z 2025-05-07T20:33:38.0411150Z self = 2025-05-07T20:33:38.0412308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.0413823Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28c252d0>} 2025-05-07T20:33:38.0415255Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.0416342Z context = 2025-05-07T20:33:38.0416650Z 2025-05-07T20:33:38.0416833Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.0417391Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.0417895Z module_map=module_map) 2025-05-07T20:33:38.0418333Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.0418715Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:38.0418998Z E ^ 2025-05-07T20:33:38.0419499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.0419978Z 2025-05-07T20:33:38.0420429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.0421017Z 2025-05-07T20:33:38.0421170Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.0421617Z self=, 2025-05-07T20:33:38.0422045Z T=16384, 2025-05-07T20:33:38.0422259Z D=5120, 2025-05-07T20:33:38.0422463Z scale_ub=None, 2025-05-07T20:33:38.0422700Z contiguous=True, 2025-05-07T20:33:38.0422942Z compiled=True, 2025-05-07T20:33:38.0423157Z ) 2025-05-07T20:33:38.0830116Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:38.0831720Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:38.0833173Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:38.0834329Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:38.0835530Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:38.1923329Z self = 2025-05-07T20:33:38.1925110Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:38.1925852Z 2025-05-07T20:33:38.1926072Z @given( 2025-05-07T20:33:38.1926676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.1927404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.1927997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.1928641Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.1929270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.1929669Z ) 2025-05-07T20:33:38.1930089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.1930577Z def test_silu_mul_quant( 2025-05-07T20:33:38.1930847Z self, 2025-05-07T20:33:38.1931067Z T: int, 2025-05-07T20:33:38.1931281Z D: int, 2025-05-07T20:33:38.1931528Z scale_ub: Optional[float], 2025-05-07T20:33:38.1931835Z contiguous: bool, 2025-05-07T20:33:38.1932097Z compiled: bool, 2025-05-07T20:33:38.1932345Z ) -> None: 2025-05-07T20:33:38.1932583Z torch.manual_seed(2025) 2025-05-07T20:33:38.1932973Z 2025-05-07T20:33:38.1933282Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.1933661Z 2025-05-07T20:33:38.1933875Z x_sign = torch.sign(x) 2025-05-07T20:33:38.1934197Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.1934542Z x = x_sign * x_clamp 2025-05-07T20:33:38.1934812Z x0 = x[:, :D] 2025-05-07T20:33:38.1935048Z x1 = x[:, D:] 2025-05-07T20:33:38.1935278Z 2025-05-07T20:33:38.1935487Z if contiguous: 2025-05-07T20:33:38.1935742Z x0 = x0.contiguous() 2025-05-07T20:33:38.1936032Z x1 = x1.contiguous() 2025-05-07T20:33:38.1936305Z 2025-05-07T20:33:38.1936516Z if scale_ub is not None: 2025-05-07T20:33:38.1936825Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.1937274Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.1937611Z ) 2025-05-07T20:33:38.1937828Z else: 2025-05-07T20:33:38.1938067Z scale_ub_tensor = None 2025-05-07T20:33:38.1938343Z 2025-05-07T20:33:38.1938606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.1938954Z op = silu_mul_quant 2025-05-07T20:33:38.1939308Z if compiled: 2025-05-07T20:33:38.1939641Z op = torch.compile(op) 2025-05-07T20:33:38.1939977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.1940284Z 2025-05-07T20:33:38.1940492Z y_fp8, y_scale = fn() 2025-05-07T20:33:38.1940818Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:38.1941140Z 2025-05-07T20:33:38.1941401Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.1941773Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:38.1942105Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:38.1942449Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:38.1942847Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:38.1943196Z 2025-05-07T20:33:38.1943419Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:38.1943639Z 2025-05-07T20:33:38.1943754Z moe/activation_test.py:126: 2025-05-07T20:33:38.1944090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.1944464Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:38.1944825Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:38.1945701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:38.1946534Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:38.1947149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.1947905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.1948671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:38.1949529Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:38.1950365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:38.1951198Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:38.1952017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:38.1952733Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:38.1953398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:38.1954085Z fn() 2025-05-07T20:33:38.1954705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:38.1955352Z self.fn.run( 2025-05-07T20:33:38.1955865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.1956463Z kernel = self.compile( 2025-05-07T20:33:38.1957067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.1957789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.1958230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.1958488Z 2025-05-07T20:33:38.1958723Z self = 2025-05-07T20:33:38.1959933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.1961527Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28c360e0>} 2025-05-07T20:33:38.1963102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.1964240Z context = 2025-05-07T20:33:38.1964559Z 2025-05-07T20:33:38.1964750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.1965327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.1965847Z module_map=module_map) 2025-05-07T20:33:38.1966255Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.1966653Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:38.1966948Z E ^ 2025-05-07T20:33:38.1967463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.1967968Z 2025-05-07T20:33:38.1968436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.1969001Z 2025-05-07T20:33:38.1969126Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.1969579Z self=, 2025-05-07T20:33:38.1970025Z T=1, 2025-05-07T20:33:38.1970231Z D=5120, 2025-05-07T20:33:38.1970445Z scale_ub=1200.0, 2025-05-07T20:33:38.1970697Z contiguous=True, 2025-05-07T20:33:38.1970949Z compiled=True, 2025-05-07T20:33:38.1971173Z ) 2025-05-07T20:33:38.3493384Z self = 2025-05-07T20:33:38.3494933Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:38.3495704Z 2025-05-07T20:33:38.3495945Z @given( 2025-05-07T20:33:38.3496540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.3497221Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.3497874Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.3498583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.3499098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.3499400Z ) 2025-05-07T20:33:38.3499780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.3500257Z def test_silu_mul_quant( 2025-05-07T20:33:38.3500519Z self, 2025-05-07T20:33:38.3500723Z T: int, 2025-05-07T20:33:38.3500944Z D: int, 2025-05-07T20:33:38.3501183Z scale_ub: Optional[float], 2025-05-07T20:33:38.3501469Z contiguous: bool, 2025-05-07T20:33:38.3501739Z compiled: bool, 2025-05-07T20:33:38.3502111Z ) -> None: 2025-05-07T20:33:38.3502346Z torch.manual_seed(2025) 2025-05-07T20:33:38.3502608Z 2025-05-07T20:33:38.3502902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.3503267Z 2025-05-07T20:33:38.3503481Z x_sign = torch.sign(x) 2025-05-07T20:33:38.3503796Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.3504133Z x = x_sign * x_clamp 2025-05-07T20:33:38.3504400Z x0 = x[:, :D] 2025-05-07T20:33:38.3504644Z x1 = x[:, D:] 2025-05-07T20:33:38.3504865Z 2025-05-07T20:33:38.3505072Z if contiguous: 2025-05-07T20:33:38.3505323Z x0 = x0.contiguous() 2025-05-07T20:33:38.3505604Z x1 = x1.contiguous() 2025-05-07T20:33:38.3505934Z 2025-05-07T20:33:38.3506144Z if scale_ub is not None: 2025-05-07T20:33:38.3506444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.3506803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.3507137Z ) 2025-05-07T20:33:38.3507348Z else: 2025-05-07T20:33:38.3507571Z scale_ub_tensor = None 2025-05-07T20:33:38.3507842Z 2025-05-07T20:33:38.3508165Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.3508556Z op = silu_mul_quant 2025-05-07T20:33:38.3508830Z if compiled: 2025-05-07T20:33:38.3509101Z op = torch.compile(op) 2025-05-07T20:33:38.3509415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.3509761Z 2025-05-07T20:33:38.3509973Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.3510151Z 2025-05-07T20:33:38.3510258Z moe/activation_test.py:117: 2025-05-07T20:33:38.3510578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.3510937Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.3511242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.3511840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:38.3512441Z return fn(*args, **kwargs) 2025-05-07T20:33:38.3513147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.3513967Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.3514540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.3515268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.3515974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.3516535Z kernel = self.compile( 2025-05-07T20:33:38.3517119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.3517826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.3518255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.3518514Z 2025-05-07T20:33:38.3518792Z self = 2025-05-07T20:33:38.3520241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.3521786Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28222560>} 2025-05-07T20:33:38.3523218Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.3524648Z context = 2025-05-07T20:33:38.3524966Z 2025-05-07T20:33:38.3525144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.3525707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.3526210Z module_map=module_map) 2025-05-07T20:33:38.3526597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.3526976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.3527255Z E ^ 2025-05-07T20:33:38.3527745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.3528230Z 2025-05-07T20:33:38.3528739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.3529291Z 2025-05-07T20:33:38.3529408Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.3529852Z self=, 2025-05-07T20:33:38.3530278Z T=1, 2025-05-07T20:33:38.3530477Z D=5120, 2025-05-07T20:33:38.3530758Z scale_ub=None, 2025-05-07T20:33:38.3530989Z contiguous=False, 2025-05-07T20:33:38.3531289Z compiled=True, 2025-05-07T20:33:38.3531512Z ) 2025-05-07T20:33:38.4211268Z self = 2025-05-07T20:33:38.4212045Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:38.4212429Z 2025-05-07T20:33:38.4212545Z @given( 2025-05-07T20:33:38.4212860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.4213194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.4213522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.4213875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.4214230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.4214534Z ) 2025-05-07T20:33:38.4214908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.4215373Z def test_silu_mul_quant( 2025-05-07T20:33:38.4215629Z self, 2025-05-07T20:33:38.4215837Z T: int, 2025-05-07T20:33:38.4216054Z D: int, 2025-05-07T20:33:38.4216292Z scale_ub: Optional[float], 2025-05-07T20:33:38.4216575Z contiguous: bool, 2025-05-07T20:33:38.4216829Z compiled: bool, 2025-05-07T20:33:38.4217067Z ) -> None: 2025-05-07T20:33:38.4217295Z torch.manual_seed(2025) 2025-05-07T20:33:38.4217552Z 2025-05-07T20:33:38.4217841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.4218198Z 2025-05-07T20:33:38.4218406Z x_sign = torch.sign(x) 2025-05-07T20:33:38.4218714Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.4219039Z x = x_sign * x_clamp 2025-05-07T20:33:38.4219298Z x0 = x[:, :D] 2025-05-07T20:33:38.4219533Z x1 = x[:, D:] 2025-05-07T20:33:38.4219750Z 2025-05-07T20:33:38.4219950Z if contiguous: 2025-05-07T20:33:38.4220199Z x0 = x0.contiguous() 2025-05-07T20:33:38.4220473Z x1 = x1.contiguous() 2025-05-07T20:33:38.4220733Z 2025-05-07T20:33:38.4220943Z if scale_ub is not None: 2025-05-07T20:33:38.4221232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.4221593Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.4221922Z ) 2025-05-07T20:33:38.4222128Z else: 2025-05-07T20:33:38.4222346Z scale_ub_tensor = None 2025-05-07T20:33:38.4222613Z 2025-05-07T20:33:38.4222861Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.4223193Z op = silu_mul_quant 2025-05-07T20:33:38.4223460Z if compiled: 2025-05-07T20:33:38.4223723Z op = torch.compile(op) 2025-05-07T20:33:38.4224351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.4224647Z 2025-05-07T20:33:38.4224853Z y_fp8, y_scale = fn() 2025-05-07T20:33:38.4225152Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:38.4225463Z 2025-05-07T20:33:38.4225719Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.4226069Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:38.4226385Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:38.4226718Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:38.4227098Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:38.4227430Z 2025-05-07T20:33:38.4227660Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:38.4227934Z 2025-05-07T20:33:38.4228045Z moe/activation_test.py:126: 2025-05-07T20:33:38.4228357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.4228713Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:38.4229060Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:38.4229888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:38.4230812Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:38.4231389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.4232109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.4232828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:38.4233662Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:38.4234464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:38.4235250Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:38.4236008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:38.4236686Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:38.4237326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:38.4237869Z fn() 2025-05-07T20:33:38.4238404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:38.4239016Z self.fn.run( 2025-05-07T20:33:38.4239508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.4240066Z kernel = self.compile( 2025-05-07T20:33:38.4240644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.4241332Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.4241756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.4242001Z 2025-05-07T20:33:38.4242224Z self = 2025-05-07T20:33:38.4243358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.4244801Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28d47910>} 2025-05-07T20:33:38.4251716Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.4252810Z context = 2025-05-07T20:33:38.4253121Z 2025-05-07T20:33:38.4253300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.4253847Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.4254343Z module_map=module_map) 2025-05-07T20:33:38.4254728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.4255102Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:38.4255384Z E ^ 2025-05-07T20:33:38.4255880Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.4256402Z 2025-05-07T20:33:38.4256844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.4257386Z 2025-05-07T20:33:38.4257498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.4257933Z self=, 2025-05-07T20:33:38.4258396Z T=1, 2025-05-07T20:33:38.4258587Z D=5120, 2025-05-07T20:33:38.4258836Z scale_ub=None, 2025-05-07T20:33:38.4259067Z contiguous=True, 2025-05-07T20:33:38.4259300Z compiled=False, 2025-05-07T20:33:38.4259515Z ) 2025-05-07T20:33:38.7503845Z self = 2025-05-07T20:33:38.7504634Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:38.7505018Z 2025-05-07T20:33:38.7505135Z @given( 2025-05-07T20:33:38.7505479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.7505949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.7506394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.7506817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.7507164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.7507468Z ) 2025-05-07T20:33:38.7507833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.7508303Z def test_silu_mul_quant( 2025-05-07T20:33:38.7508560Z self, 2025-05-07T20:33:38.7508761Z T: int, 2025-05-07T20:33:38.7508969Z D: int, 2025-05-07T20:33:38.7509197Z scale_ub: Optional[float], 2025-05-07T20:33:38.7509479Z contiguous: bool, 2025-05-07T20:33:38.7509730Z compiled: bool, 2025-05-07T20:33:38.7509970Z ) -> None: 2025-05-07T20:33:38.7510197Z torch.manual_seed(2025) 2025-05-07T20:33:38.7510453Z 2025-05-07T20:33:38.7510745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.7511107Z 2025-05-07T20:33:38.7511311Z x_sign = torch.sign(x) 2025-05-07T20:33:38.7511622Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.7511950Z x = x_sign * x_clamp 2025-05-07T20:33:38.7512209Z x0 = x[:, :D] 2025-05-07T20:33:38.7512438Z x1 = x[:, D:] 2025-05-07T20:33:38.7512661Z 2025-05-07T20:33:38.7512855Z if contiguous: 2025-05-07T20:33:38.7513105Z x0 = x0.contiguous() 2025-05-07T20:33:38.7513383Z x1 = x1.contiguous() 2025-05-07T20:33:38.7513697Z 2025-05-07T20:33:38.7513903Z if scale_ub is not None: 2025-05-07T20:33:38.7514195Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.7514544Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.7514869Z ) 2025-05-07T20:33:38.7515074Z else: 2025-05-07T20:33:38.7515295Z scale_ub_tensor = None 2025-05-07T20:33:38.7515564Z 2025-05-07T20:33:38.7515808Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.7516135Z op = silu_mul_quant 2025-05-07T20:33:38.7516520Z if compiled: 2025-05-07T20:33:38.7516789Z op = torch.compile(op) 2025-05-07T20:33:38.7517100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.7517387Z 2025-05-07T20:33:38.7517596Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.7517768Z 2025-05-07T20:33:38.7517882Z moe/activation_test.py:117: 2025-05-07T20:33:38.7518188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.7518535Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.7518831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.7519557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.7520304Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.7520934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.7521654Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.7522348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.7522964Z kernel = self.compile( 2025-05-07T20:33:38.7523592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.7524481Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.7524901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.7525141Z 2025-05-07T20:33:38.7525357Z self = 2025-05-07T20:33:38.7526485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.7527926Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28d47d00>} 2025-05-07T20:33:38.7529330Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.7530391Z context = 2025-05-07T20:33:38.7530697Z 2025-05-07T20:33:38.7530871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.7531414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.7531912Z module_map=module_map) 2025-05-07T20:33:38.7532290Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.7532658Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.7532937Z E ^ 2025-05-07T20:33:38.7533416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.7533889Z 2025-05-07T20:33:38.7534330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.7534864Z 2025-05-07T20:33:38.7534973Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.7535404Z self=, 2025-05-07T20:33:38.7535817Z T=128, 2025-05-07T20:33:38.7536013Z D=5120, 2025-05-07T20:33:38.7536220Z scale_ub=None, 2025-05-07T20:33:38.7536443Z contiguous=False, 2025-05-07T20:33:38.7536685Z compiled=True, 2025-05-07T20:33:38.7536907Z ) 2025-05-07T20:33:38.7537235Z self = 2025-05-07T20:33:38.7537828Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:38.7538115Z 2025-05-07T20:33:38.7538197Z @given( 2025-05-07T20:33:38.7538442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.7538767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.7539089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.7539440Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.7539787Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.7540090Z ) 2025-05-07T20:33:38.7540461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.7540918Z def test_silu_mul_quant( 2025-05-07T20:33:38.7541172Z self, 2025-05-07T20:33:38.7541377Z T: int, 2025-05-07T20:33:38.7541585Z D: int, 2025-05-07T20:33:38.7541877Z scale_ub: Optional[float], 2025-05-07T20:33:38.7542164Z contiguous: bool, 2025-05-07T20:33:38.7542416Z compiled: bool, 2025-05-07T20:33:38.7542650Z ) -> None: 2025-05-07T20:33:38.7542878Z torch.manual_seed(2025) 2025-05-07T20:33:38.7543134Z 2025-05-07T20:33:38.7543414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.7543841Z 2025-05-07T20:33:38.7544047Z x_sign = torch.sign(x) 2025-05-07T20:33:38.7544406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.7544736Z x = x_sign * x_clamp 2025-05-07T20:33:38.7544990Z x0 = x[:, :D] 2025-05-07T20:33:38.7545214Z x1 = x[:, D:] 2025-05-07T20:33:38.7545432Z 2025-05-07T20:33:38.7545630Z if contiguous: 2025-05-07T20:33:38.7545870Z x0 = x0.contiguous() 2025-05-07T20:33:38.7546140Z x1 = x1.contiguous() 2025-05-07T20:33:38.7546395Z 2025-05-07T20:33:38.7546601Z if scale_ub is not None: 2025-05-07T20:33:38.7546885Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.7547238Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.7547566Z ) 2025-05-07T20:33:38.7547767Z else: 2025-05-07T20:33:38.7547988Z scale_ub_tensor = None 2025-05-07T20:33:38.7548254Z 2025-05-07T20:33:38.7548497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.7548833Z op = silu_mul_quant 2025-05-07T20:33:38.7549099Z if compiled: 2025-05-07T20:33:38.7549355Z op = torch.compile(op) 2025-05-07T20:33:38.7549665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.7549954Z 2025-05-07T20:33:38.7550153Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.7550333Z 2025-05-07T20:33:38.7550438Z moe/activation_test.py:117: 2025-05-07T20:33:38.7550750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.7551094Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.7551394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.7551977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:38.7552561Z return fn(*args, **kwargs) 2025-05-07T20:33:38.7553243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.7554015Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.7554574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.7555286Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.7555973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.7556528Z kernel = self.compile( 2025-05-07T20:33:38.7557099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.7557827Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.7558244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.7558491Z 2025-05-07T20:33:38.7558705Z self = 2025-05-07T20:33:38.7559828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.7561253Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28d44940>} 2025-05-07T20:33:38.7562652Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.7563822Z context = 2025-05-07T20:33:38.7564122Z 2025-05-07T20:33:38.7564301Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.7564849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.7565439Z module_map=module_map) 2025-05-07T20:33:38.7565823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.7566192Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.7566464Z E ^ 2025-05-07T20:33:38.7566952Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.7567424Z 2025-05-07T20:33:38.7567865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.7568401Z 2025-05-07T20:33:38.7568515Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.7568947Z self=, 2025-05-07T20:33:38.7569368Z T=128, 2025-05-07T20:33:38.7569569Z D=7168, 2025-05-07T20:33:38.7569767Z scale_ub=1200.0, 2025-05-07T20:33:38.7570009Z contiguous=False, 2025-05-07T20:33:38.7570250Z compiled=False, 2025-05-07T20:33:38.7570461Z ) 2025-05-07T20:33:38.8840213Z self = 2025-05-07T20:33:38.8840935Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:38.8841375Z 2025-05-07T20:33:38.8841507Z @given( 2025-05-07T20:33:38.8841860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.8842332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.8842784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.8843242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.8843589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.8843890Z ) 2025-05-07T20:33:38.8844254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.8844725Z def test_silu_mul_quant( 2025-05-07T20:33:38.8844986Z self, 2025-05-07T20:33:38.8845193Z T: int, 2025-05-07T20:33:38.8845403Z D: int, 2025-05-07T20:33:38.8845630Z scale_ub: Optional[float], 2025-05-07T20:33:38.8845919Z contiguous: bool, 2025-05-07T20:33:38.8846170Z compiled: bool, 2025-05-07T20:33:38.8846405Z ) -> None: 2025-05-07T20:33:38.8846635Z torch.manual_seed(2025) 2025-05-07T20:33:38.8846894Z 2025-05-07T20:33:38.8847183Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.8847536Z 2025-05-07T20:33:38.8847747Z x_sign = torch.sign(x) 2025-05-07T20:33:38.8848055Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.8848375Z x = x_sign * x_clamp 2025-05-07T20:33:38.8848751Z x0 = x[:, :D] 2025-05-07T20:33:38.8848984Z x1 = x[:, D:] 2025-05-07T20:33:38.8849201Z 2025-05-07T20:33:38.8849401Z if contiguous: 2025-05-07T20:33:38.8849645Z x0 = x0.contiguous() 2025-05-07T20:33:38.8849914Z x1 = x1.contiguous() 2025-05-07T20:33:38.8850168Z 2025-05-07T20:33:38.8850373Z if scale_ub is not None: 2025-05-07T20:33:38.8850656Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.8851009Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.8851338Z ) 2025-05-07T20:33:38.8851544Z else: 2025-05-07T20:33:38.8851761Z scale_ub_tensor = None 2025-05-07T20:33:38.8852031Z 2025-05-07T20:33:38.8852275Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.8852670Z op = silu_mul_quant 2025-05-07T20:33:38.8852933Z if compiled: 2025-05-07T20:33:38.8853206Z op = torch.compile(op) 2025-05-07T20:33:38.8853517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.8853805Z 2025-05-07T20:33:38.8854012Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.8854187Z 2025-05-07T20:33:38.8854292Z moe/activation_test.py:117: 2025-05-07T20:33:38.8854727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.8855102Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.8855396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.8856114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.8856835Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.8857394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.8858111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.8858799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.8859358Z kernel = self.compile( 2025-05-07T20:33:38.8859922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.8860613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.8861023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.8861267Z 2025-05-07T20:33:38.8861483Z self = 2025-05-07T20:33:38.8862604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.8864045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289abf40>} 2025-05-07T20:33:38.8865441Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.8866516Z context = 2025-05-07T20:33:38.8866822Z 2025-05-07T20:33:38.8866995Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.8867539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.8868024Z module_map=module_map) 2025-05-07T20:33:38.8868406Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.8868776Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.8869045Z E ^ 2025-05-07T20:33:38.8869579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.8870056Z 2025-05-07T20:33:38.8870490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.8871027Z 2025-05-07T20:33:38.8871145Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.8871575Z self=, 2025-05-07T20:33:38.8871992Z T=128, 2025-05-07T20:33:38.8872189Z D=5120, 2025-05-07T20:33:38.8872394Z scale_ub=None, 2025-05-07T20:33:38.8872621Z contiguous=False, 2025-05-07T20:33:38.8872862Z compiled=False, 2025-05-07T20:33:38.8873081Z ) 2025-05-07T20:33:38.8873413Z self = 2025-05-07T20:33:38.8874065Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:38.8874345Z 2025-05-07T20:33:38.8874431Z @given( 2025-05-07T20:33:38.8874674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.8875004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.8875325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.8875667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.8876103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.8876404Z ) 2025-05-07T20:33:38.8876774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.8877230Z def test_silu_mul_quant( 2025-05-07T20:33:38.8877481Z self, 2025-05-07T20:33:38.8877687Z T: int, 2025-05-07T20:33:38.8877889Z D: int, 2025-05-07T20:33:38.8878119Z scale_ub: Optional[float], 2025-05-07T20:33:38.8878406Z contiguous: bool, 2025-05-07T20:33:38.8878660Z compiled: bool, 2025-05-07T20:33:38.8878895Z ) -> None: 2025-05-07T20:33:38.8879123Z torch.manual_seed(2025) 2025-05-07T20:33:38.8879375Z 2025-05-07T20:33:38.8879666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.8880021Z 2025-05-07T20:33:38.8880227Z x_sign = torch.sign(x) 2025-05-07T20:33:38.8880532Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.8880860Z x = x_sign * x_clamp 2025-05-07T20:33:38.8881112Z x0 = x[:, :D] 2025-05-07T20:33:38.8881337Z x1 = x[:, D:] 2025-05-07T20:33:38.8881557Z 2025-05-07T20:33:38.8881749Z if contiguous: 2025-05-07T20:33:38.8881995Z x0 = x0.contiguous() 2025-05-07T20:33:38.8882270Z x1 = x1.contiguous() 2025-05-07T20:33:38.8882524Z 2025-05-07T20:33:38.8882727Z if scale_ub is not None: 2025-05-07T20:33:38.8883016Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.8883369Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.8883688Z ) 2025-05-07T20:33:38.8883890Z else: 2025-05-07T20:33:38.8884113Z scale_ub_tensor = None 2025-05-07T20:33:38.8884369Z 2025-05-07T20:33:38.8884613Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.8884942Z op = silu_mul_quant 2025-05-07T20:33:38.8885202Z if compiled: 2025-05-07T20:33:38.8885468Z op = torch.compile(op) 2025-05-07T20:33:38.8885782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.8886064Z 2025-05-07T20:33:38.8886271Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.8886450Z 2025-05-07T20:33:38.8886554Z moe/activation_test.py:117: 2025-05-07T20:33:38.8886863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.8887204Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.8887499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.8888222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.8888996Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.8889560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.8890274Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.8890971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.8891524Z kernel = self.compile( 2025-05-07T20:33:38.8892091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.8892775Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.8893185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.8893473Z 2025-05-07T20:33:38.8893689Z self = 2025-05-07T20:33:38.8894811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.8896319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289a95a0>} 2025-05-07T20:33:38.8897719Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.8898781Z context = 2025-05-07T20:33:38.8899087Z 2025-05-07T20:33:38.8899262Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.8899826Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.8900354Z module_map=module_map) 2025-05-07T20:33:38.8900732Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.8901098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.8901371Z E ^ 2025-05-07T20:33:38.8901858Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.8902331Z 2025-05-07T20:33:38.8902763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.8903302Z 2025-05-07T20:33:38.8903410Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.8903842Z self=, 2025-05-07T20:33:38.8904262Z T=128, 2025-05-07T20:33:38.8904459Z D=5120, 2025-05-07T20:33:38.8904662Z scale_ub=1200.0, 2025-05-07T20:33:38.8904891Z contiguous=True, 2025-05-07T20:33:38.8905122Z compiled=False, 2025-05-07T20:33:38.8905337Z ) 2025-05-07T20:33:39.0843086Z self = 2025-05-07T20:33:39.0843613Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:39.0844060Z 2025-05-07T20:33:39.0844195Z @given( 2025-05-07T20:33:39.0844552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.0844997Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.0845415Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.0845754Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.0846098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.0846400Z ) 2025-05-07T20:33:39.0846758Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.0847221Z def test_silu_mul_quant( 2025-05-07T20:33:39.0847473Z self, 2025-05-07T20:33:39.0847674Z T: int, 2025-05-07T20:33:39.0847993Z D: int, 2025-05-07T20:33:39.0848229Z scale_ub: Optional[float], 2025-05-07T20:33:39.0848510Z contiguous: bool, 2025-05-07T20:33:39.0848762Z compiled: bool, 2025-05-07T20:33:39.0848996Z ) -> None: 2025-05-07T20:33:39.0849218Z torch.manual_seed(2025) 2025-05-07T20:33:39.0849475Z 2025-05-07T20:33:39.0849767Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.0850121Z 2025-05-07T20:33:39.0850327Z x_sign = torch.sign(x) 2025-05-07T20:33:39.0856618Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.0856983Z x = x_sign * x_clamp 2025-05-07T20:33:39.0857233Z x0 = x[:, :D] 2025-05-07T20:33:39.0857459Z x1 = x[:, D:] 2025-05-07T20:33:39.0857675Z 2025-05-07T20:33:39.0857972Z if contiguous: 2025-05-07T20:33:39.0858217Z x0 = x0.contiguous() 2025-05-07T20:33:39.0858486Z x1 = x1.contiguous() 2025-05-07T20:33:39.0858733Z 2025-05-07T20:33:39.0858937Z if scale_ub is not None: 2025-05-07T20:33:39.0859219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.0859563Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.0859985Z ) 2025-05-07T20:33:39.0860222Z else: 2025-05-07T20:33:39.0860515Z scale_ub_tensor = None 2025-05-07T20:33:39.0860779Z 2025-05-07T20:33:39.0861022Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.0861355Z op = silu_mul_quant 2025-05-07T20:33:39.0861610Z if compiled: 2025-05-07T20:33:39.0861873Z op = torch.compile(op) 2025-05-07T20:33:39.0862184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.0862472Z 2025-05-07T20:33:39.0862675Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.0862854Z 2025-05-07T20:33:39.0862963Z moe/activation_test.py:117: 2025-05-07T20:33:39.0863274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.0863624Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.0863919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.0864643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.0865364Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.0865920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.0866628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.0867309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.0867865Z kernel = self.compile( 2025-05-07T20:33:39.0868431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.0869119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.0869526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.0869769Z 2025-05-07T20:33:39.0869981Z self = 2025-05-07T20:33:39.0871108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.0872535Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289abd90>} 2025-05-07T20:33:39.0874001Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.0875123Z context = 2025-05-07T20:33:39.0875430Z 2025-05-07T20:33:39.0875605Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.0876147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.0876633Z module_map=module_map) 2025-05-07T20:33:39.0877017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.0877383Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.0877653Z E ^ 2025-05-07T20:33:39.0878130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.0878605Z 2025-05-07T20:33:39.0879040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.0879613Z 2025-05-07T20:33:39.0879726Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.0880156Z self=, 2025-05-07T20:33:39.0880572Z T=1, 2025-05-07T20:33:39.0880763Z D=7168, 2025-05-07T20:33:39.0880967Z scale_ub=1200.0, 2025-05-07T20:33:39.0881243Z contiguous=True, 2025-05-07T20:33:39.0881515Z compiled=True, 2025-05-07T20:33:39.0881728Z ) 2025-05-07T20:33:39.0882054Z self = 2025-05-07T20:33:39.0882558Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:39.0882825Z 2025-05-07T20:33:39.0882910Z @given( 2025-05-07T20:33:39.0883147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.0883474Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.0883796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.0884136Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.0884481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.0884782Z ) 2025-05-07T20:33:39.0885146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.0885600Z def test_silu_mul_quant( 2025-05-07T20:33:39.0885855Z self, 2025-05-07T20:33:39.0886060Z T: int, 2025-05-07T20:33:39.0886266Z D: int, 2025-05-07T20:33:39.0886493Z scale_ub: Optional[float], 2025-05-07T20:33:39.0886777Z contiguous: bool, 2025-05-07T20:33:39.0887023Z compiled: bool, 2025-05-07T20:33:39.0887256Z ) -> None: 2025-05-07T20:33:39.0887480Z torch.manual_seed(2025) 2025-05-07T20:33:39.0887727Z 2025-05-07T20:33:39.0888011Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.0888367Z 2025-05-07T20:33:39.0888564Z x_sign = torch.sign(x) 2025-05-07T20:33:39.0888866Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.0889185Z x = x_sign * x_clamp 2025-05-07T20:33:39.0889464Z x0 = x[:, :D] 2025-05-07T20:33:39.0889709Z x1 = x[:, D:] 2025-05-07T20:33:39.0889925Z 2025-05-07T20:33:39.0890118Z if contiguous: 2025-05-07T20:33:39.0890354Z x0 = x0.contiguous() 2025-05-07T20:33:39.0890625Z x1 = x1.contiguous() 2025-05-07T20:33:39.0890875Z 2025-05-07T20:33:39.0891073Z if scale_ub is not None: 2025-05-07T20:33:39.0891362Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.0891708Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.0892026Z ) 2025-05-07T20:33:39.0892226Z else: 2025-05-07T20:33:39.0892447Z scale_ub_tensor = None 2025-05-07T20:33:39.0892708Z 2025-05-07T20:33:39.0892950Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.0893279Z op = silu_mul_quant 2025-05-07T20:33:39.0893535Z if compiled: 2025-05-07T20:33:39.0893791Z op = torch.compile(op) 2025-05-07T20:33:39.0894148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.0894430Z 2025-05-07T20:33:39.0894635Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.0894809Z 2025-05-07T20:33:39.0894928Z moe/activation_test.py:117: 2025-05-07T20:33:39.0895239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.0895577Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.0895870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.0896450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.0897037Z return fn(*args, **kwargs) 2025-05-07T20:33:39.0897715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.0898471Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.0899030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.0899751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.0900478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.0901117Z kernel = self.compile( 2025-05-07T20:33:39.0901680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.0902357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.0902766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.0903006Z 2025-05-07T20:33:39.0903226Z self = 2025-05-07T20:33:39.0904346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.0905763Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289ab1c0>} 2025-05-07T20:33:39.0907157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.0908221Z context = 2025-05-07T20:33:39.0908519Z 2025-05-07T20:33:39.0908696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.0909232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.0909725Z module_map=module_map) 2025-05-07T20:33:39.0910105Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.0910471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.0910739Z E ^ 2025-05-07T20:33:39.0911220Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.0911689Z 2025-05-07T20:33:39.0912127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.0912657Z 2025-05-07T20:33:39.0912766Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.0913191Z self=, 2025-05-07T20:33:39.0913680Z T=1, 2025-05-07T20:33:39.0913873Z D=7168, 2025-05-07T20:33:39.0914076Z scale_ub=1200.0, 2025-05-07T20:33:39.0914311Z contiguous=False, 2025-05-07T20:33:39.0914549Z compiled=True, 2025-05-07T20:33:39.0914756Z ) 2025-05-07T20:33:39.2300090Z self = 2025-05-07T20:33:39.2301006Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:39.2301399Z 2025-05-07T20:33:39.2301519Z @given( 2025-05-07T20:33:39.2301832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.2302267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.2302694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.2303068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.2303409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.2303710Z ) 2025-05-07T20:33:39.2304073Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.2304530Z def test_silu_mul_quant( 2025-05-07T20:33:39.2304785Z self, 2025-05-07T20:33:39.2305070Z T: int, 2025-05-07T20:33:39.2305272Z D: int, 2025-05-07T20:33:39.2305504Z scale_ub: Optional[float], 2025-05-07T20:33:39.2305787Z contiguous: bool, 2025-05-07T20:33:39.2306037Z compiled: bool, 2025-05-07T20:33:39.2306274Z ) -> None: 2025-05-07T20:33:39.2306501Z torch.manual_seed(2025) 2025-05-07T20:33:39.2306749Z 2025-05-07T20:33:39.2307033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.2307457Z 2025-05-07T20:33:39.2307747Z x_sign = torch.sign(x) 2025-05-07T20:33:39.2308050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.2308378Z x = x_sign * x_clamp 2025-05-07T20:33:39.2308630Z x0 = x[:, :D] 2025-05-07T20:33:39.2308851Z x1 = x[:, D:] 2025-05-07T20:33:39.2309067Z 2025-05-07T20:33:39.2309268Z if contiguous: 2025-05-07T20:33:39.2309506Z x0 = x0.contiguous() 2025-05-07T20:33:39.2309786Z x1 = x1.contiguous() 2025-05-07T20:33:39.2310057Z 2025-05-07T20:33:39.2310256Z if scale_ub is not None: 2025-05-07T20:33:39.2310546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.2310899Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.2311218Z ) 2025-05-07T20:33:39.2311426Z else: 2025-05-07T20:33:39.2311647Z scale_ub_tensor = None 2025-05-07T20:33:39.2311913Z 2025-05-07T20:33:39.2312160Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.2312484Z op = silu_mul_quant 2025-05-07T20:33:39.2312745Z if compiled: 2025-05-07T20:33:39.2313002Z op = torch.compile(op) 2025-05-07T20:33:39.2313319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.2313680Z 2025-05-07T20:33:39.2313879Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.2314059Z 2025-05-07T20:33:39.2314166Z moe/activation_test.py:117: 2025-05-07T20:33:39.2314474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.2314826Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.2315117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.2315710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.2316295Z return fn(*args, **kwargs) 2025-05-07T20:33:39.2316982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.2317704Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.2318269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.2318980Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.2319717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.2320273Z kernel = self.compile( 2025-05-07T20:33:39.2320836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.2321568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.2321984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.2322229Z 2025-05-07T20:33:39.2322450Z self = 2025-05-07T20:33:39.2323575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.2325192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289a85e0>} 2025-05-07T20:33:39.2326661Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.2327726Z context = 2025-05-07T20:33:39.2328025Z 2025-05-07T20:33:39.2328202Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.2328867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.2329372Z module_map=module_map) 2025-05-07T20:33:39.2329791Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.2330163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.2330431Z E ^ 2025-05-07T20:33:39.2330917Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.2331389Z 2025-05-07T20:33:39.2331827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.2332359Z 2025-05-07T20:33:39.2332478Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.2332905Z self=, 2025-05-07T20:33:39.2333323Z T=1, 2025-05-07T20:33:39.2333520Z D=7168, 2025-05-07T20:33:39.2333719Z scale_ub=None, 2025-05-07T20:33:39.2333947Z contiguous=False, 2025-05-07T20:33:39.2334183Z compiled=True, 2025-05-07T20:33:39.2334390Z ) 2025-05-07T20:33:39.4931767Z self = 2025-05-07T20:33:39.4932462Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:39.4932874Z 2025-05-07T20:33:39.4933001Z @given( 2025-05-07T20:33:39.4933355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.4933828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.4934241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.4934591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.4934970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.4935399Z ) 2025-05-07T20:33:39.4935917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.4936567Z def test_silu_mul_quant( 2025-05-07T20:33:39.4936836Z self, 2025-05-07T20:33:39.4937045Z T: int, 2025-05-07T20:33:39.4937253Z D: int, 2025-05-07T20:33:39.4937479Z scale_ub: Optional[float], 2025-05-07T20:33:39.4937764Z contiguous: bool, 2025-05-07T20:33:39.4938014Z compiled: bool, 2025-05-07T20:33:39.4938246Z ) -> None: 2025-05-07T20:33:39.4938480Z torch.manual_seed(2025) 2025-05-07T20:33:39.4938733Z 2025-05-07T20:33:39.4939021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.4939402Z 2025-05-07T20:33:39.4939645Z x_sign = torch.sign(x) 2025-05-07T20:33:39.4939952Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.4940411Z x = x_sign * x_clamp 2025-05-07T20:33:39.4940676Z x0 = x[:, :D] 2025-05-07T20:33:39.4940905Z x1 = x[:, D:] 2025-05-07T20:33:39.4941122Z 2025-05-07T20:33:39.4941321Z if contiguous: 2025-05-07T20:33:39.4941569Z x0 = x0.contiguous() 2025-05-07T20:33:39.4941843Z x1 = x1.contiguous() 2025-05-07T20:33:39.4942103Z 2025-05-07T20:33:39.4942308Z if scale_ub is not None: 2025-05-07T20:33:39.4942592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.4942942Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.4943268Z ) 2025-05-07T20:33:39.4943468Z else: 2025-05-07T20:33:39.4943692Z scale_ub_tensor = None 2025-05-07T20:33:39.4943956Z 2025-05-07T20:33:39.4944265Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.4944599Z op = silu_mul_quant 2025-05-07T20:33:39.4944863Z if compiled: 2025-05-07T20:33:39.4945131Z op = torch.compile(op) 2025-05-07T20:33:39.4945444Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.4945736Z 2025-05-07T20:33:39.4945941Z y_fp8, y_scale = fn() 2025-05-07T20:33:39.4946302Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:39.4946610Z 2025-05-07T20:33:39.4946937Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.4947294Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:39.4947600Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:39.4947929Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:39.4948310Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:39.4948635Z 2025-05-07T20:33:39.4948855Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:39.4949065Z 2025-05-07T20:33:39.4949176Z moe/activation_test.py:126: 2025-05-07T20:33:39.4949489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.4949844Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:39.4950189Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:39.4951016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:39.4951804Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:39.4952373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.4953091Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.4953890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:39.4954651Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:39.4955447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:39.4956234Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:39.4957000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:39.4957675Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:39.4958310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:39.4958855Z fn() 2025-05-07T20:33:39.4959384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:39.4959989Z self.fn.run( 2025-05-07T20:33:39.4960484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.4961033Z kernel = self.compile( 2025-05-07T20:33:39.4961649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.4962331Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.4962751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.4962994Z 2025-05-07T20:33:39.4963212Z self = 2025-05-07T20:33:39.4964342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.4965788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2837eef0>} 2025-05-07T20:33:39.4967239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.4968307Z context = 2025-05-07T20:33:39.4968649Z 2025-05-07T20:33:39.4968862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.4969410Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.4969903Z module_map=module_map) 2025-05-07T20:33:39.4970279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.4970652Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:39.4970934Z E ^ 2025-05-07T20:33:39.4971420Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.4971896Z 2025-05-07T20:33:39.4972334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.4972874Z 2025-05-07T20:33:39.4972984Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.4973416Z self=, 2025-05-07T20:33:39.4973837Z T=1, 2025-05-07T20:33:39.4974029Z D=5120, 2025-05-07T20:33:39.4974235Z scale_ub=1200.0, 2025-05-07T20:33:39.4974468Z contiguous=False, 2025-05-07T20:33:39.4974706Z compiled=True, 2025-05-07T20:33:39.4974919Z ) 2025-05-07T20:33:39.6663123Z self = 2025-05-07T20:33:39.6663700Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:39.6663987Z 2025-05-07T20:33:39.6664075Z @given( 2025-05-07T20:33:39.6664336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.6664662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.6664987Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.6665332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.6665679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.6665981Z ) 2025-05-07T20:33:39.6666356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.6666818Z def test_silu_mul_quant( 2025-05-07T20:33:39.6667070Z self, 2025-05-07T20:33:39.6667277Z T: int, 2025-05-07T20:33:39.6667489Z D: int, 2025-05-07T20:33:39.6667712Z scale_ub: Optional[float], 2025-05-07T20:33:39.6668001Z contiguous: bool, 2025-05-07T20:33:39.6668253Z compiled: bool, 2025-05-07T20:33:39.6668493Z ) -> None: 2025-05-07T20:33:39.6668726Z torch.manual_seed(2025) 2025-05-07T20:33:39.6668989Z 2025-05-07T20:33:39.6669278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.6669643Z 2025-05-07T20:33:39.6669986Z x_sign = torch.sign(x) 2025-05-07T20:33:39.6670291Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.6670616Z x = x_sign * x_clamp 2025-05-07T20:33:39.6670868Z x0 = x[:, :D] 2025-05-07T20:33:39.6671100Z x1 = x[:, D:] 2025-05-07T20:33:39.6671322Z 2025-05-07T20:33:39.6671527Z if contiguous: 2025-05-07T20:33:39.6671774Z x0 = x0.contiguous() 2025-05-07T20:33:39.6672043Z x1 = x1.contiguous() 2025-05-07T20:33:39.6672300Z 2025-05-07T20:33:39.6672504Z if scale_ub is not None: 2025-05-07T20:33:39.6672787Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.6673139Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.6673468Z ) 2025-05-07T20:33:39.6673759Z else: 2025-05-07T20:33:39.6674053Z scale_ub_tensor = None 2025-05-07T20:33:39.6674320Z 2025-05-07T20:33:39.6674574Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.6674905Z op = silu_mul_quant 2025-05-07T20:33:39.6675170Z if compiled: 2025-05-07T20:33:39.6675433Z op = torch.compile(op) 2025-05-07T20:33:39.6675744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.6682224Z 2025-05-07T20:33:39.6682464Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.6682761Z 2025-05-07T20:33:39.6682873Z moe/activation_test.py:117: 2025-05-07T20:33:39.6683200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.6683562Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.6683860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.6684476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.6685087Z return fn(*args, **kwargs) 2025-05-07T20:33:39.6685791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.6686540Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.6687122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.6687859Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.6688567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.6689137Z kernel = self.compile( 2025-05-07T20:33:39.6689723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.6690427Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.6690849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.6691106Z 2025-05-07T20:33:39.6691330Z self = 2025-05-07T20:33:39.6692501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.6694043Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2837feb0>} 2025-05-07T20:33:39.6695486Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.6696583Z context = 2025-05-07T20:33:39.6696907Z 2025-05-07T20:33:39.6697086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.6697701Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.6698215Z module_map=module_map) 2025-05-07T20:33:39.6698606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.6698988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.6699276Z E ^ 2025-05-07T20:33:39.6699776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.6700277Z 2025-05-07T20:33:39.6700721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.6701268Z 2025-05-07T20:33:39.6701380Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.6701825Z self=, 2025-05-07T20:33:39.6702291Z T=1, 2025-05-07T20:33:39.6702494Z D=5120, 2025-05-07T20:33:39.6702705Z scale_ub=1200.0, 2025-05-07T20:33:39.6702944Z contiguous=False, 2025-05-07T20:33:39.6703194Z compiled=False, 2025-05-07T20:33:39.6703419Z ) 2025-05-07T20:33:39.6703754Z self = 2025-05-07T20:33:39.6704281Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:39.6704607Z 2025-05-07T20:33:39.6704733Z @given( 2025-05-07T20:33:39.6704974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.6705311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.6705638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.6705994Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.6706336Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.6706640Z ) 2025-05-07T20:33:39.6707014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.6707481Z def test_silu_mul_quant( 2025-05-07T20:33:39.6707740Z self, 2025-05-07T20:33:39.6707950Z T: int, 2025-05-07T20:33:39.6708158Z D: int, 2025-05-07T20:33:39.6708394Z scale_ub: Optional[float], 2025-05-07T20:33:39.6708686Z contiguous: bool, 2025-05-07T20:33:39.6708940Z compiled: bool, 2025-05-07T20:33:39.6709184Z ) -> None: 2025-05-07T20:33:39.6709417Z torch.manual_seed(2025) 2025-05-07T20:33:39.6709678Z 2025-05-07T20:33:39.6709975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.6710341Z 2025-05-07T20:33:39.6710543Z x_sign = torch.sign(x) 2025-05-07T20:33:39.6710856Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.6711190Z x = x_sign * x_clamp 2025-05-07T20:33:39.6711447Z x0 = x[:, :D] 2025-05-07T20:33:39.6711673Z x1 = x[:, D:] 2025-05-07T20:33:39.6711897Z 2025-05-07T20:33:39.6712098Z if contiguous: 2025-05-07T20:33:39.6712345Z x0 = x0.contiguous() 2025-05-07T20:33:39.6712621Z x1 = x1.contiguous() 2025-05-07T20:33:39.6712880Z 2025-05-07T20:33:39.6713079Z if scale_ub is not None: 2025-05-07T20:33:39.6713369Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.6713786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.6714110Z ) 2025-05-07T20:33:39.6714318Z else: 2025-05-07T20:33:39.6714543Z scale_ub_tensor = None 2025-05-07T20:33:39.6714806Z 2025-05-07T20:33:39.6715051Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.6715387Z op = silu_mul_quant 2025-05-07T20:33:39.6715651Z if compiled: 2025-05-07T20:33:39.6715914Z op = torch.compile(op) 2025-05-07T20:33:39.6716230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.6716520Z 2025-05-07T20:33:39.6716724Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.6716903Z 2025-05-07T20:33:39.6717008Z moe/activation_test.py:117: 2025-05-07T20:33:39.6717376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.6717729Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.6718031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.6718765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.6719488Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.6720057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.6720769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.6721465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.6722065Z kernel = self.compile( 2025-05-07T20:33:39.6722638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.6723330Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.6723748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.6724170Z 2025-05-07T20:33:39.6724471Z self = 2025-05-07T20:33:39.6725655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.6727093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289a9120>} 2025-05-07T20:33:39.6728505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.6729568Z context = 2025-05-07T20:33:39.6729876Z 2025-05-07T20:33:39.6730053Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.6730605Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.6731101Z module_map=module_map) 2025-05-07T20:33:39.6731481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.6731853Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.6732129Z E ^ 2025-05-07T20:33:39.6732611Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.6733092Z 2025-05-07T20:33:39.6733530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.6734076Z 2025-05-07T20:33:39.6734194Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.6734640Z self=, 2025-05-07T20:33:39.6735059Z T=16384, 2025-05-07T20:33:39.6735263Z D=5120, 2025-05-07T20:33:39.6735472Z scale_ub=1200.0, 2025-05-07T20:33:39.6735707Z contiguous=False, 2025-05-07T20:33:39.6735945Z compiled=True, 2025-05-07T20:33:39.6736163Z ) 2025-05-07T20:33:39.7740812Z self = 2025-05-07T20:33:39.7741351Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:39.7741670Z 2025-05-07T20:33:39.7741755Z @given( 2025-05-07T20:33:39.7742007Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.7742360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.7742692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.7743055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.7743506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.7743807Z ) 2025-05-07T20:33:39.7744183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.7744655Z def test_silu_mul_quant( 2025-05-07T20:33:39.7744906Z self, 2025-05-07T20:33:39.7745117Z T: int, 2025-05-07T20:33:39.7745332Z D: int, 2025-05-07T20:33:39.7745563Z scale_ub: Optional[float], 2025-05-07T20:33:39.7745856Z contiguous: bool, 2025-05-07T20:33:39.7746113Z compiled: bool, 2025-05-07T20:33:39.7746346Z ) -> None: 2025-05-07T20:33:39.7746577Z torch.manual_seed(2025) 2025-05-07T20:33:39.7746832Z 2025-05-07T20:33:39.7747117Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.7747581Z 2025-05-07T20:33:39.7747783Z x_sign = torch.sign(x) 2025-05-07T20:33:39.7748093Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.7748424Z x = x_sign * x_clamp 2025-05-07T20:33:39.7748672Z x0 = x[:, :D] 2025-05-07T20:33:39.7748905Z x1 = x[:, D:] 2025-05-07T20:33:39.7749131Z 2025-05-07T20:33:39.7749324Z if contiguous: 2025-05-07T20:33:39.7749642Z x0 = x0.contiguous() 2025-05-07T20:33:39.7749975Z x1 = x1.contiguous() 2025-05-07T20:33:39.7750227Z 2025-05-07T20:33:39.7750434Z if scale_ub is not None: 2025-05-07T20:33:39.7750724Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.7751084Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.7751406Z ) 2025-05-07T20:33:39.7751616Z else: 2025-05-07T20:33:39.7751846Z scale_ub_tensor = None 2025-05-07T20:33:39.7752107Z 2025-05-07T20:33:39.7752361Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.7752694Z op = silu_mul_quant 2025-05-07T20:33:39.7752951Z if compiled: 2025-05-07T20:33:39.7753219Z op = torch.compile(op) 2025-05-07T20:33:39.7753596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.7753885Z 2025-05-07T20:33:39.7754091Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.7754265Z 2025-05-07T20:33:39.7754380Z moe/activation_test.py:117: 2025-05-07T20:33:39.7754697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.7755042Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.7755339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.7755929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.7756511Z return fn(*args, **kwargs) 2025-05-07T20:33:39.7757198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.7757926Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.7758487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.7759199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.7759892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.7760455Z kernel = self.compile( 2025-05-07T20:33:39.7761017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.7761702Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.7762119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.7762362Z 2025-05-07T20:33:39.7762583Z self = 2025-05-07T20:33:39.7763752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.7765192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad075108b0>} 2025-05-07T20:33:39.7766590Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.7767659Z context = 2025-05-07T20:33:39.7767958Z 2025-05-07T20:33:39.7768133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.7768677Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.7769232Z module_map=module_map) 2025-05-07T20:33:39.7769622Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.7769985Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.7770258Z E ^ 2025-05-07T20:33:39.7770746Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.7771294Z 2025-05-07T20:33:39.7771727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.7772262Z 2025-05-07T20:33:39.7772374Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.7772808Z self=, 2025-05-07T20:33:39.7773228Z T=2048, 2025-05-07T20:33:39.7773424Z D=7168, 2025-05-07T20:33:39.7773633Z scale_ub=1200.0, 2025-05-07T20:33:39.7773880Z contiguous=False, 2025-05-07T20:33:39.7774114Z compiled=True, 2025-05-07T20:33:39.7774335Z ) 2025-05-07T20:33:39.7774674Z self = 2025-05-07T20:33:39.7775188Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:39.7775482Z 2025-05-07T20:33:39.7775564Z @given( 2025-05-07T20:33:39.7775813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.7776144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.7776462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.7776812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.7777161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.7777454Z ) 2025-05-07T20:33:39.7777825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.7778288Z def test_silu_mul_quant( 2025-05-07T20:33:39.7778540Z self, 2025-05-07T20:33:39.7778749Z T: int, 2025-05-07T20:33:39.7778964Z D: int, 2025-05-07T20:33:39.7779189Z scale_ub: Optional[float], 2025-05-07T20:33:39.7779482Z contiguous: bool, 2025-05-07T20:33:39.7779736Z compiled: bool, 2025-05-07T20:33:39.7779968Z ) -> None: 2025-05-07T20:33:39.7780196Z torch.manual_seed(2025) 2025-05-07T20:33:39.7780452Z 2025-05-07T20:33:39.7780741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.7781097Z 2025-05-07T20:33:39.7781302Z x_sign = torch.sign(x) 2025-05-07T20:33:39.7781616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.7781937Z x = x_sign * x_clamp 2025-05-07T20:33:39.7782192Z x0 = x[:, :D] 2025-05-07T20:33:39.7782425Z x1 = x[:, D:] 2025-05-07T20:33:39.7782641Z 2025-05-07T20:33:39.7782839Z if contiguous: 2025-05-07T20:33:39.7783087Z x0 = x0.contiguous() 2025-05-07T20:33:39.7783359Z x1 = x1.contiguous() 2025-05-07T20:33:39.7783618Z 2025-05-07T20:33:39.7783827Z if scale_ub is not None: 2025-05-07T20:33:39.7784168Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.7784525Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.7784850Z ) 2025-05-07T20:33:39.7785054Z else: 2025-05-07T20:33:39.7785282Z scale_ub_tensor = None 2025-05-07T20:33:39.7785552Z 2025-05-07T20:33:39.7785801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.7786125Z op = silu_mul_quant 2025-05-07T20:33:39.7786395Z if compiled: 2025-05-07T20:33:39.7786658Z op = torch.compile(op) 2025-05-07T20:33:39.7786964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.7787255Z 2025-05-07T20:33:39.7787456Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.7787629Z 2025-05-07T20:33:39.7787734Z moe/activation_test.py:117: 2025-05-07T20:33:39.7788092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.7788443Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.7788736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.7789325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.7789911Z return fn(*args, **kwargs) 2025-05-07T20:33:39.7790685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.7791399Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.7791960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.7792676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.7793370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.7793996Z kernel = self.compile( 2025-05-07T20:33:39.7794566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.7795252Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.7795661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.7795914Z 2025-05-07T20:33:39.7796133Z self = 2025-05-07T20:33:39.7797255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.7798681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07511090>} 2025-05-07T20:33:39.7800086Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.7801148Z context = 2025-05-07T20:33:39.7801455Z 2025-05-07T20:33:39.7801633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.7802184Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.7802676Z module_map=module_map) 2025-05-07T20:33:39.7803055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.7803427Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.7803700Z E ^ 2025-05-07T20:33:39.7804180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.7804658Z 2025-05-07T20:33:39.7805142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.7805687Z 2025-05-07T20:33:39.9101903Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.9102780Z self=, 2025-05-07T20:33:39.9103558Z T=1, 2025-05-07T20:33:39.9103922Z D=5120, 2025-05-07T20:33:39.9104303Z scale_ub=None, 2025-05-07T20:33:39.9104714Z contiguous=False, 2025-05-07T20:33:39.9105150Z compiled=False, 2025-05-07T20:33:39.9105549Z ) 2025-05-07T20:33:39.9106151Z self = 2025-05-07T20:33:39.9107087Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:39.9107596Z 2025-05-07T20:33:39.9107750Z @given( 2025-05-07T20:33:39.9108200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.9108980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.9109570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.9110203Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.9110590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.9110899Z ) 2025-05-07T20:33:39.9111269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.9111850Z def test_silu_mul_quant( 2025-05-07T20:33:39.9112115Z self, 2025-05-07T20:33:39.9112327Z T: int, 2025-05-07T20:33:39.9112536Z D: int, 2025-05-07T20:33:39.9112773Z scale_ub: Optional[float], 2025-05-07T20:33:39.9113066Z contiguous: bool, 2025-05-07T20:33:39.9113325Z compiled: bool, 2025-05-07T20:33:39.9113624Z ) -> None: 2025-05-07T20:33:39.9113858Z torch.manual_seed(2025) 2025-05-07T20:33:39.9114119Z 2025-05-07T20:33:39.9114406Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.9114775Z 2025-05-07T20:33:39.9114988Z x_sign = torch.sign(x) 2025-05-07T20:33:39.9115296Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.9115634Z x = x_sign * x_clamp 2025-05-07T20:33:39.9115898Z x0 = x[:, :D] 2025-05-07T20:33:39.9116129Z x1 = x[:, D:] 2025-05-07T20:33:39.9116356Z 2025-05-07T20:33:39.9116568Z if contiguous: 2025-05-07T20:33:39.9116818Z x0 = x0.contiguous() 2025-05-07T20:33:39.9117099Z x1 = x1.contiguous() 2025-05-07T20:33:39.9117358Z 2025-05-07T20:33:39.9117562Z if scale_ub is not None: 2025-05-07T20:33:39.9117856Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.9118214Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.9118550Z ) 2025-05-07T20:33:39.9118755Z else: 2025-05-07T20:33:39.9118983Z scale_ub_tensor = None 2025-05-07T20:33:39.9119256Z 2025-05-07T20:33:39.9119505Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.9119841Z op = silu_mul_quant 2025-05-07T20:33:39.9120115Z if compiled: 2025-05-07T20:33:39.9120377Z op = torch.compile(op) 2025-05-07T20:33:39.9120698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.9120993Z 2025-05-07T20:33:39.9121198Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.9121381Z 2025-05-07T20:33:39.9121491Z moe/activation_test.py:117: 2025-05-07T20:33:39.9121815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.9122162Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.9122466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.9123200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.9124091Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.9124656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.9125452Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.9126158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.9126722Z kernel = self.compile( 2025-05-07T20:33:39.9127296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.9127987Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.9128406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.9128647Z 2025-05-07T20:33:39.9128866Z self = 2025-05-07T20:33:39.9129992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.9131495Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad075117e0>} 2025-05-07T20:33:39.9132959Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.9134553Z context = 2025-05-07T20:33:39.9134857Z 2025-05-07T20:33:39.9135035Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.9135584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.9136083Z module_map=module_map) 2025-05-07T20:33:39.9136473Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.9136851Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.9137140Z E ^ 2025-05-07T20:33:39.9137638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.9138108Z 2025-05-07T20:33:39.9138553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.9139092Z 2025-05-07T20:33:39.9139205Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.9139664Z self=, 2025-05-07T20:33:39.9140088Z T=4096, 2025-05-07T20:33:39.9140296Z D=7168, 2025-05-07T20:33:39.9140507Z scale_ub=1200.0, 2025-05-07T20:33:39.9140745Z contiguous=False, 2025-05-07T20:33:39.9140992Z compiled=False, 2025-05-07T20:33:39.9141217Z ) 2025-05-07T20:33:39.9141550Z self = 2025-05-07T20:33:39.9142075Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:39.9148012Z 2025-05-07T20:33:39.9148113Z @given( 2025-05-07T20:33:39.9148376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.9148710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.9149048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.9149399Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.9149742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.9150049Z ) 2025-05-07T20:33:39.9150427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.9150895Z def test_silu_mul_quant( 2025-05-07T20:33:39.9151159Z self, 2025-05-07T20:33:39.9151374Z T: int, 2025-05-07T20:33:39.9151584Z D: int, 2025-05-07T20:33:39.9151829Z scale_ub: Optional[float], 2025-05-07T20:33:39.9152125Z contiguous: bool, 2025-05-07T20:33:39.9152386Z compiled: bool, 2025-05-07T20:33:39.9152623Z ) -> None: 2025-05-07T20:33:39.9152934Z torch.manual_seed(2025) 2025-05-07T20:33:39.9153202Z 2025-05-07T20:33:39.9153489Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.9153932Z 2025-05-07T20:33:39.9154139Z x_sign = torch.sign(x) 2025-05-07T20:33:39.9154447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.9154782Z x = x_sign * x_clamp 2025-05-07T20:33:39.9155044Z x0 = x[:, :D] 2025-05-07T20:33:39.9155271Z x1 = x[:, D:] 2025-05-07T20:33:39.9155503Z 2025-05-07T20:33:39.9155704Z if contiguous: 2025-05-07T20:33:39.9155950Z x0 = x0.contiguous() 2025-05-07T20:33:39.9156229Z x1 = x1.contiguous() 2025-05-07T20:33:39.9156488Z 2025-05-07T20:33:39.9156690Z if scale_ub is not None: 2025-05-07T20:33:39.9157035Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.9157394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.9157727Z ) 2025-05-07T20:33:39.9157933Z else: 2025-05-07T20:33:39.9158167Z scale_ub_tensor = None 2025-05-07T20:33:39.9158435Z 2025-05-07T20:33:39.9158677Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.9159054Z op = silu_mul_quant 2025-05-07T20:33:39.9159361Z if compiled: 2025-05-07T20:33:39.9159626Z op = torch.compile(op) 2025-05-07T20:33:39.9159943Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.9160239Z 2025-05-07T20:33:39.9160445Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.9160629Z 2025-05-07T20:33:39.9160736Z moe/activation_test.py:117: 2025-05-07T20:33:39.9161049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.9161403Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.9161706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.9162440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.9163159Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.9163729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.9164457Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.9165148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.9165711Z kernel = self.compile( 2025-05-07T20:33:39.9166285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.9166976Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.9167397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.9167644Z 2025-05-07T20:33:39.9167865Z self = 2025-05-07T20:33:39.9168994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.9170427Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07512200>} 2025-05-07T20:33:39.9171824Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.9172895Z context = 2025-05-07T20:33:39.9173206Z 2025-05-07T20:33:39.9173382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.9173978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.9174474Z module_map=module_map) 2025-05-07T20:33:39.9174869Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.9175253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.9175533Z E ^ 2025-05-07T20:33:39.9176015Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.9176489Z 2025-05-07T20:33:39.9176923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.9177455Z 2025-05-07T20:33:39.9177570Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.9178045Z self=, 2025-05-07T20:33:39.9178465Z T=16384, 2025-05-07T20:33:39.9178675Z D=7168, 2025-05-07T20:33:39.9178890Z scale_ub=None, 2025-05-07T20:33:39.9179117Z contiguous=True, 2025-05-07T20:33:39.9179355Z compiled=True, 2025-05-07T20:33:39.9179570Z ) 2025-05-07T20:33:40.1133932Z self = 2025-05-07T20:33:40.1134711Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:40.1135009Z 2025-05-07T20:33:40.1135094Z @given( 2025-05-07T20:33:40.1135344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.1135671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.1135999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.1136353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.1136704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.1137009Z ) 2025-05-07T20:33:40.1137381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.1137847Z def test_silu_mul_quant( 2025-05-07T20:33:40.1138105Z self, 2025-05-07T20:33:40.1138315Z T: int, 2025-05-07T20:33:40.1138527Z D: int, 2025-05-07T20:33:40.1138755Z scale_ub: Optional[float], 2025-05-07T20:33:40.1139046Z contiguous: bool, 2025-05-07T20:33:40.1139304Z compiled: bool, 2025-05-07T20:33:40.1139545Z ) -> None: 2025-05-07T20:33:40.1139779Z torch.manual_seed(2025) 2025-05-07T20:33:40.1140038Z 2025-05-07T20:33:40.1140327Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.1140693Z 2025-05-07T20:33:40.1140902Z x_sign = torch.sign(x) 2025-05-07T20:33:40.1141206Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.1141538Z x = x_sign * x_clamp 2025-05-07T20:33:40.1141798Z x0 = x[:, :D] 2025-05-07T20:33:40.1142030Z x1 = x[:, D:] 2025-05-07T20:33:40.1142255Z 2025-05-07T20:33:40.1142457Z if contiguous: 2025-05-07T20:33:40.1142706Z x0 = x0.contiguous() 2025-05-07T20:33:40.1142981Z x1 = x1.contiguous() 2025-05-07T20:33:40.1143239Z 2025-05-07T20:33:40.1143447Z if scale_ub is not None: 2025-05-07T20:33:40.1143734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.1144102Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.1144438Z ) 2025-05-07T20:33:40.1144643Z else: 2025-05-07T20:33:40.1144875Z scale_ub_tensor = None 2025-05-07T20:33:40.1145145Z 2025-05-07T20:33:40.1145390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.1145723Z op = silu_mul_quant 2025-05-07T20:33:40.1145993Z if compiled: 2025-05-07T20:33:40.1146262Z op = torch.compile(op) 2025-05-07T20:33:40.1146578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.1146879Z 2025-05-07T20:33:40.1147087Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.1147269Z 2025-05-07T20:33:40.1147449Z moe/activation_test.py:117: 2025-05-07T20:33:40.1147768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.1148126Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.1148429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.1149031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.1149656Z return fn(*args, **kwargs) 2025-05-07T20:33:40.1150373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.1151098Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.1151664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.1152447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.1153143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.1153766Z kernel = self.compile( 2025-05-07T20:33:40.1154339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.1155118Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.1155535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.1155782Z 2025-05-07T20:33:40.1156002Z self = 2025-05-07T20:33:40.1157137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.1158585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07513760>} 2025-05-07T20:33:40.1160038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.1161120Z context = 2025-05-07T20:33:40.1161430Z 2025-05-07T20:33:40.1161606Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.1162154Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.1162644Z module_map=module_map) 2025-05-07T20:33:40.1163028Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.1163404Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.1163676Z E ^ 2025-05-07T20:33:40.1164173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.1164651Z 2025-05-07T20:33:40.1165089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.1165628Z 2025-05-07T20:33:40.1165749Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.1166188Z self=, 2025-05-07T20:33:40.1166612Z T=4096, 2025-05-07T20:33:40.1166820Z D=5120, 2025-05-07T20:33:40.1167023Z scale_ub=None, 2025-05-07T20:33:40.1167258Z contiguous=False, 2025-05-07T20:33:40.1167501Z compiled=True, 2025-05-07T20:33:40.1167712Z ) 2025-05-07T20:33:40.1168051Z self = 2025-05-07T20:33:40.1168576Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:40.1168861Z 2025-05-07T20:33:40.1168951Z @given( 2025-05-07T20:33:40.1169241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.1169576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.1169901Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.1170245Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.1170605Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.1170911Z ) 2025-05-07T20:33:40.1171279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.1171747Z def test_silu_mul_quant( 2025-05-07T20:33:40.1172007Z self, 2025-05-07T20:33:40.1172216Z T: int, 2025-05-07T20:33:40.1172422Z D: int, 2025-05-07T20:33:40.1172657Z scale_ub: Optional[float], 2025-05-07T20:33:40.1172948Z contiguous: bool, 2025-05-07T20:33:40.1173246Z compiled: bool, 2025-05-07T20:33:40.1173488Z ) -> None: 2025-05-07T20:33:40.1173718Z torch.manual_seed(2025) 2025-05-07T20:33:40.1173971Z 2025-05-07T20:33:40.1174264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.1174628Z 2025-05-07T20:33:40.1174830Z x_sign = torch.sign(x) 2025-05-07T20:33:40.1175144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.1175548Z x = x_sign * x_clamp 2025-05-07T20:33:40.1175842Z x0 = x[:, :D] 2025-05-07T20:33:40.1176079Z x1 = x[:, D:] 2025-05-07T20:33:40.1176312Z 2025-05-07T20:33:40.1176512Z if contiguous: 2025-05-07T20:33:40.1176765Z x0 = x0.contiguous() 2025-05-07T20:33:40.1177046Z x1 = x1.contiguous() 2025-05-07T20:33:40.1177301Z 2025-05-07T20:33:40.1177506Z if scale_ub is not None: 2025-05-07T20:33:40.1177802Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.1178158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.1178490Z ) 2025-05-07T20:33:40.1178694Z else: 2025-05-07T20:33:40.1178918Z scale_ub_tensor = None 2025-05-07T20:33:40.1179187Z 2025-05-07T20:33:40.1179433Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.1179785Z op = silu_mul_quant 2025-05-07T20:33:40.1180054Z if compiled: 2025-05-07T20:33:40.1180345Z op = torch.compile(op) 2025-05-07T20:33:40.1180687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.1180980Z 2025-05-07T20:33:40.1181183Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.1181367Z 2025-05-07T20:33:40.1181471Z moe/activation_test.py:117: 2025-05-07T20:33:40.1181784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.1182132Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.1182435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.1183030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.1183616Z return fn(*args, **kwargs) 2025-05-07T20:33:40.1184316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.1185043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.1185615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.1186326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.1187025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.1187587Z kernel = self.compile( 2025-05-07T20:33:40.1188161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.1188847Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.1189268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.1189563Z 2025-05-07T20:33:40.1189824Z self = 2025-05-07T20:33:40.1190973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.1192415Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07610280>} 2025-05-07T20:33:40.1193890Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.1195013Z context = 2025-05-07T20:33:40.1195316Z 2025-05-07T20:33:40.1195503Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.1196049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.1196543Z module_map=module_map) 2025-05-07T20:33:40.1197014Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.1197385Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.1197667Z E ^ 2025-05-07T20:33:40.1198158Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.1198635Z 2025-05-07T20:33:40.1199076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.1199613Z 2025-05-07T20:33:40.4480942Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.4482254Z self=, 2025-05-07T20:33:40.4483041Z T=4096, 2025-05-07T20:33:40.4483462Z D=5120, 2025-05-07T20:33:40.4483802Z scale_ub=1200.0, 2025-05-07T20:33:40.4484195Z contiguous=False, 2025-05-07T20:33:40.4484582Z compiled=False, 2025-05-07T20:33:40.4484912Z ) 2025-05-07T20:33:40.4485453Z self = 2025-05-07T20:33:40.4486356Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:40.4486859Z 2025-05-07T20:33:40.4487013Z @given( 2025-05-07T20:33:40.4487418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.4488257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.4488816Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.4489406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.4490026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.4490546Z ) 2025-05-07T20:33:40.4491176Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.4491993Z def test_silu_mul_quant( 2025-05-07T20:33:40.4492435Z self, 2025-05-07T20:33:40.4492785Z T: int, 2025-05-07T20:33:40.4493129Z D: int, 2025-05-07T20:33:40.4493531Z scale_ub: Optional[float], 2025-05-07T20:33:40.4494023Z contiguous: bool, 2025-05-07T20:33:40.4494448Z compiled: bool, 2025-05-07T20:33:40.4494861Z ) -> None: 2025-05-07T20:33:40.4495250Z torch.manual_seed(2025) 2025-05-07T20:33:40.4495677Z 2025-05-07T20:33:40.4496162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.4496780Z 2025-05-07T20:33:40.4497117Z x_sign = torch.sign(x) 2025-05-07T20:33:40.4497634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.4498203Z x = x_sign * x_clamp 2025-05-07T20:33:40.4498631Z x0 = x[:, :D] 2025-05-07T20:33:40.4499016Z x1 = x[:, D:] 2025-05-07T20:33:40.4499394Z 2025-05-07T20:33:40.4500072Z if contiguous: 2025-05-07T20:33:40.4500511Z x0 = x0.contiguous() 2025-05-07T20:33:40.4500983Z x1 = x1.contiguous() 2025-05-07T20:33:40.4501433Z 2025-05-07T20:33:40.4501766Z if scale_ub is not None: 2025-05-07T20:33:40.4502278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.4502901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.4503453Z ) 2025-05-07T20:33:40.4503804Z else: 2025-05-07T20:33:40.4504191Z scale_ub_tensor = None 2025-05-07T20:33:40.4504641Z 2025-05-07T20:33:40.4505063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.4505637Z op = silu_mul_quant 2025-05-07T20:33:40.4506088Z if compiled: 2025-05-07T20:33:40.4506653Z op = torch.compile(op) 2025-05-07T20:33:40.4507166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.4507639Z 2025-05-07T20:33:40.4507976Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.4508276Z 2025-05-07T20:33:40.4508462Z moe/activation_test.py:117: 2025-05-07T20:33:40.4508985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.4509553Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.4510291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.4511507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.4512741Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.4513809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.4515084Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.4516323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.4517299Z kernel = self.compile( 2025-05-07T20:33:40.4518301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.4519508Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.4520232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.4520662Z 2025-05-07T20:33:40.4521027Z self = 2025-05-07T20:33:40.4523016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.4525850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07611000>} 2025-05-07T20:33:40.4528364Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.4530249Z context = 2025-05-07T20:33:40.4530790Z 2025-05-07T20:33:40.4531093Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.4532052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.4532914Z module_map=module_map) 2025-05-07T20:33:40.4533566Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.4534203Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.4534672Z E ^ 2025-05-07T20:33:40.4535526Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.4536377Z 2025-05-07T20:33:40.4537314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.4538280Z 2025-05-07T20:33:40.4538465Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.4539215Z self=, 2025-05-07T20:33:40.4540000Z T=4096, 2025-05-07T20:33:40.4540343Z D=5120, 2025-05-07T20:33:40.4540718Z scale_ub=1200.0, 2025-05-07T20:33:40.4541120Z contiguous=False, 2025-05-07T20:33:40.4541535Z compiled=True, 2025-05-07T20:33:40.4541906Z ) 2025-05-07T20:33:40.4542471Z self = 2025-05-07T20:33:40.4543368Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:40.4543863Z 2025-05-07T20:33:40.4544123Z @given( 2025-05-07T20:33:40.4544524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.4545065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.4545577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.4546044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.4546505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.4547165Z ) 2025-05-07T20:33:40.4547777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.4548402Z def test_silu_mul_quant( 2025-05-07T20:33:40.4548755Z self, 2025-05-07T20:33:40.4549059Z T: int, 2025-05-07T20:33:40.4549348Z D: int, 2025-05-07T20:33:40.4549691Z scale_ub: Optional[float], 2025-05-07T20:33:40.4550135Z contiguous: bool, 2025-05-07T20:33:40.4550508Z compiled: bool, 2025-05-07T20:33:40.4550877Z ) -> None: 2025-05-07T20:33:40.4551234Z torch.manual_seed(2025) 2025-05-07T20:33:40.4551617Z 2025-05-07T20:33:40.4552045Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.4552589Z 2025-05-07T20:33:40.4552918Z x_sign = torch.sign(x) 2025-05-07T20:33:40.4553361Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.4553999Z x = x_sign * x_clamp 2025-05-07T20:33:40.4554381Z x0 = x[:, :D] 2025-05-07T20:33:40.4554759Z x1 = x[:, D:] 2025-05-07T20:33:40.4555124Z 2025-05-07T20:33:40.4555453Z if contiguous: 2025-05-07T20:33:40.4555846Z x0 = x0.contiguous() 2025-05-07T20:33:40.4556298Z x1 = x1.contiguous() 2025-05-07T20:33:40.4556718Z 2025-05-07T20:33:40.4557045Z if scale_ub is not None: 2025-05-07T20:33:40.4557524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.4558099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.4558617Z ) 2025-05-07T20:33:40.4558960Z else: 2025-05-07T20:33:40.4559305Z scale_ub_tensor = None 2025-05-07T20:33:40.4559712Z 2025-05-07T20:33:40.4560111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.4560655Z op = silu_mul_quant 2025-05-07T20:33:40.4561079Z if compiled: 2025-05-07T20:33:40.4561501Z op = torch.compile(op) 2025-05-07T20:33:40.4562018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.4572262Z 2025-05-07T20:33:40.4572668Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.4572960Z 2025-05-07T20:33:40.4573152Z moe/activation_test.py:117: 2025-05-07T20:33:40.4573683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.4574304Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.4574826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.4575865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.4576916Z return fn(*args, **kwargs) 2025-05-07T20:33:40.4578204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.4579466Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.4580447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.4581710Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.4582885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.4583813Z kernel = self.compile( 2025-05-07T20:33:40.4584787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.4585965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.4586679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.4587187Z 2025-05-07T20:33:40.4587552Z self = 2025-05-07T20:33:40.4589496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.4592093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07610700>} 2025-05-07T20:33:40.4594407Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.4596160Z context = 2025-05-07T20:33:40.4596662Z 2025-05-07T20:33:40.4596931Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.4597820Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.4598625Z module_map=module_map) 2025-05-07T20:33:40.4599225Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.4599816Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.4600260Z E ^ 2025-05-07T20:33:40.4601053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.4601856Z 2025-05-07T20:33:40.4602562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.4603463Z 2025-05-07T20:33:40.5871479Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.5872333Z self=, 2025-05-07T20:33:40.5873095Z T=2048, 2025-05-07T20:33:40.5873445Z D=7168, 2025-05-07T20:33:40.5873878Z scale_ub=1200.0, 2025-05-07T20:33:40.5874278Z contiguous=False, 2025-05-07T20:33:40.5874662Z compiled=False, 2025-05-07T20:33:40.5875003Z ) 2025-05-07T20:33:40.5875544Z self = 2025-05-07T20:33:40.5876453Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:40.5876975Z 2025-05-07T20:33:40.5877125Z @given( 2025-05-07T20:33:40.5877527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.5878095Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.5878649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.5879247Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.5879844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.5880351Z ) 2025-05-07T20:33:40.5880987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.5881795Z def test_silu_mul_quant( 2025-05-07T20:33:40.5882223Z self, 2025-05-07T20:33:40.5882918Z T: int, 2025-05-07T20:33:40.5883285Z D: int, 2025-05-07T20:33:40.5883666Z scale_ub: Optional[float], 2025-05-07T20:33:40.5884153Z contiguous: bool, 2025-05-07T20:33:40.5884584Z compiled: bool, 2025-05-07T20:33:40.5884985Z ) -> None: 2025-05-07T20:33:40.5885373Z torch.manual_seed(2025) 2025-05-07T20:33:40.5885812Z 2025-05-07T20:33:40.5886295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.5886901Z 2025-05-07T20:33:40.5887247Z x_sign = torch.sign(x) 2025-05-07T20:33:40.5887763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.5888312Z x = x_sign * x_clamp 2025-05-07T20:33:40.5888767Z x0 = x[:, :D] 2025-05-07T20:33:40.5889178Z x1 = x[:, D:] 2025-05-07T20:33:40.5889692Z 2025-05-07T20:33:40.5890024Z if contiguous: 2025-05-07T20:33:40.5890436Z x0 = x0.contiguous() 2025-05-07T20:33:40.5890893Z x1 = x1.contiguous() 2025-05-07T20:33:40.5891330Z 2025-05-07T20:33:40.5891673Z if scale_ub is not None: 2025-05-07T20:33:40.5892154Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.5892753Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.5893441Z ) 2025-05-07T20:33:40.5893880Z else: 2025-05-07T20:33:40.5894263Z scale_ub_tensor = None 2025-05-07T20:33:40.5894734Z 2025-05-07T20:33:40.5895171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.5895733Z op = silu_mul_quant 2025-05-07T20:33:40.5896196Z if compiled: 2025-05-07T20:33:40.5896624Z op = torch.compile(op) 2025-05-07T20:33:40.5897137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.5897623Z 2025-05-07T20:33:40.5897965Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.5898255Z 2025-05-07T20:33:40.5898430Z moe/activation_test.py:117: 2025-05-07T20:33:40.5898943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.5899531Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.5900017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.5901253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.5902471Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.5903461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.5904710Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.5905943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.5906935Z kernel = self.compile( 2025-05-07T20:33:40.5907929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.5909138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.5909867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.5910292Z 2025-05-07T20:33:40.5910678Z self = 2025-05-07T20:33:40.5912670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.5915404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07611240>} 2025-05-07T20:33:40.5917938Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.5919878Z context = 2025-05-07T20:33:40.5920458Z 2025-05-07T20:33:40.5920765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.5921724Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.5922595Z module_map=module_map) 2025-05-07T20:33:40.5923252Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.5924189Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.5924679Z E ^ 2025-05-07T20:33:40.5925541Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.5926512Z 2025-05-07T20:33:40.5927301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.5928257Z 2025-05-07T20:33:40.5928450Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.5929202Z self=, 2025-05-07T20:33:40.5929982Z T=1, 2025-05-07T20:33:40.5930456Z D=7168, 2025-05-07T20:33:40.5930798Z scale_ub=None, 2025-05-07T20:33:40.5931269Z contiguous=True, 2025-05-07T20:33:40.5931678Z compiled=False, 2025-05-07T20:33:40.5932039Z ) 2025-05-07T20:33:40.5932618Z self = 2025-05-07T20:33:40.5933514Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:40.5933998Z 2025-05-07T20:33:40.5934140Z @given( 2025-05-07T20:33:40.5934560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.5935144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.5935695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.5936267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.5936798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.5937229Z ) 2025-05-07T20:33:40.5937725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.5938385Z def test_silu_mul_quant( 2025-05-07T20:33:40.5938756Z self, 2025-05-07T20:33:40.5939052Z T: int, 2025-05-07T20:33:40.5939355Z D: int, 2025-05-07T20:33:40.5939682Z scale_ub: Optional[float], 2025-05-07T20:33:40.5940080Z contiguous: bool, 2025-05-07T20:33:40.5940451Z compiled: bool, 2025-05-07T20:33:40.5940803Z ) -> None: 2025-05-07T20:33:40.5941133Z torch.manual_seed(2025) 2025-05-07T20:33:40.5941520Z 2025-05-07T20:33:40.5941956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.5942489Z 2025-05-07T20:33:40.5942790Z x_sign = torch.sign(x) 2025-05-07T20:33:40.5943245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.5943750Z x = x_sign * x_clamp 2025-05-07T20:33:40.5944146Z x0 = x[:, :D] 2025-05-07T20:33:40.5944494Z x1 = x[:, D:] 2025-05-07T20:33:40.5944819Z 2025-05-07T20:33:40.5945114Z if contiguous: 2025-05-07T20:33:40.5945495Z x0 = x0.contiguous() 2025-05-07T20:33:40.5945920Z x1 = x1.contiguous() 2025-05-07T20:33:40.5946297Z 2025-05-07T20:33:40.5946608Z if scale_ub is not None: 2025-05-07T20:33:40.5947032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.5947540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.5948053Z ) 2025-05-07T20:33:40.5948358Z else: 2025-05-07T20:33:40.5948706Z scale_ub_tensor = None 2025-05-07T20:33:40.5949094Z 2025-05-07T20:33:40.5949472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.5950006Z op = silu_mul_quant 2025-05-07T20:33:40.5950439Z if compiled: 2025-05-07T20:33:40.5951007Z op = torch.compile(op) 2025-05-07T20:33:40.5951513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.5951989Z 2025-05-07T20:33:40.5952324Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.5952609Z 2025-05-07T20:33:40.5952790Z moe/activation_test.py:117: 2025-05-07T20:33:40.5953296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.5953986Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.5954488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.5955696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.5956897Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.5957827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.5959064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.5960285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.5961224Z kernel = self.compile( 2025-05-07T20:33:40.5962254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.5963439Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.5964134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.5964544Z 2025-05-07T20:33:40.5964898Z self = 2025-05-07T20:33:40.5966793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.5969235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07612050>} 2025-05-07T20:33:40.5971657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.5973451Z context = 2025-05-07T20:33:40.5973950Z 2025-05-07T20:33:40.5974242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.5975146Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.5975956Z module_map=module_map) 2025-05-07T20:33:40.5976590Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.5977215Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.5977658Z E ^ 2025-05-07T20:33:40.5978470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.5979265Z 2025-05-07T20:33:40.5980025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.5980927Z 2025-05-07T20:33:40.5981116Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.5981821Z self=, 2025-05-07T20:33:40.5982521Z T=16384, 2025-05-07T20:33:40.5982862Z D=7168, 2025-05-07T20:33:40.5983192Z scale_ub=1200.0, 2025-05-07T20:33:40.5983584Z contiguous=False, 2025-05-07T20:33:40.5983976Z compiled=True, 2025-05-07T20:33:40.8677194Z ) 2025-05-07T20:33:40.8677885Z self = 2025-05-07T20:33:40.8678788Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:40.8679289Z 2025-05-07T20:33:40.8679735Z @given( 2025-05-07T20:33:40.8680161Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.8680701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.8681246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.8681851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.8682342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.8682778Z ) 2025-05-07T20:33:40.8683319Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.8684022Z def test_silu_mul_quant( 2025-05-07T20:33:40.8684412Z self, 2025-05-07T20:33:40.8684748Z T: int, 2025-05-07T20:33:40.8685104Z D: int, 2025-05-07T20:33:40.8685474Z scale_ub: Optional[float], 2025-05-07T20:33:40.8686081Z contiguous: bool, 2025-05-07T20:33:40.8686461Z compiled: bool, 2025-05-07T20:33:40.8686814Z ) -> None: 2025-05-07T20:33:40.8687172Z torch.manual_seed(2025) 2025-05-07T20:33:40.8687576Z 2025-05-07T20:33:40.8688022Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.8688600Z 2025-05-07T20:33:40.8688924Z x_sign = torch.sign(x) 2025-05-07T20:33:40.8689667Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.8690244Z x = x_sign * x_clamp 2025-05-07T20:33:40.8690689Z x0 = x[:, :D] 2025-05-07T20:33:40.8691043Z x1 = x[:, D:] 2025-05-07T20:33:40.8691415Z 2025-05-07T20:33:40.8691762Z if contiguous: 2025-05-07T20:33:40.8692167Z x0 = x0.contiguous() 2025-05-07T20:33:40.8692633Z x1 = x1.contiguous() 2025-05-07T20:33:40.8693066Z 2025-05-07T20:33:40.8693405Z if scale_ub is not None: 2025-05-07T20:33:40.8693894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.8694480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.8695045Z ) 2025-05-07T20:33:40.8695409Z else: 2025-05-07T20:33:40.8695800Z scale_ub_tensor = None 2025-05-07T20:33:40.8696256Z 2025-05-07T20:33:40.8696657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.8697224Z op = silu_mul_quant 2025-05-07T20:33:40.8697669Z if compiled: 2025-05-07T20:33:40.8698105Z op = torch.compile(op) 2025-05-07T20:33:40.8698636Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.8699138Z 2025-05-07T20:33:40.8699475Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.8699785Z 2025-05-07T20:33:40.8699963Z moe/activation_test.py:117: 2025-05-07T20:33:40.8700501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.8701103Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.8701603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.8702637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.8703677Z return fn(*args, **kwargs) 2025-05-07T20:33:40.8704868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.8706111Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.8707095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.8708349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.8709575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.8710607Z kernel = self.compile( 2025-05-07T20:33:40.8711610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.8712811Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.8713755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.8714175Z 2025-05-07T20:33:40.8714538Z self = 2025-05-07T20:33:40.8716400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.8718886Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07613490>} 2025-05-07T20:33:40.8721322Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.8723248Z context = 2025-05-07T20:33:40.8724116Z 2025-05-07T20:33:40.8724427Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.8725373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.8726443Z module_map=module_map) 2025-05-07T20:33:40.8727097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.8727690Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.8728131Z E ^ 2025-05-07T20:33:40.8728968Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.8729807Z 2025-05-07T20:33:40.8730560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.8731495Z 2025-05-07T20:33:40.8731689Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.8732428Z self=, 2025-05-07T20:33:40.8733151Z T=1, 2025-05-07T20:33:40.8733467Z D=7168, 2025-05-07T20:33:40.8733806Z scale_ub=None, 2025-05-07T20:33:40.8734185Z contiguous=False, 2025-05-07T20:33:40.8734586Z compiled=False, 2025-05-07T20:33:40.8734940Z ) 2025-05-07T20:33:40.8735509Z self = 2025-05-07T20:33:40.8736381Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:40.8736847Z 2025-05-07T20:33:40.8736990Z @given( 2025-05-07T20:33:40.8737383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.8737933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.8738485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.8739075Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.8739653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.8740160Z ) 2025-05-07T20:33:40.8740781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.8741574Z def test_silu_mul_quant( 2025-05-07T20:33:40.8742008Z self, 2025-05-07T20:33:40.8742347Z T: int, 2025-05-07T20:33:40.8742701Z D: int, 2025-05-07T20:33:40.8743082Z scale_ub: Optional[float], 2025-05-07T20:33:40.8743561Z contiguous: bool, 2025-05-07T20:33:40.8743972Z compiled: bool, 2025-05-07T20:33:40.8744375Z ) -> None: 2025-05-07T20:33:40.8744755Z torch.manual_seed(2025) 2025-05-07T20:33:40.8745178Z 2025-05-07T20:33:40.8745650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.8746251Z 2025-05-07T20:33:40.8746583Z x_sign = torch.sign(x) 2025-05-07T20:33:40.8747105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.8747659Z x = x_sign * x_clamp 2025-05-07T20:33:40.8748081Z x0 = x[:, :D] 2025-05-07T20:33:40.8748587Z x1 = x[:, D:] 2025-05-07T20:33:40.8748962Z 2025-05-07T20:33:40.8749281Z if contiguous: 2025-05-07T20:33:40.8749690Z x0 = x0.contiguous() 2025-05-07T20:33:40.8750194Z x1 = x1.contiguous() 2025-05-07T20:33:40.8750612Z 2025-05-07T20:33:40.8750951Z if scale_ub is not None: 2025-05-07T20:33:40.8751441Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.8752037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.8752586Z ) 2025-05-07T20:33:40.8752924Z else: 2025-05-07T20:33:40.8753290Z scale_ub_tensor = None 2025-05-07T20:33:40.8753830Z 2025-05-07T20:33:40.8754242Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.8754802Z op = silu_mul_quant 2025-05-07T20:33:40.8755369Z if compiled: 2025-05-07T20:33:40.8755808Z op = torch.compile(op) 2025-05-07T20:33:40.8756336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.8756824Z 2025-05-07T20:33:40.8757168Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.8757460Z 2025-05-07T20:33:40.8757647Z moe/activation_test.py:117: 2025-05-07T20:33:40.8758156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.8758892Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.8759436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.8760742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.8762015Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.8762989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.8764238Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.8765461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.8766428Z kernel = self.compile( 2025-05-07T20:33:40.8767393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.8768555Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.8769250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.8769669Z 2025-05-07T20:33:40.8770029Z self = 2025-05-07T20:33:40.8772000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.8774348Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad076137f0>} 2025-05-07T20:33:40.8776854Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.8778739Z context = 2025-05-07T20:33:40.8779280Z 2025-05-07T20:33:40.8779573Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.8780527Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.8781385Z module_map=module_map) 2025-05-07T20:33:40.8782024Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.8782651Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.8783121Z E ^ 2025-05-07T20:33:40.8783959Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.8784888Z 2025-05-07T20:33:40.8785654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.8786612Z 2025-05-07T20:33:40.8786802Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.8787549Z self=, 2025-05-07T20:33:40.8788275Z T=2048, 2025-05-07T20:33:40.8788607Z D=7168, 2025-05-07T20:33:40.8788946Z scale_ub=None, 2025-05-07T20:33:40.8789323Z contiguous=False, 2025-05-07T20:33:40.8789721Z compiled=True, 2025-05-07T20:33:40.8790086Z ) 2025-05-07T20:33:40.9790836Z self = 2025-05-07T20:33:40.9791761Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:40.9792851Z 2025-05-07T20:33:40.9792987Z @given( 2025-05-07T20:33:40.9793375Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.9793943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.9804605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.9805196Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.9805963Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.9806477Z ) 2025-05-07T20:33:40.9807207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.9808011Z def test_silu_mul_quant( 2025-05-07T20:33:40.9808438Z self, 2025-05-07T20:33:40.9808769Z T: int, 2025-05-07T20:33:40.9809111Z D: int, 2025-05-07T20:33:40.9809493Z scale_ub: Optional[float], 2025-05-07T20:33:40.9810015Z contiguous: bool, 2025-05-07T20:33:40.9810535Z compiled: bool, 2025-05-07T20:33:40.9810933Z ) -> None: 2025-05-07T20:33:40.9811314Z torch.manual_seed(2025) 2025-05-07T20:33:40.9811744Z 2025-05-07T20:33:40.9812227Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.9812841Z 2025-05-07T20:33:40.9813187Z x_sign = torch.sign(x) 2025-05-07T20:33:40.9813691Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.9814245Z x = x_sign * x_clamp 2025-05-07T20:33:40.9814675Z x0 = x[:, :D] 2025-05-07T20:33:40.9815038Z x1 = x[:, D:] 2025-05-07T20:33:40.9815402Z 2025-05-07T20:33:40.9815730Z if contiguous: 2025-05-07T20:33:40.9816133Z x0 = x0.contiguous() 2025-05-07T20:33:40.9816582Z x1 = x1.contiguous() 2025-05-07T20:33:40.9817005Z 2025-05-07T20:33:40.9817341Z if scale_ub is not None: 2025-05-07T20:33:40.9817805Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.9818394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.9818948Z ) 2025-05-07T20:33:40.9819272Z else: 2025-05-07T20:33:40.9819623Z scale_ub_tensor = None 2025-05-07T20:33:40.9820054Z 2025-05-07T20:33:40.9820445Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.9820994Z op = silu_mul_quant 2025-05-07T20:33:40.9821450Z if compiled: 2025-05-07T20:33:40.9821888Z op = torch.compile(op) 2025-05-07T20:33:40.9822427Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.9822923Z 2025-05-07T20:33:40.9823254Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.9823557Z 2025-05-07T20:33:40.9823731Z moe/activation_test.py:117: 2025-05-07T20:33:40.9824594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.9825200Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.9825688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.9826719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.9827744Z return fn(*args, **kwargs) 2025-05-07T20:33:40.9829097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.9830389Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.9831350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.9832612Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.9833931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.9834852Z kernel = self.compile( 2025-05-07T20:33:40.9835826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.9837029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.9837859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.9838273Z 2025-05-07T20:33:40.9838646Z self = 2025-05-07T20:33:40.9840714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.9843365Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07300280>} 2025-05-07T20:33:40.9845859Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.9847739Z context = 2025-05-07T20:33:40.9848262Z 2025-05-07T20:33:40.9848562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.9849501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.9850352Z module_map=module_map) 2025-05-07T20:33:40.9851004Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.9851620Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.9852082Z E ^ 2025-05-07T20:33:40.9852930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.9853761Z 2025-05-07T20:33:40.9854461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.9855194Z 2025-05-07T20:33:40.9855344Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.9855940Z self=, 2025-05-07T20:33:40.9856518Z T=4096, 2025-05-07T20:33:40.9856787Z D=7168, 2025-05-07T20:33:40.9857077Z scale_ub=None, 2025-05-07T20:33:40.9857398Z contiguous=False, 2025-05-07T20:33:40.9857717Z compiled=True, 2025-05-07T20:33:40.9858024Z ) 2025-05-07T20:33:40.9858464Z self = 2025-05-07T20:33:40.9859196Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:40.9859629Z 2025-05-07T20:33:40.9859754Z @given( 2025-05-07T20:33:40.9860110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.9860583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.9861060Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.9861585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.9862092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.9862521Z ) 2025-05-07T20:33:40.9863057Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.9863859Z def test_silu_mul_quant( 2025-05-07T20:33:40.9864235Z self, 2025-05-07T20:33:40.9864527Z T: int, 2025-05-07T20:33:40.9864828Z D: int, 2025-05-07T20:33:40.9865161Z scale_ub: Optional[float], 2025-05-07T20:33:40.9865592Z contiguous: bool, 2025-05-07T20:33:40.9865950Z compiled: bool, 2025-05-07T20:33:40.9866328Z ) -> None: 2025-05-07T20:33:40.9866646Z torch.manual_seed(2025) 2025-05-07T20:33:40.9867010Z 2025-05-07T20:33:40.9867466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.9868040Z 2025-05-07T20:33:40.9868368Z x_sign = torch.sign(x) 2025-05-07T20:33:40.9868858Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.9869378Z x = x_sign * x_clamp 2025-05-07T20:33:40.9869882Z x0 = x[:, :D] 2025-05-07T20:33:40.9870247Z x1 = x[:, D:] 2025-05-07T20:33:40.9870595Z 2025-05-07T20:33:40.9870908Z if contiguous: 2025-05-07T20:33:40.9871304Z x0 = x0.contiguous() 2025-05-07T20:33:40.9871732Z x1 = x1.contiguous() 2025-05-07T20:33:40.9872141Z 2025-05-07T20:33:40.9872465Z if scale_ub is not None: 2025-05-07T20:33:40.9872918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.9873680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.9874206Z ) 2025-05-07T20:33:40.9874539Z else: 2025-05-07T20:33:40.9874886Z scale_ub_tensor = None 2025-05-07T20:33:40.9875311Z 2025-05-07T20:33:40.9875699Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.9876219Z op = silu_mul_quant 2025-05-07T20:33:40.9876635Z if compiled: 2025-05-07T20:33:40.9877057Z op = torch.compile(op) 2025-05-07T20:33:40.9877543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.9878008Z 2025-05-07T20:33:40.9878335Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.9878612Z 2025-05-07T20:33:40.9878787Z moe/activation_test.py:117: 2025-05-07T20:33:40.9879284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.9879844Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.9880318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.9881267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.9882222Z return fn(*args, **kwargs) 2025-05-07T20:33:40.9883345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.9884509Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.9885415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.9886581Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.9887721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.9888622Z kernel = self.compile( 2025-05-07T20:33:40.9889537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.9890661Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.9891320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.9891716Z 2025-05-07T20:33:40.9892057Z self = 2025-05-07T20:33:40.9893893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.9896321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07303520>} 2025-05-07T20:33:40.9898634Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.9900433Z context = 2025-05-07T20:33:40.9900926Z 2025-05-07T20:33:40.9901197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.9902081Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.9902880Z module_map=module_map) 2025-05-07T20:33:40.9903478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.9904120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.9904565Z E ^ 2025-05-07T20:33:40.9905350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.9906130Z 2025-05-07T20:33:40.9906839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.9907772Z 2025-05-07T20:33:41.3314072Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.3314910Z self=, 2025-05-07T20:33:41.3315661Z T=16384, 2025-05-07T20:33:41.3316003Z D=5120, 2025-05-07T20:33:41.3316349Z scale_ub=1200.0, 2025-05-07T20:33:41.3316745Z contiguous=False, 2025-05-07T20:33:41.3317115Z compiled=False, 2025-05-07T20:33:41.3317451Z ) 2025-05-07T20:33:41.3317929Z self = 2025-05-07T20:33:41.3318805Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.3319321Z 2025-05-07T20:33:41.3319460Z @given( 2025-05-07T20:33:41.3319870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.3320431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.3320977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.3321579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.3322172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.3322683Z ) 2025-05-07T20:33:41.3323314Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.3324557Z def test_silu_mul_quant( 2025-05-07T20:33:41.3324982Z self, 2025-05-07T20:33:41.3325325Z T: int, 2025-05-07T20:33:41.3325682Z D: int, 2025-05-07T20:33:41.3326057Z scale_ub: Optional[float], 2025-05-07T20:33:41.3326538Z contiguous: bool, 2025-05-07T20:33:41.3326972Z compiled: bool, 2025-05-07T20:33:41.3327366Z ) -> None: 2025-05-07T20:33:41.3327748Z torch.manual_seed(2025) 2025-05-07T20:33:41.3328182Z 2025-05-07T20:33:41.3328682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.3329301Z 2025-05-07T20:33:41.3329645Z x_sign = torch.sign(x) 2025-05-07T20:33:41.3330206Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.3330795Z x = x_sign * x_clamp 2025-05-07T20:33:41.3331231Z x0 = x[:, :D] 2025-05-07T20:33:41.3331613Z x1 = x[:, D:] 2025-05-07T20:33:41.3331989Z 2025-05-07T20:33:41.3332319Z if contiguous: 2025-05-07T20:33:41.3332725Z x0 = x0.contiguous() 2025-05-07T20:33:41.3333194Z x1 = x1.contiguous() 2025-05-07T20:33:41.3333631Z 2025-05-07T20:33:41.3333963Z if scale_ub is not None: 2025-05-07T20:33:41.3334457Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.3335062Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.3335609Z ) 2025-05-07T20:33:41.3335958Z else: 2025-05-07T20:33:41.3336536Z scale_ub_tensor = None 2025-05-07T20:33:41.3336997Z 2025-05-07T20:33:41.3337420Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.3337993Z op = silu_mul_quant 2025-05-07T20:33:41.3338454Z if compiled: 2025-05-07T20:33:41.3338901Z op = torch.compile(op) 2025-05-07T20:33:41.3339431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.3339925Z 2025-05-07T20:33:41.3340247Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.3340542Z 2025-05-07T20:33:41.3340718Z moe/activation_test.py:117: 2025-05-07T20:33:41.3341247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.3341822Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.3342358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.3343732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.3344972Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.3345928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.3347194Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.3348673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.3349680Z kernel = self.compile( 2025-05-07T20:33:41.3350727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.3351943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.3352675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.3353101Z 2025-05-07T20:33:41.3353480Z self = 2025-05-07T20:33:41.3355602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.3358182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07302a70>} 2025-05-07T20:33:41.3360709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.3362649Z context = 2025-05-07T20:33:41.3363185Z 2025-05-07T20:33:41.3363498Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.3364448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.3365327Z module_map=module_map) 2025-05-07T20:33:41.3365992Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.3366617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.3367093Z E ^ 2025-05-07T20:33:41.3367965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.3368811Z 2025-05-07T20:33:41.3369595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.3370609Z 2025-05-07T20:33:41.3370797Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.3371555Z self=, 2025-05-07T20:33:41.3372298Z T=16384, 2025-05-07T20:33:41.3372638Z D=5120, 2025-05-07T20:33:41.3372985Z scale_ub=1200.0, 2025-05-07T20:33:41.3373385Z contiguous=True, 2025-05-07T20:33:41.3373774Z compiled=True, 2025-05-07T20:33:41.3374226Z ) 2025-05-07T20:33:41.3374803Z self = 2025-05-07T20:33:41.3375711Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.3376218Z 2025-05-07T20:33:41.3376356Z @given( 2025-05-07T20:33:41.3376770Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.3377340Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.3377889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.3378495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.3379096Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.3379602Z ) 2025-05-07T20:33:41.3380192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.3380906Z def test_silu_mul_quant( 2025-05-07T20:33:41.3381256Z self, 2025-05-07T20:33:41.3381545Z T: int, 2025-05-07T20:33:41.3381845Z D: int, 2025-05-07T20:33:41.3382175Z scale_ub: Optional[float], 2025-05-07T20:33:41.3382576Z contiguous: bool, 2025-05-07T20:33:41.3382938Z compiled: bool, 2025-05-07T20:33:41.3383282Z ) -> None: 2025-05-07T20:33:41.3383676Z torch.manual_seed(2025) 2025-05-07T20:33:41.3384095Z 2025-05-07T20:33:41.3384488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.3384987Z 2025-05-07T20:33:41.3385297Z x_sign = torch.sign(x) 2025-05-07T20:33:41.3385759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.3386241Z x = x_sign * x_clamp 2025-05-07T20:33:41.3386609Z x0 = x[:, :D] 2025-05-07T20:33:41.3386945Z x1 = x[:, D:] 2025-05-07T20:33:41.3387270Z 2025-05-07T20:33:41.3387578Z if contiguous: 2025-05-07T20:33:41.3387961Z x0 = x0.contiguous() 2025-05-07T20:33:41.3388363Z x1 = x1.contiguous() 2025-05-07T20:33:41.3388747Z 2025-05-07T20:33:41.3389057Z if scale_ub is not None: 2025-05-07T20:33:41.3389482Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.3390026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.3390526Z ) 2025-05-07T20:33:41.3390822Z else: 2025-05-07T20:33:41.3391148Z scale_ub_tensor = None 2025-05-07T20:33:41.3391549Z 2025-05-07T20:33:41.3391923Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.3392402Z op = silu_mul_quant 2025-05-07T20:33:41.3392835Z if compiled: 2025-05-07T20:33:41.3393259Z op = torch.compile(op) 2025-05-07T20:33:41.3393876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.3394345Z 2025-05-07T20:33:41.3394691Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.3394977Z 2025-05-07T20:33:41.3395151Z moe/activation_test.py:117: 2025-05-07T20:33:41.3395680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.3396277Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.3396789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.3397794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.3398796Z return fn(*args, **kwargs) 2025-05-07T20:33:41.3399971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.3401193Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.3402125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.3403382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.3404633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.3405620Z kernel = self.compile( 2025-05-07T20:33:41.3406727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.3407945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.3408677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.3409110Z 2025-05-07T20:33:41.3409477Z self = 2025-05-07T20:33:41.3411499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.3414084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07302560>} 2025-05-07T20:33:41.3416691Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.3418595Z context = 2025-05-07T20:33:41.3419206Z 2025-05-07T20:33:41.3419563Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.3420533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.3421395Z module_map=module_map) 2025-05-07T20:33:41.3422045Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.3422683Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.3423154Z E ^ 2025-05-07T20:33:41.3424314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.3425176Z 2025-05-07T20:33:41.3425960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.3426923Z 2025-05-07T20:33:41.5381160Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.5382088Z self=, 2025-05-07T20:33:41.5382852Z T=16384, 2025-05-07T20:33:41.5383209Z D=5120, 2025-05-07T20:33:41.5383574Z scale_ub=None, 2025-05-07T20:33:41.5383964Z contiguous=False, 2025-05-07T20:33:41.5384370Z compiled=True, 2025-05-07T20:33:41.5384754Z ) 2025-05-07T20:33:41.5385348Z self = 2025-05-07T20:33:41.5386273Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.5386737Z 2025-05-07T20:33:41.5386866Z @given( 2025-05-07T20:33:41.5387240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.5387757Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.5388266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.5388832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.5389407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.5389915Z ) 2025-05-07T20:33:41.5390543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.5391292Z def test_silu_mul_quant( 2025-05-07T20:33:41.5391700Z self, 2025-05-07T20:33:41.5392035Z T: int, 2025-05-07T20:33:41.5392388Z D: int, 2025-05-07T20:33:41.5392768Z scale_ub: Optional[float], 2025-05-07T20:33:41.5393247Z contiguous: bool, 2025-05-07T20:33:41.5393768Z compiled: bool, 2025-05-07T20:33:41.5394177Z ) -> None: 2025-05-07T20:33:41.5394588Z torch.manual_seed(2025) 2025-05-07T20:33:41.5395067Z 2025-05-07T20:33:41.5395552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.5396173Z 2025-05-07T20:33:41.5396898Z x_sign = torch.sign(x) 2025-05-07T20:33:41.5397458Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.5398033Z x = x_sign * x_clamp 2025-05-07T20:33:41.5398481Z x0 = x[:, :D] 2025-05-07T20:33:41.5398887Z x1 = x[:, D:] 2025-05-07T20:33:41.5399273Z 2025-05-07T20:33:41.5399626Z if contiguous: 2025-05-07T20:33:41.5400068Z x0 = x0.contiguous() 2025-05-07T20:33:41.5400553Z x1 = x1.contiguous() 2025-05-07T20:33:41.5401018Z 2025-05-07T20:33:41.5401384Z if scale_ub is not None: 2025-05-07T20:33:41.5401885Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.5402519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.5403112Z ) 2025-05-07T20:33:41.5403636Z else: 2025-05-07T20:33:41.5404036Z scale_ub_tensor = None 2025-05-07T20:33:41.5404531Z 2025-05-07T20:33:41.5404980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.5405599Z op = silu_mul_quant 2025-05-07T20:33:41.5406086Z if compiled: 2025-05-07T20:33:41.5406556Z op = torch.compile(op) 2025-05-07T20:33:41.5407116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.5407785Z 2025-05-07T20:33:41.5408316Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.5419043Z 2025-05-07T20:33:41.5419271Z moe/activation_test.py:117: 2025-05-07T20:33:41.5419831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.5420459Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.5420967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.5421968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.5423049Z return fn(*args, **kwargs) 2025-05-07T20:33:41.5424635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.5425954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.5426957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.5428251Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.5429513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.5430542Z kernel = self.compile( 2025-05-07T20:33:41.5431573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.5432807Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.5433637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.5434077Z 2025-05-07T20:33:41.5434480Z self = 2025-05-07T20:33:41.5436542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.5439515Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07303760>} 2025-05-07T20:33:41.5442103Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.5444084Z context = 2025-05-07T20:33:41.5444634Z 2025-05-07T20:33:41.5444944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.5446101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.5446990Z module_map=module_map) 2025-05-07T20:33:41.5447669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.5448322Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.5448823Z E ^ 2025-05-07T20:33:41.5449711Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.5450581Z 2025-05-07T20:33:41.5451368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.5452351Z 2025-05-07T20:33:41.5452543Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.5453324Z self=, 2025-05-07T20:33:41.5454288Z T=2048, 2025-05-07T20:33:41.5454635Z D=5120, 2025-05-07T20:33:41.5454987Z scale_ub=None, 2025-05-07T20:33:41.5455400Z contiguous=False, 2025-05-07T20:33:41.5455807Z compiled=True, 2025-05-07T20:33:41.5456196Z ) 2025-05-07T20:33:41.6567969Z self = 2025-05-07T20:33:41.6569407Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.6570122Z 2025-05-07T20:33:41.6570309Z @given( 2025-05-07T20:33:41.6570726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.6571309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.6571821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.6572430Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.6573063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.6573609Z ) 2025-05-07T20:33:41.6574299Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.6575173Z def test_silu_mul_quant( 2025-05-07T20:33:41.6575629Z self, 2025-05-07T20:33:41.6576007Z T: int, 2025-05-07T20:33:41.6576385Z D: int, 2025-05-07T20:33:41.6576790Z scale_ub: Optional[float], 2025-05-07T20:33:41.6577314Z contiguous: bool, 2025-05-07T20:33:41.6577785Z compiled: bool, 2025-05-07T20:33:41.6578206Z ) -> None: 2025-05-07T20:33:41.6578624Z torch.manual_seed(2025) 2025-05-07T20:33:41.6579086Z 2025-05-07T20:33:41.6579600Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.6580249Z 2025-05-07T20:33:41.6580618Z x_sign = torch.sign(x) 2025-05-07T20:33:41.6581170Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.6581756Z x = x_sign * x_clamp 2025-05-07T20:33:41.6582215Z x0 = x[:, :D] 2025-05-07T20:33:41.6582633Z x1 = x[:, D:] 2025-05-07T20:33:41.6583021Z 2025-05-07T20:33:41.6583374Z if contiguous: 2025-05-07T20:33:41.6583815Z x0 = x0.contiguous() 2025-05-07T20:33:41.6584311Z x1 = x1.contiguous() 2025-05-07T20:33:41.6584775Z 2025-05-07T20:33:41.6585140Z if scale_ub is not None: 2025-05-07T20:33:41.6585651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.6586292Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.6586890Z ) 2025-05-07T20:33:41.6587249Z else: 2025-05-07T20:33:41.6587650Z scale_ub_tensor = None 2025-05-07T20:33:41.6588134Z 2025-05-07T20:33:41.6588560Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.6589168Z op = silu_mul_quant 2025-05-07T20:33:41.6589642Z if compiled: 2025-05-07T20:33:41.6590108Z op = torch.compile(op) 2025-05-07T20:33:41.6590662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.6591194Z 2025-05-07T20:33:41.6591558Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.6591877Z 2025-05-07T20:33:41.6592063Z moe/activation_test.py:117: 2025-05-07T20:33:41.6592784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.6593429Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.6594079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.6595175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.6596236Z return fn(*args, **kwargs) 2025-05-07T20:33:41.6597477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.6598783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.6599813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.6601267Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.6602570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.6603638Z kernel = self.compile( 2025-05-07T20:33:41.6604712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.6606130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.6606967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.6607435Z 2025-05-07T20:33:41.6607831Z self = 2025-05-07T20:33:41.6609996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.6612799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad074543a0>} 2025-05-07T20:33:41.6615519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.6617499Z context = 2025-05-07T20:33:41.6618087Z 2025-05-07T20:33:41.6618407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.6619438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.6620354Z module_map=module_map) 2025-05-07T20:33:41.6621054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.6621732Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.6622241Z E ^ 2025-05-07T20:33:41.6623148Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.6624380Z 2025-05-07T20:33:41.6625216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.6626237Z 2025-05-07T20:33:41.6626453Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.6627263Z self=, 2025-05-07T20:33:41.6628047Z T=2048, 2025-05-07T20:33:41.6628406Z D=5120, 2025-05-07T20:33:41.6628772Z scale_ub=1200.0, 2025-05-07T20:33:41.6629202Z contiguous=False, 2025-05-07T20:33:41.6629630Z compiled=True, 2025-05-07T20:33:41.6630022Z ) 2025-05-07T20:33:41.6630631Z self = 2025-05-07T20:33:41.6631605Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.6632157Z 2025-05-07T20:33:41.6632317Z @given( 2025-05-07T20:33:41.6632749Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.6633488Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.6634201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.6634848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.6635486Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.6636055Z ) 2025-05-07T20:33:41.6636743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.6637609Z def test_silu_mul_quant( 2025-05-07T20:33:41.6638053Z self, 2025-05-07T20:33:41.6638379Z T: int, 2025-05-07T20:33:41.6638678Z D: int, 2025-05-07T20:33:41.6639021Z scale_ub: Optional[float], 2025-05-07T20:33:41.6639448Z contiguous: bool, 2025-05-07T20:33:41.6639815Z compiled: bool, 2025-05-07T20:33:41.6640310Z ) -> None: 2025-05-07T20:33:41.6640654Z torch.manual_seed(2025) 2025-05-07T20:33:41.6641032Z 2025-05-07T20:33:41.6641463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.6641993Z 2025-05-07T20:33:41.6642288Z x_sign = torch.sign(x) 2025-05-07T20:33:41.6642752Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.6643431Z x = x_sign * x_clamp 2025-05-07T20:33:41.6643838Z x0 = x[:, :D] 2025-05-07T20:33:41.6644304Z x1 = x[:, D:] 2025-05-07T20:33:41.6644675Z 2025-05-07T20:33:41.6644994Z if contiguous: 2025-05-07T20:33:41.6645389Z x0 = x0.contiguous() 2025-05-07T20:33:41.6645839Z x1 = x1.contiguous() 2025-05-07T20:33:41.6646247Z 2025-05-07T20:33:41.6646557Z if scale_ub is not None: 2025-05-07T20:33:41.6647006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.6647571Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.6648100Z ) 2025-05-07T20:33:41.6648447Z else: 2025-05-07T20:33:41.6648813Z scale_ub_tensor = None 2025-05-07T20:33:41.6649233Z 2025-05-07T20:33:41.6649640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.6650175Z op = silu_mul_quant 2025-05-07T20:33:41.6650608Z if compiled: 2025-05-07T20:33:41.6651080Z op = torch.compile(op) 2025-05-07T20:33:41.6651589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.6652074Z 2025-05-07T20:33:41.6652423Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.6652731Z 2025-05-07T20:33:41.6652914Z moe/activation_test.py:117: 2025-05-07T20:33:41.6653456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.6654059Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.6654576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.6655606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.6656634Z return fn(*args, **kwargs) 2025-05-07T20:33:41.6657860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.6659135Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.6660121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.6661372Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.6662602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.6663581Z kernel = self.compile( 2025-05-07T20:33:41.6664529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.6665733Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.6666463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.6666890Z 2025-05-07T20:33:41.6667395Z self = 2025-05-07T20:33:41.6669505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.6672167Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07454820>} 2025-05-07T20:33:41.6674819Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.6677005Z context = 2025-05-07T20:33:41.6677574Z 2025-05-07T20:33:41.6677907Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.6678930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.6679868Z module_map=module_map) 2025-05-07T20:33:41.6680734Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.6681463Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.6681972Z E ^ 2025-05-07T20:33:41.6682896Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.6683808Z 2025-05-07T20:33:41.6684647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.6685677Z 2025-05-07T20:33:42.0512894Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.0513918Z self=, 2025-05-07T20:33:42.0514680Z T=4096, 2025-05-07T20:33:42.0515029Z D=5120, 2025-05-07T20:33:42.0515395Z scale_ub=1200.0, 2025-05-07T20:33:42.0515816Z contiguous=True, 2025-05-07T20:33:42.0516210Z compiled=True, 2025-05-07T20:33:42.0516594Z ) 2025-05-07T20:33:42.0517181Z self = 2025-05-07T20:33:42.0518123Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:42.0518625Z 2025-05-07T20:33:42.0518757Z @given( 2025-05-07T20:33:42.0519126Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.0519637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.0520176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.0520875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.0521556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.0522056Z ) 2025-05-07T20:33:42.0522681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.0523458Z def test_silu_mul_quant( 2025-05-07T20:33:42.0524106Z self, 2025-05-07T20:33:42.0524448Z T: int, 2025-05-07T20:33:42.0524786Z D: int, 2025-05-07T20:33:42.0525144Z scale_ub: Optional[float], 2025-05-07T20:33:42.0525635Z contiguous: bool, 2025-05-07T20:33:42.0526060Z compiled: bool, 2025-05-07T20:33:42.0526449Z ) -> None: 2025-05-07T20:33:42.0526829Z torch.manual_seed(2025) 2025-05-07T20:33:42.0527262Z 2025-05-07T20:33:42.0527782Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.0528411Z 2025-05-07T20:33:42.0528760Z x_sign = torch.sign(x) 2025-05-07T20:33:42.0529290Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.0529868Z x = x_sign * x_clamp 2025-05-07T20:33:42.0530331Z x0 = x[:, :D] 2025-05-07T20:33:42.0530731Z x1 = x[:, D:] 2025-05-07T20:33:42.0531112Z 2025-05-07T20:33:42.0531469Z if contiguous: 2025-05-07T20:33:42.0532226Z x0 = x0.contiguous() 2025-05-07T20:33:42.0532728Z x1 = x1.contiguous() 2025-05-07T20:33:42.0533192Z 2025-05-07T20:33:42.0533561Z if scale_ub is not None: 2025-05-07T20:33:42.0534077Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.0534720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.0535312Z ) 2025-05-07T20:33:42.0535673Z else: 2025-05-07T20:33:42.0536073Z scale_ub_tensor = None 2025-05-07T20:33:42.0536557Z 2025-05-07T20:33:42.0536992Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.0537605Z op = silu_mul_quant 2025-05-07T20:33:42.0538085Z if compiled: 2025-05-07T20:33:42.0538561Z op = torch.compile(op) 2025-05-07T20:33:42.0539286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.0539817Z 2025-05-07T20:33:42.0540186Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.0540504Z 2025-05-07T20:33:42.0540698Z moe/activation_test.py:117: 2025-05-07T20:33:42.0541263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.0541906Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.0542561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.0543743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:42.0544823Z return fn(*args, **kwargs) 2025-05-07T20:33:42.0546105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.0547457Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.0548503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.0549856Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.0551162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.0552230Z kernel = self.compile( 2025-05-07T20:33:42.0553252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.0554590Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.0555316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.0555712Z 2025-05-07T20:33:42.0556097Z self = 2025-05-07T20:33:42.0558145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.0560829Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07455360>} 2025-05-07T20:33:42.0563454Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.0565466Z context = 2025-05-07T20:33:42.0566029Z 2025-05-07T20:33:42.0566345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.0567337Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.0568232Z module_map=module_map) 2025-05-07T20:33:42.0568921Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.0569596Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.0570092Z E ^ 2025-05-07T20:33:42.0571085Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.0571979Z 2025-05-07T20:33:42.0572791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.0573787Z 2025-05-07T20:33:42.0574000Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.0574786Z self=, 2025-05-07T20:33:42.0575542Z T=128, 2025-05-07T20:33:42.0575899Z D=5120, 2025-05-07T20:33:42.0576261Z scale_ub=1200.0, 2025-05-07T20:33:42.0576675Z contiguous=False, 2025-05-07T20:33:42.0577101Z compiled=True, 2025-05-07T20:33:42.0577487Z ) 2025-05-07T20:33:42.1818258Z self = 2025-05-07T20:33:42.1819130Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:42.1819585Z 2025-05-07T20:33:42.1819691Z @given( 2025-05-07T20:33:42.1819966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.1820604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.1821219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.1822038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.1822813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.1823395Z ) 2025-05-07T20:33:42.1824423Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.1825310Z def test_silu_mul_quant( 2025-05-07T20:33:42.1825796Z self, 2025-05-07T20:33:42.1826182Z T: int, 2025-05-07T20:33:42.1826579Z D: int, 2025-05-07T20:33:42.1827015Z scale_ub: Optional[float], 2025-05-07T20:33:42.1827552Z contiguous: bool, 2025-05-07T20:33:42.1828047Z compiled: bool, 2025-05-07T20:33:42.1828500Z ) -> None: 2025-05-07T20:33:42.1828935Z torch.manual_seed(2025) 2025-05-07T20:33:42.1829416Z 2025-05-07T20:33:42.1829968Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.1830654Z 2025-05-07T20:33:42.1830954Z x_sign = torch.sign(x) 2025-05-07T20:33:42.1831335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.1831710Z x = x_sign * x_clamp 2025-05-07T20:33:42.1831990Z x0 = x[:, :D] 2025-05-07T20:33:42.1832245Z x1 = x[:, D:] 2025-05-07T20:33:42.1832486Z 2025-05-07T20:33:42.1832698Z if contiguous: 2025-05-07T20:33:42.1832970Z x0 = x0.contiguous() 2025-05-07T20:33:42.1833273Z x1 = x1.contiguous() 2025-05-07T20:33:42.1833609Z 2025-05-07T20:33:42.1833837Z if scale_ub is not None: 2025-05-07T20:33:42.1834157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.1834543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.1834905Z ) 2025-05-07T20:33:42.1835134Z else: 2025-05-07T20:33:42.1835383Z scale_ub_tensor = None 2025-05-07T20:33:42.1835679Z 2025-05-07T20:33:42.1835956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.1836321Z op = silu_mul_quant 2025-05-07T20:33:42.1836612Z if compiled: 2025-05-07T20:33:42.1836904Z op = torch.compile(op) 2025-05-07T20:33:42.1837253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.1837574Z 2025-05-07T20:33:42.1837831Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.1838029Z 2025-05-07T20:33:42.1838146Z moe/activation_test.py:117: 2025-05-07T20:33:42.1838494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.1838878Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.1839212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.1839871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:42.1840614Z return fn(*args, **kwargs) 2025-05-07T20:33:42.1841378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.1842178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.1842807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.1843586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.1844351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.1844968Z kernel = self.compile( 2025-05-07T20:33:42.1845598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.1846422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.1846889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.1847156Z 2025-05-07T20:33:42.1847403Z self = 2025-05-07T20:33:42.1848713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.1850367Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07456290>} 2025-05-07T20:33:42.1851963Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.1853147Z context = 2025-05-07T20:33:42.1853481Z 2025-05-07T20:33:42.1853685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.1854282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.1854832Z module_map=module_map) 2025-05-07T20:33:42.1855259Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.1855670Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.1855965Z E ^ 2025-05-07T20:33:42.1856506Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.1857028Z 2025-05-07T20:33:42.1857515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.1858106Z 2025-05-07T20:33:42.1858226Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.1858707Z self=, 2025-05-07T20:33:42.1859172Z T=16384, 2025-05-07T20:33:42.1859399Z D=7168, 2025-05-07T20:33:42.1859636Z scale_ub=1200.0, 2025-05-07T20:33:42.1859890Z contiguous=True, 2025-05-07T20:33:42.1860147Z compiled=True, 2025-05-07T20:33:42.1866836Z ) 2025-05-07T20:33:42.1867243Z self = 2025-05-07T20:33:42.1867824Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:42.1868145Z 2025-05-07T20:33:42.1868247Z @given( 2025-05-07T20:33:42.1868519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.1868890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.1869250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.1869629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.1870013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.1870352Z ) 2025-05-07T20:33:42.1870836Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.1871402Z def test_silu_mul_quant( 2025-05-07T20:33:42.1871690Z self, 2025-05-07T20:33:42.1871915Z T: int, 2025-05-07T20:33:42.1872154Z D: int, 2025-05-07T20:33:42.1872419Z scale_ub: Optional[float], 2025-05-07T20:33:42.1872744Z contiguous: bool, 2025-05-07T20:33:42.1873025Z compiled: bool, 2025-05-07T20:33:42.1873290Z ) -> None: 2025-05-07T20:33:42.1873620Z torch.manual_seed(2025) 2025-05-07T20:33:42.1873901Z 2025-05-07T20:33:42.1874221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.1874618Z 2025-05-07T20:33:42.1874841Z x_sign = torch.sign(x) 2025-05-07T20:33:42.1875184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.1875607Z x = x_sign * x_clamp 2025-05-07T20:33:42.1875885Z x0 = x[:, :D] 2025-05-07T20:33:42.1876144Z x1 = x[:, D:] 2025-05-07T20:33:42.1876389Z 2025-05-07T20:33:42.1876611Z if contiguous: 2025-05-07T20:33:42.1876884Z x0 = x0.contiguous() 2025-05-07T20:33:42.1877186Z x1 = x1.contiguous() 2025-05-07T20:33:42.1877458Z 2025-05-07T20:33:42.1877743Z if scale_ub is not None: 2025-05-07T20:33:42.1878112Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.1878502Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.1878865Z ) 2025-05-07T20:33:42.1879094Z else: 2025-05-07T20:33:42.1879342Z scale_ub_tensor = None 2025-05-07T20:33:42.1879631Z 2025-05-07T20:33:42.1879908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.1880283Z op = silu_mul_quant 2025-05-07T20:33:42.1880570Z if compiled: 2025-05-07T20:33:42.1880865Z op = torch.compile(op) 2025-05-07T20:33:42.1881218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.1881533Z 2025-05-07T20:33:42.1881763Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.1881953Z 2025-05-07T20:33:42.1882071Z moe/activation_test.py:117: 2025-05-07T20:33:42.1882420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.1882808Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.1883136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.1883781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:42.1884428Z return fn(*args, **kwargs) 2025-05-07T20:33:42.1885188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.1885972Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.1886591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.1887374Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.1888133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.1888740Z kernel = self.compile( 2025-05-07T20:33:42.1889370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.1890123Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.1890580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.1890851Z 2025-05-07T20:33:42.1891090Z self = 2025-05-07T20:33:42.1892330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.1893957Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07456d40>} 2025-05-07T20:33:42.1895490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.1896655Z context = 2025-05-07T20:33:42.1896994Z 2025-05-07T20:33:42.1897186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.1897788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.1898329Z module_map=module_map) 2025-05-07T20:33:42.1898810Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.1899216Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.1899515Z E ^ 2025-05-07T20:33:42.1900047Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.1900568Z 2025-05-07T20:33:42.1901125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.1901766Z 2025-05-07T20:33:42.3390760Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.3392249Z self=, 2025-05-07T20:33:42.3393141Z T=16384, 2025-05-07T20:33:42.3393654Z D=5120, 2025-05-07T20:33:42.3394027Z scale_ub=1200.0, 2025-05-07T20:33:42.3394435Z contiguous=True, 2025-05-07T20:33:42.3394847Z compiled=False, 2025-05-07T20:33:42.3395200Z ) 2025-05-07T20:33:42.3395823Z self = 2025-05-07T20:33:42.3396797Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:42.3397368Z 2025-05-07T20:33:42.3397527Z @given( 2025-05-07T20:33:42.3397964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.3398582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.3399198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.3399845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.3400493Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.3401057Z ) 2025-05-07T20:33:42.3401737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.3402634Z def test_silu_mul_quant( 2025-05-07T20:33:42.3403112Z self, 2025-05-07T20:33:42.3403498Z T: int, 2025-05-07T20:33:42.3403871Z D: int, 2025-05-07T20:33:42.3404295Z scale_ub: Optional[float], 2025-05-07T20:33:42.3404829Z contiguous: bool, 2025-05-07T20:33:42.3405285Z compiled: bool, 2025-05-07T20:33:42.3405723Z ) -> None: 2025-05-07T20:33:42.3406147Z torch.manual_seed(2025) 2025-05-07T20:33:42.3406612Z 2025-05-07T20:33:42.3407143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.3407816Z 2025-05-07T20:33:42.3408181Z x_sign = torch.sign(x) 2025-05-07T20:33:42.3408748Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.3409359Z x = x_sign * x_clamp 2025-05-07T20:33:42.3409818Z x0 = x[:, :D] 2025-05-07T20:33:42.3410256Z x1 = x[:, D:] 2025-05-07T20:33:42.3410667Z 2025-05-07T20:33:42.3411012Z if contiguous: 2025-05-07T20:33:42.3411457Z x0 = x0.contiguous() 2025-05-07T20:33:42.3411959Z x1 = x1.contiguous() 2025-05-07T20:33:42.3412426Z 2025-05-07T20:33:42.3412784Z if scale_ub is not None: 2025-05-07T20:33:42.3413318Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.3413955Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.3414545Z ) 2025-05-07T20:33:42.3415247Z else: 2025-05-07T20:33:42.3415670Z scale_ub_tensor = None 2025-05-07T20:33:42.3416150Z 2025-05-07T20:33:42.3416589Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.3417205Z op = silu_mul_quant 2025-05-07T20:33:42.3417680Z if compiled: 2025-05-07T20:33:42.3418155Z op = torch.compile(op) 2025-05-07T20:33:42.3418732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.3419231Z 2025-05-07T20:33:42.3419589Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.3419900Z 2025-05-07T20:33:42.3420092Z moe/activation_test.py:117: 2025-05-07T20:33:42.3420652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.3421286Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.3421974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.3423305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.3424882Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.3425914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.3427555Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.3428905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.3429966Z kernel = self.compile( 2025-05-07T20:33:42.3431045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.3432355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.3433136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.3433709Z 2025-05-07T20:33:42.3434118Z self = 2025-05-07T20:33:42.3436305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.3439097Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07457ac0>} 2025-05-07T20:33:42.3441832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.3443885Z context = 2025-05-07T20:33:42.3444474Z 2025-05-07T20:33:42.3444795Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.3445840Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.3446771Z module_map=module_map) 2025-05-07T20:33:42.3447481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.3448173Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.3448680Z E ^ 2025-05-07T20:33:42.3449598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.3450569Z 2025-05-07T20:33:42.3451400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.3452446Z 2025-05-07T20:33:42.3452645Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.3453457Z self=, 2025-05-07T20:33:42.3454250Z T=1, 2025-05-07T20:33:42.3454600Z D=7168, 2025-05-07T20:33:42.3454974Z scale_ub=1200.0, 2025-05-07T20:33:42.3455523Z contiguous=False, 2025-05-07T20:33:42.3455966Z compiled=False, 2025-05-07T20:33:42.3456369Z ) 2025-05-07T20:33:42.3456976Z self = 2025-05-07T20:33:42.3457948Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:42.3458483Z 2025-05-07T20:33:42.3458639Z @given( 2025-05-07T20:33:42.3459069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.3459681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.3460273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.3460922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.3461546Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.3462193Z ) 2025-05-07T20:33:42.3462751Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.3463437Z def test_silu_mul_quant( 2025-05-07T20:33:42.3463831Z self, 2025-05-07T20:33:42.3464151Z T: int, 2025-05-07T20:33:42.3464464Z D: int, 2025-05-07T20:33:42.3464822Z scale_ub: Optional[float], 2025-05-07T20:33:42.3465256Z contiguous: bool, 2025-05-07T20:33:42.3465733Z compiled: bool, 2025-05-07T20:33:42.3466154Z ) -> None: 2025-05-07T20:33:42.3466537Z torch.manual_seed(2025) 2025-05-07T20:33:42.3466954Z 2025-05-07T20:33:42.3467426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.3467995Z 2025-05-07T20:33:42.3468309Z x_sign = torch.sign(x) 2025-05-07T20:33:42.3468808Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.3469345Z x = x_sign * x_clamp 2025-05-07T20:33:42.3469763Z x0 = x[:, :D] 2025-05-07T20:33:42.3470132Z x1 = x[:, D:] 2025-05-07T20:33:42.3470496Z 2025-05-07T20:33:42.3470820Z if contiguous: 2025-05-07T20:33:42.3471216Z x0 = x0.contiguous() 2025-05-07T20:33:42.3471676Z x1 = x1.contiguous() 2025-05-07T20:33:42.3472093Z 2025-05-07T20:33:42.3472412Z if scale_ub is not None: 2025-05-07T20:33:42.3472871Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.3473468Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.3474086Z ) 2025-05-07T20:33:42.3474433Z else: 2025-05-07T20:33:42.3474806Z scale_ub_tensor = None 2025-05-07T20:33:42.3475282Z 2025-05-07T20:33:42.3475717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.3476307Z op = silu_mul_quant 2025-05-07T20:33:42.3476769Z if compiled: 2025-05-07T20:33:42.3477228Z op = torch.compile(op) 2025-05-07T20:33:42.3477779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.3478300Z 2025-05-07T20:33:42.3478651Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.3478967Z 2025-05-07T20:33:42.3479152Z moe/activation_test.py:117: 2025-05-07T20:33:42.3479709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.3480317Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.3480892Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.3482191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.3483479Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.3484476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.3485751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.3486989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.3487982Z kernel = self.compile( 2025-05-07T20:33:42.3489136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.3490387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.3491084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.3491498Z 2025-05-07T20:33:42.3491883Z self = 2025-05-07T20:33:42.3493876Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.3496582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06fa8550>} 2025-05-07T20:33:42.3499271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.3501262Z context = 2025-05-07T20:33:42.3501843Z 2025-05-07T20:33:42.3502249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.3503345Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.3504284Z module_map=module_map) 2025-05-07T20:33:42.3504980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.3505674Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.3506185Z E ^ 2025-05-07T20:33:42.3507096Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.3508020Z 2025-05-07T20:33:42.3508853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.3509895Z 2025-05-07T20:33:42.5601845Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.5602741Z self=, 2025-05-07T20:33:42.5603541Z T=4096, 2025-05-07T20:33:42.5603896Z D=7168, 2025-05-07T20:33:42.5604270Z scale_ub=1200.0, 2025-05-07T20:33:42.5604689Z contiguous=False, 2025-05-07T20:33:42.5605121Z compiled=True, 2025-05-07T20:33:42.5605524Z ) 2025-05-07T20:33:42.5606129Z self = 2025-05-07T20:33:42.5607066Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:42.5607539Z 2025-05-07T20:33:42.5607670Z @given( 2025-05-07T20:33:42.5608055Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.5608589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.5609108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.5609690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.5610290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.5610835Z ) 2025-05-07T20:33:42.5611459Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.5612227Z def test_silu_mul_quant( 2025-05-07T20:33:42.5612654Z self, 2025-05-07T20:33:42.5612998Z T: int, 2025-05-07T20:33:42.5613361Z D: int, 2025-05-07T20:33:42.5613760Z scale_ub: Optional[float], 2025-05-07T20:33:42.5614257Z contiguous: bool, 2025-05-07T20:33:42.5614697Z compiled: bool, 2025-05-07T20:33:42.5615105Z ) -> None: 2025-05-07T20:33:42.5615499Z torch.manual_seed(2025) 2025-05-07T20:33:42.5615975Z 2025-05-07T20:33:42.5616490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.5617127Z 2025-05-07T20:33:42.5617486Z x_sign = torch.sign(x) 2025-05-07T20:33:42.5618379Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.5618973Z x = x_sign * x_clamp 2025-05-07T20:33:42.5619430Z x0 = x[:, :D] 2025-05-07T20:33:42.5619846Z x1 = x[:, D:] 2025-05-07T20:33:42.5620237Z 2025-05-07T20:33:42.5620645Z if contiguous: 2025-05-07T20:33:42.5621128Z x0 = x0.contiguous() 2025-05-07T20:33:42.5621628Z x1 = x1.contiguous() 2025-05-07T20:33:42.5622107Z 2025-05-07T20:33:42.5622490Z if scale_ub is not None: 2025-05-07T20:33:42.5623015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.5623668Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.5624685Z ) 2025-05-07T20:33:42.5625065Z else: 2025-05-07T20:33:42.5625469Z scale_ub_tensor = None 2025-05-07T20:33:42.5626150Z 2025-05-07T20:33:42.5626611Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.5627233Z op = silu_mul_quant 2025-05-07T20:33:42.5627743Z if compiled: 2025-05-07T20:33:42.5628224Z op = torch.compile(op) 2025-05-07T20:33:42.5628805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.5629355Z 2025-05-07T20:33:42.5629727Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.5630208Z 2025-05-07T20:33:42.5630513Z moe/activation_test.py:117: 2025-05-07T20:33:42.5631102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.5631777Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.5632330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.5633420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:42.5634646Z return fn(*args, **kwargs) 2025-05-07T20:33:42.5635983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.5637379Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.5638471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.5639845Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.5641225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.5642259Z kernel = self.compile( 2025-05-07T20:33:42.5643322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.5644546Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.5645328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.5645763Z 2025-05-07T20:33:42.5646160Z self = 2025-05-07T20:33:42.5648277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.5651055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06fa8f70>} 2025-05-07T20:33:42.5653759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.5655776Z context = 2025-05-07T20:33:42.5656331Z 2025-05-07T20:33:42.5656639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.5657669Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.5658738Z module_map=module_map) 2025-05-07T20:33:42.5659445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.5660128Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.5660638Z E ^ 2025-05-07T20:33:42.5661562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.5662478Z 2025-05-07T20:33:42.5663290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.5664322Z 2025-05-07T20:33:42.5664525Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.5665334Z self=, 2025-05-07T20:33:42.5666113Z T=128, 2025-05-07T20:33:42.5666574Z D=7168, 2025-05-07T20:33:42.5666952Z scale_ub=1200.0, 2025-05-07T20:33:42.5667372Z contiguous=False, 2025-05-07T20:33:42.5667813Z compiled=True, 2025-05-07T20:33:42.5668225Z ) 2025-05-07T20:33:42.6808813Z self = 2025-05-07T20:33:42.6809931Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:42.6810843Z 2025-05-07T20:33:42.6810999Z @given( 2025-05-07T20:33:42.6811550Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.6812135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.6812657Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.6813280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.6813927Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.6814501Z ) 2025-05-07T20:33:42.6815184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.6816074Z def test_silu_mul_quant( 2025-05-07T20:33:42.6816545Z self, 2025-05-07T20:33:42.6816913Z T: int, 2025-05-07T20:33:42.6817297Z D: int, 2025-05-07T20:33:42.6817722Z scale_ub: Optional[float], 2025-05-07T20:33:42.6818246Z contiguous: bool, 2025-05-07T20:33:42.6818713Z compiled: bool, 2025-05-07T20:33:42.6819151Z ) -> None: 2025-05-07T20:33:42.6819566Z torch.manual_seed(2025) 2025-05-07T20:33:42.6820051Z 2025-05-07T20:33:42.6820622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.6821330Z 2025-05-07T20:33:42.6821701Z x_sign = torch.sign(x) 2025-05-07T20:33:42.6822268Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.6822881Z x = x_sign * x_clamp 2025-05-07T20:33:42.6823337Z x0 = x[:, :D] 2025-05-07T20:33:42.6824004Z x1 = x[:, D:] 2025-05-07T20:33:42.6824419Z 2025-05-07T20:33:42.6824768Z if contiguous: 2025-05-07T20:33:42.6825228Z x0 = x0.contiguous() 2025-05-07T20:33:42.6825739Z x1 = x1.contiguous() 2025-05-07T20:33:42.6826201Z 2025-05-07T20:33:42.6826581Z if scale_ub is not None: 2025-05-07T20:33:42.6827120Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.6827771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.6828380Z ) 2025-05-07T20:33:42.6828760Z else: 2025-05-07T20:33:42.6829164Z scale_ub_tensor = None 2025-05-07T20:33:42.6829662Z 2025-05-07T20:33:42.6830108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.6830805Z op = silu_mul_quant 2025-05-07T20:33:42.6831295Z if compiled: 2025-05-07T20:33:42.6843067Z op = torch.compile(op) 2025-05-07T20:33:42.6843722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.6844284Z 2025-05-07T20:33:42.6844659Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.6845017Z 2025-05-07T20:33:42.6845216Z moe/activation_test.py:117: 2025-05-07T20:33:42.6850122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.6850964Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.6851539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.6852691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:42.6853840Z return fn(*args, **kwargs) 2025-05-07T20:33:42.6855204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.6856629Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.6857667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.6859031Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.6860389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.6861555Z kernel = self.compile( 2025-05-07T20:33:42.6862655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.6863989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.6864919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.6865488Z 2025-05-07T20:33:42.6865895Z self = 2025-05-07T20:33:42.6868097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.6871024Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06fa92d0>} 2025-05-07T20:33:42.6873957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.6876088Z context = 2025-05-07T20:33:42.6876694Z 2025-05-07T20:33:42.6877029Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.6878090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.6879033Z module_map=module_map) 2025-05-07T20:33:42.6879707Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.6880268Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.6880717Z E ^ 2025-05-07T20:33:42.6881451Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.6882194Z 2025-05-07T20:33:42.6882866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.6883697Z 2025-05-07T20:33:42.6883889Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.6884593Z self=, 2025-05-07T20:33:42.6885283Z T=2048, 2025-05-07T20:33:42.6885626Z D=7168, 2025-05-07T20:33:42.6885955Z scale_ub=None, 2025-05-07T20:33:42.6886315Z contiguous=True, 2025-05-07T20:33:42.6886726Z compiled=True, 2025-05-07T20:33:42.6887093Z ) 2025-05-07T20:33:42.6887651Z self = 2025-05-07T20:33:42.6888491Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:42.6888954Z 2025-05-07T20:33:42.6889114Z @given( 2025-05-07T20:33:42.6889522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.6890255Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.6890872Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.6891473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.6892029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.6892498Z ) 2025-05-07T20:33:42.6893101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.6893797Z def test_silu_mul_quant( 2025-05-07T20:33:42.6894207Z self, 2025-05-07T20:33:42.6894534Z T: int, 2025-05-07T20:33:42.6894856Z D: int, 2025-05-07T20:33:42.6895222Z scale_ub: Optional[float], 2025-05-07T20:33:42.6895687Z contiguous: bool, 2025-05-07T20:33:42.6896092Z compiled: bool, 2025-05-07T20:33:42.6896481Z ) -> None: 2025-05-07T20:33:42.6896836Z torch.manual_seed(2025) 2025-05-07T20:33:42.6897240Z 2025-05-07T20:33:42.6897680Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.6898245Z 2025-05-07T20:33:42.6898570Z x_sign = torch.sign(x) 2025-05-07T20:33:42.6899043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.6899554Z x = x_sign * x_clamp 2025-05-07T20:33:42.6899954Z x0 = x[:, :D] 2025-05-07T20:33:42.6900395Z x1 = x[:, D:] 2025-05-07T20:33:42.6900789Z 2025-05-07T20:33:42.6901154Z if contiguous: 2025-05-07T20:33:42.6901532Z x0 = x0.contiguous() 2025-05-07T20:33:42.6901962Z x1 = x1.contiguous() 2025-05-07T20:33:42.6902366Z 2025-05-07T20:33:42.6902683Z if scale_ub is not None: 2025-05-07T20:33:42.6903138Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.6903683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.6904215Z ) 2025-05-07T20:33:42.6904551Z else: 2025-05-07T20:33:42.6904909Z scale_ub_tensor = None 2025-05-07T20:33:42.6905337Z 2025-05-07T20:33:42.6905715Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.6906245Z op = silu_mul_quant 2025-05-07T20:33:42.6906657Z if compiled: 2025-05-07T20:33:42.6907064Z op = torch.compile(op) 2025-05-07T20:33:42.6907557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.6908013Z 2025-05-07T20:33:42.6908323Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.6908605Z 2025-05-07T20:33:42.6908768Z moe/activation_test.py:117: 2025-05-07T20:33:42.6909252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.6909791Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.6910251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.6911193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:42.6912119Z return fn(*args, **kwargs) 2025-05-07T20:33:42.6913199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.6914447Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.6915329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.6916446Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.6917553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.6918435Z kernel = self.compile( 2025-05-07T20:33:42.6919326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.6920406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.6921072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.6921451Z 2025-05-07T20:33:42.6921802Z self = 2025-05-07T20:33:42.6924089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.6926409Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06faa560>} 2025-05-07T20:33:42.6928643Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.6930330Z context = 2025-05-07T20:33:42.6930855Z 2025-05-07T20:33:42.6931142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.6932006Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.6932787Z module_map=module_map) 2025-05-07T20:33:42.6933398Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.6934003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.6934577Z E ^ 2025-05-07T20:33:42.6935436Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.6936202Z 2025-05-07T20:33:42.6936898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.6937740Z 2025-05-07T20:33:42.7794013Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.7794896Z self=, 2025-05-07T20:33:42.7795668Z T=16384, 2025-05-07T20:33:42.7796032Z D=5120, 2025-05-07T20:33:42.7796422Z scale_ub=None, 2025-05-07T20:33:42.7796836Z contiguous=False, 2025-05-07T20:33:42.7797269Z compiled=False, 2025-05-07T20:33:42.7797660Z ) 2025-05-07T20:33:42.7798266Z self = 2025-05-07T20:33:42.7799245Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:42.7799806Z 2025-05-07T20:33:42.7799964Z @given( 2025-05-07T20:33:42.7800407Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.7801001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.7801593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.7802233Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.7802873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.7803422Z ) 2025-05-07T20:33:42.7804114Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.7804992Z def test_silu_mul_quant( 2025-05-07T20:33:42.7805458Z self, 2025-05-07T20:33:42.7805831Z T: int, 2025-05-07T20:33:42.7806217Z D: int, 2025-05-07T20:33:42.7806625Z scale_ub: Optional[float], 2025-05-07T20:33:42.7807152Z contiguous: bool, 2025-05-07T20:33:42.7807624Z compiled: bool, 2025-05-07T20:33:42.7808061Z ) -> None: 2025-05-07T20:33:42.7808487Z torch.manual_seed(2025) 2025-05-07T20:33:42.7808967Z 2025-05-07T20:33:42.7809492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.7810174Z 2025-05-07T20:33:42.7810564Z x_sign = torch.sign(x) 2025-05-07T20:33:42.7811130Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.7815470Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:42.7819416Z 2025-05-07T20:33:42.7819651Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:42.7820076Z 2025-05-07T20:33:42.7820275Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.7821143Z self=, 2025-05-07T20:33:42.7821926Z T=4096, 2025-05-07T20:33:42.7822292Z D=7168, 2025-05-07T20:33:42.7822665Z scale_ub=1200.0, 2025-05-07T20:33:42.7823089Z contiguous=True, 2025-05-07T20:33:42.7823518Z compiled=True, 2025-05-07T20:33:42.7824277Z ) 2025-05-07T20:33:42.7824885Z self = 2025-05-07T20:33:42.7825852Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:42.7826376Z 2025-05-07T20:33:42.7826523Z @given( 2025-05-07T20:33:42.7826946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.7827535Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.7828339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.7829104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.7829730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.7830373Z ) 2025-05-07T20:33:42.7831164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.7832228Z def test_silu_mul_quant( 2025-05-07T20:33:42.7832737Z self, 2025-05-07T20:33:42.7833108Z T: int, 2025-05-07T20:33:42.7833480Z D: int, 2025-05-07T20:33:42.7833978Z scale_ub: Optional[float], 2025-05-07T20:33:42.7834501Z contiguous: bool, 2025-05-07T20:33:42.7834975Z compiled: bool, 2025-05-07T20:33:42.7835388Z ) -> None: 2025-05-07T20:33:42.7835804Z torch.manual_seed(2025) 2025-05-07T20:33:42.7836274Z 2025-05-07T20:33:42.7836790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.7837464Z 2025-05-07T20:33:42.7837844Z x_sign = torch.sign(x) 2025-05-07T20:33:42.7838390Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.7842392Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:42.7846147Z 2025-05-07T20:33:42.7846386Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:42.7846803Z 2025-05-07T20:33:42.7847001Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.7847804Z self=, 2025-05-07T20:33:42.7848580Z T=16384, 2025-05-07T20:33:42.7848945Z D=7168, 2025-05-07T20:33:42.7849342Z scale_ub=None, 2025-05-07T20:33:42.7849742Z contiguous=False, 2025-05-07T20:33:42.7850178Z compiled=False, 2025-05-07T20:33:42.7850624Z ) 2025-05-07T20:33:42.7851227Z self = 2025-05-07T20:33:42.7852183Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:42.7852721Z 2025-05-07T20:33:42.7852882Z @given( 2025-05-07T20:33:42.7853315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.7853912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.7854668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.7855409Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.7856043Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.7856604Z ) 2025-05-07T20:33:42.7857299Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.7858168Z def test_silu_mul_quant( 2025-05-07T20:33:42.7858640Z self, 2025-05-07T20:33:42.7859001Z T: int, 2025-05-07T20:33:42.7859374Z D: int, 2025-05-07T20:33:42.7859782Z scale_ub: Optional[float], 2025-05-07T20:33:42.7860305Z contiguous: bool, 2025-05-07T20:33:42.7860801Z compiled: bool, 2025-05-07T20:33:42.7861231Z ) -> None: 2025-05-07T20:33:42.7861637Z torch.manual_seed(2025) 2025-05-07T20:33:42.7862101Z 2025-05-07T20:33:42.7862614Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.7866833Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:42.7870879Z 2025-05-07T20:33:42.7871170Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:42.7871687Z 2025-05-07T20:33:42.7871900Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.7872676Z self=, 2025-05-07T20:33:42.7873462Z T=2048, 2025-05-07T20:33:42.7873915Z D=7168, 2025-05-07T20:33:42.7874274Z scale_ub=1200.0, 2025-05-07T20:33:42.7874711Z contiguous=True, 2025-05-07T20:33:42.7875138Z compiled=True, 2025-05-07T20:33:42.7875538Z ) 2025-05-07T20:33:42.7876144Z self = 2025-05-07T20:33:42.7877099Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:42.7877631Z 2025-05-07T20:33:42.7877787Z @given( 2025-05-07T20:33:42.7878217Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.7878828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.7879423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.7880052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.7880715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.7881296Z ) 2025-05-07T20:33:42.7881968Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.7882833Z def test_silu_mul_quant( 2025-05-07T20:33:42.7883293Z self, 2025-05-07T20:33:42.7883669Z T: int, 2025-05-07T20:33:42.7884040Z D: int, 2025-05-07T20:33:42.7884448Z scale_ub: Optional[float], 2025-05-07T20:33:42.7884972Z contiguous: bool, 2025-05-07T20:33:42.7885428Z compiled: bool, 2025-05-07T20:33:42.7885858Z ) -> None: 2025-05-07T20:33:42.7886277Z torch.manual_seed(2025) 2025-05-07T20:33:42.7886757Z 2025-05-07T20:33:42.7887293Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.7887933Z 2025-05-07T20:33:42.7888268Z x_sign = torch.sign(x) 2025-05-07T20:33:42.7888815Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.7893512Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:42.7897397Z 2025-05-07T20:33:42.7897632Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:42.7898059Z 2025-05-07T20:33:42.7898267Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.7899065Z self=, 2025-05-07T20:33:42.7899860Z T=2048, 2025-05-07T20:33:42.7900220Z D=7168, 2025-05-07T20:33:42.7900594Z scale_ub=None, 2025-05-07T20:33:42.7901040Z contiguous=True, 2025-05-07T20:33:42.7901468Z compiled=False, 2025-05-07T20:33:42.7901855Z ) 2025-05-07T20:33:43.1073406Z self = 2025-05-07T20:33:43.1074549Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:43.1075099Z 2025-05-07T20:33:43.1075264Z @given( 2025-05-07T20:33:43.1075691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.1076296Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.1076874Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.1078023Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.1078628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.1079139Z ) 2025-05-07T20:33:43.1079782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.1080633Z def test_silu_mul_quant( 2025-05-07T20:33:43.1081087Z self, 2025-05-07T20:33:43.1081451Z T: int, 2025-05-07T20:33:43.1081813Z D: int, 2025-05-07T20:33:43.1082223Z scale_ub: Optional[float], 2025-05-07T20:33:43.1082759Z contiguous: bool, 2025-05-07T20:33:43.1083208Z compiled: bool, 2025-05-07T20:33:43.1083638Z ) -> None: 2025-05-07T20:33:43.1084041Z torch.manual_seed(2025) 2025-05-07T20:33:43.1084507Z 2025-05-07T20:33:43.1085015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.1085668Z 2025-05-07T20:33:43.1086035Z > x_sign = torch.sign(x) 2025-05-07T20:33:43.1089933Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.1093762Z 2025-05-07T20:33:43.1094006Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:43.1094429Z 2025-05-07T20:33:43.1094637Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.1095444Z self=, 2025-05-07T20:33:43.1096241Z T=1, 2025-05-07T20:33:43.1096584Z D=7168, 2025-05-07T20:33:43.1096954Z scale_ub=1200.0, 2025-05-07T20:33:43.1097380Z contiguous=True, 2025-05-07T20:33:43.1097809Z compiled=False, 2025-05-07T20:33:43.1098193Z ) 2025-05-07T20:33:43.1098799Z self = 2025-05-07T20:33:43.1099740Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:43.1100256Z 2025-05-07T20:33:43.1100404Z @given( 2025-05-07T20:33:43.1100840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.1101462Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.1102060Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.1102896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.1103659Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.1104223Z ) 2025-05-07T20:33:43.1104918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.1105791Z def test_silu_mul_quant( 2025-05-07T20:33:43.1106273Z self, 2025-05-07T20:33:43.1106644Z T: int, 2025-05-07T20:33:43.1107027Z D: int, 2025-05-07T20:33:43.1107443Z scale_ub: Optional[float], 2025-05-07T20:33:43.1107953Z contiguous: bool, 2025-05-07T20:33:43.1108418Z compiled: bool, 2025-05-07T20:33:43.1108833Z ) -> None: 2025-05-07T20:33:43.1109226Z torch.manual_seed(2025) 2025-05-07T20:33:43.1109675Z 2025-05-07T20:33:43.1110158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.1110867Z 2025-05-07T20:33:43.1111225Z x_sign = torch.sign(x) 2025-05-07T20:33:43.1111768Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.1112371Z x = x_sign * x_clamp 2025-05-07T20:33:43.1112838Z x0 = x[:, :D] 2025-05-07T20:33:43.1113237Z x1 = x[:, D:] 2025-05-07T20:33:43.1113745Z 2025-05-07T20:33:43.1114097Z if contiguous: 2025-05-07T20:33:43.1114532Z x0 = x0.contiguous() 2025-05-07T20:33:43.1115134Z x1 = x1.contiguous() 2025-05-07T20:33:43.1115704Z 2025-05-07T20:33:43.1116089Z if scale_ub is not None: 2025-05-07T20:33:43.1116611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.1117239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.1117839Z ) 2025-05-07T20:33:43.1118213Z else: 2025-05-07T20:33:43.1118604Z scale_ub_tensor = None 2025-05-07T20:33:43.1119063Z 2025-05-07T20:33:43.1119477Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.1120052Z op = silu_mul_quant 2025-05-07T20:33:43.1120543Z if compiled: 2025-05-07T20:33:43.1121021Z op = torch.compile(op) 2025-05-07T20:33:43.1121583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.1122094Z 2025-05-07T20:33:43.1122453Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.1122773Z 2025-05-07T20:33:43.1122964Z moe/activation_test.py:117: 2025-05-07T20:33:43.1123546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.1124423Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.1124973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.1126319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.1127672Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.1128735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.1130042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.1131354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.1132409Z kernel = self.compile( 2025-05-07T20:33:43.1133471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.1134748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.1135529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.1135990Z 2025-05-07T20:33:43.1136365Z self = 2025-05-07T20:33:43.1138480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.1141580Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06b4c4c0>} 2025-05-07T20:33:43.1144277Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.1146298Z context = 2025-05-07T20:33:43.1146867Z 2025-05-07T20:33:43.1147200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.1148224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.1149136Z module_map=module_map) 2025-05-07T20:33:43.1149834Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.1150519Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.1151022Z E ^ 2025-05-07T20:33:43.1151929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.1152820Z 2025-05-07T20:33:43.1153759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.1154921Z 2025-05-07T20:33:43.1155230Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.1156038Z self=, 2025-05-07T20:33:43.1167061Z T=128, 2025-05-07T20:33:43.1167479Z D=5120, 2025-05-07T20:33:43.1167838Z scale_ub=None, 2025-05-07T20:33:43.1168256Z contiguous=True, 2025-05-07T20:33:43.1168688Z compiled=False, 2025-05-07T20:33:43.1169077Z ) 2025-05-07T20:33:43.2006074Z self = 2025-05-07T20:33:43.2007089Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:43.2007606Z 2025-05-07T20:33:43.2007746Z @given( 2025-05-07T20:33:43.2008167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.2008728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.2009304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.2009938Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.2010578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.2011132Z ) 2025-05-07T20:33:43.2011801Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.2012666Z def test_silu_mul_quant( 2025-05-07T20:33:43.2013116Z self, 2025-05-07T20:33:43.2013487Z T: int, 2025-05-07T20:33:43.2013860Z D: int, 2025-05-07T20:33:43.2014261Z scale_ub: Optional[float], 2025-05-07T20:33:43.2014774Z contiguous: bool, 2025-05-07T20:33:43.2015224Z compiled: bool, 2025-05-07T20:33:43.2015641Z ) -> None: 2025-05-07T20:33:43.2016049Z torch.manual_seed(2025) 2025-05-07T20:33:43.2016515Z 2025-05-07T20:33:43.2017030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.2017700Z 2025-05-07T20:33:43.2018071Z x_sign = torch.sign(x) 2025-05-07T20:33:43.2018621Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.2019224Z x = x_sign * x_clamp 2025-05-07T20:33:43.2019683Z x0 = x[:, :D] 2025-05-07T20:33:43.2020098Z x1 = x[:, D:] 2025-05-07T20:33:43.2020494Z 2025-05-07T20:33:43.2020854Z if contiguous: 2025-05-07T20:33:43.2021301Z x0 = x0.contiguous() 2025-05-07T20:33:43.2021795Z x1 = x1.contiguous() 2025-05-07T20:33:43.2022263Z 2025-05-07T20:33:43.2022635Z if scale_ub is not None: 2025-05-07T20:33:43.2023162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.2024060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.2025110Z ) 2025-05-07T20:33:43.2025480Z else: 2025-05-07T20:33:43.2026022Z scale_ub_tensor = None 2025-05-07T20:33:43.2026521Z 2025-05-07T20:33:43.2026959Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.2027556Z op = silu_mul_quant 2025-05-07T20:33:43.2028022Z if compiled: 2025-05-07T20:33:43.2028482Z op = torch.compile(op) 2025-05-07T20:33:43.2029046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.2029559Z 2025-05-07T20:33:43.2029931Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.2030261Z 2025-05-07T20:33:43.2030448Z moe/activation_test.py:117: 2025-05-07T20:33:43.2031012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.2031660Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.2032209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.2033653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.2035054Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.2036109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.2037468Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.2039048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.2040052Z kernel = self.compile( 2025-05-07T20:33:43.2041090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.2042371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.2043127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.2043566Z 2025-05-07T20:33:43.2043934Z self = 2025-05-07T20:33:43.2046044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.2048771Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06b4c940>} 2025-05-07T20:33:43.2051408Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.2053401Z context = 2025-05-07T20:33:43.2053970Z 2025-05-07T20:33:43.2054280Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.2055293Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.2056205Z module_map=module_map) 2025-05-07T20:33:43.2056886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.2057537Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.2058039Z E ^ 2025-05-07T20:33:43.2058932Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.2059828Z 2025-05-07T20:33:43.2060643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.2061655Z 2025-05-07T20:33:43.2061850Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.2062637Z self=, 2025-05-07T20:33:43.2063406Z T=128, 2025-05-07T20:33:43.2063767Z D=7168, 2025-05-07T20:33:43.2064127Z scale_ub=None, 2025-05-07T20:33:43.2064640Z contiguous=True, 2025-05-07T20:33:43.2065064Z compiled=False, 2025-05-07T20:33:43.2065527Z ) 2025-05-07T20:33:43.2066126Z self = 2025-05-07T20:33:43.2067062Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:43.2067581Z 2025-05-07T20:33:43.2067738Z @given( 2025-05-07T20:33:43.2068178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.2068760Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.2069342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.2069980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.2070608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.2071144Z ) 2025-05-07T20:33:43.2071810Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.2072651Z def test_silu_mul_quant( 2025-05-07T20:33:43.2073121Z self, 2025-05-07T20:33:43.2073488Z T: int, 2025-05-07T20:33:43.2073962Z D: int, 2025-05-07T20:33:43.2074374Z scale_ub: Optional[float], 2025-05-07T20:33:43.2074889Z contiguous: bool, 2025-05-07T20:33:43.2075334Z compiled: bool, 2025-05-07T20:33:43.2075753Z ) -> None: 2025-05-07T20:33:43.2076252Z torch.manual_seed(2025) 2025-05-07T20:33:43.2076768Z 2025-05-07T20:33:43.2077280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.2077942Z 2025-05-07T20:33:43.2078311Z x_sign = torch.sign(x) 2025-05-07T20:33:43.2078845Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.2079441Z x = x_sign * x_clamp 2025-05-07T20:33:43.2079903Z x0 = x[:, :D] 2025-05-07T20:33:43.2080304Z x1 = x[:, D:] 2025-05-07T20:33:43.2080704Z 2025-05-07T20:33:43.2081065Z if contiguous: 2025-05-07T20:33:43.2081517Z x0 = x0.contiguous() 2025-05-07T20:33:43.2082036Z x1 = x1.contiguous() 2025-05-07T20:33:43.2082500Z 2025-05-07T20:33:43.2082862Z if scale_ub is not None: 2025-05-07T20:33:43.2083387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.2084016Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.2084605Z ) 2025-05-07T20:33:43.2084973Z else: 2025-05-07T20:33:43.2085378Z scale_ub_tensor = None 2025-05-07T20:33:43.2085857Z 2025-05-07T20:33:43.2086302Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.2086913Z op = silu_mul_quant 2025-05-07T20:33:43.2087395Z if compiled: 2025-05-07T20:33:43.2087866Z op = torch.compile(op) 2025-05-07T20:33:43.2088430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.2088961Z 2025-05-07T20:33:43.2089318Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.2089640Z 2025-05-07T20:33:43.2089826Z moe/activation_test.py:117: 2025-05-07T20:33:43.2090426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.2091124Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.2091674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.2093042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.2094427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.2095449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.2096780Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.2098106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.2099167Z kernel = self.compile( 2025-05-07T20:33:43.2100178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.2101668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.2102456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.2102904Z 2025-05-07T20:33:43.2103298Z self = 2025-05-07T20:33:43.2105462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.2108243Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06b4d240>} 2025-05-07T20:33:43.2110967Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.2113030Z context = 2025-05-07T20:33:43.2113681Z 2025-05-07T20:33:43.2114005Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.2115184Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.2116122Z module_map=module_map) 2025-05-07T20:33:43.2116820Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.2117503Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.2118009Z E ^ 2025-05-07T20:33:43.2118932Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.2119838Z 2025-05-07T20:33:43.2120694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.2121725Z 2025-05-07T20:33:43.2121928Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.2122732Z self=, 2025-05-07T20:33:43.2123491Z T=2048, 2025-05-07T20:33:43.2124585Z D=7168, 2025-05-07T20:33:43.2124973Z scale_ub=1200.0, 2025-05-07T20:33:43.2125393Z contiguous=True, 2025-05-07T20:33:43.2125812Z compiled=False, 2025-05-07T20:33:43.2126199Z ) 2025-05-07T20:33:43.3166256Z self = 2025-05-07T20:33:43.3167278Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:43.3167802Z 2025-05-07T20:33:43.3167955Z @given( 2025-05-07T20:33:43.3168348Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.3168888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.3169445Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.3170077Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.3170735Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.3171312Z ) 2025-05-07T20:33:43.3171965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.3172825Z def test_silu_mul_quant( 2025-05-07T20:33:43.3173317Z self, 2025-05-07T20:33:43.3173724Z T: int, 2025-05-07T20:33:43.3174106Z D: int, 2025-05-07T20:33:43.3174527Z scale_ub: Optional[float], 2025-05-07T20:33:43.3175042Z contiguous: bool, 2025-05-07T20:33:43.3175510Z compiled: bool, 2025-05-07T20:33:43.3175939Z ) -> None: 2025-05-07T20:33:43.3176342Z torch.manual_seed(2025) 2025-05-07T20:33:43.3176814Z 2025-05-07T20:33:43.3177331Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.3181775Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.3185763Z 2025-05-07T20:33:43.3186009Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.3186426Z 2025-05-07T20:33:43.3186628Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.3187445Z self=, 2025-05-07T20:33:43.3188232Z T=1, 2025-05-07T20:33:43.3188581Z D=5120, 2025-05-07T20:33:43.3188953Z scale_ub=1200.0, 2025-05-07T20:33:43.3189371Z contiguous=True, 2025-05-07T20:33:43.3189794Z compiled=False, 2025-05-07T20:33:43.3190188Z ) 2025-05-07T20:33:43.3190808Z self = 2025-05-07T20:33:43.3191765Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:43.3192295Z 2025-05-07T20:33:43.3192445Z @given( 2025-05-07T20:33:43.3192895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.3193924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.3194523Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.3195182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.3195845Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.3196410Z ) 2025-05-07T20:33:43.3197103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.3197999Z def test_silu_mul_quant( 2025-05-07T20:33:43.3198474Z self, 2025-05-07T20:33:43.3198842Z T: int, 2025-05-07T20:33:43.3199229Z D: int, 2025-05-07T20:33:43.3199654Z scale_ub: Optional[float], 2025-05-07T20:33:43.3200170Z contiguous: bool, 2025-05-07T20:33:43.3200629Z compiled: bool, 2025-05-07T20:33:43.3201049Z ) -> None: 2025-05-07T20:33:43.3201448Z torch.manual_seed(2025) 2025-05-07T20:33:43.3201908Z 2025-05-07T20:33:43.3202433Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.3203106Z 2025-05-07T20:33:43.3203481Z x_sign = torch.sign(x) 2025-05-07T20:33:43.3204011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.3204592Z x = x_sign * x_clamp 2025-05-07T20:33:43.3205057Z x0 = x[:, :D] 2025-05-07T20:33:43.3205470Z x1 = x[:, D:] 2025-05-07T20:33:43.3205861Z 2025-05-07T20:33:43.3206222Z if contiguous: 2025-05-07T20:33:43.3206666Z x0 = x0.contiguous() 2025-05-07T20:33:43.3207169Z x1 = x1.contiguous() 2025-05-07T20:33:43.3207628Z 2025-05-07T20:33:43.3208009Z if scale_ub is not None: 2025-05-07T20:33:43.3208532Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.3209172Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.3209779Z ) 2025-05-07T20:33:43.3210157Z else: 2025-05-07T20:33:43.3210572Z scale_ub_tensor = None 2025-05-07T20:33:43.3211100Z 2025-05-07T20:33:43.3211545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.3212132Z op = silu_mul_quant 2025-05-07T20:33:43.3212615Z if compiled: 2025-05-07T20:33:43.3213093Z op = torch.compile(op) 2025-05-07T20:33:43.3213652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.3214193Z 2025-05-07T20:33:43.3214561Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.3214878Z 2025-05-07T20:33:43.3215078Z moe/activation_test.py:117: 2025-05-07T20:33:43.3215638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.3216407Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.3217030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.3218398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.3219750Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.3220866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.3222218Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.3223526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.3224924Z kernel = self.compile( 2025-05-07T20:33:43.3225985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.3227271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.3228058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.3228519Z 2025-05-07T20:33:43.3228909Z self = 2025-05-07T20:33:43.3231193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.3234122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06b4e200>} 2025-05-07T20:33:43.3236842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.3238865Z context = 2025-05-07T20:33:43.3239445Z 2025-05-07T20:33:43.3239783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.3240797Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.3241716Z module_map=module_map) 2025-05-07T20:33:43.3242431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.3243103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.3243600Z E ^ 2025-05-07T20:33:43.3244517Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.3245420Z 2025-05-07T20:33:43.3246255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.3247287Z 2025-05-07T20:33:43.3247496Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.3248310Z self=, 2025-05-07T20:33:43.3249111Z T=2048, 2025-05-07T20:33:43.3249476Z D=5120, 2025-05-07T20:33:43.3249845Z scale_ub=None, 2025-05-07T20:33:43.3250260Z contiguous=True, 2025-05-07T20:33:43.3250704Z compiled=False, 2025-05-07T20:33:43.3251120Z ) 2025-05-07T20:33:43.3251748Z self = 2025-05-07T20:33:43.3252714Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:43.3253246Z 2025-05-07T20:33:43.3253407Z @given( 2025-05-07T20:33:43.3253837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.3254444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.3255043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.3255665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.3256302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.3257018Z ) 2025-05-07T20:33:43.3257786Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.3258677Z def test_silu_mul_quant( 2025-05-07T20:33:43.3259150Z self, 2025-05-07T20:33:43.3259511Z T: int, 2025-05-07T20:33:43.3259894Z D: int, 2025-05-07T20:33:43.3260321Z scale_ub: Optional[float], 2025-05-07T20:33:43.3260833Z contiguous: bool, 2025-05-07T20:33:43.3261283Z compiled: bool, 2025-05-07T20:33:43.3261697Z ) -> None: 2025-05-07T20:33:43.3262092Z torch.manual_seed(2025) 2025-05-07T20:33:43.3262558Z 2025-05-07T20:33:43.3263077Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.3263746Z 2025-05-07T20:33:43.3264106Z > x_sign = torch.sign(x) 2025-05-07T20:33:43.3268145Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.3272088Z 2025-05-07T20:33:43.3272322Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:43.3272743Z 2025-05-07T20:33:43.3272954Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.3273831Z self=, 2025-05-07T20:33:43.3274637Z T=16384, 2025-05-07T20:33:43.3275014Z D=5120, 2025-05-07T20:33:43.3275386Z scale_ub=None, 2025-05-07T20:33:43.3275790Z contiguous=True, 2025-05-07T20:33:43.3276224Z compiled=False, 2025-05-07T20:33:43.3276624Z ) 2025-05-07T20:33:43.4325536Z self = 2025-05-07T20:33:43.4326587Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:43.4327110Z 2025-05-07T20:33:43.4327260Z @given( 2025-05-07T20:33:43.4327681Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.4328265Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.4328833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.4329464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.4330084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.4330672Z ) 2025-05-07T20:33:43.4331359Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.4332223Z def test_silu_mul_quant( 2025-05-07T20:33:43.4332701Z self, 2025-05-07T20:33:43.4333060Z T: int, 2025-05-07T20:33:43.4333438Z D: int, 2025-05-07T20:33:43.4333852Z scale_ub: Optional[float], 2025-05-07T20:33:43.4334376Z contiguous: bool, 2025-05-07T20:33:43.4334841Z compiled: bool, 2025-05-07T20:33:43.4335265Z ) -> None: 2025-05-07T20:33:43.4335672Z torch.manual_seed(2025) 2025-05-07T20:33:43.4336130Z 2025-05-07T20:33:43.4336642Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.4340781Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.4344635Z 2025-05-07T20:33:43.4345162Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.4345602Z 2025-05-07T20:33:43.4345919Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.4346724Z self=, 2025-05-07T20:33:43.4347515Z T=4096, 2025-05-07T20:33:43.4347865Z D=5120, 2025-05-07T20:33:43.4348240Z scale_ub=None, 2025-05-07T20:33:43.4348650Z contiguous=True, 2025-05-07T20:33:43.4349072Z compiled=False, 2025-05-07T20:33:43.4349470Z ) 2025-05-07T20:33:43.4350089Z self = 2025-05-07T20:33:43.4351045Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:43.4351601Z 2025-05-07T20:33:43.4351751Z @given( 2025-05-07T20:33:43.4352188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.4352801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.4353391Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.4354202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.4354858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.4355410Z ) 2025-05-07T20:33:43.4356101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.4357124Z def test_silu_mul_quant( 2025-05-07T20:33:43.4357584Z self, 2025-05-07T20:33:43.4358092Z T: int, 2025-05-07T20:33:43.4358475Z D: int, 2025-05-07T20:33:43.4358863Z scale_ub: Optional[float], 2025-05-07T20:33:43.4359366Z contiguous: bool, 2025-05-07T20:33:43.4359811Z compiled: bool, 2025-05-07T20:33:43.4360222Z ) -> None: 2025-05-07T20:33:43.4360659Z torch.manual_seed(2025) 2025-05-07T20:33:43.4361143Z 2025-05-07T20:33:43.4361651Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.4365699Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.4369369Z 2025-05-07T20:33:43.4369604Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.4370029Z 2025-05-07T20:33:43.4370220Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.4382517Z self=, 2025-05-07T20:33:43.4383330Z T=2048, 2025-05-07T20:33:43.4383696Z D=5120, 2025-05-07T20:33:43.4384044Z scale_ub=None, 2025-05-07T20:33:43.4384456Z contiguous=False, 2025-05-07T20:33:43.4384904Z compiled=False, 2025-05-07T20:33:43.4385290Z ) 2025-05-07T20:33:43.4385905Z self = 2025-05-07T20:33:43.4386881Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:43.4387418Z 2025-05-07T20:33:43.4387565Z @given( 2025-05-07T20:33:43.4388011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.4388619Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.4389214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.4389850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.4390505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.4391084Z ) 2025-05-07T20:33:43.4391742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.4392601Z def test_silu_mul_quant( 2025-05-07T20:33:43.4393071Z self, 2025-05-07T20:33:43.4393432Z T: int, 2025-05-07T20:33:43.4394034Z D: int, 2025-05-07T20:33:43.4394448Z scale_ub: Optional[float], 2025-05-07T20:33:43.4395033Z contiguous: bool, 2025-05-07T20:33:43.4395496Z compiled: bool, 2025-05-07T20:33:43.4395924Z ) -> None: 2025-05-07T20:33:43.4396322Z torch.manual_seed(2025) 2025-05-07T20:33:43.4396788Z 2025-05-07T20:33:43.4397310Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.4401385Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.4405121Z 2025-05-07T20:33:43.4405377Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.4405798Z 2025-05-07T20:33:43.4405999Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.4406803Z self=, 2025-05-07T20:33:43.4407703Z T=4096, 2025-05-07T20:33:43.4408058Z D=7168, 2025-05-07T20:33:43.4408500Z scale_ub=None, 2025-05-07T20:33:43.4408925Z contiguous=True, 2025-05-07T20:33:43.4409344Z compiled=True, 2025-05-07T20:33:43.4409743Z ) 2025-05-07T20:33:43.4410355Z self = 2025-05-07T20:33:43.4411377Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:43.4411905Z 2025-05-07T20:33:43.4412055Z @given( 2025-05-07T20:33:43.4412502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.4413100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.4413695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.4414332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.4414949Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.4415492Z ) 2025-05-07T20:33:43.4416157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.4417038Z def test_silu_mul_quant( 2025-05-07T20:33:43.4417508Z self, 2025-05-07T20:33:43.4417878Z T: int, 2025-05-07T20:33:43.4418253Z D: int, 2025-05-07T20:33:43.4418672Z scale_ub: Optional[float], 2025-05-07T20:33:43.4419179Z contiguous: bool, 2025-05-07T20:33:43.4419632Z compiled: bool, 2025-05-07T20:33:43.4420027Z ) -> None: 2025-05-07T20:33:43.4420436Z torch.manual_seed(2025) 2025-05-07T20:33:43.4420946Z 2025-05-07T20:33:43.4421454Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.4425836Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.4429654Z 2025-05-07T20:33:43.4429885Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.4430304Z 2025-05-07T20:33:43.4430507Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.4431305Z self=, 2025-05-07T20:33:43.4432095Z T=2048, 2025-05-07T20:33:43.4432457Z D=5120, 2025-05-07T20:33:43.4432830Z scale_ub=1200.0, 2025-05-07T20:33:43.4433422Z contiguous=False, 2025-05-07T20:33:43.4433940Z compiled=False, 2025-05-07T20:33:43.4434435Z ) 2025-05-07T20:33:43.4435046Z self = 2025-05-07T20:33:43.4436015Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:43.4436561Z 2025-05-07T20:33:43.4436715Z @given( 2025-05-07T20:33:43.4437145Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.4437748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.4438340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.4438971Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.4439609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.4440167Z ) 2025-05-07T20:33:43.4440850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.4441706Z def test_silu_mul_quant( 2025-05-07T20:33:43.4442180Z self, 2025-05-07T20:33:43.4442548Z T: int, 2025-05-07T20:33:43.4442919Z D: int, 2025-05-07T20:33:43.4443332Z scale_ub: Optional[float], 2025-05-07T20:33:43.4443826Z contiguous: bool, 2025-05-07T20:33:43.4444274Z compiled: bool, 2025-05-07T20:33:43.4444695Z ) -> None: 2025-05-07T20:33:43.4445231Z torch.manual_seed(2025) 2025-05-07T20:33:43.4445775Z 2025-05-07T20:33:43.4446290Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.4450300Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.4454193Z 2025-05-07T20:33:43.4454423Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.4454843Z 2025-05-07T20:33:43.4455051Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.4455857Z self=, 2025-05-07T20:33:43.4456655Z T=4096, 2025-05-07T20:33:43.4457016Z D=7168, 2025-05-07T20:33:43.4457376Z scale_ub=1200.0, 2025-05-07T20:33:43.4457807Z contiguous=True, 2025-05-07T20:33:43.4458235Z compiled=False, 2025-05-07T20:33:43.4458620Z ) 2025-05-07T20:33:43.5824563Z self = 2025-05-07T20:33:43.5825594Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:43.5826122Z 2025-05-07T20:33:43.5826264Z @given( 2025-05-07T20:33:43.5826705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.5827304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.5827885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.5828493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.5829067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.5829563Z ) 2025-05-07T20:33:43.5830178Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.5831031Z def test_silu_mul_quant( 2025-05-07T20:33:43.5831448Z self, 2025-05-07T20:33:43.5831801Z T: int, 2025-05-07T20:33:43.5832173Z D: int, 2025-05-07T20:33:43.5832565Z scale_ub: Optional[float], 2025-05-07T20:33:43.5833087Z contiguous: bool, 2025-05-07T20:33:43.5833646Z compiled: bool, 2025-05-07T20:33:43.5834072Z ) -> None: 2025-05-07T20:33:43.5834482Z torch.manual_seed(2025) 2025-05-07T20:33:43.5834939Z 2025-05-07T20:33:43.5835443Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.5839979Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.5843859Z 2025-05-07T20:33:43.5844089Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.5844507Z 2025-05-07T20:33:43.5844704Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.5845501Z self=, 2025-05-07T20:33:43.5846275Z T=16384, 2025-05-07T20:33:43.5846651Z D=7168, 2025-05-07T20:33:43.5847018Z scale_ub=None, 2025-05-07T20:33:43.5847424Z contiguous=False, 2025-05-07T20:33:43.5847859Z compiled=True, 2025-05-07T20:33:43.5848254Z ) 2025-05-07T20:33:43.5848858Z self = 2025-05-07T20:33:43.5849973Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:43.5850641Z 2025-05-07T20:33:43.5850801Z @given( 2025-05-07T20:33:43.5851248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.5851846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.5852432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.5853075Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.5853704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.5854259Z ) 2025-05-07T20:33:43.5854935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.5855808Z def test_silu_mul_quant( 2025-05-07T20:33:43.5856265Z self, 2025-05-07T20:33:43.5856640Z T: int, 2025-05-07T20:33:43.5857016Z D: int, 2025-05-07T20:33:43.5857415Z scale_ub: Optional[float], 2025-05-07T20:33:43.5857936Z contiguous: bool, 2025-05-07T20:33:43.5858402Z compiled: bool, 2025-05-07T20:33:43.5858815Z ) -> None: 2025-05-07T20:33:43.5859210Z torch.manual_seed(2025) 2025-05-07T20:33:43.5859620Z 2025-05-07T20:33:43.5860116Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.5864203Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.5867899Z 2025-05-07T20:33:43.5868135Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.5868560Z 2025-05-07T20:33:43.5868769Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.5869571Z self=, 2025-05-07T20:33:43.5870338Z T=4096, 2025-05-07T20:33:43.5870706Z D=7168, 2025-05-07T20:33:43.5871070Z scale_ub=None, 2025-05-07T20:33:43.5871476Z contiguous=True, 2025-05-07T20:33:43.5871908Z compiled=False, 2025-05-07T20:33:43.5872301Z ) 2025-05-07T20:33:43.5872893Z self = 2025-05-07T20:33:43.5873991Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:43.5874515Z 2025-05-07T20:33:43.5874791Z @given( 2025-05-07T20:33:43.5875223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.5876835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.5877459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.5878100Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.5878746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.5879329Z ) 2025-05-07T20:33:43.5880020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.5880831Z def test_silu_mul_quant( 2025-05-07T20:33:43.5881266Z self, 2025-05-07T20:33:43.5881629Z T: int, 2025-05-07T20:33:43.5881989Z D: int, 2025-05-07T20:33:43.5882400Z scale_ub: Optional[float], 2025-05-07T20:33:43.5882928Z contiguous: bool, 2025-05-07T20:33:43.5883383Z compiled: bool, 2025-05-07T20:33:43.5883818Z ) -> None: 2025-05-07T20:33:43.5884237Z torch.manual_seed(2025) 2025-05-07T20:33:43.5884702Z 2025-05-07T20:33:43.5885231Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.5889419Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.5893632Z 2025-05-07T20:33:43.5893873Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.5894287Z 2025-05-07T20:33:43.5894497Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.5895283Z self=, 2025-05-07T20:33:43.5896073Z T=16384, 2025-05-07T20:33:43.5896454Z D=7168, 2025-05-07T20:33:43.5896826Z scale_ub=None, 2025-05-07T20:33:43.5897248Z contiguous=True, 2025-05-07T20:33:43.5897681Z compiled=False, 2025-05-07T20:33:43.5898070Z ) 2025-05-07T20:33:43.5898676Z self = 2025-05-07T20:33:43.5899657Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:43.5900190Z 2025-05-07T20:33:43.5900353Z @given( 2025-05-07T20:33:43.5900782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.5901389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.5901980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.5902607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.5903245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.5903794Z ) 2025-05-07T20:33:43.5904452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.5905311Z def test_silu_mul_quant( 2025-05-07T20:33:43.5905785Z self, 2025-05-07T20:33:43.5906161Z T: int, 2025-05-07T20:33:43.5906536Z D: int, 2025-05-07T20:33:43.5906959Z scale_ub: Optional[float], 2025-05-07T20:33:43.5907491Z contiguous: bool, 2025-05-07T20:33:43.5907955Z compiled: bool, 2025-05-07T20:33:43.5908392Z ) -> None: 2025-05-07T20:33:43.5908804Z torch.manual_seed(2025) 2025-05-07T20:33:43.5909266Z 2025-05-07T20:33:43.5909797Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.5914065Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.5917802Z 2025-05-07T20:33:43.5918055Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.5918468Z 2025-05-07T20:33:43.5918687Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.5919477Z self=, 2025-05-07T20:33:43.5920263Z T=16384, 2025-05-07T20:33:43.5920641Z D=7168, 2025-05-07T20:33:43.5921037Z scale_ub=1200.0, 2025-05-07T20:33:43.5921490Z contiguous=True, 2025-05-07T20:33:43.5921919Z compiled=False, 2025-05-07T20:33:43.5922308Z ) 2025-05-07T20:33:43.5922917Z self = 2025-05-07T20:33:43.5924115Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:43.5924671Z 2025-05-07T20:33:43.5924823Z @given( 2025-05-07T20:33:43.5925269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.5925878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.5926470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.5927261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.5927970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.5928497Z ) 2025-05-07T20:33:43.5929151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.5930007Z def test_silu_mul_quant( 2025-05-07T20:33:43.5930483Z self, 2025-05-07T20:33:43.5930850Z T: int, 2025-05-07T20:33:43.5931230Z D: int, 2025-05-07T20:33:43.5931653Z scale_ub: Optional[float], 2025-05-07T20:33:43.5932168Z contiguous: bool, 2025-05-07T20:33:43.5932639Z compiled: bool, 2025-05-07T20:33:43.5933071Z ) -> None: 2025-05-07T20:33:43.5933476Z torch.manual_seed(2025) 2025-05-07T20:33:43.5933956Z 2025-05-07T20:33:43.5934474Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.5938501Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.5942190Z 2025-05-07T20:33:43.5942438Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.5942852Z 2025-05-07T20:33:43.5943056Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.5943863Z self=, 2025-05-07T20:33:43.5944648Z T=128, 2025-05-07T20:33:43.5945002Z D=5120, 2025-05-07T20:33:43.5945372Z scale_ub=1200.0, 2025-05-07T20:33:43.5945806Z contiguous=False, 2025-05-07T20:33:43.5946225Z compiled=False, 2025-05-07T20:33:43.5946632Z ) 2025-05-07T20:33:43.7483813Z self = 2025-05-07T20:33:43.7484900Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:43.7485457Z 2025-05-07T20:33:43.7485633Z @given( 2025-05-07T20:33:43.7486105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.7486740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.7487366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.7488033Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.7488699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.7489506Z ) 2025-05-07T20:33:43.7490408Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.7491031Z def test_silu_mul_quant( 2025-05-07T20:33:43.7491325Z self, 2025-05-07T20:33:43.7491560Z T: int, 2025-05-07T20:33:43.7491797Z D: int, 2025-05-07T20:33:43.7492056Z scale_ub: Optional[float], 2025-05-07T20:33:43.7492378Z contiguous: bool, 2025-05-07T20:33:43.7492657Z compiled: bool, 2025-05-07T20:33:43.7492927Z ) -> None: 2025-05-07T20:33:43.7493184Z torch.manual_seed(2025) 2025-05-07T20:33:43.7493461Z 2025-05-07T20:33:43.7493780Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.7494176Z 2025-05-07T20:33:43.7494400Z x_sign = torch.sign(x) 2025-05-07T20:33:43.7494741Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.7495107Z x = x_sign * x_clamp 2025-05-07T20:33:43.7495390Z x0 = x[:, :D] 2025-05-07T20:33:43.7495651Z x1 = x[:, D:] 2025-05-07T20:33:43.7495908Z 2025-05-07T20:33:43.7496123Z if contiguous: 2025-05-07T20:33:43.7496405Z x0 = x0.contiguous() 2025-05-07T20:33:43.7496714Z x1 = x1.contiguous() 2025-05-07T20:33:43.7496995Z 2025-05-07T20:33:43.7497318Z if scale_ub is not None: 2025-05-07T20:33:43.7497717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.7498111Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.7498466Z ) 2025-05-07T20:33:43.7498697Z else: 2025-05-07T20:33:43.7498953Z scale_ub_tensor = None 2025-05-07T20:33:43.7499243Z 2025-05-07T20:33:43.7499522Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.7499892Z op = silu_mul_quant 2025-05-07T20:33:43.7500184Z if compiled: 2025-05-07T20:33:43.7500538Z op = torch.compile(op) 2025-05-07T20:33:43.7500983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.7501383Z 2025-05-07T20:33:43.7501681Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.7501919Z 2025-05-07T20:33:43.7502071Z moe/activation_test.py:117: 2025-05-07T20:33:43.7502457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.7502851Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.7503188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.7503985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.7504770Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.7505385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.7506169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.7506927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.7507540Z kernel = self.compile( 2025-05-07T20:33:43.7508165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.7508916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.7509371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.7509642Z 2025-05-07T20:33:43.7509883Z self = 2025-05-07T20:33:43.7511115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.7512686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06c11ea0>} 2025-05-07T20:33:43.7514420Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.7515587Z context = 2025-05-07T20:33:43.7515928Z 2025-05-07T20:33:43.7516119Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.7516718Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.7517257Z module_map=module_map) 2025-05-07T20:33:43.7517674Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.7518085Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.7518397Z E ^ 2025-05-07T20:33:43.7518929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.7519452Z 2025-05-07T20:33:43.7519932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.7520524Z 2025-05-07T20:33:43.7520651Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.7521225Z self=, 2025-05-07T20:33:43.7521684Z T=2048, 2025-05-07T20:33:43.7521910Z D=7168, 2025-05-07T20:33:43.7522139Z scale_ub=None, 2025-05-07T20:33:43.7522393Z contiguous=False, 2025-05-07T20:33:43.7522660Z compiled=False, 2025-05-07T20:33:43.7522904Z ) 2025-05-07T20:33:43.7523270Z self = 2025-05-07T20:33:43.7524002Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:43.7524317Z 2025-05-07T20:33:43.7524413Z @given( 2025-05-07T20:33:43.7524683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.7525051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.7525407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.7525794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.7526173Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.7526512Z ) 2025-05-07T20:33:43.7526924Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.7527425Z def test_silu_mul_quant( 2025-05-07T20:33:43.7527711Z self, 2025-05-07T20:33:43.7527942Z T: int, 2025-05-07T20:33:43.7528169Z D: int, 2025-05-07T20:33:43.7528425Z scale_ub: Optional[float], 2025-05-07T20:33:43.7528741Z contiguous: bool, 2025-05-07T20:33:43.7529021Z compiled: bool, 2025-05-07T20:33:43.7529280Z ) -> None: 2025-05-07T20:33:43.7529533Z torch.manual_seed(2025) 2025-05-07T20:33:43.7529821Z 2025-05-07T20:33:43.7530142Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.7532531Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.7534640Z 2025-05-07T20:33:43.7534781Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:43.7535028Z 2025-05-07T20:33:43.7543037Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.7543550Z self=, 2025-05-07T20:33:43.7544127Z T=128, 2025-05-07T20:33:43.7544342Z D=7168, 2025-05-07T20:33:43.7544634Z scale_ub=1200.0, 2025-05-07T20:33:43.7544906Z contiguous=True, 2025-05-07T20:33:43.7545166Z compiled=True, 2025-05-07T20:33:43.7545413Z ) 2025-05-07T20:33:43.7989533Z self = 2025-05-07T20:33:43.7990958Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:43.7991386Z 2025-05-07T20:33:43.7991530Z @given( 2025-05-07T20:33:43.7991907Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.7992393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.7992857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.7993325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.7993796Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.7994139Z ) 2025-05-07T20:33:43.7994549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.7995071Z def test_silu_mul_quant( 2025-05-07T20:33:43.7995368Z self, 2025-05-07T20:33:43.7995601Z T: int, 2025-05-07T20:33:43.7995850Z D: int, 2025-05-07T20:33:43.7996119Z scale_ub: Optional[float], 2025-05-07T20:33:43.7996445Z contiguous: bool, 2025-05-07T20:33:43.7996863Z compiled: bool, 2025-05-07T20:33:43.7997209Z ) -> None: 2025-05-07T20:33:43.7997479Z torch.manual_seed(2025) 2025-05-07T20:33:43.7997765Z 2025-05-07T20:33:43.7998093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.7998496Z 2025-05-07T20:33:43.7998727Z x_sign = torch.sign(x) 2025-05-07T20:33:43.7999078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.7999449Z x = x_sign * x_clamp 2025-05-07T20:33:43.7999735Z x0 = x[:, :D] 2025-05-07T20:33:43.8000001Z x1 = x[:, D:] 2025-05-07T20:33:43.8000254Z 2025-05-07T20:33:43.8000480Z if contiguous: 2025-05-07T20:33:43.8000765Z x0 = x0.contiguous() 2025-05-07T20:33:43.8001082Z x1 = x1.contiguous() 2025-05-07T20:33:43.8001393Z 2025-05-07T20:33:43.8001631Z if scale_ub is not None: 2025-05-07T20:33:43.8001959Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.8002355Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.8002722Z ) 2025-05-07T20:33:43.8002954Z else: 2025-05-07T20:33:43.8003208Z scale_ub_tensor = None 2025-05-07T20:33:43.8003509Z 2025-05-07T20:33:43.8003785Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.8004149Z op = silu_mul_quant 2025-05-07T20:33:43.8004447Z if compiled: 2025-05-07T20:33:43.8004745Z op = torch.compile(op) 2025-05-07T20:33:43.8005094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.8005418Z 2025-05-07T20:33:43.8005654Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.8005848Z 2025-05-07T20:33:43.8005970Z moe/activation_test.py:117: 2025-05-07T20:33:43.8006319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.8006711Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.8007042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.8007695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:43.8008347Z return fn(*args, **kwargs) 2025-05-07T20:33:43.8009104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.8009885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.8010504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.8011284Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.8012185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.8012795Z kernel = self.compile( 2025-05-07T20:33:43.8013423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.8014180Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.8014635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.8014903Z 2025-05-07T20:33:43.8015141Z self = 2025-05-07T20:33:43.8016365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.8017918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06c137f0>} 2025-05-07T20:33:43.8019440Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.8020685Z context = 2025-05-07T20:33:43.8021023Z 2025-05-07T20:33:43.8021218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.8021816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.8022358Z module_map=module_map) 2025-05-07T20:33:43.8022775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.8023189Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.8023500Z E ^ 2025-05-07T20:33:43.8024224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.8024745Z 2025-05-07T20:33:43.8025220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.8025809Z 2025-05-07T20:33:43.8025936Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.8026426Z self=, 2025-05-07T20:33:43.8026884Z T=128, 2025-05-07T20:33:43.8027114Z D=7168, 2025-05-07T20:33:43.8027349Z scale_ub=1200.0, 2025-05-07T20:33:43.8027610Z contiguous=True, 2025-05-07T20:33:43.8027877Z compiled=False, 2025-05-07T20:33:43.8028126Z ) 2025-05-07T20:33:43.8028494Z self = 2025-05-07T20:33:43.8029068Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:43.8029382Z 2025-05-07T20:33:43.8029483Z @given( 2025-05-07T20:33:43.8029756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.8030126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.8030493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.8030878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.8031264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.8031602Z ) 2025-05-07T20:33:43.8032011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.8032518Z def test_silu_mul_quant( 2025-05-07T20:33:43.8032804Z self, 2025-05-07T20:33:43.8033045Z T: int, 2025-05-07T20:33:43.8033279Z D: int, 2025-05-07T20:33:43.8033611Z scale_ub: Optional[float], 2025-05-07T20:33:43.8033933Z contiguous: bool, 2025-05-07T20:33:43.8034214Z compiled: bool, 2025-05-07T20:33:43.8034484Z ) -> None: 2025-05-07T20:33:43.8034826Z torch.manual_seed(2025) 2025-05-07T20:33:43.8035109Z 2025-05-07T20:33:43.8035502Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.8035903Z 2025-05-07T20:33:43.8036137Z x_sign = torch.sign(x) 2025-05-07T20:33:43.8036475Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.8038748Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.8040841Z 2025-05-07T20:33:43.8040983Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:43.8041231Z 2025-05-07T20:33:43.8041388Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.8041923Z self=, 2025-05-07T20:33:43.8042392Z T=128, 2025-05-07T20:33:43.8042620Z D=5120, 2025-05-07T20:33:43.8042927Z scale_ub=1200.0, 2025-05-07T20:33:43.8043192Z contiguous=True, 2025-05-07T20:33:43.8043522Z compiled=True, 2025-05-07T20:33:43.8043770Z ) 2025-05-07T20:33:43.8044140Z self = 2025-05-07T20:33:43.8044912Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:43.8045330Z 2025-05-07T20:33:43.8045472Z @given( 2025-05-07T20:33:43.8045757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.8046126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.8046487Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.8046871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.8047264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.8047604Z ) 2025-05-07T20:33:43.8048016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.8048524Z def test_silu_mul_quant( 2025-05-07T20:33:43.8048820Z self, 2025-05-07T20:33:43.8049061Z T: int, 2025-05-07T20:33:43.8049297Z D: int, 2025-05-07T20:33:43.8049581Z scale_ub: Optional[float], 2025-05-07T20:33:43.8049926Z contiguous: bool, 2025-05-07T20:33:43.8050214Z compiled: bool, 2025-05-07T20:33:43.8050475Z ) -> None: 2025-05-07T20:33:43.8050734Z torch.manual_seed(2025) 2025-05-07T20:33:43.8051026Z 2025-05-07T20:33:43.8051343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.8051745Z 2025-05-07T20:33:43.8051979Z x_sign = torch.sign(x) 2025-05-07T20:33:43.8052319Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.8054586Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:43.8056899Z 2025-05-07T20:33:43.8057498Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:43.8057755Z 2025-05-07T20:33:43.8057881Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.8058360Z self=, 2025-05-07T20:33:43.8058977Z T=128, 2025-05-07T20:33:43.8059282Z D=7168, 2025-05-07T20:33:43.8059512Z scale_ub=None, 2025-05-07T20:33:43.8059807Z contiguous=True, 2025-05-07T20:33:43.8060076Z compiled=True, 2025-05-07T20:33:43.8060320Z ) 2025-05-07T20:33:44.0196431Z self = 2025-05-07T20:33:44.0197129Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.0197454Z 2025-05-07T20:33:44.0197547Z @given( 2025-05-07T20:33:44.0197820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0198178Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0198537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0198922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0199310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0199638Z ) 2025-05-07T20:33:44.0200045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0200556Z def test_silu_mul_quant( 2025-05-07T20:33:44.0200834Z self, 2025-05-07T20:33:44.0201067Z T: int, 2025-05-07T20:33:44.0201319Z D: int, 2025-05-07T20:33:44.0201581Z scale_ub: Optional[float], 2025-05-07T20:33:44.0201905Z contiguous: bool, 2025-05-07T20:33:44.0202186Z compiled: bool, 2025-05-07T20:33:44.0202575Z ) -> None: 2025-05-07T20:33:44.0202891Z torch.manual_seed(2025) 2025-05-07T20:33:44.0203176Z 2025-05-07T20:33:44.0203487Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0205797Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.0207884Z 2025-05-07T20:33:44.0208022Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.0208272Z 2025-05-07T20:33:44.0219220Z FAILED 2025-05-07T20:33:44.0219375Z 2025-05-07T20:33:44.0219590Z =================================== FAILURES =================================== 2025-05-07T20:33:44.0220295Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:44.0221048Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:44.0222033Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:33:44.0222876Z | yield 2025-05-07T20:33:44.0223549Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:33:44.0224592Z | self._callTestMethod(testMethod) 2025-05-07T20:33:44.0225479Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:33:44.0226361Z | method() 2025-05-07T20:33:44.0227490Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:44.0228649Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0229641Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:44.0230618Z | raise the_error_hypothesis_found 2025-05-07T20:33:44.0231377Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:44.0232182Z +-+---------------- 1 ---------------- 2025-05-07T20:33:44.0232646Z | Traceback (most recent call last): 2025-05-07T20:33:44.0234089Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:44.0235325Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0238531Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.0241626Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:44.0242319Z | self=, 2025-05-07T20:33:44.0242952Z | T=2048, 2025-05-07T20:33:44.0243340Z | D=5120, # or any other generated value 2025-05-07T20:33:44.0243874Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:44.0244452Z | contiguous=True, # or any other generated value 2025-05-07T20:33:44.0245130Z | compiled=False, # or any other generated value 2025-05-07T20:33:44.0245691Z | ) 2025-05-07T20:33:44.0245993Z | 2025-05-07T20:33:44.0246808Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:44.0247757Z +---------------- 2 ---------------- 2025-05-07T20:33:44.0248226Z | Traceback (most recent call last): 2025-05-07T20:33:44.0249359Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:44.0250567Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0253713Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.0256741Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:44.0257437Z | self=, 2025-05-07T20:33:44.0258074Z | T=128, 2025-05-07T20:33:44.0258396Z | D=7168, 2025-05-07T20:33:44.0258741Z | scale_ub=None, 2025-05-07T20:33:44.0259131Z | contiguous=True, 2025-05-07T20:33:44.0259516Z | compiled=True, 2025-05-07T20:33:44.0259884Z | ) 2025-05-07T20:33:44.0260177Z | 2025-05-07T20:33:44.0260993Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:44.0261938Z +---------------- 3 ---------------- 2025-05-07T20:33:44.0262410Z | Traceback (most recent call last): 2025-05-07T20:33:44.0263440Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:44.0264320Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0266694Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.0268946Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:44.0269455Z | self=, 2025-05-07T20:33:44.0269918Z | T=128, 2025-05-07T20:33:44.0270159Z | D=5120, 2025-05-07T20:33:44.0270412Z | scale_ub=1200.0, 2025-05-07T20:33:44.0270743Z | contiguous=True, 2025-05-07T20:33:44.0271142Z | compiled=True, 2025-05-07T20:33:44.0271517Z | ) 2025-05-07T20:33:44.0271820Z | 2025-05-07T20:33:44.0272675Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:44.0273812Z +---------------- 4 ---------------- 2025-05-07T20:33:44.0274308Z | Traceback (most recent call last): 2025-05-07T20:33:44.0275466Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:44.0276768Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0277875Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:44.0278681Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0279628Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:44.0280749Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0281710Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:44.0282836Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0283995Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:44.0285252Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0286545Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:33:44.0287856Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0289136Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:44.0290284Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0291568Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:44.0292685Z | fn() 2025-05-07T20:33:44.0293676Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:44.0294664Z | self.fn.run( 2025-05-07T20:33:44.0295502Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:44.0296425Z | kernel = self.compile( 2025-05-07T20:33:44.0297397Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:44.0298520Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0299697Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:44.0301147Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0302128Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0302680Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0303083Z | ^ 2025-05-07T20:33:44.0303780Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0304641Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:44.0305244Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:44.0306012Z | self=, 2025-05-07T20:33:44.0306687Z | T=1, # or any other generated value 2025-05-07T20:33:44.0307181Z | D=5120, # or any other generated value 2025-05-07T20:33:44.0307725Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:44.0308327Z | contiguous=True, # or any other generated value 2025-05-07T20:33:44.0308921Z | compiled=True, # or any other generated value 2025-05-07T20:33:44.0309413Z | ) 2025-05-07T20:33:44.0309787Z | 2025-05-07T20:33:44.0310669Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:44.0311637Z +------------------------------------ 2025-05-07T20:33:44.0312215Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:44.0312823Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0313491Z self=, 2025-05-07T20:33:44.0314292Z T=1, 2025-05-07T20:33:44.0314597Z D=5120, 2025-05-07T20:33:44.0314924Z scale_ub=None, 2025-05-07T20:33:44.0315281Z contiguous=True, 2025-05-07T20:33:44.0315655Z compiled=True, 2025-05-07T20:33:44.0316003Z ) 2025-05-07T20:33:44.0316524Z self = 2025-05-07T20:33:44.0317306Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.0317737Z 2025-05-07T20:33:44.0317868Z @given( 2025-05-07T20:33:44.0318255Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0318758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0319262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0319807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0320337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0320812Z ) 2025-05-07T20:33:44.0321386Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0322173Z def test_silu_mul_quant( 2025-05-07T20:33:44.0322559Z self, 2025-05-07T20:33:44.0322884Z T: int, 2025-05-07T20:33:44.0323206Z D: int, 2025-05-07T20:33:44.0348671Z scale_ub: Optional[float], 2025-05-07T20:33:44.0349208Z contiguous: bool, 2025-05-07T20:33:44.0349621Z compiled: bool, 2025-05-07T20:33:44.0350005Z ) -> None: 2025-05-07T20:33:44.0350366Z torch.manual_seed(2025) 2025-05-07T20:33:44.0350769Z 2025-05-07T20:33:44.0351221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0351763Z 2025-05-07T20:33:44.0352090Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0352562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0353053Z x = x_sign * x_clamp 2025-05-07T20:33:44.0353450Z x0 = x[:, :D] 2025-05-07T20:33:44.0353918Z x1 = x[:, D:] 2025-05-07T20:33:44.0354269Z 2025-05-07T20:33:44.0354588Z if contiguous: 2025-05-07T20:33:44.0354977Z x0 = x0.contiguous() 2025-05-07T20:33:44.0355406Z x1 = x1.contiguous() 2025-05-07T20:33:44.0356031Z 2025-05-07T20:33:44.0356470Z if scale_ub is not None: 2025-05-07T20:33:44.0356923Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0357473Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0357973Z ) 2025-05-07T20:33:44.0358298Z else: 2025-05-07T20:33:44.0358636Z scale_ub_tensor = None 2025-05-07T20:33:44.0359051Z 2025-05-07T20:33:44.0359405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0359889Z op = silu_mul_quant 2025-05-07T20:33:44.0360279Z if compiled: 2025-05-07T20:33:44.0360674Z op = torch.compile(op) 2025-05-07T20:33:44.0361147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0361600Z 2025-05-07T20:33:44.0361923Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0362392Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0362876Z 2025-05-07T20:33:44.0363277Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0363826Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0364301Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0364822Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0365534Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0366147Z 2025-05-07T20:33:44.0366495Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0366817Z 2025-05-07T20:33:44.0366998Z moe/activation_test.py:126: 2025-05-07T20:33:44.0367487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0368036Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0368587Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0369841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0371042Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0371928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0373014Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0374108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0375238Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0376466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0377691Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0378892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0379933Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0380898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0381735Z fn() 2025-05-07T20:33:44.0382549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0383499Z self.fn.run( 2025-05-07T20:33:44.0384254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0385128Z kernel = self.compile( 2025-05-07T20:33:44.0386015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0387082Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0387731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0388188Z 2025-05-07T20:33:44.0388556Z self = 2025-05-07T20:33:44.0390215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0392406Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2b098af0>} 2025-05-07T20:33:44.0394669Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0396307Z context = 2025-05-07T20:33:44.0396777Z 2025-05-07T20:33:44.0397042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0397899Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0398649Z module_map=module_map) 2025-05-07T20:33:44.0399223Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0399862Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0400339Z E ^ 2025-05-07T20:33:44.0401102Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0401899Z 2025-05-07T20:33:44.0402573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0403407Z 2025-05-07T20:33:44.0403590Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0404268Z self=, 2025-05-07T20:33:44.0404928Z T=2048, 2025-05-07T20:33:44.0405252Z D=5120, 2025-05-07T20:33:44.0405574Z scale_ub=1200.0, 2025-05-07T20:33:44.0405957Z contiguous=True, 2025-05-07T20:33:44.0406334Z compiled=False, 2025-05-07T20:33:44.0406688Z ) 2025-05-07T20:33:44.0407209Z self = 2025-05-07T20:33:44.0408030Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.0408481Z 2025-05-07T20:33:44.0408620Z @given( 2025-05-07T20:33:44.0408993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0409516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0410014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0410533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0411067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0411517Z ) 2025-05-07T20:33:44.0412073Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0412777Z def test_silu_mul_quant( 2025-05-07T20:33:44.0413166Z self, 2025-05-07T20:33:44.0413479Z T: int, 2025-05-07T20:33:44.0413788Z D: int, 2025-05-07T20:33:44.0414133Z scale_ub: Optional[float], 2025-05-07T20:33:44.0414566Z contiguous: bool, 2025-05-07T20:33:44.0414950Z compiled: bool, 2025-05-07T20:33:44.0415314Z ) -> None: 2025-05-07T20:33:44.0415679Z torch.manual_seed(2025) 2025-05-07T20:33:44.0416079Z 2025-05-07T20:33:44.0416532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0417097Z 2025-05-07T20:33:44.0417417Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0417905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0418422Z x = x_sign * x_clamp 2025-05-07T20:33:44.0418816Z x0 = x[:, :D] 2025-05-07T20:33:44.0419185Z x1 = x[:, D:] 2025-05-07T20:33:44.0419536Z 2025-05-07T20:33:44.0419841Z if contiguous: 2025-05-07T20:33:44.0420296Z x0 = x0.contiguous() 2025-05-07T20:33:44.0420793Z x1 = x1.contiguous() 2025-05-07T20:33:44.0421222Z 2025-05-07T20:33:44.0421546Z if scale_ub is not None: 2025-05-07T20:33:44.0422008Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0422559Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0423054Z ) 2025-05-07T20:33:44.0423373Z else: 2025-05-07T20:33:44.0423715Z scale_ub_tensor = None 2025-05-07T20:33:44.0424439Z 2025-05-07T20:33:44.0424812Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0425319Z op = silu_mul_quant 2025-05-07T20:33:44.0425720Z if compiled: 2025-05-07T20:33:44.0426122Z op = torch.compile(op) 2025-05-07T20:33:44.0426608Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0427054Z 2025-05-07T20:33:44.0427376Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0427648Z 2025-05-07T20:33:44.0427815Z moe/activation_test.py:117: 2025-05-07T20:33:44.0428287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0428812Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0429260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0430542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0431650Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0432443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0433482Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0434653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0435511Z kernel = self.compile( 2025-05-07T20:33:44.0436400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0437445Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0438078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0438467Z 2025-05-07T20:33:44.0438795Z self = 2025-05-07T20:33:44.0440498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0442722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2af71990>} 2025-05-07T20:33:44.0444877Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0446452Z context = 2025-05-07T20:33:44.0446881Z 2025-05-07T20:33:44.0447129Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0447952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0448714Z module_map=module_map) 2025-05-07T20:33:44.0449285Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0449849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0450272Z E ^ 2025-05-07T20:33:44.0451009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0451731Z 2025-05-07T20:33:44.0452400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0453414Z 2025-05-07T20:33:44.0453590Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0454261Z self=, 2025-05-07T20:33:44.0454904Z T=2048, 2025-05-07T20:33:44.0455223Z D=5120, 2025-05-07T20:33:44.0455491Z scale_ub=1200.0, 2025-05-07T20:33:44.0455763Z contiguous=True, 2025-05-07T20:33:44.0456034Z compiled=True, 2025-05-07T20:33:44.0456284Z ) 2025-05-07T20:33:44.0456657Z self = 2025-05-07T20:33:44.0457233Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.0457547Z 2025-05-07T20:33:44.0457649Z @given( 2025-05-07T20:33:44.0457926Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0458289Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0458656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0459049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0459432Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0459772Z ) 2025-05-07T20:33:44.0460186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0460824Z def test_silu_mul_quant( 2025-05-07T20:33:44.0461191Z self, 2025-05-07T20:33:44.0461435Z T: int, 2025-05-07T20:33:44.0461670Z D: int, 2025-05-07T20:33:44.0461940Z scale_ub: Optional[float], 2025-05-07T20:33:44.0462268Z contiguous: bool, 2025-05-07T20:33:44.0462556Z compiled: bool, 2025-05-07T20:33:44.0462832Z ) -> None: 2025-05-07T20:33:44.0463092Z torch.manual_seed(2025) 2025-05-07T20:33:44.0463383Z 2025-05-07T20:33:44.0463698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0464105Z 2025-05-07T20:33:44.0464348Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0464693Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0465078Z x = x_sign * x_clamp 2025-05-07T20:33:44.0465371Z x0 = x[:, :D] 2025-05-07T20:33:44.0465628Z x1 = x[:, D:] 2025-05-07T20:33:44.0465882Z 2025-05-07T20:33:44.0466120Z if contiguous: 2025-05-07T20:33:44.0466394Z x0 = x0.contiguous() 2025-05-07T20:33:44.0466710Z x1 = x1.contiguous() 2025-05-07T20:33:44.0467004Z 2025-05-07T20:33:44.0467237Z if scale_ub is not None: 2025-05-07T20:33:44.0467566Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0467965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0468328Z ) 2025-05-07T20:33:44.0468568Z else: 2025-05-07T20:33:44.0468827Z scale_ub_tensor = None 2025-05-07T20:33:44.0469127Z 2025-05-07T20:33:44.0469410Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0469797Z op = silu_mul_quant 2025-05-07T20:33:44.0470097Z if compiled: 2025-05-07T20:33:44.0470399Z op = torch.compile(op) 2025-05-07T20:33:44.0470782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0471145Z 2025-05-07T20:33:44.0471380Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0471729Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0472078Z 2025-05-07T20:33:44.0472358Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0472756Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0473107Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0473475Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0473977Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0474349Z 2025-05-07T20:33:44.0474595Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0474837Z 2025-05-07T20:33:44.0475043Z moe/activation_test.py:126: 2025-05-07T20:33:44.0475439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0475839Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0476226Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0477155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0478035Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0478670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0479463Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0480271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0481301Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0482327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0483200Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0484146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0484887Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0485580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0486188Z fn() 2025-05-07T20:33:44.0486780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0487458Z self.fn.run( 2025-05-07T20:33:44.0488004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0488637Z kernel = self.compile( 2025-05-07T20:33:44.0489273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0490026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0490501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0490767Z 2025-05-07T20:33:44.0491018Z self = 2025-05-07T20:33:44.0492249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0493810Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae29a196c0>} 2025-05-07T20:33:44.0495349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0496519Z context = 2025-05-07T20:33:44.0496858Z 2025-05-07T20:33:44.0497066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0497666Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0498216Z module_map=module_map) 2025-05-07T20:33:44.0498649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0499069Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0499385Z E ^ 2025-05-07T20:33:44.0499926Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0500496Z 2025-05-07T20:33:44.0501045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0501677Z 2025-05-07T20:33:44.0501809Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0502290Z self=, 2025-05-07T20:33:44.0502765Z T=16384, 2025-05-07T20:33:44.0503006Z D=7168, 2025-05-07T20:33:44.0503237Z scale_ub=1200.0, 2025-05-07T20:33:44.0503509Z contiguous=False, 2025-05-07T20:33:44.0503789Z compiled=False, 2025-05-07T20:33:44.0504032Z ) 2025-05-07T20:33:44.0504409Z self = 2025-05-07T20:33:44.0504995Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.0505318Z 2025-05-07T20:33:44.0505412Z @given( 2025-05-07T20:33:44.0505691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0506069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0506441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0506829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0507223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0507619Z ) 2025-05-07T20:33:44.0508066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0508597Z def test_silu_mul_quant( 2025-05-07T20:33:44.0508897Z self, 2025-05-07T20:33:44.0509135Z T: int, 2025-05-07T20:33:44.0509384Z D: int, 2025-05-07T20:33:44.0509650Z scale_ub: Optional[float], 2025-05-07T20:33:44.0509969Z contiguous: bool, 2025-05-07T20:33:44.0510261Z compiled: bool, 2025-05-07T20:33:44.0510536Z ) -> None: 2025-05-07T20:33:44.0510817Z torch.manual_seed(2025) 2025-05-07T20:33:44.0511137Z 2025-05-07T20:33:44.0511463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0511872Z 2025-05-07T20:33:44.0512106Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0512460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0512827Z x = x_sign * x_clamp 2025-05-07T20:33:44.0513111Z x0 = x[:, :D] 2025-05-07T20:33:44.0513379Z x1 = x[:, D:] 2025-05-07T20:33:44.0513683Z 2025-05-07T20:33:44.0513905Z if contiguous: 2025-05-07T20:33:44.0514190Z x0 = x0.contiguous() 2025-05-07T20:33:44.0514505Z x1 = x1.contiguous() 2025-05-07T20:33:44.0514789Z 2025-05-07T20:33:44.0515031Z if scale_ub is not None: 2025-05-07T20:33:44.0515362Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0515752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0516118Z ) 2025-05-07T20:33:44.0516353Z else: 2025-05-07T20:33:44.0516607Z scale_ub_tensor = None 2025-05-07T20:33:44.0516914Z 2025-05-07T20:33:44.0517197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0517568Z op = silu_mul_quant 2025-05-07T20:33:44.0517861Z if compiled: 2025-05-07T20:33:44.0518157Z op = torch.compile(op) 2025-05-07T20:33:44.0518510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0518836Z 2025-05-07T20:33:44.0519071Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0519266Z 2025-05-07T20:33:44.0519389Z moe/activation_test.py:117: 2025-05-07T20:33:44.0519731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0520122Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0520456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0521294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0522084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0522807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0523595Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0524665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0525287Z kernel = self.compile( 2025-05-07T20:33:44.0525917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0526674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0527131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0527400Z 2025-05-07T20:33:44.0527641Z self = 2025-05-07T20:33:44.0528873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0530439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae29a18940>} 2025-05-07T20:33:44.0532132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0533308Z context = 2025-05-07T20:33:44.0533645Z 2025-05-07T20:33:44.0533841Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0534446Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0534987Z module_map=module_map) 2025-05-07T20:33:44.0535420Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0535834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0536135Z E ^ 2025-05-07T20:33:44.0536674Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0537200Z 2025-05-07T20:33:44.0537679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0538261Z 2025-05-07T20:33:44.0538392Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0538869Z self=, 2025-05-07T20:33:44.0539340Z T=1, 2025-05-07T20:33:44.0539563Z D=7168, 2025-05-07T20:33:44.0539793Z scale_ub=None, 2025-05-07T20:33:44.0540053Z contiguous=True, 2025-05-07T20:33:44.0540322Z compiled=True, 2025-05-07T20:33:44.0540574Z ) 2025-05-07T20:33:44.0540942Z self = 2025-05-07T20:33:44.0541504Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.0541802Z 2025-05-07T20:33:44.0541901Z @given( 2025-05-07T20:33:44.0542169Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0542544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0542906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0543287Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0543671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0544008Z ) 2025-05-07T20:33:44.0544410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0544923Z def test_silu_mul_quant( 2025-05-07T20:33:44.0545212Z self, 2025-05-07T20:33:44.0545448Z T: int, 2025-05-07T20:33:44.0545679Z D: int, 2025-05-07T20:33:44.0546016Z scale_ub: Optional[float], 2025-05-07T20:33:44.0546335Z contiguous: bool, 2025-05-07T20:33:44.0546704Z compiled: bool, 2025-05-07T20:33:44.0546970Z ) -> None: 2025-05-07T20:33:44.0547228Z torch.manual_seed(2025) 2025-05-07T20:33:44.0547508Z 2025-05-07T20:33:44.0547830Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0548239Z 2025-05-07T20:33:44.0548467Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0548807Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0549167Z x = x_sign * x_clamp 2025-05-07T20:33:44.0549446Z x0 = x[:, :D] 2025-05-07T20:33:44.0549708Z x1 = x[:, D:] 2025-05-07T20:33:44.0549956Z 2025-05-07T20:33:44.0550175Z if contiguous: 2025-05-07T20:33:44.0550456Z x0 = x0.contiguous() 2025-05-07T20:33:44.0550762Z x1 = x1.contiguous() 2025-05-07T20:33:44.0551048Z 2025-05-07T20:33:44.0551274Z if scale_ub is not None: 2025-05-07T20:33:44.0551650Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0560117Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0560501Z ) 2025-05-07T20:33:44.0560735Z else: 2025-05-07T20:33:44.0560999Z scale_ub_tensor = None 2025-05-07T20:33:44.0561421Z 2025-05-07T20:33:44.0561742Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0562120Z op = silu_mul_quant 2025-05-07T20:33:44.0562422Z if compiled: 2025-05-07T20:33:44.0562718Z op = torch.compile(op) 2025-05-07T20:33:44.0563062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0563385Z 2025-05-07T20:33:44.0563619Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0563951Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0564294Z 2025-05-07T20:33:44.0564583Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0564975Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0565333Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0565703Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0566117Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0566485Z 2025-05-07T20:33:44.0566725Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0566954Z 2025-05-07T20:33:44.0567080Z moe/activation_test.py:126: 2025-05-07T20:33:44.0567423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0567816Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0568199Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0569092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0569947Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0570579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0571363Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0572143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0572974Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0573838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0574695Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0575521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0576259Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0577054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0577648Z fn() 2025-05-07T20:33:44.0578235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0578910Z self.fn.run( 2025-05-07T20:33:44.0579455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0580061Z kernel = self.compile( 2025-05-07T20:33:44.0580685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0581438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0581897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0582166Z 2025-05-07T20:33:44.0582407Z self = 2025-05-07T20:33:44.0583639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0585235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae297b0790>} 2025-05-07T20:33:44.0586788Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0587948Z context = 2025-05-07T20:33:44.0588278Z 2025-05-07T20:33:44.0588470Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0589068Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0589615Z module_map=module_map) 2025-05-07T20:33:44.0590030Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0590450Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0590767Z E ^ 2025-05-07T20:33:44.0591306Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0591876Z 2025-05-07T20:33:44.0592376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0592968Z 2025-05-07T20:33:44.0593093Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0593635Z self=, 2025-05-07T20:33:44.0594100Z T=4096, 2025-05-07T20:33:44.0594322Z D=5120, 2025-05-07T20:33:44.0594554Z scale_ub=None, 2025-05-07T20:33:44.0594815Z contiguous=False, 2025-05-07T20:33:44.0595080Z compiled=False, 2025-05-07T20:33:44.0595328Z ) 2025-05-07T20:33:44.0595705Z self = 2025-05-07T20:33:44.0596273Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.0596601Z 2025-05-07T20:33:44.0596696Z @given( 2025-05-07T20:33:44.0596971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0597334Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0597693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0598081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0598470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0598801Z ) 2025-05-07T20:33:44.0599211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0599729Z def test_silu_mul_quant( 2025-05-07T20:33:44.0600071Z self, 2025-05-07T20:33:44.0600304Z T: int, 2025-05-07T20:33:44.0600539Z D: int, 2025-05-07T20:33:44.0600840Z scale_ub: Optional[float], 2025-05-07T20:33:44.0601163Z contiguous: bool, 2025-05-07T20:33:44.0601450Z compiled: bool, 2025-05-07T20:33:44.0601712Z ) -> None: 2025-05-07T20:33:44.0601975Z torch.manual_seed(2025) 2025-05-07T20:33:44.0602261Z 2025-05-07T20:33:44.0602579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0602976Z 2025-05-07T20:33:44.0603207Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0603543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0603908Z x = x_sign * x_clamp 2025-05-07T20:33:44.0604197Z x0 = x[:, :D] 2025-05-07T20:33:44.0604458Z x1 = x[:, D:] 2025-05-07T20:33:44.0604700Z 2025-05-07T20:33:44.0604927Z if contiguous: 2025-05-07T20:33:44.0605206Z x0 = x0.contiguous() 2025-05-07T20:33:44.0605506Z x1 = x1.contiguous() 2025-05-07T20:33:44.0605791Z 2025-05-07T20:33:44.0606031Z if scale_ub is not None: 2025-05-07T20:33:44.0606353Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0606749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0607166Z ) 2025-05-07T20:33:44.0607395Z else: 2025-05-07T20:33:44.0607691Z scale_ub_tensor = None 2025-05-07T20:33:44.0607995Z 2025-05-07T20:33:44.0608264Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0608630Z op = silu_mul_quant 2025-05-07T20:33:44.0608925Z if compiled: 2025-05-07T20:33:44.0609212Z op = torch.compile(op) 2025-05-07T20:33:44.0609561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0609887Z 2025-05-07T20:33:44.0610113Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0610313Z 2025-05-07T20:33:44.0610431Z moe/activation_test.py:117: 2025-05-07T20:33:44.0610782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0611178Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0611506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0612301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0613102Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0613715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0614498Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0615259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0615873Z kernel = self.compile( 2025-05-07T20:33:44.0616493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0617251Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0617712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0617975Z 2025-05-07T20:33:44.0618218Z self = 2025-05-07T20:33:44.0619444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0620998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae297b1510>} 2025-05-07T20:33:44.0622524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0624015Z context = 2025-05-07T20:33:44.0624432Z 2025-05-07T20:33:44.0624636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0625235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0625790Z module_map=module_map) 2025-05-07T20:33:44.0626216Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0626626Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0626933Z E ^ 2025-05-07T20:33:44.0627475Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0627993Z 2025-05-07T20:33:44.0628474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0629061Z 2025-05-07T20:33:44.0629186Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0629671Z self=, 2025-05-07T20:33:44.0630137Z T=4096, 2025-05-07T20:33:44.0630357Z D=7168, 2025-05-07T20:33:44.0630588Z scale_ub=None, 2025-05-07T20:33:44.0630952Z contiguous=False, 2025-05-07T20:33:44.0631276Z compiled=False, 2025-05-07T20:33:44.0631522Z ) 2025-05-07T20:33:44.0631896Z self = 2025-05-07T20:33:44.0632463Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.0632782Z 2025-05-07T20:33:44.0632874Z @given( 2025-05-07T20:33:44.0633142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0633566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0633930Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0634316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0634712Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0635047Z ) 2025-05-07T20:33:44.0635454Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0635967Z def test_silu_mul_quant( 2025-05-07T20:33:44.0636253Z self, 2025-05-07T20:33:44.0636484Z T: int, 2025-05-07T20:33:44.0636724Z D: int, 2025-05-07T20:33:44.0636975Z scale_ub: Optional[float], 2025-05-07T20:33:44.0637296Z contiguous: bool, 2025-05-07T20:33:44.0637582Z compiled: bool, 2025-05-07T20:33:44.0637844Z ) -> None: 2025-05-07T20:33:44.0638093Z torch.manual_seed(2025) 2025-05-07T20:33:44.0638375Z 2025-05-07T20:33:44.0638691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0639087Z 2025-05-07T20:33:44.0639320Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0639661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0640021Z x = x_sign * x_clamp 2025-05-07T20:33:44.0640311Z x0 = x[:, :D] 2025-05-07T20:33:44.0640566Z x1 = x[:, D:] 2025-05-07T20:33:44.0640807Z 2025-05-07T20:33:44.0641029Z if contiguous: 2025-05-07T20:33:44.0641302Z x0 = x0.contiguous() 2025-05-07T20:33:44.0641605Z x1 = x1.contiguous() 2025-05-07T20:33:44.0641892Z 2025-05-07T20:33:44.0642127Z if scale_ub is not None: 2025-05-07T20:33:44.0642447Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0642841Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0643203Z ) 2025-05-07T20:33:44.0643430Z else: 2025-05-07T20:33:44.0643672Z scale_ub_tensor = None 2025-05-07T20:33:44.0643964Z 2025-05-07T20:33:44.0644239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0644599Z op = silu_mul_quant 2025-05-07T20:33:44.0644893Z if compiled: 2025-05-07T20:33:44.0645305Z op = torch.compile(op) 2025-05-07T20:33:44.0645713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0646037Z 2025-05-07T20:33:44.0646271Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0646464Z 2025-05-07T20:33:44.0646580Z moe/activation_test.py:117: 2025-05-07T20:33:44.0646931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0647322Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0647649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0648444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0649231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0649848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0650625Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0651394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0652005Z kernel = self.compile( 2025-05-07T20:33:44.0652630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0653505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0653964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0654229Z 2025-05-07T20:33:44.0654472Z self = 2025-05-07T20:33:44.0655691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0657259Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae297b1bd0>} 2025-05-07T20:33:44.0658787Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0659962Z context = 2025-05-07T20:33:44.0660293Z 2025-05-07T20:33:44.0660491Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0661083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0661626Z module_map=module_map) 2025-05-07T20:33:44.0662052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0662454Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0662762Z E ^ 2025-05-07T20:33:44.0663303Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0663816Z 2025-05-07T20:33:44.0664294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0664880Z 2025-05-07T20:33:44.0665007Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0665489Z self=, 2025-05-07T20:33:44.0665950Z T=128, 2025-05-07T20:33:44.0666169Z D=7168, 2025-05-07T20:33:44.0666399Z scale_ub=None, 2025-05-07T20:33:44.0666653Z contiguous=False, 2025-05-07T20:33:44.0666917Z compiled=True, 2025-05-07T20:33:44.0667151Z ) 2025-05-07T20:33:44.0667520Z self = 2025-05-07T20:33:44.0668085Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.0668451Z 2025-05-07T20:33:44.0668545Z @given( 2025-05-07T20:33:44.0668862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0669229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0669583Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0669972Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0670363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0670704Z ) 2025-05-07T20:33:44.0671103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0671613Z def test_silu_mul_quant( 2025-05-07T20:33:44.0671901Z self, 2025-05-07T20:33:44.0672131Z T: int, 2025-05-07T20:33:44.0672367Z D: int, 2025-05-07T20:33:44.0672625Z scale_ub: Optional[float], 2025-05-07T20:33:44.0672946Z contiguous: bool, 2025-05-07T20:33:44.0673225Z compiled: bool, 2025-05-07T20:33:44.0673488Z ) -> None: 2025-05-07T20:33:44.0673794Z torch.manual_seed(2025) 2025-05-07T20:33:44.0674075Z 2025-05-07T20:33:44.0674400Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0674790Z 2025-05-07T20:33:44.0675023Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0675361Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0675780Z x = x_sign * x_clamp 2025-05-07T20:33:44.0676109Z x0 = x[:, :D] 2025-05-07T20:33:44.0676366Z x1 = x[:, D:] 2025-05-07T20:33:44.0676607Z 2025-05-07T20:33:44.0676829Z if contiguous: 2025-05-07T20:33:44.0677101Z x0 = x0.contiguous() 2025-05-07T20:33:44.0677402Z x1 = x1.contiguous() 2025-05-07T20:33:44.0677680Z 2025-05-07T20:33:44.0677907Z if scale_ub is not None: 2025-05-07T20:33:44.0678231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0678615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0678978Z ) 2025-05-07T20:33:44.0679207Z else: 2025-05-07T20:33:44.0679451Z scale_ub_tensor = None 2025-05-07T20:33:44.0679745Z 2025-05-07T20:33:44.0680015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0680375Z op = silu_mul_quant 2025-05-07T20:33:44.0680668Z if compiled: 2025-05-07T20:33:44.0680963Z op = torch.compile(op) 2025-05-07T20:33:44.0681304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0681627Z 2025-05-07T20:33:44.0681858Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0682187Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0682530Z 2025-05-07T20:33:44.0682810Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0683197Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0683538Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0683906Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0684330Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0684693Z 2025-05-07T20:33:44.0684939Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0685167Z 2025-05-07T20:33:44.0685292Z moe/activation_test.py:126: 2025-05-07T20:33:44.0685635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0686037Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0686419Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0687314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0687435Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0687846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0688109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0688626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0688928Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0689385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0689681Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0690114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0690308Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0690698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0690797Z fn() 2025-05-07T20:33:44.0691307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0691413Z self.fn.run( 2025-05-07T20:33:44.0691797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0691907Z kernel = self.compile( 2025-05-07T20:33:44.0692433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0692637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0692788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0692799Z 2025-05-07T20:33:44.0693037Z self = 2025-05-07T20:33:44.0693921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0694511Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae29a1a9e0>} 2025-05-07T20:33:44.0695356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0695587Z context = 2025-05-07T20:33:44.0695592Z 2025-05-07T20:33:44.0695784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0696090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0696220Z module_map=module_map) 2025-05-07T20:33:44.0696408Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0696540Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0696636Z E ^ 2025-05-07T20:33:44.0697042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0697048Z 2025-05-07T20:33:44.0697529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0697535Z 2025-05-07T20:33:44.0697657Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0697910Z self=, 2025-05-07T20:33:44.0698007Z T=128, 2025-05-07T20:33:44.0698096Z D=7168, 2025-05-07T20:33:44.0698197Z scale_ub=None, 2025-05-07T20:33:44.0698303Z contiguous=False, 2025-05-07T20:33:44.0698402Z compiled=False, 2025-05-07T20:33:44.0698494Z ) 2025-05-07T20:33:44.0698745Z self = 2025-05-07T20:33:44.0698993Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.0699042Z 2025-05-07T20:33:44.0699141Z @given( 2025-05-07T20:33:44.0699280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0699398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0699542Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0699684Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0699825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0699913Z ) 2025-05-07T20:33:44.0700197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0700312Z def test_silu_mul_quant( 2025-05-07T20:33:44.0700404Z self, 2025-05-07T20:33:44.0700495Z T: int, 2025-05-07T20:33:44.0700592Z D: int, 2025-05-07T20:33:44.0700708Z scale_ub: Optional[float], 2025-05-07T20:33:44.0700813Z contiguous: bool, 2025-05-07T20:33:44.0700930Z compiled: bool, 2025-05-07T20:33:44.0701040Z ) -> None: 2025-05-07T20:33:44.0701173Z torch.manual_seed(2025) 2025-05-07T20:33:44.0701269Z 2025-05-07T20:33:44.0701464Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0701605Z 2025-05-07T20:33:44.0701713Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0701925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0702036Z x = x_sign * x_clamp 2025-05-07T20:33:44.0702132Z x0 = x[:, :D] 2025-05-07T20:33:44.0702226Z x1 = x[:, D:] 2025-05-07T20:33:44.0702315Z 2025-05-07T20:33:44.0702415Z if contiguous: 2025-05-07T20:33:44.0702522Z x0 = x0.contiguous() 2025-05-07T20:33:44.0702630Z x1 = x1.contiguous() 2025-05-07T20:33:44.0702714Z 2025-05-07T20:33:44.0702823Z if scale_ub is not None: 2025-05-07T20:33:44.0702954Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0703113Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0703216Z ) 2025-05-07T20:33:44.0703306Z else: 2025-05-07T20:33:44.0703416Z scale_ub_tensor = None 2025-05-07T20:33:44.0703511Z 2025-05-07T20:33:44.0703660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0703771Z op = silu_mul_quant 2025-05-07T20:33:44.0703878Z if compiled: 2025-05-07T20:33:44.0703996Z op = torch.compile(op) 2025-05-07T20:33:44.0704122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0704215Z 2025-05-07T20:33:44.0704322Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0704327Z 2025-05-07T20:33:44.0704443Z moe/activation_test.py:117: 2025-05-07T20:33:44.0704599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0704720Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0704844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0705420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0705536Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0705952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0706215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0706605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0706721Z kernel = self.compile( 2025-05-07T20:33:44.0707160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0707368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0707517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0707573Z 2025-05-07T20:33:44.0707853Z self = 2025-05-07T20:33:44.0708741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0709320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2982e560>} 2025-05-07T20:33:44.0710170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0710392Z context = 2025-05-07T20:33:44.0710397Z 2025-05-07T20:33:44.0710597Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0710903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0711030Z module_map=module_map) 2025-05-07T20:33:44.0711223Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0711427Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0711522Z E ^ 2025-05-07T20:33:44.0711934Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0711939Z 2025-05-07T20:33:44.0712409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0712415Z 2025-05-07T20:33:44.0712541Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0713411Z self=, 2025-05-07T20:33:44.0713506Z T=4096, 2025-05-07T20:33:44.0713653Z D=5120, 2025-05-07T20:33:44.0713755Z scale_ub=1200.0, 2025-05-07T20:33:44.0713856Z contiguous=True, 2025-05-07T20:33:44.0713959Z compiled=False, 2025-05-07T20:33:44.0714045Z ) 2025-05-07T20:33:44.0714301Z self = 2025-05-07T20:33:44.0714510Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.0714515Z 2025-05-07T20:33:44.0714607Z @given( 2025-05-07T20:33:44.0714750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0714867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0715002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0715143Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0715276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0715364Z ) 2025-05-07T20:33:44.0715653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0715770Z def test_silu_mul_quant( 2025-05-07T20:33:44.0715867Z self, 2025-05-07T20:33:44.0715956Z T: int, 2025-05-07T20:33:44.0716047Z D: int, 2025-05-07T20:33:44.0716167Z scale_ub: Optional[float], 2025-05-07T20:33:44.0716275Z contiguous: bool, 2025-05-07T20:33:44.0716379Z compiled: bool, 2025-05-07T20:33:44.0716479Z ) -> None: 2025-05-07T20:33:44.0716591Z torch.manual_seed(2025) 2025-05-07T20:33:44.0716678Z 2025-05-07T20:33:44.0716879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0716967Z 2025-05-07T20:33:44.0717075Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0717225Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0717331Z x = x_sign * x_clamp 2025-05-07T20:33:44.0717433Z x0 = x[:, :D] 2025-05-07T20:33:44.0717528Z x1 = x[:, D:] 2025-05-07T20:33:44.0717614Z 2025-05-07T20:33:44.0717784Z if contiguous: 2025-05-07T20:33:44.0717934Z x0 = x0.contiguous() 2025-05-07T20:33:44.0718042Z x1 = x1.contiguous() 2025-05-07T20:33:44.0718138Z 2025-05-07T20:33:44.0718246Z if scale_ub is not None: 2025-05-07T20:33:44.0718370Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0718540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0718630Z ) 2025-05-07T20:33:44.0718724Z else: 2025-05-07T20:33:44.0718839Z scale_ub_tensor = None 2025-05-07T20:33:44.0718925Z 2025-05-07T20:33:44.0719079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0719186Z op = silu_mul_quant 2025-05-07T20:33:44.0719288Z if compiled: 2025-05-07T20:33:44.0719410Z op = torch.compile(op) 2025-05-07T20:33:44.0719535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0719622Z 2025-05-07T20:33:44.0719738Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0719743Z 2025-05-07T20:33:44.0719858Z moe/activation_test.py:117: 2025-05-07T20:33:44.0720008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0720133Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0720251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0720916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0721040Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0734988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0735280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0735685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0735804Z kernel = self.compile( 2025-05-07T20:33:44.0736261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0736467Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0736617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0736627Z 2025-05-07T20:33:44.0736877Z self = 2025-05-07T20:33:44.0737759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0738349Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2982e7a0>} 2025-05-07T20:33:44.0739191Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0739419Z context = 2025-05-07T20:33:44.0739424Z 2025-05-07T20:33:44.0739622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0739925Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0740057Z module_map=module_map) 2025-05-07T20:33:44.0740243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0740361Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0740457Z E ^ 2025-05-07T20:33:44.0740870Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0740876Z 2025-05-07T20:33:44.0741538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0741544Z 2025-05-07T20:33:44.0741671Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0741926Z self=, 2025-05-07T20:33:44.0742028Z T=1, 2025-05-07T20:33:44.0742121Z D=5120, 2025-05-07T20:33:44.0742225Z scale_ub=None, 2025-05-07T20:33:44.0742334Z contiguous=True, 2025-05-07T20:33:44.0742433Z compiled=True, 2025-05-07T20:33:44.0742522Z ) 2025-05-07T20:33:44.0742779Z self = 2025-05-07T20:33:44.0742965Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.0742970Z 2025-05-07T20:33:44.0743070Z @given( 2025-05-07T20:33:44.0743211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0743329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0743474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0743613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0743746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0743843Z ) 2025-05-07T20:33:44.0744126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0744371Z def test_silu_mul_quant( 2025-05-07T20:33:44.0744470Z self, 2025-05-07T20:33:44.0744562Z T: int, 2025-05-07T20:33:44.0744658Z D: int, 2025-05-07T20:33:44.0744775Z scale_ub: Optional[float], 2025-05-07T20:33:44.0744882Z contiguous: bool, 2025-05-07T20:33:44.0744989Z compiled: bool, 2025-05-07T20:33:44.0745086Z ) -> None: 2025-05-07T20:33:44.0745198Z torch.manual_seed(2025) 2025-05-07T20:33:44.0745292Z 2025-05-07T20:33:44.0745490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0745579Z 2025-05-07T20:33:44.0745697Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0745846Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0745952Z x = x_sign * x_clamp 2025-05-07T20:33:44.0746055Z x0 = x[:, :D] 2025-05-07T20:33:44.0746153Z x1 = x[:, D:] 2025-05-07T20:33:44.0746240Z 2025-05-07T20:33:44.0746348Z if contiguous: 2025-05-07T20:33:44.0746459Z x0 = x0.contiguous() 2025-05-07T20:33:44.0746570Z x1 = x1.contiguous() 2025-05-07T20:33:44.0746658Z 2025-05-07T20:33:44.0746767Z if scale_ub is not None: 2025-05-07T20:33:44.0746895Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0747053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0747142Z ) 2025-05-07T20:33:44.0747238Z else: 2025-05-07T20:33:44.0747348Z scale_ub_tensor = None 2025-05-07T20:33:44.0747435Z 2025-05-07T20:33:44.0747595Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0747705Z op = silu_mul_quant 2025-05-07T20:33:44.0747809Z if compiled: 2025-05-07T20:33:44.0747931Z op = torch.compile(op) 2025-05-07T20:33:44.0748055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0748146Z 2025-05-07T20:33:44.0748256Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0748401Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0748493Z 2025-05-07T20:33:44.0748652Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0748771Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0748897Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0749039Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0749202Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0749292Z 2025-05-07T20:33:44.0749411Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0749469Z 2025-05-07T20:33:44.0749593Z moe/activation_test.py:126: 2025-05-07T20:33:44.0749787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0749915Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0750080Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0750719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0750838Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0751254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0751512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0751930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0752226Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0752685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0752978Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0753495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0753782Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0754177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0754269Z fn() 2025-05-07T20:33:44.0754730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0754834Z self.fn.run( 2025-05-07T20:33:44.0755219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0755337Z kernel = self.compile( 2025-05-07T20:33:44.0755776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0755979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0756137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0756143Z 2025-05-07T20:33:44.0756378Z self = 2025-05-07T20:33:44.0757260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0757841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2982e050>} 2025-05-07T20:33:44.0758684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0758914Z context = 2025-05-07T20:33:44.0758922Z 2025-05-07T20:33:44.0759114Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0759421Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0759547Z module_map=module_map) 2025-05-07T20:33:44.0759736Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0759861Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0759952Z E ^ 2025-05-07T20:33:44.0760355Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0760414Z 2025-05-07T20:33:44.0760931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0760937Z 2025-05-07T20:33:44.0761062Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0761326Z self=, 2025-05-07T20:33:44.0761418Z T=2048, 2025-05-07T20:33:44.0761510Z D=5120, 2025-05-07T20:33:44.0761611Z scale_ub=None, 2025-05-07T20:33:44.0761713Z contiguous=True, 2025-05-07T20:33:44.0761813Z compiled=True, 2025-05-07T20:33:44.0761905Z ) 2025-05-07T20:33:44.0762154Z self = 2025-05-07T20:33:44.0762352Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.0762357Z 2025-05-07T20:33:44.0762451Z @given( 2025-05-07T20:33:44.0762590Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0762719Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0762856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0762993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0763130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0763271Z ) 2025-05-07T20:33:44.0763597Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0763716Z def test_silu_mul_quant( 2025-05-07T20:33:44.0763807Z self, 2025-05-07T20:33:44.0763898Z T: int, 2025-05-07T20:33:44.0763992Z D: int, 2025-05-07T20:33:44.0764108Z scale_ub: Optional[float], 2025-05-07T20:33:44.0764213Z contiguous: bool, 2025-05-07T20:33:44.0764318Z compiled: bool, 2025-05-07T20:33:44.0764410Z ) -> None: 2025-05-07T20:33:44.0764526Z torch.manual_seed(2025) 2025-05-07T20:33:44.0764612Z 2025-05-07T20:33:44.0764811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0764907Z 2025-05-07T20:33:44.0765016Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0765161Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0765271Z x = x_sign * x_clamp 2025-05-07T20:33:44.0765371Z x0 = x[:, :D] 2025-05-07T20:33:44.0765466Z x1 = x[:, D:] 2025-05-07T20:33:44.0765560Z 2025-05-07T20:33:44.0765660Z if contiguous: 2025-05-07T20:33:44.0765769Z x0 = x0.contiguous() 2025-05-07T20:33:44.0765879Z x1 = x1.contiguous() 2025-05-07T20:33:44.0765965Z 2025-05-07T20:33:44.0766078Z if scale_ub is not None: 2025-05-07T20:33:44.0766202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0766360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0766454Z ) 2025-05-07T20:33:44.0766543Z else: 2025-05-07T20:33:44.0766653Z scale_ub_tensor = None 2025-05-07T20:33:44.0766747Z 2025-05-07T20:33:44.0766900Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0767008Z op = silu_mul_quant 2025-05-07T20:33:44.0767113Z if compiled: 2025-05-07T20:33:44.0767232Z op = torch.compile(op) 2025-05-07T20:33:44.0767360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0767451Z 2025-05-07T20:33:44.0767560Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0767705Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0767793Z 2025-05-07T20:33:44.0767951Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0768074Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0768192Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0768333Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0768501Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0768678Z 2025-05-07T20:33:44.0768795Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0768841Z 2025-05-07T20:33:44.0768964Z moe/activation_test.py:126: 2025-05-07T20:33:44.0769113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0769238Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0769402Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0770037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0770161Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0770571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0770826Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0771247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0771547Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0772006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0772382Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0772809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0773006Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0773397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0773494Z fn() 2025-05-07T20:33:44.0773954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0774055Z self.fn.run( 2025-05-07T20:33:44.0774447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0774557Z kernel = self.compile( 2025-05-07T20:33:44.0774990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0775204Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0775353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0775358Z 2025-05-07T20:33:44.0775601Z self = 2025-05-07T20:33:44.0776482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0777062Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae292e77f0>} 2025-05-07T20:33:44.0777906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0778132Z context = 2025-05-07T20:33:44.0778138Z 2025-05-07T20:33:44.0778333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0778634Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0778764Z module_map=module_map) 2025-05-07T20:33:44.0778953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0779072Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0779217Z E ^ 2025-05-07T20:33:44.0779666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0779672Z 2025-05-07T20:33:44.0780143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0780150Z 2025-05-07T20:33:44.0780279Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0780535Z self=, 2025-05-07T20:33:44.0780630Z T=128, 2025-05-07T20:33:44.0780723Z D=5120, 2025-05-07T20:33:44.0780820Z scale_ub=None, 2025-05-07T20:33:44.0780927Z contiguous=True, 2025-05-07T20:33:44.0781026Z compiled=True, 2025-05-07T20:33:44.0781133Z ) 2025-05-07T20:33:44.0781418Z self = 2025-05-07T20:33:44.0781612Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.0781620Z 2025-05-07T20:33:44.0781711Z @given( 2025-05-07T20:33:44.0781854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0781971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0782112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0782295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0782469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0782563Z ) 2025-05-07T20:33:44.0782848Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0782958Z def test_silu_mul_quant( 2025-05-07T20:33:44.0783054Z self, 2025-05-07T20:33:44.0783145Z T: int, 2025-05-07T20:33:44.0783234Z D: int, 2025-05-07T20:33:44.0783355Z scale_ub: Optional[float], 2025-05-07T20:33:44.0783459Z contiguous: bool, 2025-05-07T20:33:44.0783561Z compiled: bool, 2025-05-07T20:33:44.0783655Z ) -> None: 2025-05-07T20:33:44.0783771Z torch.manual_seed(2025) 2025-05-07T20:33:44.0783862Z 2025-05-07T20:33:44.0784060Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0784149Z 2025-05-07T20:33:44.0784261Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0784405Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0784518Z x = x_sign * x_clamp 2025-05-07T20:33:44.0784616Z x0 = x[:, :D] 2025-05-07T20:33:44.0784711Z x1 = x[:, D:] 2025-05-07T20:33:44.0784799Z 2025-05-07T20:33:44.0784903Z if contiguous: 2025-05-07T20:33:44.0785011Z x0 = x0.contiguous() 2025-05-07T20:33:44.0785118Z x1 = x1.contiguous() 2025-05-07T20:33:44.0785206Z 2025-05-07T20:33:44.0785317Z if scale_ub is not None: 2025-05-07T20:33:44.0785444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0785602Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0785696Z ) 2025-05-07T20:33:44.0785792Z else: 2025-05-07T20:33:44.0785905Z scale_ub_tensor = None 2025-05-07T20:33:44.0785991Z 2025-05-07T20:33:44.0786144Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0786251Z op = silu_mul_quant 2025-05-07T20:33:44.0786354Z if compiled: 2025-05-07T20:33:44.0786476Z op = torch.compile(op) 2025-05-07T20:33:44.0786599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0786686Z 2025-05-07T20:33:44.0786796Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0786938Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0787028Z 2025-05-07T20:33:44.0787187Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0787309Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0787429Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0787569Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0787788Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0787921Z 2025-05-07T20:33:44.0788041Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0788046Z 2025-05-07T20:33:44.0788162Z moe/activation_test.py:126: 2025-05-07T20:33:44.0788317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0788445Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0788608Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0789244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0789363Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0789783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0790039Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0790470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0790765Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0791327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0791665Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0792093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0792288Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0792684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0792777Z fn() 2025-05-07T20:33:44.0793250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0793348Z self.fn.run( 2025-05-07T20:33:44.0793792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0793908Z kernel = self.compile( 2025-05-07T20:33:44.0794348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0794551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0794701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0794706Z 2025-05-07T20:33:44.0794943Z self = 2025-05-07T20:33:44.0795833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0796413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28c24280>} 2025-05-07T20:33:44.0797261Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0797487Z context = 2025-05-07T20:33:44.0797493Z 2025-05-07T20:33:44.0797684Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0797990Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0798115Z module_map=module_map) 2025-05-07T20:33:44.0798305Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0798477Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0798612Z E ^ 2025-05-07T20:33:44.0799021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0799026Z 2025-05-07T20:33:44.0799498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0799503Z 2025-05-07T20:33:44.0799626Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0799886Z self=, 2025-05-07T20:33:44.0799978Z T=4096, 2025-05-07T20:33:44.0800073Z D=5120, 2025-05-07T20:33:44.0800169Z scale_ub=None, 2025-05-07T20:33:44.0800269Z contiguous=True, 2025-05-07T20:33:44.0800370Z compiled=True, 2025-05-07T20:33:44.0800458Z ) 2025-05-07T20:33:44.0800706Z self = 2025-05-07T20:33:44.0800912Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.0800917Z 2025-05-07T20:33:44.0801008Z @given( 2025-05-07T20:33:44.0801146Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0801267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0801451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0801657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0801793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0801880Z ) 2025-05-07T20:33:44.0802167Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0802279Z def test_silu_mul_quant( 2025-05-07T20:33:44.0802370Z self, 2025-05-07T20:33:44.0802466Z T: int, 2025-05-07T20:33:44.0802558Z D: int, 2025-05-07T20:33:44.0802675Z scale_ub: Optional[float], 2025-05-07T20:33:44.0802784Z contiguous: bool, 2025-05-07T20:33:44.0802890Z compiled: bool, 2025-05-07T20:33:44.0802986Z ) -> None: 2025-05-07T20:33:44.0803102Z torch.manual_seed(2025) 2025-05-07T20:33:44.0803189Z 2025-05-07T20:33:44.0803387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0803479Z 2025-05-07T20:33:44.0803586Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0803738Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0803844Z x = x_sign * x_clamp 2025-05-07T20:33:44.0803939Z x0 = x[:, :D] 2025-05-07T20:33:44.0804039Z x1 = x[:, D:] 2025-05-07T20:33:44.0804126Z 2025-05-07T20:33:44.0804225Z if contiguous: 2025-05-07T20:33:44.0804340Z x0 = x0.contiguous() 2025-05-07T20:33:44.0804445Z x1 = x1.contiguous() 2025-05-07T20:33:44.0804531Z 2025-05-07T20:33:44.0804641Z if scale_ub is not None: 2025-05-07T20:33:44.0804765Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0804930Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0805025Z ) 2025-05-07T20:33:44.0805116Z else: 2025-05-07T20:33:44.0805228Z scale_ub_tensor = None 2025-05-07T20:33:44.0805314Z 2025-05-07T20:33:44.0805465Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0805577Z op = silu_mul_quant 2025-05-07T20:33:44.0805679Z if compiled: 2025-05-07T20:33:44.0805800Z op = torch.compile(op) 2025-05-07T20:33:44.0805930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0806018Z 2025-05-07T20:33:44.0806123Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0806268Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0806360Z 2025-05-07T20:33:44.0806519Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0806645Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0806816Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0807006Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0807172Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0807259Z 2025-05-07T20:33:44.0807381Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0807388Z 2025-05-07T20:33:44.0807504Z moe/activation_test.py:126: 2025-05-07T20:33:44.0807658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0807785Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0807943Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0808589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0808708Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0809121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0809385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0809804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0810098Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0810644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0810934Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0811369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0811564Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0811956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0812053Z fn() 2025-05-07T20:33:44.0812514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0812616Z self.fn.run( 2025-05-07T20:33:44.0813005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0813126Z kernel = self.compile( 2025-05-07T20:33:44.0813564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0813767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0813914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0813922Z 2025-05-07T20:33:44.0814158Z self = 2025-05-07T20:33:44.0815042Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0815622Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28c252d0>} 2025-05-07T20:33:44.0816471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0816697Z context = 2025-05-07T20:33:44.0816702Z 2025-05-07T20:33:44.0816894Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0817201Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0817388Z module_map=module_map) 2025-05-07T20:33:44.0817616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0817736Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0817834Z E ^ 2025-05-07T20:33:44.0818239Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0818249Z 2025-05-07T20:33:44.0818720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0818725Z 2025-05-07T20:33:44.0818849Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0819103Z self=, 2025-05-07T20:33:44.0819199Z T=16384, 2025-05-07T20:33:44.0819291Z D=5120, 2025-05-07T20:33:44.0819398Z scale_ub=None, 2025-05-07T20:33:44.0819498Z contiguous=True, 2025-05-07T20:33:44.0819596Z compiled=True, 2025-05-07T20:33:44.0819690Z ) 2025-05-07T20:33:44.0819943Z self = 2025-05-07T20:33:44.0820145Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.0820150Z 2025-05-07T20:33:44.0820249Z @given( 2025-05-07T20:33:44.0820436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0820592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0820732Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0820879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0821038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0821146Z ) 2025-05-07T20:33:44.0821427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0821540Z def test_silu_mul_quant( 2025-05-07T20:33:44.0821632Z self, 2025-05-07T20:33:44.0821723Z T: int, 2025-05-07T20:33:44.0821822Z D: int, 2025-05-07T20:33:44.0821940Z scale_ub: Optional[float], 2025-05-07T20:33:44.0822047Z contiguous: bool, 2025-05-07T20:33:44.0822152Z compiled: bool, 2025-05-07T20:33:44.0822245Z ) -> None: 2025-05-07T20:33:44.0822358Z torch.manual_seed(2025) 2025-05-07T20:33:44.0822449Z 2025-05-07T20:33:44.0822652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0822739Z 2025-05-07T20:33:44.0822854Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0822998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0823105Z x = x_sign * x_clamp 2025-05-07T20:33:44.0823199Z x0 = x[:, :D] 2025-05-07T20:33:44.0823292Z x1 = x[:, D:] 2025-05-07T20:33:44.0823385Z 2025-05-07T20:33:44.0823484Z if contiguous: 2025-05-07T20:33:44.0823594Z x0 = x0.contiguous() 2025-05-07T20:33:44.0823706Z x1 = x1.contiguous() 2025-05-07T20:33:44.0824011Z 2025-05-07T20:33:44.0824177Z if scale_ub is not None: 2025-05-07T20:33:44.0824354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0824515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0824607Z ) 2025-05-07T20:33:44.0824701Z else: 2025-05-07T20:33:44.0824812Z scale_ub_tensor = None 2025-05-07T20:33:44.0824906Z 2025-05-07T20:33:44.0825063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0825171Z op = silu_mul_quant 2025-05-07T20:33:44.0825276Z if compiled: 2025-05-07T20:33:44.0825392Z op = torch.compile(op) 2025-05-07T20:33:44.0825516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0825605Z 2025-05-07T20:33:44.0825713Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0825854Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0825945Z 2025-05-07T20:33:44.0826101Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0826313Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0826515Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0826660Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0826832Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0826920Z 2025-05-07T20:33:44.0827041Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0827047Z 2025-05-07T20:33:44.0827166Z moe/activation_test.py:126: 2025-05-07T20:33:44.0827315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0827439Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0827605Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0828238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0828360Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0828775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0829034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0829458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0829879Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0830342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0830636Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0831065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0831262Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0831661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0831754Z fn() 2025-05-07T20:33:44.0832218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0832317Z self.fn.run( 2025-05-07T20:33:44.0832715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0832824Z kernel = self.compile( 2025-05-07T20:33:44.0833257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0833466Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0833736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0833742Z 2025-05-07T20:33:44.0833975Z self = 2025-05-07T20:33:44.0834869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0835450Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28c360e0>} 2025-05-07T20:33:44.0836296Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0836518Z context = 2025-05-07T20:33:44.0836523Z 2025-05-07T20:33:44.0836718Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0837073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0837240Z module_map=module_map) 2025-05-07T20:33:44.0837435Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0837557Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0837652Z E ^ 2025-05-07T20:33:44.0838064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0838069Z 2025-05-07T20:33:44.0838541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0838546Z 2025-05-07T20:33:44.0838670Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0838923Z self=, 2025-05-07T20:33:44.0839015Z T=1, 2025-05-07T20:33:44.0839110Z D=5120, 2025-05-07T20:33:44.0839209Z scale_ub=1200.0, 2025-05-07T20:33:44.0839313Z contiguous=True, 2025-05-07T20:33:44.0839421Z compiled=True, 2025-05-07T20:33:44.0839509Z ) 2025-05-07T20:33:44.0839760Z self = 2025-05-07T20:33:44.0839953Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.0840032Z 2025-05-07T20:33:44.0840124Z @given( 2025-05-07T20:33:44.0840311Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0840431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0840565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0840709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0840847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0840940Z ) 2025-05-07T20:33:44.0841275Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0841387Z def test_silu_mul_quant( 2025-05-07T20:33:44.0841486Z self, 2025-05-07T20:33:44.0841578Z T: int, 2025-05-07T20:33:44.0841670Z D: int, 2025-05-07T20:33:44.0841787Z scale_ub: Optional[float], 2025-05-07T20:33:44.0841892Z contiguous: bool, 2025-05-07T20:33:44.0841995Z compiled: bool, 2025-05-07T20:33:44.0842092Z ) -> None: 2025-05-07T20:33:44.0842205Z torch.manual_seed(2025) 2025-05-07T20:33:44.0842298Z 2025-05-07T20:33:44.0842497Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0842586Z 2025-05-07T20:33:44.0842695Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0842847Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0842952Z x = x_sign * x_clamp 2025-05-07T20:33:44.0843051Z x0 = x[:, :D] 2025-05-07T20:33:44.0843147Z x1 = x[:, D:] 2025-05-07T20:33:44.0843233Z 2025-05-07T20:33:44.0843335Z if contiguous: 2025-05-07T20:33:44.0843446Z x0 = x0.contiguous() 2025-05-07T20:33:44.0843558Z x1 = x1.contiguous() 2025-05-07T20:33:44.0843647Z 2025-05-07T20:33:44.0843755Z if scale_ub is not None: 2025-05-07T20:33:44.0843880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0844039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0844131Z ) 2025-05-07T20:33:44.0844223Z else: 2025-05-07T20:33:44.0844342Z scale_ub_tensor = None 2025-05-07T20:33:44.0844428Z 2025-05-07T20:33:44.0844583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0844689Z op = silu_mul_quant 2025-05-07T20:33:44.0844789Z if compiled: 2025-05-07T20:33:44.0844909Z op = torch.compile(op) 2025-05-07T20:33:44.0845034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0845121Z 2025-05-07T20:33:44.0845231Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0845235Z 2025-05-07T20:33:44.0845350Z moe/activation_test.py:117: 2025-05-07T20:33:44.0845593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0845717Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0845834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0846254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.0846370Z return fn(*args, **kwargs) 2025-05-07T20:33:44.0846930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0847049Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0847456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0847712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0848104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0848217Z kernel = self.compile( 2025-05-07T20:33:44.0848661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0848862Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0849097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0849103Z 2025-05-07T20:33:44.0849345Z self = 2025-05-07T20:33:44.0850227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0850830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28222560>} 2025-05-07T20:33:44.0851702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0851925Z context = 2025-05-07T20:33:44.0851939Z 2025-05-07T20:33:44.0852132Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0852435Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0852563Z module_map=module_map) 2025-05-07T20:33:44.0852750Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0852870Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0852965Z E ^ 2025-05-07T20:33:44.0853367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0853376Z 2025-05-07T20:33:44.0853851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0853856Z 2025-05-07T20:33:44.0853979Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0854238Z self=, 2025-05-07T20:33:44.0854336Z T=1, 2025-05-07T20:33:44.0854428Z D=5120, 2025-05-07T20:33:44.0854525Z scale_ub=None, 2025-05-07T20:33:44.0854630Z contiguous=False, 2025-05-07T20:33:44.0854729Z compiled=True, 2025-05-07T20:33:44.0854816Z ) 2025-05-07T20:33:44.0855071Z self = 2025-05-07T20:33:44.0855259Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.0855264Z 2025-05-07T20:33:44.0855360Z @given( 2025-05-07T20:33:44.0855499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0855666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0855850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0855988Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0856121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0856214Z ) 2025-05-07T20:33:44.0856499Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0856613Z def test_silu_mul_quant( 2025-05-07T20:33:44.0856702Z self, 2025-05-07T20:33:44.0856794Z T: int, 2025-05-07T20:33:44.0856889Z D: int, 2025-05-07T20:33:44.0857005Z scale_ub: Optional[float], 2025-05-07T20:33:44.0857110Z contiguous: bool, 2025-05-07T20:33:44.0857213Z compiled: bool, 2025-05-07T20:33:44.0857306Z ) -> None: 2025-05-07T20:33:44.0857415Z torch.manual_seed(2025) 2025-05-07T20:33:44.0857505Z 2025-05-07T20:33:44.0857698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0857788Z 2025-05-07T20:33:44.0857902Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0858050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0858154Z x = x_sign * x_clamp 2025-05-07T20:33:44.0858251Z x0 = x[:, :D] 2025-05-07T20:33:44.0858393Z x1 = x[:, D:] 2025-05-07T20:33:44.0858523Z 2025-05-07T20:33:44.0858623Z if contiguous: 2025-05-07T20:33:44.0858730Z x0 = x0.contiguous() 2025-05-07T20:33:44.0858838Z x1 = x1.contiguous() 2025-05-07T20:33:44.0858925Z 2025-05-07T20:33:44.0859031Z if scale_ub is not None: 2025-05-07T20:33:44.0859160Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0859316Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0859405Z ) 2025-05-07T20:33:44.0859500Z else: 2025-05-07T20:33:44.0859610Z scale_ub_tensor = None 2025-05-07T20:33:44.0859699Z 2025-05-07T20:33:44.0859854Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0859963Z op = silu_mul_quant 2025-05-07T20:33:44.0860069Z if compiled: 2025-05-07T20:33:44.0860188Z op = torch.compile(op) 2025-05-07T20:33:44.0860312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0860409Z 2025-05-07T20:33:44.0860518Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0860660Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0860752Z 2025-05-07T20:33:44.0860934Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0861079Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0861200Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0861342Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0861505Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0861600Z 2025-05-07T20:33:44.0861719Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0861724Z 2025-05-07T20:33:44.0861846Z moe/activation_test.py:126: 2025-05-07T20:33:44.0861995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0862119Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0862284Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0862923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0863041Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0863455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0863711Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0864130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0864515Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0864970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0865269Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0865697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0865892Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0866283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0866374Z fn() 2025-05-07T20:33:44.0866835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0866935Z self.fn.run( 2025-05-07T20:33:44.0867322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0867436Z kernel = self.compile( 2025-05-07T20:33:44.0867868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0868158Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0868305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0868310Z 2025-05-07T20:33:44.0868543Z self = 2025-05-07T20:33:44.0869421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0869998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28d47910>} 2025-05-07T20:33:44.0870849Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0871077Z context = 2025-05-07T20:33:44.0871082Z 2025-05-07T20:33:44.0871275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0871578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0871704Z module_map=module_map) 2025-05-07T20:33:44.0871894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0872015Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.0872106Z E ^ 2025-05-07T20:33:44.0872520Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0872525Z 2025-05-07T20:33:44.0872994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0873002Z 2025-05-07T20:33:44.0873129Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0873386Z self=, 2025-05-07T20:33:44.0873477Z T=1, 2025-05-07T20:33:44.0873626Z D=5120, 2025-05-07T20:33:44.0873726Z scale_ub=None, 2025-05-07T20:33:44.0873825Z contiguous=True, 2025-05-07T20:33:44.0873928Z compiled=False, 2025-05-07T20:33:44.0874015Z ) 2025-05-07T20:33:44.0874265Z self = 2025-05-07T20:33:44.0874458Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.0874517Z 2025-05-07T20:33:44.0874608Z @given( 2025-05-07T20:33:44.0874816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0874935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0875070Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0875212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0875351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0875439Z ) 2025-05-07T20:33:44.0875725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0875835Z def test_silu_mul_quant( 2025-05-07T20:33:44.0875931Z self, 2025-05-07T20:33:44.0876021Z T: int, 2025-05-07T20:33:44.0876112Z D: int, 2025-05-07T20:33:44.0876230Z scale_ub: Optional[float], 2025-05-07T20:33:44.0876336Z contiguous: bool, 2025-05-07T20:33:44.0876437Z compiled: bool, 2025-05-07T20:33:44.0876533Z ) -> None: 2025-05-07T20:33:44.0876647Z torch.manual_seed(2025) 2025-05-07T20:33:44.0876733Z 2025-05-07T20:33:44.0876936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0877023Z 2025-05-07T20:33:44.0877133Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0877283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0877437Z x = x_sign * x_clamp 2025-05-07T20:33:44.0877573Z x0 = x[:, :D] 2025-05-07T20:33:44.0877674Z x1 = x[:, D:] 2025-05-07T20:33:44.0877760Z 2025-05-07T20:33:44.0877861Z if contiguous: 2025-05-07T20:33:44.0877972Z x0 = x0.contiguous() 2025-05-07T20:33:44.0878077Z x1 = x1.contiguous() 2025-05-07T20:33:44.0878168Z 2025-05-07T20:33:44.0878275Z if scale_ub is not None: 2025-05-07T20:33:44.0878398Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0878558Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0878648Z ) 2025-05-07T20:33:44.0878742Z else: 2025-05-07T20:33:44.0878855Z scale_ub_tensor = None 2025-05-07T20:33:44.0878945Z 2025-05-07T20:33:44.0879095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0879208Z op = silu_mul_quant 2025-05-07T20:33:44.0883128Z if compiled: 2025-05-07T20:33:44.0883272Z op = torch.compile(op) 2025-05-07T20:33:44.0883407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0883501Z 2025-05-07T20:33:44.0883611Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0883617Z 2025-05-07T20:33:44.0883734Z moe/activation_test.py:117: 2025-05-07T20:33:44.0883889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0884010Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0884128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0884714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0884836Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0885252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0885510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0885905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0886021Z kernel = self.compile( 2025-05-07T20:33:44.0886459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0886667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0886815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0886820Z 2025-05-07T20:33:44.0887059Z self = 2025-05-07T20:33:44.0888057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0888640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28d47d00>} 2025-05-07T20:33:44.0889492Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0889718Z context = 2025-05-07T20:33:44.0889723Z 2025-05-07T20:33:44.0889913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0890223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0890355Z module_map=module_map) 2025-05-07T20:33:44.0890547Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0890669Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0890759Z E ^ 2025-05-07T20:33:44.0891260Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0891267Z 2025-05-07T20:33:44.0891737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0891743Z 2025-05-07T20:33:44.0891870Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0892125Z self=, 2025-05-07T20:33:44.0892216Z T=128, 2025-05-07T20:33:44.0892308Z D=5120, 2025-05-07T20:33:44.0892407Z scale_ub=None, 2025-05-07T20:33:44.0892515Z contiguous=False, 2025-05-07T20:33:44.0892618Z compiled=True, 2025-05-07T20:33:44.0892707Z ) 2025-05-07T20:33:44.0892957Z self = 2025-05-07T20:33:44.0893158Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.0893167Z 2025-05-07T20:33:44.0893258Z @given( 2025-05-07T20:33:44.0893405Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0893522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0893657Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0893799Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0893931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0894020Z ) 2025-05-07T20:33:44.0894308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0894418Z def test_silu_mul_quant( 2025-05-07T20:33:44.0894511Z self, 2025-05-07T20:33:44.0894610Z T: int, 2025-05-07T20:33:44.0894700Z D: int, 2025-05-07T20:33:44.0894826Z scale_ub: Optional[float], 2025-05-07T20:33:44.0894933Z contiguous: bool, 2025-05-07T20:33:44.0895032Z compiled: bool, 2025-05-07T20:33:44.0895129Z ) -> None: 2025-05-07T20:33:44.0895243Z torch.manual_seed(2025) 2025-05-07T20:33:44.0895330Z 2025-05-07T20:33:44.0895534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0895624Z 2025-05-07T20:33:44.0895731Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0895881Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0895986Z x = x_sign * x_clamp 2025-05-07T20:33:44.0896082Z x0 = x[:, :D] 2025-05-07T20:33:44.0896178Z x1 = x[:, D:] 2025-05-07T20:33:44.0896263Z 2025-05-07T20:33:44.0896364Z if contiguous: 2025-05-07T20:33:44.0896474Z x0 = x0.contiguous() 2025-05-07T20:33:44.0896630Z x1 = x1.contiguous() 2025-05-07T20:33:44.0896724Z 2025-05-07T20:33:44.0896874Z if scale_ub is not None: 2025-05-07T20:33:44.0896997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0897158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0897249Z ) 2025-05-07T20:33:44.0897341Z else: 2025-05-07T20:33:44.0897457Z scale_ub_tensor = None 2025-05-07T20:33:44.0897546Z 2025-05-07T20:33:44.0897697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0897808Z op = silu_mul_quant 2025-05-07T20:33:44.0897906Z if compiled: 2025-05-07T20:33:44.0898022Z op = torch.compile(op) 2025-05-07T20:33:44.0898151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0898240Z 2025-05-07T20:33:44.0898351Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0898356Z 2025-05-07T20:33:44.0898470Z moe/activation_test.py:117: 2025-05-07T20:33:44.0898624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0898749Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0898870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0899291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.0899447Z return fn(*args, **kwargs) 2025-05-07T20:33:44.0900059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0900175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0900589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0900847Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0901236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0901353Z kernel = self.compile( 2025-05-07T20:33:44.0901793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0901998Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0902159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0902164Z 2025-05-07T20:33:44.0902401Z self = 2025-05-07T20:33:44.0903288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0903862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae28d44940>} 2025-05-07T20:33:44.0904711Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0904932Z context = 2025-05-07T20:33:44.0904941Z 2025-05-07T20:33:44.0905135Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0905442Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0905570Z module_map=module_map) 2025-05-07T20:33:44.0905759Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0905875Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0905967Z E ^ 2025-05-07T20:33:44.0906374Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0906429Z 2025-05-07T20:33:44.0906944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0906949Z 2025-05-07T20:33:44.0907072Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0907331Z self=, 2025-05-07T20:33:44.0907429Z T=128, 2025-05-07T20:33:44.0907522Z D=7168, 2025-05-07T20:33:44.0907620Z scale_ub=1200.0, 2025-05-07T20:33:44.0907722Z contiguous=False, 2025-05-07T20:33:44.0907825Z compiled=False, 2025-05-07T20:33:44.0907913Z ) 2025-05-07T20:33:44.0908161Z self = 2025-05-07T20:33:44.0908365Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.0908370Z 2025-05-07T20:33:44.0908461Z @given( 2025-05-07T20:33:44.0908600Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0908727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0908864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0909006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0909140Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0909276Z ) 2025-05-07T20:33:44.0909606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0909718Z def test_silu_mul_quant( 2025-05-07T20:33:44.0909809Z self, 2025-05-07T20:33:44.0909904Z T: int, 2025-05-07T20:33:44.0909993Z D: int, 2025-05-07T20:33:44.0910109Z scale_ub: Optional[float], 2025-05-07T20:33:44.0910220Z contiguous: bool, 2025-05-07T20:33:44.0910321Z compiled: bool, 2025-05-07T20:33:44.0910415Z ) -> None: 2025-05-07T20:33:44.0910528Z torch.manual_seed(2025) 2025-05-07T20:33:44.0910614Z 2025-05-07T20:33:44.0910811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0910901Z 2025-05-07T20:33:44.0911012Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0911160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0911265Z x = x_sign * x_clamp 2025-05-07T20:33:44.0911359Z x0 = x[:, :D] 2025-05-07T20:33:44.0911461Z x1 = x[:, D:] 2025-05-07T20:33:44.0911549Z 2025-05-07T20:33:44.0911650Z if contiguous: 2025-05-07T20:33:44.0911761Z x0 = x0.contiguous() 2025-05-07T20:33:44.0911865Z x1 = x1.contiguous() 2025-05-07T20:33:44.0911951Z 2025-05-07T20:33:44.0912062Z if scale_ub is not None: 2025-05-07T20:33:44.0912185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0912346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0912435Z ) 2025-05-07T20:33:44.0912525Z else: 2025-05-07T20:33:44.0912641Z scale_ub_tensor = None 2025-05-07T20:33:44.0912733Z 2025-05-07T20:33:44.0912884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0912995Z op = silu_mul_quant 2025-05-07T20:33:44.0913096Z if compiled: 2025-05-07T20:33:44.0913212Z op = torch.compile(op) 2025-05-07T20:33:44.0913340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0913428Z 2025-05-07T20:33:44.0913605Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0913616Z 2025-05-07T20:33:44.0913730Z moe/activation_test.py:117: 2025-05-07T20:33:44.0913880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0914003Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0914119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0914688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0914807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0915342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0915601Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0915991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0916103Z kernel = self.compile( 2025-05-07T20:33:44.0916547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0916750Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0916897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0916902Z 2025-05-07T20:33:44.0917139Z self = 2025-05-07T20:33:44.0918019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0918596Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289abf40>} 2025-05-07T20:33:44.0919522Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0919748Z context = 2025-05-07T20:33:44.0919753Z 2025-05-07T20:33:44.0919948Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0920248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0920377Z module_map=module_map) 2025-05-07T20:33:44.0920567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0920685Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0920780Z E ^ 2025-05-07T20:33:44.0921185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0921194Z 2025-05-07T20:33:44.0921671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0921675Z 2025-05-07T20:33:44.0921797Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0922052Z self=, 2025-05-07T20:33:44.0922150Z T=128, 2025-05-07T20:33:44.0922243Z D=5120, 2025-05-07T20:33:44.0922340Z scale_ub=None, 2025-05-07T20:33:44.0922447Z contiguous=False, 2025-05-07T20:33:44.0922545Z compiled=False, 2025-05-07T20:33:44.0922634Z ) 2025-05-07T20:33:44.0922888Z self = 2025-05-07T20:33:44.0923088Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.0923093Z 2025-05-07T20:33:44.0923191Z @given( 2025-05-07T20:33:44.0923330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0923451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0923591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0923728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0924099Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0924236Z ) 2025-05-07T20:33:44.0924535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0924650Z def test_silu_mul_quant( 2025-05-07T20:33:44.0924741Z self, 2025-05-07T20:33:44.0924832Z T: int, 2025-05-07T20:33:44.0924927Z D: int, 2025-05-07T20:33:44.0925178Z scale_ub: Optional[float], 2025-05-07T20:33:44.0925282Z contiguous: bool, 2025-05-07T20:33:44.0925454Z compiled: bool, 2025-05-07T20:33:44.0925548Z ) -> None: 2025-05-07T20:33:44.0925660Z torch.manual_seed(2025) 2025-05-07T20:33:44.0925750Z 2025-05-07T20:33:44.0925945Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0926036Z 2025-05-07T20:33:44.0926150Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0926295Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0926405Z x = x_sign * x_clamp 2025-05-07T20:33:44.0926500Z x0 = x[:, :D] 2025-05-07T20:33:44.0926593Z x1 = x[:, D:] 2025-05-07T20:33:44.0926683Z 2025-05-07T20:33:44.0926781Z if contiguous: 2025-05-07T20:33:44.0926888Z x0 = x0.contiguous() 2025-05-07T20:33:44.0926998Z x1 = x1.contiguous() 2025-05-07T20:33:44.0927083Z 2025-05-07T20:33:44.0927188Z if scale_ub is not None: 2025-05-07T20:33:44.0927317Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0927479Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0927569Z ) 2025-05-07T20:33:44.0927662Z else: 2025-05-07T20:33:44.0927772Z scale_ub_tensor = None 2025-05-07T20:33:44.0927936Z 2025-05-07T20:33:44.0928156Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0928263Z op = silu_mul_quant 2025-05-07T20:33:44.0928365Z if compiled: 2025-05-07T20:33:44.0928481Z op = torch.compile(op) 2025-05-07T20:33:44.0928605Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0928693Z 2025-05-07T20:33:44.0928802Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0928807Z 2025-05-07T20:33:44.0928921Z moe/activation_test.py:117: 2025-05-07T20:33:44.0929072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0929190Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0929311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0929887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0930002Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0930427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0930686Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0931078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0931193Z kernel = self.compile( 2025-05-07T20:33:44.0931629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0931838Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0931988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0931996Z 2025-05-07T20:33:44.0932232Z self = 2025-05-07T20:33:44.0933114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0933687Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289a95a0>} 2025-05-07T20:33:44.0934528Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0934748Z context = 2025-05-07T20:33:44.0934802Z 2025-05-07T20:33:44.0935034Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0935343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0935469Z module_map=module_map) 2025-05-07T20:33:44.0935664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0935780Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0935871Z E ^ 2025-05-07T20:33:44.0936280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0936285Z 2025-05-07T20:33:44.0936753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0936758Z 2025-05-07T20:33:44.0936888Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0937144Z self=, 2025-05-07T20:33:44.0937239Z T=128, 2025-05-07T20:33:44.0937335Z D=5120, 2025-05-07T20:33:44.0937435Z scale_ub=1200.0, 2025-05-07T20:33:44.0937534Z contiguous=True, 2025-05-07T20:33:44.0937638Z compiled=False, 2025-05-07T20:33:44.0937725Z ) 2025-05-07T20:33:44.0938065Z self = 2025-05-07T20:33:44.0938270Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.0938275Z 2025-05-07T20:33:44.0938368Z @given( 2025-05-07T20:33:44.0938510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0938627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0938761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0938902Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0939034Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0939124Z ) 2025-05-07T20:33:44.0939410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0939521Z def test_silu_mul_quant( 2025-05-07T20:33:44.0939610Z self, 2025-05-07T20:33:44.0939705Z T: int, 2025-05-07T20:33:44.0939795Z D: int, 2025-05-07T20:33:44.0939914Z scale_ub: Optional[float], 2025-05-07T20:33:44.0940032Z contiguous: bool, 2025-05-07T20:33:44.0940134Z compiled: bool, 2025-05-07T20:33:44.0940231Z ) -> None: 2025-05-07T20:33:44.0940341Z torch.manual_seed(2025) 2025-05-07T20:33:44.0940428Z 2025-05-07T20:33:44.0940626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0940712Z 2025-05-07T20:33:44.0940820Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0940969Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0941073Z x = x_sign * x_clamp 2025-05-07T20:33:44.0941168Z x0 = x[:, :D] 2025-05-07T20:33:44.0941270Z x1 = x[:, D:] 2025-05-07T20:33:44.0941355Z 2025-05-07T20:33:44.0941456Z if contiguous: 2025-05-07T20:33:44.0941565Z x0 = x0.contiguous() 2025-05-07T20:33:44.0941668Z x1 = x1.contiguous() 2025-05-07T20:33:44.0941758Z 2025-05-07T20:33:44.0941863Z if scale_ub is not None: 2025-05-07T20:33:44.0941990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0942152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0942240Z ) 2025-05-07T20:33:44.0942329Z else: 2025-05-07T20:33:44.0942443Z scale_ub_tensor = None 2025-05-07T20:33:44.0942531Z 2025-05-07T20:33:44.0942682Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0942792Z op = silu_mul_quant 2025-05-07T20:33:44.0942892Z if compiled: 2025-05-07T20:33:44.0943008Z op = torch.compile(op) 2025-05-07T20:33:44.0943135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0943275Z 2025-05-07T20:33:44.0943385Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0943432Z 2025-05-07T20:33:44.0943546Z moe/activation_test.py:117: 2025-05-07T20:33:44.0943696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0943820Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0943941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0944509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0944629Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0945037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0945297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0945688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0945804Z kernel = self.compile( 2025-05-07T20:33:44.0946247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0946449Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0946713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0946725Z 2025-05-07T20:33:44.0946960Z self = 2025-05-07T20:33:44.0947839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0948420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289abd90>} 2025-05-07T20:33:44.0949265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0949489Z context = 2025-05-07T20:33:44.0949497Z 2025-05-07T20:33:44.0949690Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0949992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0950120Z module_map=module_map) 2025-05-07T20:33:44.0950306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0950422Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0950517Z E ^ 2025-05-07T20:33:44.0950922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0950930Z 2025-05-07T20:33:44.0951407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0951411Z 2025-05-07T20:33:44.0951534Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0951791Z self=, 2025-05-07T20:33:44.0951891Z T=1, 2025-05-07T20:33:44.0951981Z D=7168, 2025-05-07T20:33:44.0952083Z scale_ub=1200.0, 2025-05-07T20:33:44.0952182Z contiguous=True, 2025-05-07T20:33:44.0952279Z compiled=True, 2025-05-07T20:33:44.0952371Z ) 2025-05-07T20:33:44.0952619Z self = 2025-05-07T20:33:44.0952806Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.0952811Z 2025-05-07T20:33:44.0952906Z @given( 2025-05-07T20:33:44.0953045Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0953211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0953390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0953589Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0953731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0953823Z ) 2025-05-07T20:33:44.0954109Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0954229Z def test_silu_mul_quant( 2025-05-07T20:33:44.0954318Z self, 2025-05-07T20:33:44.0954410Z T: int, 2025-05-07T20:33:44.0954503Z D: int, 2025-05-07T20:33:44.0954617Z scale_ub: Optional[float], 2025-05-07T20:33:44.0954722Z contiguous: bool, 2025-05-07T20:33:44.0954826Z compiled: bool, 2025-05-07T20:33:44.0954918Z ) -> None: 2025-05-07T20:33:44.0955029Z torch.manual_seed(2025) 2025-05-07T20:33:44.0955119Z 2025-05-07T20:33:44.0955313Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0955402Z 2025-05-07T20:33:44.0955518Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0955664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0955772Z x = x_sign * x_clamp 2025-05-07T20:33:44.0955865Z x0 = x[:, :D] 2025-05-07T20:33:44.0956009Z x1 = x[:, D:] 2025-05-07T20:33:44.0956100Z 2025-05-07T20:33:44.0956239Z if contiguous: 2025-05-07T20:33:44.0956347Z x0 = x0.contiguous() 2025-05-07T20:33:44.0956454Z x1 = x1.contiguous() 2025-05-07T20:33:44.0956542Z 2025-05-07T20:33:44.0956649Z if scale_ub is not None: 2025-05-07T20:33:44.0956776Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0956935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0957024Z ) 2025-05-07T20:33:44.0957118Z else: 2025-05-07T20:33:44.0957229Z scale_ub_tensor = None 2025-05-07T20:33:44.0957319Z 2025-05-07T20:33:44.0957477Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0957585Z op = silu_mul_quant 2025-05-07T20:33:44.0957687Z if compiled: 2025-05-07T20:33:44.0957803Z op = torch.compile(op) 2025-05-07T20:33:44.0957925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0958017Z 2025-05-07T20:33:44.0958126Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0958131Z 2025-05-07T20:33:44.0958245Z moe/activation_test.py:117: 2025-05-07T20:33:44.0958399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0958519Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0958637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0959061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.0959170Z return fn(*args, **kwargs) 2025-05-07T20:33:44.0959741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0959859Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0960267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0960538Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0960927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0961039Z kernel = self.compile( 2025-05-07T20:33:44.0961477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0961681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0961832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0961837Z 2025-05-07T20:33:44.0962150Z self = 2025-05-07T20:33:44.0963278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0963980Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289ab1c0>} 2025-05-07T20:33:44.0965013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0965258Z context = 2025-05-07T20:33:44.0965263Z 2025-05-07T20:33:44.0965467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0965818Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0965946Z module_map=module_map) 2025-05-07T20:33:44.0966149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0966317Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0966409Z E ^ 2025-05-07T20:33:44.0966925Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0966934Z 2025-05-07T20:33:44.0967492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0967497Z 2025-05-07T20:33:44.0967625Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0967914Z self=, 2025-05-07T20:33:44.0968007Z T=1, 2025-05-07T20:33:44.0968100Z D=7168, 2025-05-07T20:33:44.0968204Z scale_ub=1200.0, 2025-05-07T20:33:44.0968309Z contiguous=False, 2025-05-07T20:33:44.0968411Z compiled=True, 2025-05-07T20:33:44.0968503Z ) 2025-05-07T20:33:44.0968782Z self = 2025-05-07T20:33:44.0968992Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.0969000Z 2025-05-07T20:33:44.0969093Z @given( 2025-05-07T20:33:44.0969237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0969357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0969496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0969638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0969781Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0969868Z ) 2025-05-07T20:33:44.0970189Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0970308Z def test_silu_mul_quant( 2025-05-07T20:33:44.0970400Z self, 2025-05-07T20:33:44.0970497Z T: int, 2025-05-07T20:33:44.0970590Z D: int, 2025-05-07T20:33:44.0970708Z scale_ub: Optional[float], 2025-05-07T20:33:44.0970820Z contiguous: bool, 2025-05-07T20:33:44.0970921Z compiled: bool, 2025-05-07T20:33:44.0971016Z ) -> None: 2025-05-07T20:33:44.0971133Z torch.manual_seed(2025) 2025-05-07T20:33:44.0971220Z 2025-05-07T20:33:44.0971430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0971521Z 2025-05-07T20:33:44.0971630Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0971779Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0971888Z x = x_sign * x_clamp 2025-05-07T20:33:44.0971982Z x0 = x[:, :D] 2025-05-07T20:33:44.0972081Z x1 = x[:, D:] 2025-05-07T20:33:44.0972167Z 2025-05-07T20:33:44.0972267Z if contiguous: 2025-05-07T20:33:44.0972427Z x0 = x0.contiguous() 2025-05-07T20:33:44.0972532Z x1 = x1.contiguous() 2025-05-07T20:33:44.0972659Z 2025-05-07T20:33:44.0972775Z if scale_ub is not None: 2025-05-07T20:33:44.0972900Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0973058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0973152Z ) 2025-05-07T20:33:44.0973245Z else: 2025-05-07T20:33:44.0973360Z scale_ub_tensor = None 2025-05-07T20:33:44.0973449Z 2025-05-07T20:33:44.0973600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0973705Z op = silu_mul_quant 2025-05-07T20:33:44.0973808Z if compiled: 2025-05-07T20:33:44.0973923Z op = torch.compile(op) 2025-05-07T20:33:44.0974051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0974137Z 2025-05-07T20:33:44.0974243Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.0974248Z 2025-05-07T20:33:44.0974371Z moe/activation_test.py:117: 2025-05-07T20:33:44.0974522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0974639Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.0974758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0975222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.0975379Z return fn(*args, **kwargs) 2025-05-07T20:33:44.0975942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.0976055Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.0976469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0976724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0977113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0977233Z kernel = self.compile( 2025-05-07T20:33:44.0977670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0977874Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0978028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0978033Z 2025-05-07T20:33:44.0978268Z self = 2025-05-07T20:33:44.0979146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0979718Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289a85e0>} 2025-05-07T20:33:44.0980572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.0980798Z context = 2025-05-07T20:33:44.0980803Z 2025-05-07T20:33:44.0980996Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.0981298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.0981448Z module_map=module_map) 2025-05-07T20:33:44.0981645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.0981787Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.0981890Z E ^ 2025-05-07T20:33:44.0982314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.0982408Z 2025-05-07T20:33:44.0982881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.0982886Z 2025-05-07T20:33:44.0983013Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.0983276Z self=, 2025-05-07T20:33:44.0983368Z T=1, 2025-05-07T20:33:44.0983462Z D=7168, 2025-05-07T20:33:44.0983561Z scale_ub=None, 2025-05-07T20:33:44.0983664Z contiguous=False, 2025-05-07T20:33:44.0983766Z compiled=True, 2025-05-07T20:33:44.0983855Z ) 2025-05-07T20:33:44.0984106Z self = 2025-05-07T20:33:44.0984299Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.0984304Z 2025-05-07T20:33:44.0984397Z @given( 2025-05-07T20:33:44.0984542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.0984662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.0984798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.0984941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.0985076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.0985239Z ) 2025-05-07T20:33:44.0985575Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.0985689Z def test_silu_mul_quant( 2025-05-07T20:33:44.0985780Z self, 2025-05-07T20:33:44.0985873Z T: int, 2025-05-07T20:33:44.0985962Z D: int, 2025-05-07T20:33:44.0986084Z scale_ub: Optional[float], 2025-05-07T20:33:44.0986189Z contiguous: bool, 2025-05-07T20:33:44.0986289Z compiled: bool, 2025-05-07T20:33:44.0986384Z ) -> None: 2025-05-07T20:33:44.0986495Z torch.manual_seed(2025) 2025-05-07T20:33:44.0986581Z 2025-05-07T20:33:44.0986783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.0986875Z 2025-05-07T20:33:44.0986984Z x_sign = torch.sign(x) 2025-05-07T20:33:44.0987135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.0987238Z x = x_sign * x_clamp 2025-05-07T20:33:44.0987336Z x0 = x[:, :D] 2025-05-07T20:33:44.0987435Z x1 = x[:, D:] 2025-05-07T20:33:44.0987522Z 2025-05-07T20:33:44.0987621Z if contiguous: 2025-05-07T20:33:44.0987732Z x0 = x0.contiguous() 2025-05-07T20:33:44.0987836Z x1 = x1.contiguous() 2025-05-07T20:33:44.0987926Z 2025-05-07T20:33:44.0988033Z if scale_ub is not None: 2025-05-07T20:33:44.0988157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.0988318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.0988409Z ) 2025-05-07T20:33:44.0988498Z else: 2025-05-07T20:33:44.0988618Z scale_ub_tensor = None 2025-05-07T20:33:44.0988703Z 2025-05-07T20:33:44.0988857Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0988967Z op = silu_mul_quant 2025-05-07T20:33:44.0989066Z if compiled: 2025-05-07T20:33:44.0989185Z op = torch.compile(op) 2025-05-07T20:33:44.0989314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.0989403Z 2025-05-07T20:33:44.0989516Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.0989657Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.0989742Z 2025-05-07T20:33:44.0989905Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.0990026Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.0990144Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.0990291Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.0990454Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0990594Z 2025-05-07T20:33:44.0990757Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.0990762Z 2025-05-07T20:33:44.0990878Z moe/activation_test.py:126: 2025-05-07T20:33:44.0991031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0991158Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.0991329Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.0991985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.0992103Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.0992513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.0992774Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.0993191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.0993491Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0993996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:44.0994377Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.0994809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.0995001Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.0995393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.0995484Z fn() 2025-05-07T20:33:44.0995940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.0996043Z self.fn.run( 2025-05-07T20:33:44.0996430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.0996538Z kernel = self.compile( 2025-05-07T20:33:44.0996975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.0997183Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.0997335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.0997340Z 2025-05-07T20:33:44.0997576Z self = 2025-05-07T20:33:44.0998452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.0999037Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2837eef0>} 2025-05-07T20:33:44.0999879Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1000109Z context = 2025-05-07T20:33:44.1000114Z 2025-05-07T20:33:44.1000306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1000611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1000739Z module_map=module_map) 2025-05-07T20:33:44.1000925Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1001047Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.1001187Z E ^ 2025-05-07T20:33:44.1001633Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1001638Z 2025-05-07T20:33:44.1002111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1002119Z 2025-05-07T20:33:44.1002244Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1002503Z self=, 2025-05-07T20:33:44.1002593Z T=1, 2025-05-07T20:33:44.1002685Z D=5120, 2025-05-07T20:33:44.1002785Z scale_ub=1200.0, 2025-05-07T20:33:44.1002886Z contiguous=False, 2025-05-07T20:33:44.1002987Z compiled=True, 2025-05-07T20:33:44.1003075Z ) 2025-05-07T20:33:44.1003324Z self = 2025-05-07T20:33:44.1003513Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.1003524Z 2025-05-07T20:33:44.1003616Z @given( 2025-05-07T20:33:44.1003757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1003875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1004008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1004192Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1004369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1004458Z ) 2025-05-07T20:33:44.1004739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1004852Z def test_silu_mul_quant( 2025-05-07T20:33:44.1004940Z self, 2025-05-07T20:33:44.1005031Z T: int, 2025-05-07T20:33:44.1005124Z D: int, 2025-05-07T20:33:44.1005241Z scale_ub: Optional[float], 2025-05-07T20:33:44.1005349Z contiguous: bool, 2025-05-07T20:33:44.1005450Z compiled: bool, 2025-05-07T20:33:44.1005545Z ) -> None: 2025-05-07T20:33:44.1005656Z torch.manual_seed(2025) 2025-05-07T20:33:44.1005744Z 2025-05-07T20:33:44.1005940Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1006031Z 2025-05-07T20:33:44.1006139Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1006290Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1006402Z x = x_sign * x_clamp 2025-05-07T20:33:44.1006497Z x0 = x[:, :D] 2025-05-07T20:33:44.1006589Z x1 = x[:, D:] 2025-05-07T20:33:44.1006680Z 2025-05-07T20:33:44.1006778Z if contiguous: 2025-05-07T20:33:44.1006884Z x0 = x0.contiguous() 2025-05-07T20:33:44.1006991Z x1 = x1.contiguous() 2025-05-07T20:33:44.1007078Z 2025-05-07T20:33:44.1007187Z if scale_ub is not None: 2025-05-07T20:33:44.1007312Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1007466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1007561Z ) 2025-05-07T20:33:44.1007651Z else: 2025-05-07T20:33:44.1007765Z scale_ub_tensor = None 2025-05-07T20:33:44.1007855Z 2025-05-07T20:33:44.1008005Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1008111Z op = silu_mul_quant 2025-05-07T20:33:44.1008215Z if compiled: 2025-05-07T20:33:44.1008334Z op = torch.compile(op) 2025-05-07T20:33:44.1008457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1008545Z 2025-05-07T20:33:44.1008650Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1008655Z 2025-05-07T20:33:44.1008771Z moe/activation_test.py:117: 2025-05-07T20:33:44.1008919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1009037Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1009157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1009572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1009777Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1010341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1010457Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1010897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1011177Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1011564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1011677Z kernel = self.compile( 2025-05-07T20:33:44.1012110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1012314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1012466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1012471Z 2025-05-07T20:33:44.1012705Z self = 2025-05-07T20:33:44.1013626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1014240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae2837feb0>} 2025-05-07T20:33:44.1015082Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1015305Z context = 2025-05-07T20:33:44.1015310Z 2025-05-07T20:33:44.1015504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1015810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1015939Z module_map=module_map) 2025-05-07T20:33:44.1016134Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1016250Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1016340Z E ^ 2025-05-07T20:33:44.1016747Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1016752Z 2025-05-07T20:33:44.1017221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1017227Z 2025-05-07T20:33:44.1017349Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1017609Z self=, 2025-05-07T20:33:44.1017704Z T=1, 2025-05-07T20:33:44.1017796Z D=5120, 2025-05-07T20:33:44.1017893Z scale_ub=1200.0, 2025-05-07T20:33:44.1017996Z contiguous=False, 2025-05-07T20:33:44.1018098Z compiled=False, 2025-05-07T20:33:44.1018187Z ) 2025-05-07T20:33:44.1018438Z self = 2025-05-07T20:33:44.1018636Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.1018641Z 2025-05-07T20:33:44.1018731Z @given( 2025-05-07T20:33:44.1018867Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1018986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1019120Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1019258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1019390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1019525Z ) 2025-05-07T20:33:44.1019879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1019990Z def test_silu_mul_quant( 2025-05-07T20:33:44.1020080Z self, 2025-05-07T20:33:44.1020175Z T: int, 2025-05-07T20:33:44.1020269Z D: int, 2025-05-07T20:33:44.1020382Z scale_ub: Optional[float], 2025-05-07T20:33:44.1020495Z contiguous: bool, 2025-05-07T20:33:44.1020595Z compiled: bool, 2025-05-07T20:33:44.1020691Z ) -> None: 2025-05-07T20:33:44.1020802Z torch.manual_seed(2025) 2025-05-07T20:33:44.1020888Z 2025-05-07T20:33:44.1021087Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1021195Z 2025-05-07T20:33:44.1021312Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1021478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1021583Z x = x_sign * x_clamp 2025-05-07T20:33:44.1021679Z x0 = x[:, :D] 2025-05-07T20:33:44.1021776Z x1 = x[:, D:] 2025-05-07T20:33:44.1021862Z 2025-05-07T20:33:44.1021963Z if contiguous: 2025-05-07T20:33:44.1022077Z x0 = x0.contiguous() 2025-05-07T20:33:44.1022183Z x1 = x1.contiguous() 2025-05-07T20:33:44.1022267Z 2025-05-07T20:33:44.1022428Z if scale_ub is not None: 2025-05-07T20:33:44.1022591Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1022751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1022839Z ) 2025-05-07T20:33:44.1022930Z else: 2025-05-07T20:33:44.1023044Z scale_ub_tensor = None 2025-05-07T20:33:44.1023132Z 2025-05-07T20:33:44.1027989Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1028109Z op = silu_mul_quant 2025-05-07T20:33:44.1028213Z if compiled: 2025-05-07T20:33:44.1028336Z op = torch.compile(op) 2025-05-07T20:33:44.1028468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1028558Z 2025-05-07T20:33:44.1028669Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1028675Z 2025-05-07T20:33:44.1028791Z moe/activation_test.py:117: 2025-05-07T20:33:44.1028947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1029070Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1029189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1029774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1029891Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1030310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1030566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1030958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1031078Z kernel = self.compile( 2025-05-07T20:33:44.1031515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1031720Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1031875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1031880Z 2025-05-07T20:33:44.1032121Z self = 2025-05-07T20:33:44.1033008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1033646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fae289a9120>} 2025-05-07T20:33:44.1034672Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1034898Z context = 2025-05-07T20:33:44.1034906Z 2025-05-07T20:33:44.1035097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1035404Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1035531Z module_map=module_map) 2025-05-07T20:33:44.1035720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1035836Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1035927Z E ^ 2025-05-07T20:33:44.1036337Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1036345Z 2025-05-07T20:33:44.1036817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1036822Z 2025-05-07T20:33:44.1036943Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1037356Z self=, 2025-05-07T20:33:44.1037451Z T=16384, 2025-05-07T20:33:44.1037551Z D=5120, 2025-05-07T20:33:44.1037648Z scale_ub=1200.0, 2025-05-07T20:33:44.1037748Z contiguous=False, 2025-05-07T20:33:44.1037850Z compiled=True, 2025-05-07T20:33:44.1037936Z ) 2025-05-07T20:33:44.1038186Z self = 2025-05-07T20:33:44.1038397Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.1038402Z 2025-05-07T20:33:44.1038493Z @given( 2025-05-07T20:33:44.1038632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1038761Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1038895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1039039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1039173Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1039263Z ) 2025-05-07T20:33:44.1039555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1039666Z def test_silu_mul_quant( 2025-05-07T20:33:44.1039756Z self, 2025-05-07T20:33:44.1039852Z T: int, 2025-05-07T20:33:44.1039943Z D: int, 2025-05-07T20:33:44.1040057Z scale_ub: Optional[float], 2025-05-07T20:33:44.1040167Z contiguous: bool, 2025-05-07T20:33:44.1040267Z compiled: bool, 2025-05-07T20:33:44.1040361Z ) -> None: 2025-05-07T20:33:44.1040479Z torch.manual_seed(2025) 2025-05-07T20:33:44.1040568Z 2025-05-07T20:33:44.1040769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1040861Z 2025-05-07T20:33:44.1040970Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1041119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1041222Z x = x_sign * x_clamp 2025-05-07T20:33:44.1041319Z x0 = x[:, :D] 2025-05-07T20:33:44.1041416Z x1 = x[:, D:] 2025-05-07T20:33:44.1041506Z 2025-05-07T20:33:44.1041604Z if contiguous: 2025-05-07T20:33:44.1041714Z x0 = x0.contiguous() 2025-05-07T20:33:44.1041821Z x1 = x1.contiguous() 2025-05-07T20:33:44.1041906Z 2025-05-07T20:33:44.1042016Z if scale_ub is not None: 2025-05-07T20:33:44.1042139Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1042299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1042387Z ) 2025-05-07T20:33:44.1042479Z else: 2025-05-07T20:33:44.1042593Z scale_ub_tensor = None 2025-05-07T20:33:44.1042736Z 2025-05-07T20:33:44.1042929Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1043039Z op = silu_mul_quant 2025-05-07T20:33:44.1043139Z if compiled: 2025-05-07T20:33:44.1043257Z op = torch.compile(op) 2025-05-07T20:33:44.1043387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1043476Z 2025-05-07T20:33:44.1043586Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1043591Z 2025-05-07T20:33:44.1043707Z moe/activation_test.py:117: 2025-05-07T20:33:44.1043859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1043977Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1044097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1044516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1044625Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1045198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1045312Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1045718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1046064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1046453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1046568Z kernel = self.compile( 2025-05-07T20:33:44.1047005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1047210Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1047362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1047370Z 2025-05-07T20:33:44.1047611Z self = 2025-05-07T20:33:44.1048496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1049077Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad075108b0>} 2025-05-07T20:33:44.1049917Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1050139Z context = 2025-05-07T20:33:44.1050144Z 2025-05-07T20:33:44.1050335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1050644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1050771Z module_map=module_map) 2025-05-07T20:33:44.1050983Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1051119Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1051224Z E ^ 2025-05-07T20:33:44.1051632Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1051638Z 2025-05-07T20:33:44.1052106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1052110Z 2025-05-07T20:33:44.1052232Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1052489Z self=, 2025-05-07T20:33:44.1052628Z T=2048, 2025-05-07T20:33:44.1052718Z D=7168, 2025-05-07T20:33:44.1052819Z scale_ub=1200.0, 2025-05-07T20:33:44.1052962Z contiguous=False, 2025-05-07T20:33:44.1053065Z compiled=True, 2025-05-07T20:33:44.1053152Z ) 2025-05-07T20:33:44.1053401Z self = 2025-05-07T20:33:44.1053611Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.1053616Z 2025-05-07T20:33:44.1053708Z @given( 2025-05-07T20:33:44.1053846Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1053967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1054099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1054234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1054372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1054462Z ) 2025-05-07T20:33:44.1054749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1054862Z def test_silu_mul_quant( 2025-05-07T20:33:44.1054954Z self, 2025-05-07T20:33:44.1055049Z T: int, 2025-05-07T20:33:44.1055140Z D: int, 2025-05-07T20:33:44.1055256Z scale_ub: Optional[float], 2025-05-07T20:33:44.1055364Z contiguous: bool, 2025-05-07T20:33:44.1055513Z compiled: bool, 2025-05-07T20:33:44.1055645Z ) -> None: 2025-05-07T20:33:44.1055761Z torch.manual_seed(2025) 2025-05-07T20:33:44.1055847Z 2025-05-07T20:33:44.1056042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1056132Z 2025-05-07T20:33:44.1056241Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1056390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1056495Z x = x_sign * x_clamp 2025-05-07T20:33:44.1056591Z x0 = x[:, :D] 2025-05-07T20:33:44.1056686Z x1 = x[:, D:] 2025-05-07T20:33:44.1056775Z 2025-05-07T20:33:44.1056876Z if contiguous: 2025-05-07T20:33:44.1056988Z x0 = x0.contiguous() 2025-05-07T20:33:44.1057097Z x1 = x1.contiguous() 2025-05-07T20:33:44.1057182Z 2025-05-07T20:33:44.1057293Z if scale_ub is not None: 2025-05-07T20:33:44.1057414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1057575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1057671Z ) 2025-05-07T20:33:44.1057760Z else: 2025-05-07T20:33:44.1057870Z scale_ub_tensor = None 2025-05-07T20:33:44.1057959Z 2025-05-07T20:33:44.1058110Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1058220Z op = silu_mul_quant 2025-05-07T20:33:44.1058319Z if compiled: 2025-05-07T20:33:44.1058435Z op = torch.compile(op) 2025-05-07T20:33:44.1058562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1058648Z 2025-05-07T20:33:44.1058753Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1058761Z 2025-05-07T20:33:44.1058878Z moe/activation_test.py:117: 2025-05-07T20:33:44.1059028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1059148Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1059267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1059693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1059803Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1060364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1060477Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1060916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1061200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1061686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1061797Z kernel = self.compile( 2025-05-07T20:33:44.1062232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1062440Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1062587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1062592Z 2025-05-07T20:33:44.1062825Z self = 2025-05-07T20:33:44.1063709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1064282Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07511090>} 2025-05-07T20:33:44.1065128Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1065433Z context = 2025-05-07T20:33:44.1065439Z 2025-05-07T20:33:44.1065632Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1065937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1066062Z module_map=module_map) 2025-05-07T20:33:44.1066250Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1066365Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1066456Z E ^ 2025-05-07T20:33:44.1066869Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1066874Z 2025-05-07T20:33:44.1067343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1067351Z 2025-05-07T20:33:44.1067477Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1067734Z self=, 2025-05-07T20:33:44.1067826Z T=1, 2025-05-07T20:33:44.1067920Z D=5120, 2025-05-07T20:33:44.1068018Z scale_ub=None, 2025-05-07T20:33:44.1068118Z contiguous=False, 2025-05-07T20:33:44.1068221Z compiled=False, 2025-05-07T20:33:44.1068308Z ) 2025-05-07T20:33:44.1068557Z self = 2025-05-07T20:33:44.1068752Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.1068756Z 2025-05-07T20:33:44.1068849Z @given( 2025-05-07T20:33:44.1068992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1069109Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1069241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1069379Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1069514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1069603Z ) 2025-05-07T20:33:44.1069891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1070000Z def test_silu_mul_quant( 2025-05-07T20:33:44.1070092Z self, 2025-05-07T20:33:44.1070182Z T: int, 2025-05-07T20:33:44.1070272Z D: int, 2025-05-07T20:33:44.1070391Z scale_ub: Optional[float], 2025-05-07T20:33:44.1070496Z contiguous: bool, 2025-05-07T20:33:44.1070598Z compiled: bool, 2025-05-07T20:33:44.1070694Z ) -> None: 2025-05-07T20:33:44.1070803Z torch.manual_seed(2025) 2025-05-07T20:33:44.1070942Z 2025-05-07T20:33:44.1071209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1071320Z 2025-05-07T20:33:44.1071439Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1071599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1071706Z x = x_sign * x_clamp 2025-05-07T20:33:44.1071808Z x0 = x[:, :D] 2025-05-07T20:33:44.1071903Z x1 = x[:, D:] 2025-05-07T20:33:44.1071988Z 2025-05-07T20:33:44.1072089Z if contiguous: 2025-05-07T20:33:44.1072195Z x0 = x0.contiguous() 2025-05-07T20:33:44.1072300Z x1 = x1.contiguous() 2025-05-07T20:33:44.1072390Z 2025-05-07T20:33:44.1072495Z if scale_ub is not None: 2025-05-07T20:33:44.1072618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1072776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1072866Z ) 2025-05-07T20:33:44.1072957Z else: 2025-05-07T20:33:44.1073070Z scale_ub_tensor = None 2025-05-07T20:33:44.1073158Z 2025-05-07T20:33:44.1073311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1073421Z op = silu_mul_quant 2025-05-07T20:33:44.1073569Z if compiled: 2025-05-07T20:33:44.1073742Z op = torch.compile(op) 2025-05-07T20:33:44.1073905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1073992Z 2025-05-07T20:33:44.1074101Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1074105Z 2025-05-07T20:33:44.1074218Z moe/activation_test.py:117: 2025-05-07T20:33:44.1074366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1074488Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1074605Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1075172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1075292Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1075703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1075961Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1076355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1076467Z kernel = self.compile( 2025-05-07T20:33:44.1076906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1077109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1077258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1077263Z 2025-05-07T20:33:44.1077496Z self = 2025-05-07T20:33:44.1078376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1078955Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad075117e0>} 2025-05-07T20:33:44.1079800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1080022Z context = 2025-05-07T20:33:44.1080027Z 2025-05-07T20:33:44.1080218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1080521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1080760Z module_map=module_map) 2025-05-07T20:33:44.1080948Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1081067Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1081158Z E ^ 2025-05-07T20:33:44.1081566Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1081571Z 2025-05-07T20:33:44.1082044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1082048Z 2025-05-07T20:33:44.1082169Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1082428Z self=, 2025-05-07T20:33:44.1082520Z T=4096, 2025-05-07T20:33:44.1082611Z D=7168, 2025-05-07T20:33:44.1082713Z scale_ub=1200.0, 2025-05-07T20:33:44.1082818Z contiguous=False, 2025-05-07T20:33:44.1082918Z compiled=False, 2025-05-07T20:33:44.1083006Z ) 2025-05-07T20:33:44.1083258Z self = 2025-05-07T20:33:44.1083462Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.1083513Z 2025-05-07T20:33:44.1083606Z @given( 2025-05-07T20:33:44.1083787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1083907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1084039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1084174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1084309Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1084396Z ) 2025-05-07T20:33:44.1084677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1084791Z def test_silu_mul_quant( 2025-05-07T20:33:44.1084882Z self, 2025-05-07T20:33:44.1084976Z T: int, 2025-05-07T20:33:44.1085072Z D: int, 2025-05-07T20:33:44.1085190Z scale_ub: Optional[float], 2025-05-07T20:33:44.1085295Z contiguous: bool, 2025-05-07T20:33:44.1085398Z compiled: bool, 2025-05-07T20:33:44.1085490Z ) -> None: 2025-05-07T20:33:44.1085610Z torch.manual_seed(2025) 2025-05-07T20:33:44.1085695Z 2025-05-07T20:33:44.1085893Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1085986Z 2025-05-07T20:33:44.1086095Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1086239Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1086344Z x = x_sign * x_clamp 2025-05-07T20:33:44.1086437Z x0 = x[:, :D] 2025-05-07T20:33:44.1086530Z x1 = x[:, D:] 2025-05-07T20:33:44.1086620Z 2025-05-07T20:33:44.1086717Z if contiguous: 2025-05-07T20:33:44.1086824Z x0 = x0.contiguous() 2025-05-07T20:33:44.1086931Z x1 = x1.contiguous() 2025-05-07T20:33:44.1087019Z 2025-05-07T20:33:44.1087131Z if scale_ub is not None: 2025-05-07T20:33:44.1087254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1087411Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1087504Z ) 2025-05-07T20:33:44.1087596Z else: 2025-05-07T20:33:44.1087708Z scale_ub_tensor = None 2025-05-07T20:33:44.1087797Z 2025-05-07T20:33:44.1087949Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1088054Z op = silu_mul_quant 2025-05-07T20:33:44.1088156Z if compiled: 2025-05-07T20:33:44.1088271Z op = torch.compile(op) 2025-05-07T20:33:44.1088393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1088481Z 2025-05-07T20:33:44.1088589Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1088594Z 2025-05-07T20:33:44.1088709Z moe/activation_test.py:117: 2025-05-07T20:33:44.1088915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1089076Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1089195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1089762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1089882Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1090295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1090553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1090976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1091107Z kernel = self.compile( 2025-05-07T20:33:44.1091542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1091754Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1091899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1091904Z 2025-05-07T20:33:44.1092143Z self = 2025-05-07T20:33:44.1093106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1093680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07512200>} 2025-05-07T20:33:44.1094527Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1094753Z context = 2025-05-07T20:33:44.1094758Z 2025-05-07T20:33:44.1094952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1095254Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1095387Z module_map=module_map) 2025-05-07T20:33:44.1095579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1095695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1095786Z E ^ 2025-05-07T20:33:44.1096192Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1096197Z 2025-05-07T20:33:44.1096664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1096672Z 2025-05-07T20:33:44.1096798Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1097054Z self=, 2025-05-07T20:33:44.1097145Z T=16384, 2025-05-07T20:33:44.1097244Z D=7168, 2025-05-07T20:33:44.1097340Z scale_ub=None, 2025-05-07T20:33:44.1097441Z contiguous=True, 2025-05-07T20:33:44.1097542Z compiled=True, 2025-05-07T20:33:44.1097631Z ) 2025-05-07T20:33:44.1097883Z self = 2025-05-07T20:33:44.1098083Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.1098088Z 2025-05-07T20:33:44.1098180Z @given( 2025-05-07T20:33:44.1098320Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1098437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1098574Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1098720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1098903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1099032Z ) 2025-05-07T20:33:44.1099319Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1099428Z def test_silu_mul_quant( 2025-05-07T20:33:44.1099521Z self, 2025-05-07T20:33:44.1099616Z T: int, 2025-05-07T20:33:44.1099709Z D: int, 2025-05-07T20:33:44.1099828Z scale_ub: Optional[float], 2025-05-07T20:33:44.1099933Z contiguous: bool, 2025-05-07T20:33:44.1100033Z compiled: bool, 2025-05-07T20:33:44.1100128Z ) -> None: 2025-05-07T20:33:44.1100238Z torch.manual_seed(2025) 2025-05-07T20:33:44.1100323Z 2025-05-07T20:33:44.1100521Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1100607Z 2025-05-07T20:33:44.1100714Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1100862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1100968Z x = x_sign * x_clamp 2025-05-07T20:33:44.1101064Z x0 = x[:, :D] 2025-05-07T20:33:44.1101163Z x1 = x[:, D:] 2025-05-07T20:33:44.1101250Z 2025-05-07T20:33:44.1101359Z if contiguous: 2025-05-07T20:33:44.1101481Z x0 = x0.contiguous() 2025-05-07T20:33:44.1101659Z x1 = x1.contiguous() 2025-05-07T20:33:44.1101748Z 2025-05-07T20:33:44.1101897Z if scale_ub is not None: 2025-05-07T20:33:44.1102023Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1102185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1102274Z ) 2025-05-07T20:33:44.1102362Z else: 2025-05-07T20:33:44.1102475Z scale_ub_tensor = None 2025-05-07T20:33:44.1102560Z 2025-05-07T20:33:44.1102709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1102816Z op = silu_mul_quant 2025-05-07T20:33:44.1102916Z if compiled: 2025-05-07T20:33:44.1103039Z op = torch.compile(op) 2025-05-07T20:33:44.1103164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1103252Z 2025-05-07T20:33:44.1103362Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1103367Z 2025-05-07T20:33:44.1103480Z moe/activation_test.py:117: 2025-05-07T20:33:44.1103631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1103757Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1103874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1104294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1104404Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1104965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1105081Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1105492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1105752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1106144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1106256Z kernel = self.compile( 2025-05-07T20:33:44.1106699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1106900Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1107048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1107052Z 2025-05-07T20:33:44.1107291Z self = 2025-05-07T20:33:44.1108240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1108861Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07513760>} 2025-05-07T20:33:44.1109705Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1109928Z context = 2025-05-07T20:33:44.1109938Z 2025-05-07T20:33:44.1110129Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1110428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1110555Z module_map=module_map) 2025-05-07T20:33:44.1110746Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1110866Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1110961Z E ^ 2025-05-07T20:33:44.1111366Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1111416Z 2025-05-07T20:33:44.1111929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1111935Z 2025-05-07T20:33:44.1112056Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1112310Z self=, 2025-05-07T20:33:44.1112407Z T=4096, 2025-05-07T20:33:44.1112497Z D=5120, 2025-05-07T20:33:44.1112593Z scale_ub=None, 2025-05-07T20:33:44.1112698Z contiguous=False, 2025-05-07T20:33:44.1112796Z compiled=True, 2025-05-07T20:33:44.1112883Z ) 2025-05-07T20:33:44.1113138Z self = 2025-05-07T20:33:44.1113340Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.1113345Z 2025-05-07T20:33:44.1113437Z @given( 2025-05-07T20:33:44.1113621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1113740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1113879Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1114016Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1114149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1114242Z ) 2025-05-07T20:33:44.1114527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1114635Z def test_silu_mul_quant( 2025-05-07T20:33:44.1114730Z self, 2025-05-07T20:33:44.1114820Z T: int, 2025-05-07T20:33:44.1114912Z D: int, 2025-05-07T20:33:44.1115025Z scale_ub: Optional[float], 2025-05-07T20:33:44.1115133Z contiguous: bool, 2025-05-07T20:33:44.1115239Z compiled: bool, 2025-05-07T20:33:44.1115330Z ) -> None: 2025-05-07T20:33:44.1115440Z torch.manual_seed(2025) 2025-05-07T20:33:44.1115528Z 2025-05-07T20:33:44.1115721Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1115811Z 2025-05-07T20:33:44.1115926Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1116070Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1116173Z x = x_sign * x_clamp 2025-05-07T20:33:44.1116269Z x0 = x[:, :D] 2025-05-07T20:33:44.1116364Z x1 = x[:, D:] 2025-05-07T20:33:44.1116454Z 2025-05-07T20:33:44.1116553Z if contiguous: 2025-05-07T20:33:44.1116661Z x0 = x0.contiguous() 2025-05-07T20:33:44.1116769Z x1 = x1.contiguous() 2025-05-07T20:33:44.1116853Z 2025-05-07T20:33:44.1116960Z if scale_ub is not None: 2025-05-07T20:33:44.1117140Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1117337Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1117428Z ) 2025-05-07T20:33:44.1117521Z else: 2025-05-07T20:33:44.1117630Z scale_ub_tensor = None 2025-05-07T20:33:44.1117717Z 2025-05-07T20:33:44.1117873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1117981Z op = silu_mul_quant 2025-05-07T20:33:44.1118081Z if compiled: 2025-05-07T20:33:44.1118204Z op = torch.compile(op) 2025-05-07T20:33:44.1118326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1118413Z 2025-05-07T20:33:44.1118518Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1118523Z 2025-05-07T20:33:44.1118635Z moe/activation_test.py:117: 2025-05-07T20:33:44.1118789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1118906Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1119024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1119449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1119558Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1120169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1120326Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1120734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1120994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1121385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1121497Z kernel = self.compile( 2025-05-07T20:33:44.1121933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1122141Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1122291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1122296Z 2025-05-07T20:33:44.1122532Z self = 2025-05-07T20:33:44.1123409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1124280Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07610280>} 2025-05-07T20:33:44.1125189Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1125424Z context = 2025-05-07T20:33:44.1125430Z 2025-05-07T20:33:44.1125623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1125937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1126062Z module_map=module_map) 2025-05-07T20:33:44.1126247Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1126367Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1126458Z E ^ 2025-05-07T20:33:44.1126860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1126865Z 2025-05-07T20:33:44.1127340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1127442Z 2025-05-07T20:33:44.1127629Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1127890Z self=, 2025-05-07T20:33:44.1127981Z T=4096, 2025-05-07T20:33:44.1128073Z D=5120, 2025-05-07T20:33:44.1128178Z scale_ub=1200.0, 2025-05-07T20:33:44.1128281Z contiguous=False, 2025-05-07T20:33:44.1128383Z compiled=False, 2025-05-07T20:33:44.1128474Z ) 2025-05-07T20:33:44.1128722Z self = 2025-05-07T20:33:44.1128924Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.1128932Z 2025-05-07T20:33:44.1129023Z @given( 2025-05-07T20:33:44.1129160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1129280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1129415Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1129556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1129696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1129783Z ) 2025-05-07T20:33:44.1130065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1130482Z def test_silu_mul_quant( 2025-05-07T20:33:44.1130573Z self, 2025-05-07T20:33:44.1130729Z T: int, 2025-05-07T20:33:44.1130826Z D: int, 2025-05-07T20:33:44.1130942Z scale_ub: Optional[float], 2025-05-07T20:33:44.1131049Z contiguous: bool, 2025-05-07T20:33:44.1131151Z compiled: bool, 2025-05-07T20:33:44.1131245Z ) -> None: 2025-05-07T20:33:44.1131360Z torch.manual_seed(2025) 2025-05-07T20:33:44.1131450Z 2025-05-07T20:33:44.1131675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1131773Z 2025-05-07T20:33:44.1131899Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1132073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1132185Z x = x_sign * x_clamp 2025-05-07T20:33:44.1132280Z x0 = x[:, :D] 2025-05-07T20:33:44.1132373Z x1 = x[:, D:] 2025-05-07T20:33:44.1132463Z 2025-05-07T20:33:44.1132563Z if contiguous: 2025-05-07T20:33:44.1132675Z x0 = x0.contiguous() 2025-05-07T20:33:44.1132789Z x1 = x1.contiguous() 2025-05-07T20:33:44.1132877Z 2025-05-07T20:33:44.1132986Z if scale_ub is not None: 2025-05-07T20:33:44.1133112Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1133268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1133362Z ) 2025-05-07T20:33:44.1133451Z else: 2025-05-07T20:33:44.1133562Z scale_ub_tensor = None 2025-05-07T20:33:44.1133654Z 2025-05-07T20:33:44.1133805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1133911Z op = silu_mul_quant 2025-05-07T20:33:44.1134020Z if compiled: 2025-05-07T20:33:44.1134140Z op = torch.compile(op) 2025-05-07T20:33:44.1134262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1134353Z 2025-05-07T20:33:44.1134460Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1134465Z 2025-05-07T20:33:44.1134589Z moe/activation_test.py:117: 2025-05-07T20:33:44.1134742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1134861Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1134982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1135550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1135665Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1136079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1136392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1136832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1136945Z kernel = self.compile( 2025-05-07T20:33:44.1137392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1137604Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1137753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1137758Z 2025-05-07T20:33:44.1137997Z self = 2025-05-07T20:33:44.1138880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1139462Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07611000>} 2025-05-07T20:33:44.1140358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1140649Z context = 2025-05-07T20:33:44.1140655Z 2025-05-07T20:33:44.1140850Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1141204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1141330Z module_map=module_map) 2025-05-07T20:33:44.1141522Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1141638Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1141736Z E ^ 2025-05-07T20:33:44.1142146Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1142151Z 2025-05-07T20:33:44.1142626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1142637Z 2025-05-07T20:33:44.1142762Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1143020Z self=, 2025-05-07T20:33:44.1143117Z T=4096, 2025-05-07T20:33:44.1143209Z D=5120, 2025-05-07T20:33:44.1143308Z scale_ub=1200.0, 2025-05-07T20:33:44.1143413Z contiguous=False, 2025-05-07T20:33:44.1143511Z compiled=True, 2025-05-07T20:33:44.1143600Z ) 2025-05-07T20:33:44.1143853Z self = 2025-05-07T20:33:44.1144055Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.1144063Z 2025-05-07T20:33:44.1144156Z @given( 2025-05-07T20:33:44.1144298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1144414Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1144550Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1144696Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1144830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1144921Z ) 2025-05-07T20:33:44.1145206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1145317Z def test_silu_mul_quant( 2025-05-07T20:33:44.1145410Z self, 2025-05-07T20:33:44.1145501Z T: int, 2025-05-07T20:33:44.1145593Z D: int, 2025-05-07T20:33:44.1145711Z scale_ub: Optional[float], 2025-05-07T20:33:44.1145817Z contiguous: bool, 2025-05-07T20:33:44.1145918Z compiled: bool, 2025-05-07T20:33:44.1146066Z ) -> None: 2025-05-07T20:33:44.1146178Z torch.manual_seed(2025) 2025-05-07T20:33:44.1146306Z 2025-05-07T20:33:44.1146510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1146598Z 2025-05-07T20:33:44.1146711Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1146860Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1146967Z x = x_sign * x_clamp 2025-05-07T20:33:44.1147064Z x0 = x[:, :D] 2025-05-07T20:33:44.1147158Z x1 = x[:, D:] 2025-05-07T20:33:44.1147244Z 2025-05-07T20:33:44.1147349Z if contiguous: 2025-05-07T20:33:44.1147457Z x0 = x0.contiguous() 2025-05-07T20:33:44.1147561Z x1 = x1.contiguous() 2025-05-07T20:33:44.1147650Z 2025-05-07T20:33:44.1147761Z if scale_ub is not None: 2025-05-07T20:33:44.1147884Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1148044Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1148137Z ) 2025-05-07T20:33:44.1148238Z else: 2025-05-07T20:33:44.1148351Z scale_ub_tensor = None 2025-05-07T20:33:44.1148438Z 2025-05-07T20:33:44.1148596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1148703Z op = silu_mul_quant 2025-05-07T20:33:44.1148854Z if compiled: 2025-05-07T20:33:44.1149016Z op = torch.compile(op) 2025-05-07T20:33:44.1149141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1149227Z 2025-05-07T20:33:44.1149338Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1149343Z 2025-05-07T20:33:44.1149456Z moe/activation_test.py:117: 2025-05-07T20:33:44.1149606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1149730Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1149847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1150276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1150393Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1150960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1151085Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1151552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1151814Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1152207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1152317Z kernel = self.compile( 2025-05-07T20:33:44.1152760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1152963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1153118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1153123Z 2025-05-07T20:33:44.1153361Z self = 2025-05-07T20:33:44.1154305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1154891Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07610700>} 2025-05-07T20:33:44.1155736Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1156013Z context = 2025-05-07T20:33:44.1156058Z 2025-05-07T20:33:44.1156250Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1156555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1156686Z module_map=module_map) 2025-05-07T20:33:44.1156877Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1156993Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1157087Z E ^ 2025-05-07T20:33:44.1157493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1157497Z 2025-05-07T20:33:44.1157974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1157978Z 2025-05-07T20:33:44.1158101Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1158363Z self=, 2025-05-07T20:33:44.1158459Z T=2048, 2025-05-07T20:33:44.1158551Z D=7168, 2025-05-07T20:33:44.1158649Z scale_ub=1200.0, 2025-05-07T20:33:44.1158753Z contiguous=False, 2025-05-07T20:33:44.1158900Z compiled=False, 2025-05-07T20:33:44.1158989Z ) 2025-05-07T20:33:44.1159281Z self = 2025-05-07T20:33:44.1159485Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.1159490Z 2025-05-07T20:33:44.1159584Z @given( 2025-05-07T20:33:44.1159721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1159838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1159975Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1160112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1160248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1160339Z ) 2025-05-07T20:33:44.1160628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1160748Z def test_silu_mul_quant( 2025-05-07T20:33:44.1160856Z self, 2025-05-07T20:33:44.1160961Z T: int, 2025-05-07T20:33:44.1161072Z D: int, 2025-05-07T20:33:44.1161191Z scale_ub: Optional[float], 2025-05-07T20:33:44.1161297Z contiguous: bool, 2025-05-07T20:33:44.1161400Z compiled: bool, 2025-05-07T20:33:44.1161493Z ) -> None: 2025-05-07T20:33:44.1161606Z torch.manual_seed(2025) 2025-05-07T20:33:44.1161696Z 2025-05-07T20:33:44.1161890Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1161978Z 2025-05-07T20:33:44.1162089Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1162237Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1162345Z x = x_sign * x_clamp 2025-05-07T20:33:44.1162442Z x0 = x[:, :D] 2025-05-07T20:33:44.1162536Z x1 = x[:, D:] 2025-05-07T20:33:44.1162630Z 2025-05-07T20:33:44.1162729Z if contiguous: 2025-05-07T20:33:44.1162837Z x0 = x0.contiguous() 2025-05-07T20:33:44.1162947Z x1 = x1.contiguous() 2025-05-07T20:33:44.1163036Z 2025-05-07T20:33:44.1163143Z if scale_ub is not None: 2025-05-07T20:33:44.1163274Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1163430Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1163520Z ) 2025-05-07T20:33:44.1163614Z else: 2025-05-07T20:33:44.1163725Z scale_ub_tensor = None 2025-05-07T20:33:44.1163816Z 2025-05-07T20:33:44.1163968Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1164073Z op = silu_mul_quant 2025-05-07T20:33:44.1164176Z if compiled: 2025-05-07T20:33:44.1164294Z op = torch.compile(op) 2025-05-07T20:33:44.1164470Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1164604Z 2025-05-07T20:33:44.1164712Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1164717Z 2025-05-07T20:33:44.1164832Z moe/activation_test.py:117: 2025-05-07T20:33:44.1164985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1165109Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1165231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1165800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1165915Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1166329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1166590Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1170117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1170234Z kernel = self.compile( 2025-05-07T20:33:44.1170689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1170965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1171157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1171163Z 2025-05-07T20:33:44.1171431Z self = 2025-05-07T20:33:44.1172303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1172879Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07611240>} 2025-05-07T20:33:44.1173724Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1173950Z context = 2025-05-07T20:33:44.1173959Z 2025-05-07T20:33:44.1174151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1174452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1174581Z module_map=module_map) 2025-05-07T20:33:44.1174767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1174881Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1174974Z E ^ 2025-05-07T20:33:44.1175380Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1175389Z 2025-05-07T20:33:44.1175863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1175868Z 2025-05-07T20:33:44.1175992Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1176249Z self=, 2025-05-07T20:33:44.1176345Z T=1, 2025-05-07T20:33:44.1176435Z D=7168, 2025-05-07T20:33:44.1176533Z scale_ub=None, 2025-05-07T20:33:44.1176636Z contiguous=True, 2025-05-07T20:33:44.1176735Z compiled=False, 2025-05-07T20:33:44.1176820Z ) 2025-05-07T20:33:44.1177073Z self = 2025-05-07T20:33:44.1177264Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.1177269Z 2025-05-07T20:33:44.1177362Z @given( 2025-05-07T20:33:44.1177549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1177710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1177850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1177985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1178118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1178210Z ) 2025-05-07T20:33:44.1178498Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1178615Z def test_silu_mul_quant( 2025-05-07T20:33:44.1178704Z self, 2025-05-07T20:33:44.1178793Z T: int, 2025-05-07T20:33:44.1178887Z D: int, 2025-05-07T20:33:44.1179001Z scale_ub: Optional[float], 2025-05-07T20:33:44.1179105Z contiguous: bool, 2025-05-07T20:33:44.1179211Z compiled: bool, 2025-05-07T20:33:44.1179303Z ) -> None: 2025-05-07T20:33:44.1179415Z torch.manual_seed(2025) 2025-05-07T20:33:44.1179508Z 2025-05-07T20:33:44.1179704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1179795Z 2025-05-07T20:33:44.1179907Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1180051Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1180155Z x = x_sign * x_clamp 2025-05-07T20:33:44.1180328Z x0 = x[:, :D] 2025-05-07T20:33:44.1180462Z x1 = x[:, D:] 2025-05-07T20:33:44.1180553Z 2025-05-07T20:33:44.1180653Z if contiguous: 2025-05-07T20:33:44.1180760Z x0 = x0.contiguous() 2025-05-07T20:33:44.1180867Z x1 = x1.contiguous() 2025-05-07T20:33:44.1180952Z 2025-05-07T20:33:44.1181059Z if scale_ub is not None: 2025-05-07T20:33:44.1181185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1181344Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1181432Z ) 2025-05-07T20:33:44.1181524Z else: 2025-05-07T20:33:44.1181644Z scale_ub_tensor = None 2025-05-07T20:33:44.1181730Z 2025-05-07T20:33:44.1181887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1181992Z op = silu_mul_quant 2025-05-07T20:33:44.1182095Z if compiled: 2025-05-07T20:33:44.1182212Z op = torch.compile(op) 2025-05-07T20:33:44.1182338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1182429Z 2025-05-07T20:33:44.1182538Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1182543Z 2025-05-07T20:33:44.1182655Z moe/activation_test.py:117: 2025-05-07T20:33:44.1182807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1182925Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1183042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1183613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1183730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1184146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1184403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1184796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1184909Z kernel = self.compile( 2025-05-07T20:33:44.1185345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1185548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1185699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1185704Z 2025-05-07T20:33:44.1185938Z self = 2025-05-07T20:33:44.1186871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1187486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07612050>} 2025-05-07T20:33:44.1188333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1188557Z context = 2025-05-07T20:33:44.1188562Z 2025-05-07T20:33:44.1188752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1189053Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1189186Z module_map=module_map) 2025-05-07T20:33:44.1189374Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1189491Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1189581Z E ^ 2025-05-07T20:33:44.1190025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1190070Z 2025-05-07T20:33:44.1190625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1190632Z 2025-05-07T20:33:44.1190784Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1191108Z self=, 2025-05-07T20:33:44.1191221Z T=16384, 2025-05-07T20:33:44.1191335Z D=7168, 2025-05-07T20:33:44.1191462Z scale_ub=1200.0, 2025-05-07T20:33:44.1191589Z contiguous=False, 2025-05-07T20:33:44.1191714Z compiled=True, 2025-05-07T20:33:44.1191825Z ) 2025-05-07T20:33:44.1192144Z self = 2025-05-07T20:33:44.1192404Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.1192410Z 2025-05-07T20:33:44.1192528Z @given( 2025-05-07T20:33:44.1192706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1192854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1193007Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1193144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1193282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1193369Z ) 2025-05-07T20:33:44.1193729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1193843Z def test_silu_mul_quant( 2025-05-07T20:33:44.1193933Z self, 2025-05-07T20:33:44.1194024Z T: int, 2025-05-07T20:33:44.1194121Z D: int, 2025-05-07T20:33:44.1194240Z scale_ub: Optional[float], 2025-05-07T20:33:44.1194348Z contiguous: bool, 2025-05-07T20:33:44.1194451Z compiled: bool, 2025-05-07T20:33:44.1194545Z ) -> None: 2025-05-07T20:33:44.1194658Z torch.manual_seed(2025) 2025-05-07T20:33:44.1194752Z 2025-05-07T20:33:44.1194951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1195044Z 2025-05-07T20:33:44.1195152Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1195297Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1195407Z x = x_sign * x_clamp 2025-05-07T20:33:44.1195503Z x0 = x[:, :D] 2025-05-07T20:33:44.1195599Z x1 = x[:, D:] 2025-05-07T20:33:44.1195687Z 2025-05-07T20:33:44.1195786Z if contiguous: 2025-05-07T20:33:44.1195896Z x0 = x0.contiguous() 2025-05-07T20:33:44.1196007Z x1 = x1.contiguous() 2025-05-07T20:33:44.1196151Z 2025-05-07T20:33:44.1196260Z if scale_ub is not None: 2025-05-07T20:33:44.1196429Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1196587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1196679Z ) 2025-05-07T20:33:44.1196769Z else: 2025-05-07T20:33:44.1196884Z scale_ub_tensor = None 2025-05-07T20:33:44.1196973Z 2025-05-07T20:33:44.1197130Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1197236Z op = silu_mul_quant 2025-05-07T20:33:44.1197341Z if compiled: 2025-05-07T20:33:44.1197460Z op = torch.compile(op) 2025-05-07T20:33:44.1197583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1197672Z 2025-05-07T20:33:44.1197780Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1197784Z 2025-05-07T20:33:44.1197899Z moe/activation_test.py:117: 2025-05-07T20:33:44.1198053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1198176Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1198300Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1198722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1198881Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1199494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1199610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1200021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1200308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1200798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1200943Z kernel = self.compile( 2025-05-07T20:33:44.1201496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1201751Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1201941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1201951Z 2025-05-07T20:33:44.1202199Z self = 2025-05-07T20:33:44.1203085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1203663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07613490>} 2025-05-07T20:33:44.1204516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1204743Z context = 2025-05-07T20:33:44.1204751Z 2025-05-07T20:33:44.1204945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1205252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1205378Z module_map=module_map) 2025-05-07T20:33:44.1205564Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1205683Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1205774Z E ^ 2025-05-07T20:33:44.1206187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1206192Z 2025-05-07T20:33:44.1206753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1206759Z 2025-05-07T20:33:44.1206882Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1207142Z self=, 2025-05-07T20:33:44.1207236Z T=1, 2025-05-07T20:33:44.1207326Z D=7168, 2025-05-07T20:33:44.1207430Z scale_ub=None, 2025-05-07T20:33:44.1207533Z contiguous=False, 2025-05-07T20:33:44.1207635Z compiled=False, 2025-05-07T20:33:44.1207723Z ) 2025-05-07T20:33:44.1207973Z self = 2025-05-07T20:33:44.1208170Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.1208175Z 2025-05-07T20:33:44.1208267Z @given( 2025-05-07T20:33:44.1208405Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1208526Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1208663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1208803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1208941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1209030Z ) 2025-05-07T20:33:44.1209320Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1209520Z def test_silu_mul_quant( 2025-05-07T20:33:44.1209613Z self, 2025-05-07T20:33:44.1209707Z T: int, 2025-05-07T20:33:44.1209797Z D: int, 2025-05-07T20:33:44.1209912Z scale_ub: Optional[float], 2025-05-07T20:33:44.1210022Z contiguous: bool, 2025-05-07T20:33:44.1210124Z compiled: bool, 2025-05-07T20:33:44.1210216Z ) -> None: 2025-05-07T20:33:44.1210330Z torch.manual_seed(2025) 2025-05-07T20:33:44.1210416Z 2025-05-07T20:33:44.1210614Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1210707Z 2025-05-07T20:33:44.1210815Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1210972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1211078Z x = x_sign * x_clamp 2025-05-07T20:33:44.1211173Z x0 = x[:, :D] 2025-05-07T20:33:44.1211272Z x1 = x[:, D:] 2025-05-07T20:33:44.1211361Z 2025-05-07T20:33:44.1211459Z if contiguous: 2025-05-07T20:33:44.1211573Z x0 = x0.contiguous() 2025-05-07T20:33:44.1211678Z x1 = x1.contiguous() 2025-05-07T20:33:44.1211766Z 2025-05-07T20:33:44.1211875Z if scale_ub is not None: 2025-05-07T20:33:44.1212000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1212157Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1212249Z ) 2025-05-07T20:33:44.1212340Z else: 2025-05-07T20:33:44.1212455Z scale_ub_tensor = None 2025-05-07T20:33:44.1212542Z 2025-05-07T20:33:44.1212694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1212806Z op = silu_mul_quant 2025-05-07T20:33:44.1212909Z if compiled: 2025-05-07T20:33:44.1213025Z op = torch.compile(op) 2025-05-07T20:33:44.1213156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1213243Z 2025-05-07T20:33:44.1213352Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1213356Z 2025-05-07T20:33:44.1213476Z moe/activation_test.py:117: 2025-05-07T20:33:44.1213627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1213749Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1213865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1214438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1214554Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1214969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1215345Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1215741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1215853Z kernel = self.compile( 2025-05-07T20:33:44.1216300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1216504Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1216651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1216656Z 2025-05-07T20:33:44.1216898Z self = 2025-05-07T20:33:44.1217779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1218364Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad076137f0>} 2025-05-07T20:33:44.1219294Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1219520Z context = 2025-05-07T20:33:44.1219528Z 2025-05-07T20:33:44.1219721Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1220025Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1220156Z module_map=module_map) 2025-05-07T20:33:44.1220350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1220488Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1220611Z E ^ 2025-05-07T20:33:44.1221120Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1221127Z 2025-05-07T20:33:44.1221731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1221737Z 2025-05-07T20:33:44.1221890Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1222211Z self=, 2025-05-07T20:33:44.1222329Z T=2048, 2025-05-07T20:33:44.1222442Z D=7168, 2025-05-07T20:33:44.1222563Z scale_ub=None, 2025-05-07T20:33:44.1222686Z contiguous=False, 2025-05-07T20:33:44.1222783Z compiled=True, 2025-05-07T20:33:44.1222870Z ) 2025-05-07T20:33:44.1223124Z self = 2025-05-07T20:33:44.1223332Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.1223337Z 2025-05-07T20:33:44.1223429Z @given( 2025-05-07T20:33:44.1223568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1223685Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1224068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1224271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1224420Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1224511Z ) 2025-05-07T20:33:44.1224836Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1224948Z def test_silu_mul_quant( 2025-05-07T20:33:44.1225045Z self, 2025-05-07T20:33:44.1225137Z T: int, 2025-05-07T20:33:44.1225232Z D: int, 2025-05-07T20:33:44.1225351Z scale_ub: Optional[float], 2025-05-07T20:33:44.1225553Z contiguous: bool, 2025-05-07T20:33:44.1225656Z compiled: bool, 2025-05-07T20:33:44.1225812Z ) -> None: 2025-05-07T20:33:44.1225925Z torch.manual_seed(2025) 2025-05-07T20:33:44.1226013Z 2025-05-07T20:33:44.1226209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1226299Z 2025-05-07T20:33:44.1226410Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1226558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1226662Z x = x_sign * x_clamp 2025-05-07T20:33:44.1226760Z x0 = x[:, :D] 2025-05-07T20:33:44.1226854Z x1 = x[:, D:] 2025-05-07T20:33:44.1226940Z 2025-05-07T20:33:44.1227045Z if contiguous: 2025-05-07T20:33:44.1227153Z x0 = x0.contiguous() 2025-05-07T20:33:44.1227263Z x1 = x1.contiguous() 2025-05-07T20:33:44.1227348Z 2025-05-07T20:33:44.1227455Z if scale_ub is not None: 2025-05-07T20:33:44.1227581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1227741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1227833Z ) 2025-05-07T20:33:44.1227926Z else: 2025-05-07T20:33:44.1228037Z scale_ub_tensor = None 2025-05-07T20:33:44.1228124Z 2025-05-07T20:33:44.1228281Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1228518Z op = silu_mul_quant 2025-05-07T20:33:44.1228621Z if compiled: 2025-05-07T20:33:44.1228740Z op = torch.compile(op) 2025-05-07T20:33:44.1228862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1228953Z 2025-05-07T20:33:44.1229059Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1229064Z 2025-05-07T20:33:44.1229178Z moe/activation_test.py:117: 2025-05-07T20:33:44.1229329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1229447Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1229563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1229999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1230110Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1230681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1230802Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1231259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1231528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1231921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1232030Z kernel = self.compile( 2025-05-07T20:33:44.1232472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1232683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1232835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1232840Z 2025-05-07T20:33:44.1233077Z self = 2025-05-07T20:33:44.1234028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1234605Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07300280>} 2025-05-07T20:33:44.1235454Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1235790Z context = 2025-05-07T20:33:44.1235796Z 2025-05-07T20:33:44.1235990Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1236300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1236435Z module_map=module_map) 2025-05-07T20:33:44.1236624Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1236745Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1236838Z E ^ 2025-05-07T20:33:44.1237245Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1237250Z 2025-05-07T20:33:44.1237727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1237734Z 2025-05-07T20:33:44.1237856Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1238120Z self=, 2025-05-07T20:33:44.1238211Z T=4096, 2025-05-07T20:33:44.1238300Z D=7168, 2025-05-07T20:33:44.1238399Z scale_ub=None, 2025-05-07T20:33:44.1238549Z contiguous=False, 2025-05-07T20:33:44.1238688Z compiled=True, 2025-05-07T20:33:44.1238779Z ) 2025-05-07T20:33:44.1239031Z self = 2025-05-07T20:33:44.1239232Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.1239241Z 2025-05-07T20:33:44.1239333Z @given( 2025-05-07T20:33:44.1239471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1239593Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1239728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1239864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1240004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1240093Z ) 2025-05-07T20:33:44.1240378Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1240493Z def test_silu_mul_quant( 2025-05-07T20:33:44.1240586Z self, 2025-05-07T20:33:44.1240676Z T: int, 2025-05-07T20:33:44.1240775Z D: int, 2025-05-07T20:33:44.1240891Z scale_ub: Optional[float], 2025-05-07T20:33:44.1240999Z contiguous: bool, 2025-05-07T20:33:44.1241100Z compiled: bool, 2025-05-07T20:33:44.1241194Z ) -> None: 2025-05-07T20:33:44.1241308Z torch.manual_seed(2025) 2025-05-07T20:33:44.1241395Z 2025-05-07T20:33:44.1241591Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1241688Z 2025-05-07T20:33:44.1241795Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1241941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1242051Z x = x_sign * x_clamp 2025-05-07T20:33:44.1242148Z x0 = x[:, :D] 2025-05-07T20:33:44.1242243Z x1 = x[:, D:] 2025-05-07T20:33:44.1242332Z 2025-05-07T20:33:44.1242435Z if contiguous: 2025-05-07T20:33:44.1242542Z x0 = x0.contiguous() 2025-05-07T20:33:44.1242653Z x1 = x1.contiguous() 2025-05-07T20:33:44.1242740Z 2025-05-07T20:33:44.1242849Z if scale_ub is not None: 2025-05-07T20:33:44.1242977Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1243133Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1243226Z ) 2025-05-07T20:33:44.1243315Z else: 2025-05-07T20:33:44.1243425Z scale_ub_tensor = None 2025-05-07T20:33:44.1243515Z 2025-05-07T20:33:44.1243666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1243772Z op = silu_mul_quant 2025-05-07T20:33:44.1243876Z if compiled: 2025-05-07T20:33:44.1244047Z op = torch.compile(op) 2025-05-07T20:33:44.1244227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1244319Z 2025-05-07T20:33:44.1244426Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1244431Z 2025-05-07T20:33:44.1244548Z moe/activation_test.py:117: 2025-05-07T20:33:44.1244707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1244825Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1244947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1245368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1245477Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1246046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1246160Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1246582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1246843Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1247236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1247443Z kernel = self.compile( 2025-05-07T20:33:44.1247883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1248089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1248241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1248245Z 2025-05-07T20:33:44.1248482Z self = 2025-05-07T20:33:44.1249377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1249957Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07303520>} 2025-05-07T20:33:44.1250869Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1251150Z context = 2025-05-07T20:33:44.1251156Z 2025-05-07T20:33:44.1251400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1251785Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1251942Z module_map=module_map) 2025-05-07T20:33:44.1252190Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1252343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1252459Z E ^ 2025-05-07T20:33:44.1252919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1252927Z 2025-05-07T20:33:44.1253404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1253409Z 2025-05-07T20:33:44.1253532Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1253793Z self=, 2025-05-07T20:33:44.1253886Z T=16384, 2025-05-07T20:33:44.1253979Z D=5120, 2025-05-07T20:33:44.1254077Z scale_ub=1200.0, 2025-05-07T20:33:44.1254179Z contiguous=False, 2025-05-07T20:33:44.1254285Z compiled=False, 2025-05-07T20:33:44.1254373Z ) 2025-05-07T20:33:44.1254702Z self = 2025-05-07T20:33:44.1254957Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.1254962Z 2025-05-07T20:33:44.1255053Z @given( 2025-05-07T20:33:44.1255193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1255317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1255457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1255599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1255732Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1255821Z ) 2025-05-07T20:33:44.1256111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1256221Z def test_silu_mul_quant( 2025-05-07T20:33:44.1256311Z self, 2025-05-07T20:33:44.1256406Z T: int, 2025-05-07T20:33:44.1256497Z D: int, 2025-05-07T20:33:44.1256615Z scale_ub: Optional[float], 2025-05-07T20:33:44.1256723Z contiguous: bool, 2025-05-07T20:33:44.1256829Z compiled: bool, 2025-05-07T20:33:44.1256921Z ) -> None: 2025-05-07T20:33:44.1257037Z torch.manual_seed(2025) 2025-05-07T20:33:44.1257122Z 2025-05-07T20:33:44.1257322Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1257458Z 2025-05-07T20:33:44.1257608Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1257760Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1257866Z x = x_sign * x_clamp 2025-05-07T20:33:44.1257960Z x0 = x[:, :D] 2025-05-07T20:33:44.1258059Z x1 = x[:, D:] 2025-05-07T20:33:44.1258146Z 2025-05-07T20:33:44.1258244Z if contiguous: 2025-05-07T20:33:44.1258354Z x0 = x0.contiguous() 2025-05-07T20:33:44.1258460Z x1 = x1.contiguous() 2025-05-07T20:33:44.1258547Z 2025-05-07T20:33:44.1258658Z if scale_ub is not None: 2025-05-07T20:33:44.1258786Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1258950Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1259039Z ) 2025-05-07T20:33:44.1259130Z else: 2025-05-07T20:33:44.1259247Z scale_ub_tensor = None 2025-05-07T20:33:44.1259337Z 2025-05-07T20:33:44.1259492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1259601Z op = silu_mul_quant 2025-05-07T20:33:44.1259704Z if compiled: 2025-05-07T20:33:44.1259821Z op = torch.compile(op) 2025-05-07T20:33:44.1259950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1260036Z 2025-05-07T20:33:44.1260143Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1260148Z 2025-05-07T20:33:44.1260268Z moe/activation_test.py:117: 2025-05-07T20:33:44.1260443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1260599Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1260745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1261462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1261608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1262128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1262436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1262834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1262945Z kernel = self.compile( 2025-05-07T20:33:44.1263390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1263593Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1263833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1263838Z 2025-05-07T20:33:44.1264078Z self = 2025-05-07T20:33:44.1264961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1265548Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07302a70>} 2025-05-07T20:33:44.1266396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1266618Z context = 2025-05-07T20:33:44.1266629Z 2025-05-07T20:33:44.1266825Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1267132Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1267261Z module_map=module_map) 2025-05-07T20:33:44.1267615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1267734Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1267829Z E ^ 2025-05-07T20:33:44.1268236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1268241Z 2025-05-07T20:33:44.1268718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1268723Z 2025-05-07T20:33:44.1268845Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1269106Z self=, 2025-05-07T20:33:44.1269208Z T=16384, 2025-05-07T20:33:44.1269298Z D=5120, 2025-05-07T20:33:44.1269396Z scale_ub=1200.0, 2025-05-07T20:33:44.1269498Z contiguous=True, 2025-05-07T20:33:44.1269599Z compiled=True, 2025-05-07T20:33:44.1269690Z ) 2025-05-07T20:33:44.1269953Z self = 2025-05-07T20:33:44.1270159Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.1270164Z 2025-05-07T20:33:44.1270258Z @given( 2025-05-07T20:33:44.1270396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1270514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1270651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1270788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1270920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1271016Z ) 2025-05-07T20:33:44.1271304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1271419Z def test_silu_mul_quant( 2025-05-07T20:33:44.1271509Z self, 2025-05-07T20:33:44.1271601Z T: int, 2025-05-07T20:33:44.1271695Z D: int, 2025-05-07T20:33:44.1271812Z scale_ub: Optional[float], 2025-05-07T20:33:44.1271921Z contiguous: bool, 2025-05-07T20:33:44.1272027Z compiled: bool, 2025-05-07T20:33:44.1272119Z ) -> None: 2025-05-07T20:33:44.1272230Z torch.manual_seed(2025) 2025-05-07T20:33:44.1272321Z 2025-05-07T20:33:44.1272519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1272608Z 2025-05-07T20:33:44.1272721Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1272871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1272975Z x = x_sign * x_clamp 2025-05-07T20:33:44.1273072Z x0 = x[:, :D] 2025-05-07T20:33:44.1273217Z x1 = x[:, D:] 2025-05-07T20:33:44.1273307Z 2025-05-07T20:33:44.1273447Z if contiguous: 2025-05-07T20:33:44.1273605Z x0 = x0.contiguous() 2025-05-07T20:33:44.1273714Z x1 = x1.contiguous() 2025-05-07T20:33:44.1273799Z 2025-05-07T20:33:44.1273905Z if scale_ub is not None: 2025-05-07T20:33:44.1274037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1274195Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1274286Z ) 2025-05-07T20:33:44.1274379Z else: 2025-05-07T20:33:44.1274490Z scale_ub_tensor = None 2025-05-07T20:33:44.1274575Z 2025-05-07T20:33:44.1274729Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1274835Z op = silu_mul_quant 2025-05-07T20:33:44.1274939Z if compiled: 2025-05-07T20:33:44.1275056Z op = torch.compile(op) 2025-05-07T20:33:44.1275178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1275269Z 2025-05-07T20:33:44.1275383Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1275388Z 2025-05-07T20:33:44.1275501Z moe/activation_test.py:117: 2025-05-07T20:33:44.1275654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1275828Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1275988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1276417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1276525Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1277094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1277208Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1277618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1277889Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1278281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1278395Z kernel = self.compile( 2025-05-07T20:33:44.1278841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1279046Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1279196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1279201Z 2025-05-07T20:33:44.1279437Z self = 2025-05-07T20:33:44.1280322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1280909Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07302560>} 2025-05-07T20:33:44.1281810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1282038Z context = 2025-05-07T20:33:44.1282043Z 2025-05-07T20:33:44.1282237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1282543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1282668Z module_map=module_map) 2025-05-07T20:33:44.1282855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1283025Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1283116Z E ^ 2025-05-07T20:33:44.1283563Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1283568Z 2025-05-07T20:33:44.1284045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1284055Z 2025-05-07T20:33:44.1284177Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1284436Z self=, 2025-05-07T20:33:44.1284529Z T=16384, 2025-05-07T20:33:44.1284620Z D=5120, 2025-05-07T20:33:44.1284721Z scale_ub=None, 2025-05-07T20:33:44.1284823Z contiguous=False, 2025-05-07T20:33:44.1284921Z compiled=True, 2025-05-07T20:33:44.1285010Z ) 2025-05-07T20:33:44.1285260Z self = 2025-05-07T20:33:44.1285470Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.1285475Z 2025-05-07T20:33:44.1285568Z @given( 2025-05-07T20:33:44.1285705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1285824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1285957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1286213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1286351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1286441Z ) 2025-05-07T20:33:44.1286725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1286840Z def test_silu_mul_quant( 2025-05-07T20:33:44.1286931Z self, 2025-05-07T20:33:44.1287024Z T: int, 2025-05-07T20:33:44.1287115Z D: int, 2025-05-07T20:33:44.1287231Z scale_ub: Optional[float], 2025-05-07T20:33:44.1287339Z contiguous: bool, 2025-05-07T20:33:44.1287442Z compiled: bool, 2025-05-07T20:33:44.1287534Z ) -> None: 2025-05-07T20:33:44.1287650Z torch.manual_seed(2025) 2025-05-07T20:33:44.1287736Z 2025-05-07T20:33:44.1287932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1288022Z 2025-05-07T20:33:44.1288130Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1288283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1288389Z x = x_sign * x_clamp 2025-05-07T20:33:44.1288485Z x0 = x[:, :D] 2025-05-07T20:33:44.1288578Z x1 = x[:, D:] 2025-05-07T20:33:44.1288666Z 2025-05-07T20:33:44.1288765Z if contiguous: 2025-05-07T20:33:44.1288874Z x0 = x0.contiguous() 2025-05-07T20:33:44.1288979Z x1 = x1.contiguous() 2025-05-07T20:33:44.1289064Z 2025-05-07T20:33:44.1289172Z if scale_ub is not None: 2025-05-07T20:33:44.1289296Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1289452Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1289548Z ) 2025-05-07T20:33:44.1289639Z else: 2025-05-07T20:33:44.1289754Z scale_ub_tensor = None 2025-05-07T20:33:44.1289843Z 2025-05-07T20:33:44.1289994Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1290103Z op = silu_mul_quant 2025-05-07T20:33:44.1290208Z if compiled: 2025-05-07T20:33:44.1290326Z op = torch.compile(op) 2025-05-07T20:33:44.1290451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1290538Z 2025-05-07T20:33:44.1290644Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1290649Z 2025-05-07T20:33:44.1290765Z moe/activation_test.py:117: 2025-05-07T20:33:44.1290915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1291032Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1291172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1291719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1291830Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1292397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1292516Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1292935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1293193Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1293582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1293697Z kernel = self.compile( 2025-05-07T20:33:44.1294133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1294342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1294493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1294498Z 2025-05-07T20:33:44.1294735Z self = 2025-05-07T20:33:44.1295707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1296284Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07303760>} 2025-05-07T20:33:44.1297133Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1297362Z context = 2025-05-07T20:33:44.1297367Z 2025-05-07T20:33:44.1297558Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1297865Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1297997Z module_map=module_map) 2025-05-07T20:33:44.1298188Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1298303Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1298394Z E ^ 2025-05-07T20:33:44.1298803Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1298808Z 2025-05-07T20:33:44.1299279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1299283Z 2025-05-07T20:33:44.1299412Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1299670Z self=, 2025-05-07T20:33:44.1299760Z T=2048, 2025-05-07T20:33:44.1299854Z D=5120, 2025-05-07T20:33:44.1299952Z scale_ub=None, 2025-05-07T20:33:44.1300053Z contiguous=False, 2025-05-07T20:33:44.1300156Z compiled=True, 2025-05-07T20:33:44.1300242Z ) 2025-05-07T20:33:44.1300495Z self = 2025-05-07T20:33:44.1300702Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.1300706Z 2025-05-07T20:33:44.1300799Z @given( 2025-05-07T20:33:44.1300938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1301053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1301187Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1301326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1301508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1301595Z ) 2025-05-07T20:33:44.1301927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1302038Z def test_silu_mul_quant( 2025-05-07T20:33:44.1302129Z self, 2025-05-07T20:33:44.1302225Z T: int, 2025-05-07T20:33:44.1302315Z D: int, 2025-05-07T20:33:44.1302436Z scale_ub: Optional[float], 2025-05-07T20:33:44.1302542Z contiguous: bool, 2025-05-07T20:33:44.1302643Z compiled: bool, 2025-05-07T20:33:44.1302737Z ) -> None: 2025-05-07T20:33:44.1302847Z torch.manual_seed(2025) 2025-05-07T20:33:44.1302933Z 2025-05-07T20:33:44.1303130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1303217Z 2025-05-07T20:33:44.1303323Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1303471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1303575Z x = x_sign * x_clamp 2025-05-07T20:33:44.1303674Z x0 = x[:, :D] 2025-05-07T20:33:44.1303778Z x1 = x[:, D:] 2025-05-07T20:33:44.1303862Z 2025-05-07T20:33:44.1303961Z if contiguous: 2025-05-07T20:33:44.1304071Z x0 = x0.contiguous() 2025-05-07T20:33:44.1304176Z x1 = x1.contiguous() 2025-05-07T20:33:44.1304311Z 2025-05-07T20:33:44.1304468Z if scale_ub is not None: 2025-05-07T20:33:44.1304592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1304754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1304842Z ) 2025-05-07T20:33:44.1304932Z else: 2025-05-07T20:33:44.1305044Z scale_ub_tensor = None 2025-05-07T20:33:44.1305134Z 2025-05-07T20:33:44.1305285Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1305394Z op = silu_mul_quant 2025-05-07T20:33:44.1305494Z if compiled: 2025-05-07T20:33:44.1305610Z op = torch.compile(op) 2025-05-07T20:33:44.1305740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1305829Z 2025-05-07T20:33:44.1305940Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1305945Z 2025-05-07T20:33:44.1306059Z moe/activation_test.py:117: 2025-05-07T20:33:44.1306207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1306334Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1306451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1306871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1306984Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1307547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1307664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1308075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1308337Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1308732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1308845Z kernel = self.compile( 2025-05-07T20:33:44.1309286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1309497Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1309644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1309649Z 2025-05-07T20:33:44.1309887Z self = 2025-05-07T20:33:44.1310822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1311493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad074543a0>} 2025-05-07T20:33:44.1312348Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1312570Z context = 2025-05-07T20:33:44.1312576Z 2025-05-07T20:33:44.1312771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1313074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1316241Z module_map=module_map) 2025-05-07T20:33:44.1316467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1316597Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1316689Z E ^ 2025-05-07T20:33:44.1317174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1317254Z 2025-05-07T20:33:44.1317788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1317793Z 2025-05-07T20:33:44.1317918Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1318180Z self=, 2025-05-07T20:33:44.1318271Z T=2048, 2025-05-07T20:33:44.1318363Z D=5120, 2025-05-07T20:33:44.1318464Z scale_ub=1200.0, 2025-05-07T20:33:44.1318566Z contiguous=False, 2025-05-07T20:33:44.1318664Z compiled=True, 2025-05-07T20:33:44.1318755Z ) 2025-05-07T20:33:44.1319006Z self = 2025-05-07T20:33:44.1319214Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.1319223Z 2025-05-07T20:33:44.1319315Z @given( 2025-05-07T20:33:44.1319454Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1319579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1319722Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1319859Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1319997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1320085Z ) 2025-05-07T20:33:44.1320369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1320483Z def test_silu_mul_quant( 2025-05-07T20:33:44.1320574Z self, 2025-05-07T20:33:44.1320664Z T: int, 2025-05-07T20:33:44.1320758Z D: int, 2025-05-07T20:33:44.1320875Z scale_ub: Optional[float], 2025-05-07T20:33:44.1320986Z contiguous: bool, 2025-05-07T20:33:44.1321087Z compiled: bool, 2025-05-07T20:33:44.1321183Z ) -> None: 2025-05-07T20:33:44.1321300Z torch.manual_seed(2025) 2025-05-07T20:33:44.1321386Z 2025-05-07T20:33:44.1321581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1321674Z 2025-05-07T20:33:44.1321790Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1321939Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1322048Z x = x_sign * x_clamp 2025-05-07T20:33:44.1322142Z x0 = x[:, :D] 2025-05-07T20:33:44.1322236Z x1 = x[:, D:] 2025-05-07T20:33:44.1322326Z 2025-05-07T20:33:44.1322424Z if contiguous: 2025-05-07T20:33:44.1322538Z x0 = x0.contiguous() 2025-05-07T20:33:44.1322642Z x1 = x1.contiguous() 2025-05-07T20:33:44.1322731Z 2025-05-07T20:33:44.1322844Z if scale_ub is not None: 2025-05-07T20:33:44.1322968Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1323221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1323316Z ) 2025-05-07T20:33:44.1323406Z else: 2025-05-07T20:33:44.1323518Z scale_ub_tensor = None 2025-05-07T20:33:44.1323608Z 2025-05-07T20:33:44.1323764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1324135Z op = silu_mul_quant 2025-05-07T20:33:44.1324273Z if compiled: 2025-05-07T20:33:44.1324392Z op = torch.compile(op) 2025-05-07T20:33:44.1324520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1324606Z 2025-05-07T20:33:44.1324714Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1324720Z 2025-05-07T20:33:44.1324837Z moe/activation_test.py:117: 2025-05-07T20:33:44.1324988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1325106Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1325230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1325655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1325765Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1326330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1326643Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1327059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1327317Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1327707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1327819Z kernel = self.compile( 2025-05-07T20:33:44.1328256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1328469Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1328618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1328623Z 2025-05-07T20:33:44.1328860Z self = 2025-05-07T20:33:44.1329758Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1330333Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07454820>} 2025-05-07T20:33:44.1331182Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1331414Z context = 2025-05-07T20:33:44.1331419Z 2025-05-07T20:33:44.1331610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1331921Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1332047Z module_map=module_map) 2025-05-07T20:33:44.1332240Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1332355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1332446Z E ^ 2025-05-07T20:33:44.1332855Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1332860Z 2025-05-07T20:33:44.1333332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1333404Z 2025-05-07T20:33:44.1333529Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1333851Z self=, 2025-05-07T20:33:44.1333943Z T=4096, 2025-05-07T20:33:44.1334034Z D=5120, 2025-05-07T20:33:44.1334139Z scale_ub=1200.0, 2025-05-07T20:33:44.1334244Z contiguous=True, 2025-05-07T20:33:44.1334344Z compiled=True, 2025-05-07T20:33:44.1334433Z ) 2025-05-07T20:33:44.1334686Z self = 2025-05-07T20:33:44.1334886Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.1334891Z 2025-05-07T20:33:44.1334985Z @given( 2025-05-07T20:33:44.1335125Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1335247Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1335380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1335521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1335663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1335751Z ) 2025-05-07T20:33:44.1336034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1336145Z def test_silu_mul_quant( 2025-05-07T20:33:44.1336286Z self, 2025-05-07T20:33:44.1336377Z T: int, 2025-05-07T20:33:44.1336517Z D: int, 2025-05-07T20:33:44.1336635Z scale_ub: Optional[float], 2025-05-07T20:33:44.1336740Z contiguous: bool, 2025-05-07T20:33:44.1336843Z compiled: bool, 2025-05-07T20:33:44.1336936Z ) -> None: 2025-05-07T20:33:44.1337051Z torch.manual_seed(2025) 2025-05-07T20:33:44.1337137Z 2025-05-07T20:33:44.1337333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1337422Z 2025-05-07T20:33:44.1337530Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1337675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1337785Z x = x_sign * x_clamp 2025-05-07T20:33:44.1337882Z x0 = x[:, :D] 2025-05-07T20:33:44.1337976Z x1 = x[:, D:] 2025-05-07T20:33:44.1338064Z 2025-05-07T20:33:44.1338162Z if contiguous: 2025-05-07T20:33:44.1338270Z x0 = x0.contiguous() 2025-05-07T20:33:44.1338384Z x1 = x1.contiguous() 2025-05-07T20:33:44.1338473Z 2025-05-07T20:33:44.1338580Z if scale_ub is not None: 2025-05-07T20:33:44.1338706Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1338861Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1338955Z ) 2025-05-07T20:33:44.1339046Z else: 2025-05-07T20:33:44.1339157Z scale_ub_tensor = None 2025-05-07T20:33:44.1339245Z 2025-05-07T20:33:44.1339395Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1339500Z op = silu_mul_quant 2025-05-07T20:33:44.1339608Z if compiled: 2025-05-07T20:33:44.1339726Z op = torch.compile(op) 2025-05-07T20:33:44.1339853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1339941Z 2025-05-07T20:33:44.1340047Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1340052Z 2025-05-07T20:33:44.1340170Z moe/activation_test.py:117: 2025-05-07T20:33:44.1340326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1340445Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1340565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1340988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1341099Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1341689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1341805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1342307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1342566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1342956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1343073Z kernel = self.compile( 2025-05-07T20:33:44.1343508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1343708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1343859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1343864Z 2025-05-07T20:33:44.1344098Z self = 2025-05-07T20:33:44.1344982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1345556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07455360>} 2025-05-07T20:33:44.1346483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1346705Z context = 2025-05-07T20:33:44.1346710Z 2025-05-07T20:33:44.1346901Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1347206Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1347330Z module_map=module_map) 2025-05-07T20:33:44.1347526Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1347643Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1347733Z E ^ 2025-05-07T20:33:44.1348138Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1348147Z 2025-05-07T20:33:44.1348617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1348622Z 2025-05-07T20:33:44.1348743Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1349001Z self=, 2025-05-07T20:33:44.1349092Z T=128, 2025-05-07T20:33:44.1349184Z D=5120, 2025-05-07T20:33:44.1349281Z scale_ub=1200.0, 2025-05-07T20:33:44.1349382Z contiguous=False, 2025-05-07T20:33:44.1349481Z compiled=True, 2025-05-07T20:33:44.1349570Z ) 2025-05-07T20:33:44.1349819Z self = 2025-05-07T20:33:44.1350022Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.1350027Z 2025-05-07T20:33:44.1350117Z @given( 2025-05-07T20:33:44.1350253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1350377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1350509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1350647Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1350779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1350865Z ) 2025-05-07T20:33:44.1351150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1351258Z def test_silu_mul_quant( 2025-05-07T20:33:44.1351346Z self, 2025-05-07T20:33:44.1351439Z T: int, 2025-05-07T20:33:44.1351529Z D: int, 2025-05-07T20:33:44.1351710Z scale_ub: Optional[float], 2025-05-07T20:33:44.1351829Z contiguous: bool, 2025-05-07T20:33:44.1351985Z compiled: bool, 2025-05-07T20:33:44.1352091Z ) -> None: 2025-05-07T20:33:44.1352231Z torch.manual_seed(2025) 2025-05-07T20:33:44.1352318Z 2025-05-07T20:33:44.1352521Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1352613Z 2025-05-07T20:33:44.1352723Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1352871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1352976Z x = x_sign * x_clamp 2025-05-07T20:33:44.1353069Z x0 = x[:, :D] 2025-05-07T20:33:44.1353168Z x1 = x[:, D:] 2025-05-07T20:33:44.1353254Z 2025-05-07T20:33:44.1353352Z if contiguous: 2025-05-07T20:33:44.1353462Z x0 = x0.contiguous() 2025-05-07T20:33:44.1353624Z x1 = x1.contiguous() 2025-05-07T20:33:44.1353710Z 2025-05-07T20:33:44.1353819Z if scale_ub is not None: 2025-05-07T20:33:44.1353945Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1354109Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1354198Z ) 2025-05-07T20:33:44.1354288Z else: 2025-05-07T20:33:44.1354403Z scale_ub_tensor = None 2025-05-07T20:33:44.1354542Z 2025-05-07T20:33:44.1354734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1354846Z op = silu_mul_quant 2025-05-07T20:33:44.1354945Z if compiled: 2025-05-07T20:33:44.1355064Z op = torch.compile(op) 2025-05-07T20:33:44.1355191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1355277Z 2025-05-07T20:33:44.1355384Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1355389Z 2025-05-07T20:33:44.1355504Z moe/activation_test.py:117: 2025-05-07T20:33:44.1355653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1355777Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1355897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1356317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1356429Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1357005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1357119Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1357535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1357793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1358187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1358297Z kernel = self.compile( 2025-05-07T20:33:44.1358741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1358949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1359096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1359105Z 2025-05-07T20:33:44.1359347Z self = 2025-05-07T20:33:44.1360236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1360813Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07456290>} 2025-05-07T20:33:44.1361708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1361997Z context = 2025-05-07T20:33:44.1362001Z 2025-05-07T20:33:44.1362196Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1362505Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1362630Z module_map=module_map) 2025-05-07T20:33:44.1362821Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1362937Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1363030Z E ^ 2025-05-07T20:33:44.1363434Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1363439Z 2025-05-07T20:33:44.1363908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1363918Z 2025-05-07T20:33:44.1364043Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1364303Z self=, 2025-05-07T20:33:44.1364394Z T=16384, 2025-05-07T20:33:44.1364535Z D=7168, 2025-05-07T20:33:44.1364636Z scale_ub=1200.0, 2025-05-07T20:33:44.1364781Z contiguous=True, 2025-05-07T20:33:44.1364882Z compiled=True, 2025-05-07T20:33:44.1364970Z ) 2025-05-07T20:33:44.1365225Z self = 2025-05-07T20:33:44.1365426Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.1365431Z 2025-05-07T20:33:44.1365522Z @given( 2025-05-07T20:33:44.1365663Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1365779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1365913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1366059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1366191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1366282Z ) 2025-05-07T20:33:44.1366566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1366680Z def test_silu_mul_quant( 2025-05-07T20:33:44.1366775Z self, 2025-05-07T20:33:44.1366868Z T: int, 2025-05-07T20:33:44.1366958Z D: int, 2025-05-07T20:33:44.1367075Z scale_ub: Optional[float], 2025-05-07T20:33:44.1367181Z contiguous: bool, 2025-05-07T20:33:44.1367281Z compiled: bool, 2025-05-07T20:33:44.1367376Z ) -> None: 2025-05-07T20:33:44.1367486Z torch.manual_seed(2025) 2025-05-07T20:33:44.1367572Z 2025-05-07T20:33:44.1367770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1367859Z 2025-05-07T20:33:44.1367971Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1368122Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1368229Z x = x_sign * x_clamp 2025-05-07T20:33:44.1368326Z x0 = x[:, :D] 2025-05-07T20:33:44.1368420Z x1 = x[:, D:] 2025-05-07T20:33:44.1368506Z 2025-05-07T20:33:44.1368606Z if contiguous: 2025-05-07T20:33:44.1368716Z x0 = x0.contiguous() 2025-05-07T20:33:44.1368824Z x1 = x1.contiguous() 2025-05-07T20:33:44.1368912Z 2025-05-07T20:33:44.1369018Z if scale_ub is not None: 2025-05-07T20:33:44.1369140Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1369298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1369387Z ) 2025-05-07T20:33:44.1369478Z else: 2025-05-07T20:33:44.1369594Z scale_ub_tensor = None 2025-05-07T20:33:44.1369680Z 2025-05-07T20:33:44.1369834Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1369992Z op = silu_mul_quant 2025-05-07T20:33:44.1370091Z if compiled: 2025-05-07T20:33:44.1370257Z op = torch.compile(op) 2025-05-07T20:33:44.1370405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1370513Z 2025-05-07T20:33:44.1370649Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1370659Z 2025-05-07T20:33:44.1370800Z moe/activation_test.py:117: 2025-05-07T20:33:44.1370991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1371144Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1371289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1371819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1371954Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1372656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1372807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1373266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1373525Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1374011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1374123Z kernel = self.compile( 2025-05-07T20:33:44.1374566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1374768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1374915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1374920Z 2025-05-07T20:33:44.1375160Z self = 2025-05-07T20:33:44.1376044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1376632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07456d40>} 2025-05-07T20:33:44.1377479Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1377707Z context = 2025-05-07T20:33:44.1377712Z 2025-05-07T20:33:44.1377904Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1378208Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1378344Z module_map=module_map) 2025-05-07T20:33:44.1378531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1378646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1378741Z E ^ 2025-05-07T20:33:44.1379152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1379160Z 2025-05-07T20:33:44.1379637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1379642Z 2025-05-07T20:33:44.1379762Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1380017Z self=, 2025-05-07T20:33:44.1380113Z T=16384, 2025-05-07T20:33:44.1380206Z D=5120, 2025-05-07T20:33:44.1380305Z scale_ub=1200.0, 2025-05-07T20:33:44.1380410Z contiguous=True, 2025-05-07T20:33:44.1380559Z compiled=False, 2025-05-07T20:33:44.1380649Z ) 2025-05-07T20:33:44.1380944Z self = 2025-05-07T20:33:44.1381150Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.1381155Z 2025-05-07T20:33:44.1381252Z @given( 2025-05-07T20:33:44.1381393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1381525Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1381682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1381835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1381967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1382056Z ) 2025-05-07T20:33:44.1382341Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1382458Z def test_silu_mul_quant( 2025-05-07T20:33:44.1382549Z self, 2025-05-07T20:33:44.1382643Z T: int, 2025-05-07T20:33:44.1382736Z D: int, 2025-05-07T20:33:44.1382857Z scale_ub: Optional[float], 2025-05-07T20:33:44.1382961Z contiguous: bool, 2025-05-07T20:33:44.1383065Z compiled: bool, 2025-05-07T20:33:44.1383156Z ) -> None: 2025-05-07T20:33:44.1383267Z torch.manual_seed(2025) 2025-05-07T20:33:44.1383407Z 2025-05-07T20:33:44.1383642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1383729Z 2025-05-07T20:33:44.1383841Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1383986Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1384093Z x = x_sign * x_clamp 2025-05-07T20:33:44.1384187Z x0 = x[:, :D] 2025-05-07T20:33:44.1384283Z x1 = x[:, D:] 2025-05-07T20:33:44.1384372Z 2025-05-07T20:33:44.1384470Z if contiguous: 2025-05-07T20:33:44.1384577Z x0 = x0.contiguous() 2025-05-07T20:33:44.1384690Z x1 = x1.contiguous() 2025-05-07T20:33:44.1384779Z 2025-05-07T20:33:44.1384886Z if scale_ub is not None: 2025-05-07T20:33:44.1385014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1385171Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1385261Z ) 2025-05-07T20:33:44.1385358Z else: 2025-05-07T20:33:44.1385470Z scale_ub_tensor = None 2025-05-07T20:33:44.1385558Z 2025-05-07T20:33:44.1385712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1385818Z op = silu_mul_quant 2025-05-07T20:33:44.1385920Z if compiled: 2025-05-07T20:33:44.1386038Z op = torch.compile(op) 2025-05-07T20:33:44.1386160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1386248Z 2025-05-07T20:33:44.1386354Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1386359Z 2025-05-07T20:33:44.1386472Z moe/activation_test.py:117: 2025-05-07T20:33:44.1386624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1386746Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1386862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1387439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1387557Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1387975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1388232Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1388625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1388737Z kernel = self.compile( 2025-05-07T20:33:44.1389179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1389451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1389643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1389648Z 2025-05-07T20:33:44.1389886Z self = 2025-05-07T20:33:44.1390776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1391350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad07457ac0>} 2025-05-07T20:33:44.1392198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1392426Z context = 2025-05-07T20:33:44.1392431Z 2025-05-07T20:33:44.1392623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1392931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1393144Z module_map=module_map) 2025-05-07T20:33:44.1393340Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1393456Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1393594Z E ^ 2025-05-07T20:33:44.1394006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1394011Z 2025-05-07T20:33:44.1394485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1394489Z 2025-05-07T20:33:44.1394617Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1394877Z self=, 2025-05-07T20:33:44.1394967Z T=1, 2025-05-07T20:33:44.1395062Z D=7168, 2025-05-07T20:33:44.1395160Z scale_ub=1200.0, 2025-05-07T20:33:44.1395261Z contiguous=False, 2025-05-07T20:33:44.1395369Z compiled=False, 2025-05-07T20:33:44.1395456Z ) 2025-05-07T20:33:44.1395710Z self = 2025-05-07T20:33:44.1395910Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.1395915Z 2025-05-07T20:33:44.1396007Z @given( 2025-05-07T20:33:44.1396144Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1396264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1396399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1396538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1396677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1396764Z ) 2025-05-07T20:33:44.1397057Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1397168Z def test_silu_mul_quant( 2025-05-07T20:33:44.1397258Z self, 2025-05-07T20:33:44.1397353Z T: int, 2025-05-07T20:33:44.1397443Z D: int, 2025-05-07T20:33:44.1397561Z scale_ub: Optional[float], 2025-05-07T20:33:44.1397671Z contiguous: bool, 2025-05-07T20:33:44.1397773Z compiled: bool, 2025-05-07T20:33:44.1397869Z ) -> None: 2025-05-07T20:33:44.1397979Z torch.manual_seed(2025) 2025-05-07T20:33:44.1398066Z 2025-05-07T20:33:44.1398267Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1398355Z 2025-05-07T20:33:44.1398463Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1398611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1398799Z x = x_sign * x_clamp 2025-05-07T20:33:44.1398893Z x0 = x[:, :D] 2025-05-07T20:33:44.1399034Z x1 = x[:, D:] 2025-05-07T20:33:44.1399121Z 2025-05-07T20:33:44.1399219Z if contiguous: 2025-05-07T20:33:44.1399331Z x0 = x0.contiguous() 2025-05-07T20:33:44.1399436Z x1 = x1.contiguous() 2025-05-07T20:33:44.1399523Z 2025-05-07T20:33:44.1399636Z if scale_ub is not None: 2025-05-07T20:33:44.1399760Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1399921Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1400011Z ) 2025-05-07T20:33:44.1400100Z else: 2025-05-07T20:33:44.1400215Z scale_ub_tensor = None 2025-05-07T20:33:44.1400301Z 2025-05-07T20:33:44.1400452Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1400561Z op = silu_mul_quant 2025-05-07T20:33:44.1400661Z if compiled: 2025-05-07T20:33:44.1400780Z op = torch.compile(op) 2025-05-07T20:33:44.1400911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1400997Z 2025-05-07T20:33:44.1401103Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1401111Z 2025-05-07T20:33:44.1401224Z moe/activation_test.py:117: 2025-05-07T20:33:44.1401373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1401585Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1401704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1402274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1402392Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1402802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1403065Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1403462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1403574Z kernel = self.compile( 2025-05-07T20:33:44.1404015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1404224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1404371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1404376Z 2025-05-07T20:33:44.1404615Z self = 2025-05-07T20:33:44.1405497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1406079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06fa8550>} 2025-05-07T20:33:44.1406928Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1407158Z context = 2025-05-07T20:33:44.1407163Z 2025-05-07T20:33:44.1407355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1407659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1407790Z module_map=module_map) 2025-05-07T20:33:44.1407978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1408096Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1408190Z E ^ 2025-05-07T20:33:44.1408686Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1408692Z 2025-05-07T20:33:44.1409167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1409175Z 2025-05-07T20:33:44.1409298Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1409559Z self=, 2025-05-07T20:33:44.1409654Z T=4096, 2025-05-07T20:33:44.1409744Z D=7168, 2025-05-07T20:33:44.1409843Z scale_ub=1200.0, 2025-05-07T20:33:44.1409948Z contiguous=False, 2025-05-07T20:33:44.1410047Z compiled=True, 2025-05-07T20:33:44.1410136Z ) 2025-05-07T20:33:44.1410387Z self = 2025-05-07T20:33:44.1410589Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.1410594Z 2025-05-07T20:33:44.1410691Z @given( 2025-05-07T20:33:44.1410828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1410948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1411085Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1411222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1411431Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1411565Z ) 2025-05-07T20:33:44.1411901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1412015Z def test_silu_mul_quant( 2025-05-07T20:33:44.1412106Z self, 2025-05-07T20:33:44.1412198Z T: int, 2025-05-07T20:33:44.1412292Z D: int, 2025-05-07T20:33:44.1412407Z scale_ub: Optional[float], 2025-05-07T20:33:44.1412511Z contiguous: bool, 2025-05-07T20:33:44.1412615Z compiled: bool, 2025-05-07T20:33:44.1412707Z ) -> None: 2025-05-07T20:33:44.1412817Z torch.manual_seed(2025) 2025-05-07T20:33:44.1412908Z 2025-05-07T20:33:44.1413108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1413195Z 2025-05-07T20:33:44.1413305Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1413450Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1413562Z x = x_sign * x_clamp 2025-05-07T20:33:44.1413658Z x0 = x[:, :D] 2025-05-07T20:33:44.1413754Z x1 = x[:, D:] 2025-05-07T20:33:44.1413843Z 2025-05-07T20:33:44.1413941Z if contiguous: 2025-05-07T20:33:44.1414048Z x0 = x0.contiguous() 2025-05-07T20:33:44.1414156Z x1 = x1.contiguous() 2025-05-07T20:33:44.1414242Z 2025-05-07T20:33:44.1414349Z if scale_ub is not None: 2025-05-07T20:33:44.1414477Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1414632Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1414721Z ) 2025-05-07T20:33:44.1414816Z else: 2025-05-07T20:33:44.1414928Z scale_ub_tensor = None 2025-05-07T20:33:44.1415020Z 2025-05-07T20:33:44.1415171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1415277Z op = silu_mul_quant 2025-05-07T20:33:44.1415379Z if compiled: 2025-05-07T20:33:44.1415498Z op = torch.compile(op) 2025-05-07T20:33:44.1415624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1415712Z 2025-05-07T20:33:44.1415820Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1415825Z 2025-05-07T20:33:44.1415939Z moe/activation_test.py:117: 2025-05-07T20:33:44.1416091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1416208Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1416325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1416746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1416910Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1417516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1417633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1418045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1418310Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1418702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1418814Z kernel = self.compile( 2025-05-07T20:33:44.1419252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1419457Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1419605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1419614Z 2025-05-07T20:33:44.1419853Z self = 2025-05-07T20:33:44.1420853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1421502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06fa8f70>} 2025-05-07T20:33:44.1422342Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1422567Z context = 2025-05-07T20:33:44.1422575Z 2025-05-07T20:33:44.1422767Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1423071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1423196Z module_map=module_map) 2025-05-07T20:33:44.1423385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1423506Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1423596Z E ^ 2025-05-07T20:33:44.1424395Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1424407Z 2025-05-07T20:33:44.1424882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1424887Z 2025-05-07T20:33:44.1425008Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1425265Z self=, 2025-05-07T20:33:44.1425360Z T=128, 2025-05-07T20:33:44.1425453Z D=7168, 2025-05-07T20:33:44.1425554Z scale_ub=1200.0, 2025-05-07T20:33:44.1425655Z contiguous=False, 2025-05-07T20:33:44.1425752Z compiled=True, 2025-05-07T20:33:44.1425841Z ) 2025-05-07T20:33:44.1426090Z self = 2025-05-07T20:33:44.1426297Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:44.1426302Z 2025-05-07T20:33:44.1426392Z @given( 2025-05-07T20:33:44.1426530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1426648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1426781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1426916Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1427053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1427140Z ) 2025-05-07T20:33:44.1427516Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1427699Z def test_silu_mul_quant( 2025-05-07T20:33:44.1427790Z self, 2025-05-07T20:33:44.1427884Z T: int, 2025-05-07T20:33:44.1427973Z D: int, 2025-05-07T20:33:44.1428087Z scale_ub: Optional[float], 2025-05-07T20:33:44.1428197Z contiguous: bool, 2025-05-07T20:33:44.1428299Z compiled: bool, 2025-05-07T20:33:44.1428390Z ) -> None: 2025-05-07T20:33:44.1428503Z torch.manual_seed(2025) 2025-05-07T20:33:44.1428590Z 2025-05-07T20:33:44.1428785Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1428876Z 2025-05-07T20:33:44.1428984Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1429129Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1429236Z x = x_sign * x_clamp 2025-05-07T20:33:44.1429330Z x0 = x[:, :D] 2025-05-07T20:33:44.1429427Z x1 = x[:, D:] 2025-05-07T20:33:44.1429517Z 2025-05-07T20:33:44.1429613Z if contiguous: 2025-05-07T20:33:44.1429728Z x0 = x0.contiguous() 2025-05-07T20:33:44.1429831Z x1 = x1.contiguous() 2025-05-07T20:33:44.1429918Z 2025-05-07T20:33:44.1430025Z if scale_ub is not None: 2025-05-07T20:33:44.1430222Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1430460Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1430553Z ) 2025-05-07T20:33:44.1430642Z else: 2025-05-07T20:33:44.1430751Z scale_ub_tensor = None 2025-05-07T20:33:44.1430841Z 2025-05-07T20:33:44.1431014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1431146Z op = silu_mul_quant 2025-05-07T20:33:44.1431245Z if compiled: 2025-05-07T20:33:44.1431360Z op = torch.compile(op) 2025-05-07T20:33:44.1431484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1431572Z 2025-05-07T20:33:44.1431677Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1431682Z 2025-05-07T20:33:44.1431801Z moe/activation_test.py:117: 2025-05-07T20:33:44.1431949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1432066Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1432190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1432608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1432719Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1433277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1433406Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1434031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1434384Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1434924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1435080Z kernel = self.compile( 2025-05-07T20:33:44.1435568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1435781Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1435930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1435935Z 2025-05-07T20:33:44.1436171Z self = 2025-05-07T20:33:44.1437050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1437743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06fa92d0>} 2025-05-07T20:33:44.1438592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1438815Z context = 2025-05-07T20:33:44.1438821Z 2025-05-07T20:33:44.1439014Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1439318Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1439442Z module_map=module_map) 2025-05-07T20:33:44.1439630Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1439746Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1439840Z E ^ 2025-05-07T20:33:44.1440248Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1440252Z 2025-05-07T20:33:44.1440721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1440771Z 2025-05-07T20:33:44.1440937Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1441192Z self=, 2025-05-07T20:33:44.1441282Z T=2048, 2025-05-07T20:33:44.1441374Z D=7168, 2025-05-07T20:33:44.1441472Z scale_ub=None, 2025-05-07T20:33:44.1441572Z contiguous=True, 2025-05-07T20:33:44.1441673Z compiled=True, 2025-05-07T20:33:44.1441758Z ) 2025-05-07T20:33:44.1442007Z self = 2025-05-07T20:33:44.1442205Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.1442213Z 2025-05-07T20:33:44.1442303Z @given( 2025-05-07T20:33:44.1442445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1442561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1442694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1442839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1442974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1443060Z ) 2025-05-07T20:33:44.1443344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1443452Z def test_silu_mul_quant( 2025-05-07T20:33:44.1443546Z self, 2025-05-07T20:33:44.1443640Z T: int, 2025-05-07T20:33:44.1443729Z D: int, 2025-05-07T20:33:44.1443849Z scale_ub: Optional[float], 2025-05-07T20:33:44.1443953Z contiguous: bool, 2025-05-07T20:33:44.1444053Z compiled: bool, 2025-05-07T20:33:44.1444152Z ) -> None: 2025-05-07T20:33:44.1444262Z torch.manual_seed(2025) 2025-05-07T20:33:44.1444349Z 2025-05-07T20:33:44.1444547Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1444634Z 2025-05-07T20:33:44.1444742Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1444892Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1444998Z x = x_sign * x_clamp 2025-05-07T20:33:44.1445091Z x0 = x[:, :D] 2025-05-07T20:33:44.1445187Z x1 = x[:, D:] 2025-05-07T20:33:44.1445272Z 2025-05-07T20:33:44.1445372Z if contiguous: 2025-05-07T20:33:44.1445478Z x0 = x0.contiguous() 2025-05-07T20:33:44.1445580Z x1 = x1.contiguous() 2025-05-07T20:33:44.1445667Z 2025-05-07T20:33:44.1445772Z if scale_ub is not None: 2025-05-07T20:33:44.1445894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1446053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1446192Z ) 2025-05-07T20:33:44.1446281Z else: 2025-05-07T20:33:44.1446443Z scale_ub_tensor = None 2025-05-07T20:33:44.1446530Z 2025-05-07T20:33:44.1446681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1446789Z op = silu_mul_quant 2025-05-07T20:33:44.1446890Z if compiled: 2025-05-07T20:33:44.1447011Z op = torch.compile(op) 2025-05-07T20:33:44.1447133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1447218Z 2025-05-07T20:33:44.1447326Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1447331Z 2025-05-07T20:33:44.1447443Z moe/activation_test.py:117: 2025-05-07T20:33:44.1447591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1447711Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1447829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1448245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1448363Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1448923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1449039Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1449531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1449789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1450182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1450290Z kernel = self.compile( 2025-05-07T20:33:44.1450728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1450930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1451082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1451087Z 2025-05-07T20:33:44.1451323Z self = 2025-05-07T20:33:44.1452250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1452830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06faa560>} 2025-05-07T20:33:44.1453673Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1453893Z context = 2025-05-07T20:33:44.1453900Z 2025-05-07T20:33:44.1454096Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1454398Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1454528Z module_map=module_map) 2025-05-07T20:33:44.1454719Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1454833Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1454929Z E ^ 2025-05-07T20:33:44.1455333Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1455338Z 2025-05-07T20:33:44.1455805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1455815Z 2025-05-07T20:33:44.1455935Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1456237Z self=, 2025-05-07T20:33:44.1456373Z T=16384, 2025-05-07T20:33:44.1456464Z D=5120, 2025-05-07T20:33:44.1460064Z scale_ub=None, 2025-05-07T20:33:44.1460186Z contiguous=False, 2025-05-07T20:33:44.1460289Z compiled=False, 2025-05-07T20:33:44.1460385Z ) 2025-05-07T20:33:44.1460646Z self = 2025-05-07T20:33:44.1460859Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.1460865Z 2025-05-07T20:33:44.1460956Z @given( 2025-05-07T20:33:44.1461094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1461214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1461348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1461487Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1461647Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1461758Z ) 2025-05-07T20:33:44.1462051Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1462166Z def test_silu_mul_quant( 2025-05-07T20:33:44.1462256Z self, 2025-05-07T20:33:44.1462349Z T: int, 2025-05-07T20:33:44.1462514Z D: int, 2025-05-07T20:33:44.1462629Z scale_ub: Optional[float], 2025-05-07T20:33:44.1462781Z contiguous: bool, 2025-05-07T20:33:44.1462884Z compiled: bool, 2025-05-07T20:33:44.1462977Z ) -> None: 2025-05-07T20:33:44.1463092Z torch.manual_seed(2025) 2025-05-07T20:33:44.1463180Z 2025-05-07T20:33:44.1463374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1463465Z 2025-05-07T20:33:44.1463573Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1463717Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1465769Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1465780Z 2025-05-07T20:33:44.1465919Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:44.1465927Z 2025-05-07T20:33:44.1466049Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1466305Z self=, 2025-05-07T20:33:44.1466402Z T=4096, 2025-05-07T20:33:44.1466491Z D=7168, 2025-05-07T20:33:44.1466588Z scale_ub=1200.0, 2025-05-07T20:33:44.1466689Z contiguous=True, 2025-05-07T20:33:44.1466792Z compiled=True, 2025-05-07T20:33:44.1466882Z ) 2025-05-07T20:33:44.1467136Z self = 2025-05-07T20:33:44.1467332Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.1467337Z 2025-05-07T20:33:44.1467434Z @given( 2025-05-07T20:33:44.1467573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1467688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1467826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1467960Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1468094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1468187Z ) 2025-05-07T20:33:44.1468468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1468578Z def test_silu_mul_quant( 2025-05-07T20:33:44.1468672Z self, 2025-05-07T20:33:44.1468814Z T: int, 2025-05-07T20:33:44.1468904Z D: int, 2025-05-07T20:33:44.1469067Z scale_ub: Optional[float], 2025-05-07T20:33:44.1469173Z contiguous: bool, 2025-05-07T20:33:44.1469278Z compiled: bool, 2025-05-07T20:33:44.1469371Z ) -> None: 2025-05-07T20:33:44.1469481Z torch.manual_seed(2025) 2025-05-07T20:33:44.1469572Z 2025-05-07T20:33:44.1469768Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1469856Z 2025-05-07T20:33:44.1469967Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1470113Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1472175Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1472184Z 2025-05-07T20:33:44.1472322Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:44.1472374Z 2025-05-07T20:33:44.1472538Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1472797Z self=, 2025-05-07T20:33:44.1472888Z T=16384, 2025-05-07T20:33:44.1472983Z D=7168, 2025-05-07T20:33:44.1473079Z scale_ub=None, 2025-05-07T20:33:44.1473181Z contiguous=False, 2025-05-07T20:33:44.1473283Z compiled=False, 2025-05-07T20:33:44.1473371Z ) 2025-05-07T20:33:44.1473756Z self = 2025-05-07T20:33:44.1473959Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.1473968Z 2025-05-07T20:33:44.1474059Z @given( 2025-05-07T20:33:44.1474199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1474318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1474450Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1474589Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1474728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1474817Z ) 2025-05-07T20:33:44.1475103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1475212Z def test_silu_mul_quant( 2025-05-07T20:33:44.1475302Z self, 2025-05-07T20:33:44.1475398Z T: int, 2025-05-07T20:33:44.1475488Z D: int, 2025-05-07T20:33:44.1475605Z scale_ub: Optional[float], 2025-05-07T20:33:44.1475713Z contiguous: bool, 2025-05-07T20:33:44.1475812Z compiled: bool, 2025-05-07T20:33:44.1475908Z ) -> None: 2025-05-07T20:33:44.1476021Z torch.manual_seed(2025) 2025-05-07T20:33:44.1476111Z 2025-05-07T20:33:44.1476305Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1478318Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1478327Z 2025-05-07T20:33:44.1478463Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1478468Z 2025-05-07T20:33:44.1478590Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1478842Z self=, 2025-05-07T20:33:44.1479031Z T=2048, 2025-05-07T20:33:44.1479128Z D=7168, 2025-05-07T20:33:44.1479230Z scale_ub=1200.0, 2025-05-07T20:33:44.1479332Z contiguous=True, 2025-05-07T20:33:44.1479434Z compiled=True, 2025-05-07T20:33:44.1479524Z ) 2025-05-07T20:33:44.1479780Z self = 2025-05-07T20:33:44.1479979Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.1479984Z 2025-05-07T20:33:44.1480077Z @given( 2025-05-07T20:33:44.1480215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1480329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1480463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1480601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1480733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1480822Z ) 2025-05-07T20:33:44.1481108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1481219Z def test_silu_mul_quant( 2025-05-07T20:33:44.1481308Z self, 2025-05-07T20:33:44.1481402Z T: int, 2025-05-07T20:33:44.1481492Z D: int, 2025-05-07T20:33:44.1481659Z scale_ub: Optional[float], 2025-05-07T20:33:44.1481804Z contiguous: bool, 2025-05-07T20:33:44.1481906Z compiled: bool, 2025-05-07T20:33:44.1482002Z ) -> None: 2025-05-07T20:33:44.1482112Z torch.manual_seed(2025) 2025-05-07T20:33:44.1482197Z 2025-05-07T20:33:44.1482393Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1482483Z 2025-05-07T20:33:44.1482591Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1482740Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1484750Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1484763Z 2025-05-07T20:33:44.1484904Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:44.1484909Z 2025-05-07T20:33:44.1485029Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1485287Z self=, 2025-05-07T20:33:44.1485378Z T=2048, 2025-05-07T20:33:44.1485469Z D=7168, 2025-05-07T20:33:44.1485569Z scale_ub=None, 2025-05-07T20:33:44.1485668Z contiguous=True, 2025-05-07T20:33:44.1485766Z compiled=False, 2025-05-07T20:33:44.1485859Z ) 2025-05-07T20:33:44.1486111Z self = 2025-05-07T20:33:44.1486309Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.1486313Z 2025-05-07T20:33:44.1486408Z @given( 2025-05-07T20:33:44.1486545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1486666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1486798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1486932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1487066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1487153Z ) 2025-05-07T20:33:44.1487436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1487548Z def test_silu_mul_quant( 2025-05-07T20:33:44.1487639Z self, 2025-05-07T20:33:44.1487731Z T: int, 2025-05-07T20:33:44.1487875Z D: int, 2025-05-07T20:33:44.1487990Z scale_ub: Optional[float], 2025-05-07T20:33:44.1488136Z contiguous: bool, 2025-05-07T20:33:44.1488242Z compiled: bool, 2025-05-07T20:33:44.1488336Z ) -> None: 2025-05-07T20:33:44.1488452Z torch.manual_seed(2025) 2025-05-07T20:33:44.1488537Z 2025-05-07T20:33:44.1488735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1488829Z 2025-05-07T20:33:44.1488938Z > x_sign = torch.sign(x) 2025-05-07T20:33:44.1490940Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1490956Z 2025-05-07T20:33:44.1491095Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:44.1491100Z 2025-05-07T20:33:44.1491220Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1491525Z self=, 2025-05-07T20:33:44.1491657Z T=1, 2025-05-07T20:33:44.1491749Z D=7168, 2025-05-07T20:33:44.1491852Z scale_ub=1200.0, 2025-05-07T20:33:44.1491951Z contiguous=True, 2025-05-07T20:33:44.1492048Z compiled=False, 2025-05-07T20:33:44.1492138Z ) 2025-05-07T20:33:44.1492385Z self = 2025-05-07T20:33:44.1492578Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.1492583Z 2025-05-07T20:33:44.1492673Z @given( 2025-05-07T20:33:44.1492809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1492930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1493065Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1493200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1493335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1493425Z ) 2025-05-07T20:33:44.1493717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1493828Z def test_silu_mul_quant( 2025-05-07T20:33:44.1493917Z self, 2025-05-07T20:33:44.1494012Z T: int, 2025-05-07T20:33:44.1494101Z D: int, 2025-05-07T20:33:44.1494214Z scale_ub: Optional[float], 2025-05-07T20:33:44.1494323Z contiguous: bool, 2025-05-07T20:33:44.1494424Z compiled: bool, 2025-05-07T20:33:44.1494515Z ) -> None: 2025-05-07T20:33:44.1494629Z torch.manual_seed(2025) 2025-05-07T20:33:44.1494715Z 2025-05-07T20:33:44.1494907Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1495001Z 2025-05-07T20:33:44.1495111Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1495255Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1495361Z x = x_sign * x_clamp 2025-05-07T20:33:44.1495457Z x0 = x[:, :D] 2025-05-07T20:33:44.1495556Z x1 = x[:, D:] 2025-05-07T20:33:44.1495643Z 2025-05-07T20:33:44.1495745Z if contiguous: 2025-05-07T20:33:44.1495857Z x0 = x0.contiguous() 2025-05-07T20:33:44.1495963Z x1 = x1.contiguous() 2025-05-07T20:33:44.1496048Z 2025-05-07T20:33:44.1496157Z if scale_ub is not None: 2025-05-07T20:33:44.1496279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1496436Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1496530Z ) 2025-05-07T20:33:44.1496619Z else: 2025-05-07T20:33:44.1496728Z scale_ub_tensor = None 2025-05-07T20:33:44.1496898Z 2025-05-07T20:33:44.1497051Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1497204Z op = silu_mul_quant 2025-05-07T20:33:44.1497305Z if compiled: 2025-05-07T20:33:44.1497422Z op = torch.compile(op) 2025-05-07T20:33:44.1497548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1497637Z 2025-05-07T20:33:44.1497744Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1497749Z 2025-05-07T20:33:44.1497867Z moe/activation_test.py:117: 2025-05-07T20:33:44.1498016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1498134Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1498253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1498824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1498942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1499359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1499616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1500010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1500208Z kernel = self.compile( 2025-05-07T20:33:44.1500715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1500973Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1501156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1501163Z 2025-05-07T20:33:44.1501463Z self = 2025-05-07T20:33:44.1502562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1503242Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06b4c4c0>} 2025-05-07T20:33:44.1504095Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1504315Z context = 2025-05-07T20:33:44.1504320Z 2025-05-07T20:33:44.1504514Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1504816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1504948Z module_map=module_map) 2025-05-07T20:33:44.1505145Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1505260Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1505353Z E ^ 2025-05-07T20:33:44.1505758Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1505772Z 2025-05-07T20:33:44.1506243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1506251Z 2025-05-07T20:33:44.1506372Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1506625Z self=, 2025-05-07T20:33:44.1506719Z T=128, 2025-05-07T20:33:44.1506808Z D=5120, 2025-05-07T20:33:44.1506905Z scale_ub=None, 2025-05-07T20:33:44.1507007Z contiguous=True, 2025-05-07T20:33:44.1507106Z compiled=False, 2025-05-07T20:33:44.1507192Z ) 2025-05-07T20:33:44.1507494Z self = 2025-05-07T20:33:44.1507729Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.1507735Z 2025-05-07T20:33:44.1507829Z @given( 2025-05-07T20:33:44.1507967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1508089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1508226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1508365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1508497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1508587Z ) 2025-05-07T20:33:44.1508869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1508978Z def test_silu_mul_quant( 2025-05-07T20:33:44.1509073Z self, 2025-05-07T20:33:44.1509163Z T: int, 2025-05-07T20:33:44.1509253Z D: int, 2025-05-07T20:33:44.1509372Z scale_ub: Optional[float], 2025-05-07T20:33:44.1509476Z contiguous: bool, 2025-05-07T20:33:44.1509583Z compiled: bool, 2025-05-07T20:33:44.1509674Z ) -> None: 2025-05-07T20:33:44.1509785Z torch.manual_seed(2025) 2025-05-07T20:33:44.1509873Z 2025-05-07T20:33:44.1510069Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1510246Z 2025-05-07T20:33:44.1510359Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1510505Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1510611Z x = x_sign * x_clamp 2025-05-07T20:33:44.1510711Z x0 = x[:, :D] 2025-05-07T20:33:44.1510806Z x1 = x[:, D:] 2025-05-07T20:33:44.1510895Z 2025-05-07T20:33:44.1510998Z if contiguous: 2025-05-07T20:33:44.1511126Z x0 = x0.contiguous() 2025-05-07T20:33:44.1511256Z x1 = x1.contiguous() 2025-05-07T20:33:44.1511368Z 2025-05-07T20:33:44.1511501Z if scale_ub is not None: 2025-05-07T20:33:44.1511664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1511863Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1511973Z ) 2025-05-07T20:33:44.1512088Z else: 2025-05-07T20:33:44.1512225Z scale_ub_tensor = None 2025-05-07T20:33:44.1512335Z 2025-05-07T20:33:44.1512515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1512621Z op = silu_mul_quant 2025-05-07T20:33:44.1512720Z if compiled: 2025-05-07T20:33:44.1512837Z op = torch.compile(op) 2025-05-07T20:33:44.1512959Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1513043Z 2025-05-07T20:33:44.1513151Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1513156Z 2025-05-07T20:33:44.1513270Z moe/activation_test.py:117: 2025-05-07T20:33:44.1513423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1513598Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1513715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1514285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1514399Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1514810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1515068Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1515453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1515565Z kernel = self.compile( 2025-05-07T20:33:44.1515996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1516197Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1516443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1516448Z 2025-05-07T20:33:44.1516681Z self = 2025-05-07T20:33:44.1517557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1518127Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06b4c940>} 2025-05-07T20:33:44.1518964Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1519183Z context = 2025-05-07T20:33:44.1519190Z 2025-05-07T20:33:44.1519381Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1519685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1519808Z module_map=module_map) 2025-05-07T20:33:44.1520080Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1520200Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1520291Z E ^ 2025-05-07T20:33:44.1520694Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1520700Z 2025-05-07T20:33:44.1521168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1521173Z 2025-05-07T20:33:44.1521292Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1521549Z self=, 2025-05-07T20:33:44.1521641Z T=128, 2025-05-07T20:33:44.1521735Z D=7168, 2025-05-07T20:33:44.1521834Z scale_ub=None, 2025-05-07T20:33:44.1521932Z contiguous=True, 2025-05-07T20:33:44.1522034Z compiled=False, 2025-05-07T20:33:44.1522124Z ) 2025-05-07T20:33:44.1522374Z self = 2025-05-07T20:33:44.1522571Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.1522576Z 2025-05-07T20:33:44.1522666Z @given( 2025-05-07T20:33:44.1522803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1522922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1523054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1523189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1523328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1523417Z ) 2025-05-07T20:33:44.1523708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1524167Z def test_silu_mul_quant( 2025-05-07T20:33:44.1524307Z self, 2025-05-07T20:33:44.1524446Z T: int, 2025-05-07T20:33:44.1524573Z D: int, 2025-05-07T20:33:44.1524698Z scale_ub: Optional[float], 2025-05-07T20:33:44.1524809Z contiguous: bool, 2025-05-07T20:33:44.1524909Z compiled: bool, 2025-05-07T20:33:44.1525000Z ) -> None: 2025-05-07T20:33:44.1525114Z torch.manual_seed(2025) 2025-05-07T20:33:44.1525199Z 2025-05-07T20:33:44.1525394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1525484Z 2025-05-07T20:33:44.1525592Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1525742Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1525846Z x = x_sign * x_clamp 2025-05-07T20:33:44.1525939Z x0 = x[:, :D] 2025-05-07T20:33:44.1526142Z x1 = x[:, D:] 2025-05-07T20:33:44.1526228Z 2025-05-07T20:33:44.1526391Z if contiguous: 2025-05-07T20:33:44.1526505Z x0 = x0.contiguous() 2025-05-07T20:33:44.1526609Z x1 = x1.contiguous() 2025-05-07T20:33:44.1526696Z 2025-05-07T20:33:44.1526805Z if scale_ub is not None: 2025-05-07T20:33:44.1526932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1527088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1527180Z ) 2025-05-07T20:33:44.1527268Z else: 2025-05-07T20:33:44.1527383Z scale_ub_tensor = None 2025-05-07T20:33:44.1527470Z 2025-05-07T20:33:44.1527619Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1527728Z op = silu_mul_quant 2025-05-07T20:33:44.1527828Z if compiled: 2025-05-07T20:33:44.1527943Z op = torch.compile(op) 2025-05-07T20:33:44.1528073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1528162Z 2025-05-07T20:33:44.1528270Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1528275Z 2025-05-07T20:33:44.1528392Z moe/activation_test.py:117: 2025-05-07T20:33:44.1528541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1528732Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1528937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1529502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1529619Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1530023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1530276Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1530665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1530777Z kernel = self.compile( 2025-05-07T20:33:44.1531213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1531416Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1531566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1531571Z 2025-05-07T20:33:44.1531807Z self = 2025-05-07T20:33:44.1532670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1533241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06b4d240>} 2025-05-07T20:33:44.1534080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1534299Z context = 2025-05-07T20:33:44.1534306Z 2025-05-07T20:33:44.1534507Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1534805Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1534932Z module_map=module_map) 2025-05-07T20:33:44.1535115Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1535231Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1535326Z E ^ 2025-05-07T20:33:44.1535728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1535783Z 2025-05-07T20:33:44.1536289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1536299Z 2025-05-07T20:33:44.1536418Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1536672Z self=, 2025-05-07T20:33:44.1536770Z T=2048, 2025-05-07T20:33:44.1536861Z D=7168, 2025-05-07T20:33:44.1536959Z scale_ub=1200.0, 2025-05-07T20:33:44.1537062Z contiguous=True, 2025-05-07T20:33:44.1537160Z compiled=False, 2025-05-07T20:33:44.1537247Z ) 2025-05-07T20:33:44.1537499Z self = 2025-05-07T20:33:44.1537698Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.1537703Z 2025-05-07T20:33:44.1537800Z @given( 2025-05-07T20:33:44.1537938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1538058Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1538197Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1538331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1538465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1538601Z ) 2025-05-07T20:33:44.1538925Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1539037Z def test_silu_mul_quant( 2025-05-07T20:33:44.1539132Z self, 2025-05-07T20:33:44.1539221Z T: int, 2025-05-07T20:33:44.1539312Z D: int, 2025-05-07T20:33:44.1539431Z scale_ub: Optional[float], 2025-05-07T20:33:44.1539534Z contiguous: bool, 2025-05-07T20:33:44.1539637Z compiled: bool, 2025-05-07T20:33:44.1539730Z ) -> None: 2025-05-07T20:33:44.1539839Z torch.manual_seed(2025) 2025-05-07T20:33:44.1539926Z 2025-05-07T20:33:44.1540120Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1542119Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1542133Z 2025-05-07T20:33:44.1542270Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1542275Z 2025-05-07T20:33:44.1542394Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1542649Z self=, 2025-05-07T20:33:44.1542738Z T=1, 2025-05-07T20:33:44.1542829Z D=5120, 2025-05-07T20:33:44.1542930Z scale_ub=1200.0, 2025-05-07T20:33:44.1543039Z contiguous=True, 2025-05-07T20:33:44.1543142Z compiled=False, 2025-05-07T20:33:44.1543227Z ) 2025-05-07T20:33:44.1543475Z self = 2025-05-07T20:33:44.1543670Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.1543677Z 2025-05-07T20:33:44.1543765Z @given( 2025-05-07T20:33:44.1543900Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1544022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1544154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1544289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1544423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1544508Z ) 2025-05-07T20:33:44.1544791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1544983Z def test_silu_mul_quant( 2025-05-07T20:33:44.1545072Z self, 2025-05-07T20:33:44.1545208Z T: int, 2025-05-07T20:33:44.1545298Z D: int, 2025-05-07T20:33:44.1545413Z scale_ub: Optional[float], 2025-05-07T20:33:44.1545521Z contiguous: bool, 2025-05-07T20:33:44.1545628Z compiled: bool, 2025-05-07T20:33:44.1545719Z ) -> None: 2025-05-07T20:33:44.1545838Z torch.manual_seed(2025) 2025-05-07T20:33:44.1545923Z 2025-05-07T20:33:44.1546115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1546203Z 2025-05-07T20:33:44.1546311Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1546458Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1546561Z x = x_sign * x_clamp 2025-05-07T20:33:44.1546656Z x0 = x[:, :D] 2025-05-07T20:33:44.1546754Z x1 = x[:, D:] 2025-05-07T20:33:44.1546839Z 2025-05-07T20:33:44.1546936Z if contiguous: 2025-05-07T20:33:44.1547051Z x0 = x0.contiguous() 2025-05-07T20:33:44.1547159Z x1 = x1.contiguous() 2025-05-07T20:33:44.1547244Z 2025-05-07T20:33:44.1547352Z if scale_ub is not None: 2025-05-07T20:33:44.1547473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1547630Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1547815Z ) 2025-05-07T20:33:44.1547907Z else: 2025-05-07T20:33:44.1548017Z scale_ub_tensor = None 2025-05-07T20:33:44.1548104Z 2025-05-07T20:33:44.1548254Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1548363Z op = silu_mul_quant 2025-05-07T20:33:44.1548464Z if compiled: 2025-05-07T20:33:44.1548580Z op = torch.compile(op) 2025-05-07T20:33:44.1548703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1548787Z 2025-05-07T20:33:44.1548892Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1548901Z 2025-05-07T20:33:44.1549019Z moe/activation_test.py:117: 2025-05-07T20:33:44.1549170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1549288Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1549410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1549983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1550099Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1550503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1550760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1551152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1551260Z kernel = self.compile( 2025-05-07T20:33:44.1551708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1551910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1552056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1552064Z 2025-05-07T20:33:44.1552301Z self = 2025-05-07T20:33:44.1553173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1553799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06b4e200>} 2025-05-07T20:33:44.1554684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1554944Z context = 2025-05-07T20:33:44.1554950Z 2025-05-07T20:33:44.1555142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1555448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1555577Z module_map=module_map) 2025-05-07T20:33:44.1555762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1555876Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1555969Z E ^ 2025-05-07T20:33:44.1556373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1556378Z 2025-05-07T20:33:44.1556841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1556858Z 2025-05-07T20:33:44.1556979Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1557232Z self=, 2025-05-07T20:33:44.1557326Z T=2048, 2025-05-07T20:33:44.1557461Z D=5120, 2025-05-07T20:33:44.1557556Z scale_ub=None, 2025-05-07T20:33:44.1557701Z contiguous=True, 2025-05-07T20:33:44.1557800Z compiled=False, 2025-05-07T20:33:44.1557885Z ) 2025-05-07T20:33:44.1558135Z self = 2025-05-07T20:33:44.1558333Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.1558338Z 2025-05-07T20:33:44.1558432Z @given( 2025-05-07T20:33:44.1558567Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1558686Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1558822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1558962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1559095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1559185Z ) 2025-05-07T20:33:44.1559465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1559577Z def test_silu_mul_quant( 2025-05-07T20:33:44.1559674Z self, 2025-05-07T20:33:44.1559765Z T: int, 2025-05-07T20:33:44.1559854Z D: int, 2025-05-07T20:33:44.1559972Z scale_ub: Optional[float], 2025-05-07T20:33:44.1560076Z contiguous: bool, 2025-05-07T20:33:44.1560181Z compiled: bool, 2025-05-07T20:33:44.1560271Z ) -> None: 2025-05-07T20:33:44.1560381Z torch.manual_seed(2025) 2025-05-07T20:33:44.1560469Z 2025-05-07T20:33:44.1560659Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1560745Z 2025-05-07T20:33:44.1560854Z > x_sign = torch.sign(x) 2025-05-07T20:33:44.1562840Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1562850Z 2025-05-07T20:33:44.1562991Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:44.1562996Z 2025-05-07T20:33:44.1563116Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1563369Z self=, 2025-05-07T20:33:44.1563465Z T=16384, 2025-05-07T20:33:44.1563555Z D=5120, 2025-05-07T20:33:44.1563705Z scale_ub=None, 2025-05-07T20:33:44.1563804Z contiguous=True, 2025-05-07T20:33:44.1563942Z compiled=False, 2025-05-07T20:33:44.1564031Z ) 2025-05-07T20:33:44.1564277Z self = 2025-05-07T20:33:44.1564476Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.1564484Z 2025-05-07T20:33:44.1564580Z @given( 2025-05-07T20:33:44.1564714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1564829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1564964Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1565098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1565232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1565318Z ) 2025-05-07T20:33:44.1565598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1565714Z def test_silu_mul_quant( 2025-05-07T20:33:44.1565809Z self, 2025-05-07T20:33:44.1565901Z T: int, 2025-05-07T20:33:44.1565995Z D: int, 2025-05-07T20:33:44.1566109Z scale_ub: Optional[float], 2025-05-07T20:33:44.1566212Z contiguous: bool, 2025-05-07T20:33:44.1566316Z compiled: bool, 2025-05-07T20:33:44.1566481Z ) -> None: 2025-05-07T20:33:44.1566631Z torch.manual_seed(2025) 2025-05-07T20:33:44.1566721Z 2025-05-07T20:33:44.1566912Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1568896Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1568905Z 2025-05-07T20:33:44.1569041Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1569046Z 2025-05-07T20:33:44.1569167Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1569425Z self=, 2025-05-07T20:33:44.1569517Z T=4096, 2025-05-07T20:33:44.1569612Z D=5120, 2025-05-07T20:33:44.1569708Z scale_ub=None, 2025-05-07T20:33:44.1569806Z contiguous=True, 2025-05-07T20:33:44.1569909Z compiled=False, 2025-05-07T20:33:44.1569995Z ) 2025-05-07T20:33:44.1570241Z self = 2025-05-07T20:33:44.1570441Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.1570446Z 2025-05-07T20:33:44.1570538Z @given( 2025-05-07T20:33:44.1570679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1570797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1570928Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1571066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1571197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1571287Z ) 2025-05-07T20:33:44.1571599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1571712Z def test_silu_mul_quant( 2025-05-07T20:33:44.1571811Z self, 2025-05-07T20:33:44.1571920Z T: int, 2025-05-07T20:33:44.1572020Z D: int, 2025-05-07T20:33:44.1572151Z scale_ub: Optional[float], 2025-05-07T20:33:44.1572255Z contiguous: bool, 2025-05-07T20:33:44.1572355Z compiled: bool, 2025-05-07T20:33:44.1572449Z ) -> None: 2025-05-07T20:33:44.1572560Z torch.manual_seed(2025) 2025-05-07T20:33:44.1572648Z 2025-05-07T20:33:44.1572896Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1575012Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1575024Z 2025-05-07T20:33:44.1575166Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1575171Z 2025-05-07T20:33:44.1575291Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1575545Z self=, 2025-05-07T20:33:44.1575642Z T=2048, 2025-05-07T20:33:44.1575731Z D=5120, 2025-05-07T20:33:44.1575834Z scale_ub=None, 2025-05-07T20:33:44.1575935Z contiguous=False, 2025-05-07T20:33:44.1576032Z compiled=False, 2025-05-07T20:33:44.1576123Z ) 2025-05-07T20:33:44.1576369Z self = 2025-05-07T20:33:44.1576653Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.1576658Z 2025-05-07T20:33:44.1576751Z @given( 2025-05-07T20:33:44.1576888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1577004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1577139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1577275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1577409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1577496Z ) 2025-05-07T20:33:44.1577776Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1577891Z def test_silu_mul_quant( 2025-05-07T20:33:44.1577982Z self, 2025-05-07T20:33:44.1578073Z T: int, 2025-05-07T20:33:44.1578167Z D: int, 2025-05-07T20:33:44.1578281Z scale_ub: Optional[float], 2025-05-07T20:33:44.1578385Z contiguous: bool, 2025-05-07T20:33:44.1578493Z compiled: bool, 2025-05-07T20:33:44.1578587Z ) -> None: 2025-05-07T20:33:44.1578701Z torch.manual_seed(2025) 2025-05-07T20:33:44.1578790Z 2025-05-07T20:33:44.1578982Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1580968Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1580977Z 2025-05-07T20:33:44.1581113Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1581121Z 2025-05-07T20:33:44.1581245Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1581525Z self=, 2025-05-07T20:33:44.1581616Z T=4096, 2025-05-07T20:33:44.1581709Z D=7168, 2025-05-07T20:33:44.1581805Z scale_ub=None, 2025-05-07T20:33:44.1581905Z contiguous=True, 2025-05-07T20:33:44.1582006Z compiled=True, 2025-05-07T20:33:44.1582093Z ) 2025-05-07T20:33:44.1582340Z self = 2025-05-07T20:33:44.1582539Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.1582594Z 2025-05-07T20:33:44.1582685Z @given( 2025-05-07T20:33:44.1582864Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1582981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1583112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1583250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1583387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1583474Z ) 2025-05-07T20:33:44.1583757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1583865Z def test_silu_mul_quant( 2025-05-07T20:33:44.1583955Z self, 2025-05-07T20:33:44.1584048Z T: int, 2025-05-07T20:33:44.1584138Z D: int, 2025-05-07T20:33:44.1584256Z scale_ub: Optional[float], 2025-05-07T20:33:44.1584359Z contiguous: bool, 2025-05-07T20:33:44.1584459Z compiled: bool, 2025-05-07T20:33:44.1584553Z ) -> None: 2025-05-07T20:33:44.1584665Z torch.manual_seed(2025) 2025-05-07T20:33:44.1584749Z 2025-05-07T20:33:44.1584948Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1586960Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1587008Z 2025-05-07T20:33:44.1587147Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1587152Z 2025-05-07T20:33:44.1587270Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1587523Z self=, 2025-05-07T20:33:44.1587620Z T=2048, 2025-05-07T20:33:44.1587711Z D=5120, 2025-05-07T20:33:44.1587811Z scale_ub=1200.0, 2025-05-07T20:33:44.1587910Z contiguous=False, 2025-05-07T20:33:44.1588007Z compiled=False, 2025-05-07T20:33:44.1588095Z ) 2025-05-07T20:33:44.1588346Z self = 2025-05-07T20:33:44.1588545Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.1588549Z 2025-05-07T20:33:44.1588640Z @given( 2025-05-07T20:33:44.1588774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1588888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1589024Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1589157Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1589290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1589379Z ) 2025-05-07T20:33:44.1589660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1589771Z def test_silu_mul_quant( 2025-05-07T20:33:44.1589860Z self, 2025-05-07T20:33:44.1589949Z T: int, 2025-05-07T20:33:44.1590041Z D: int, 2025-05-07T20:33:44.1590158Z scale_ub: Optional[float], 2025-05-07T20:33:44.1590266Z contiguous: bool, 2025-05-07T20:33:44.1590369Z compiled: bool, 2025-05-07T20:33:44.1590459Z ) -> None: 2025-05-07T20:33:44.1590568Z torch.manual_seed(2025) 2025-05-07T20:33:44.1590656Z 2025-05-07T20:33:44.1590847Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1592871Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1592921Z 2025-05-07T20:33:44.1593057Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1593064Z 2025-05-07T20:33:44.1593187Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1593438Z self=, 2025-05-07T20:33:44.1593574Z T=4096, 2025-05-07T20:33:44.1593669Z D=7168, 2025-05-07T20:33:44.1593769Z scale_ub=1200.0, 2025-05-07T20:33:44.1593867Z contiguous=True, 2025-05-07T20:33:44.1593968Z compiled=False, 2025-05-07T20:33:44.1594053Z ) 2025-05-07T20:33:44.1594296Z self = 2025-05-07T20:33:44.1594494Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.1594502Z 2025-05-07T20:33:44.1594595Z @given( 2025-05-07T20:33:44.1594730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1594844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1594975Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1595206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1595338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1595424Z ) 2025-05-07T20:33:44.1595708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1595817Z def test_silu_mul_quant( 2025-05-07T20:33:44.1595909Z self, 2025-05-07T20:33:44.1595999Z T: int, 2025-05-07T20:33:44.1596088Z D: int, 2025-05-07T20:33:44.1596205Z scale_ub: Optional[float], 2025-05-07T20:33:44.1596309Z contiguous: bool, 2025-05-07T20:33:44.1596409Z compiled: bool, 2025-05-07T20:33:44.1596505Z ) -> None: 2025-05-07T20:33:44.1596618Z torch.manual_seed(2025) 2025-05-07T20:33:44.1596703Z 2025-05-07T20:33:44.1596898Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1598876Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1598889Z 2025-05-07T20:33:44.1599029Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1599034Z 2025-05-07T20:33:44.1599154Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1599407Z self=, 2025-05-07T20:33:44.1599502Z T=16384, 2025-05-07T20:33:44.1599591Z D=7168, 2025-05-07T20:33:44.1599692Z scale_ub=None, 2025-05-07T20:33:44.1599793Z contiguous=False, 2025-05-07T20:33:44.1599891Z compiled=True, 2025-05-07T20:33:44.1599978Z ) 2025-05-07T20:33:44.1600226Z self = 2025-05-07T20:33:44.1600425Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:44.1600430Z 2025-05-07T20:33:44.1600521Z @given( 2025-05-07T20:33:44.1600654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1600790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1604156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1604317Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1604525Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1604613Z ) 2025-05-07T20:33:44.1604974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1605086Z def test_silu_mul_quant( 2025-05-07T20:33:44.1605175Z self, 2025-05-07T20:33:44.1605276Z T: int, 2025-05-07T20:33:44.1605366Z D: int, 2025-05-07T20:33:44.1605483Z scale_ub: Optional[float], 2025-05-07T20:33:44.1605592Z contiguous: bool, 2025-05-07T20:33:44.1605693Z compiled: bool, 2025-05-07T20:33:44.1605790Z ) -> None: 2025-05-07T20:33:44.1605900Z torch.manual_seed(2025) 2025-05-07T20:33:44.1605985Z 2025-05-07T20:33:44.1606181Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1608190Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1608244Z 2025-05-07T20:33:44.1608427Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1608432Z 2025-05-07T20:33:44.1608554Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1608807Z self=, 2025-05-07T20:33:44.1608901Z T=4096, 2025-05-07T20:33:44.1608990Z D=7168, 2025-05-07T20:33:44.1609086Z scale_ub=None, 2025-05-07T20:33:44.1609189Z contiguous=True, 2025-05-07T20:33:44.1609288Z compiled=False, 2025-05-07T20:33:44.1609377Z ) 2025-05-07T20:33:44.1609622Z self = 2025-05-07T20:33:44.1609825Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.1609830Z 2025-05-07T20:33:44.1609923Z @given( 2025-05-07T20:33:44.1610059Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1610174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1610317Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1610453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1610583Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1610673Z ) 2025-05-07T20:33:44.1610956Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1611098Z def test_silu_mul_quant( 2025-05-07T20:33:44.1611205Z self, 2025-05-07T20:33:44.1611302Z T: int, 2025-05-07T20:33:44.1611393Z D: int, 2025-05-07T20:33:44.1611509Z scale_ub: Optional[float], 2025-05-07T20:33:44.1611615Z contiguous: bool, 2025-05-07T20:33:44.1611720Z compiled: bool, 2025-05-07T20:33:44.1611815Z ) -> None: 2025-05-07T20:33:44.1611924Z torch.manual_seed(2025) 2025-05-07T20:33:44.1612017Z 2025-05-07T20:33:44.1612207Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1614203Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1614210Z 2025-05-07T20:33:44.1614347Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1614401Z 2025-05-07T20:33:44.1614566Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1614820Z self=, 2025-05-07T20:33:44.1614915Z T=16384, 2025-05-07T20:33:44.1615008Z D=7168, 2025-05-07T20:33:44.1615106Z scale_ub=None, 2025-05-07T20:33:44.1615205Z contiguous=True, 2025-05-07T20:33:44.1615308Z compiled=False, 2025-05-07T20:33:44.1615394Z ) 2025-05-07T20:33:44.1615639Z self = 2025-05-07T20:33:44.1615844Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:44.1615849Z 2025-05-07T20:33:44.1615940Z @given( 2025-05-07T20:33:44.1616078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1616191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1616322Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1616466Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1616602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1616687Z ) 2025-05-07T20:33:44.1616970Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1617079Z def test_silu_mul_quant( 2025-05-07T20:33:44.1617214Z self, 2025-05-07T20:33:44.1617349Z T: int, 2025-05-07T20:33:44.1617439Z D: int, 2025-05-07T20:33:44.1617552Z scale_ub: Optional[float], 2025-05-07T20:33:44.1617659Z contiguous: bool, 2025-05-07T20:33:44.1617757Z compiled: bool, 2025-05-07T20:33:44.1617851Z ) -> None: 2025-05-07T20:33:44.1617960Z torch.manual_seed(2025) 2025-05-07T20:33:44.1618045Z 2025-05-07T20:33:44.1618238Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1620232Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1620244Z 2025-05-07T20:33:44.1620383Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1620388Z 2025-05-07T20:33:44.1620506Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1620756Z self=, 2025-05-07T20:33:44.1620852Z T=16384, 2025-05-07T20:33:44.1620943Z D=7168, 2025-05-07T20:33:44.1621041Z scale_ub=1200.0, 2025-05-07T20:33:44.1621145Z contiguous=True, 2025-05-07T20:33:44.1621244Z compiled=False, 2025-05-07T20:33:44.1621335Z ) 2025-05-07T20:33:44.1621616Z self = 2025-05-07T20:33:44.1621842Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.1621846Z 2025-05-07T20:33:44.1621949Z @given( 2025-05-07T20:33:44.1622098Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1622220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1622354Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1622491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1622623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1622709Z ) 2025-05-07T20:33:44.1622993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1623104Z def test_silu_mul_quant( 2025-05-07T20:33:44.1623202Z self, 2025-05-07T20:33:44.1623293Z T: int, 2025-05-07T20:33:44.1623383Z D: int, 2025-05-07T20:33:44.1623556Z scale_ub: Optional[float], 2025-05-07T20:33:44.1623704Z contiguous: bool, 2025-05-07T20:33:44.1624147Z compiled: bool, 2025-05-07T20:33:44.1624293Z ) -> None: 2025-05-07T20:33:44.1624413Z torch.manual_seed(2025) 2025-05-07T20:33:44.1624499Z 2025-05-07T20:33:44.1624706Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1626707Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1626716Z 2025-05-07T20:33:44.1626857Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1626865Z 2025-05-07T20:33:44.1626984Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1627240Z self=, 2025-05-07T20:33:44.1627436Z T=128, 2025-05-07T20:33:44.1627527Z D=5120, 2025-05-07T20:33:44.1627692Z scale_ub=1200.0, 2025-05-07T20:33:44.1627795Z contiguous=False, 2025-05-07T20:33:44.1627893Z compiled=False, 2025-05-07T20:33:44.1627981Z ) 2025-05-07T20:33:44.1628231Z self = 2025-05-07T20:33:44.1628427Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:44.1628432Z 2025-05-07T20:33:44.1628527Z @given( 2025-05-07T20:33:44.1628661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1628779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1628914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1629053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1629192Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1629280Z ) 2025-05-07T20:33:44.1629560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1629678Z def test_silu_mul_quant( 2025-05-07T20:33:44.1629770Z self, 2025-05-07T20:33:44.1629860Z T: int, 2025-05-07T20:33:44.1629954Z D: int, 2025-05-07T20:33:44.1630067Z scale_ub: Optional[float], 2025-05-07T20:33:44.1630171Z contiguous: bool, 2025-05-07T20:33:44.1630275Z compiled: bool, 2025-05-07T20:33:44.1630367Z ) -> None: 2025-05-07T20:33:44.1630482Z torch.manual_seed(2025) 2025-05-07T20:33:44.1630569Z 2025-05-07T20:33:44.1630786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1630888Z 2025-05-07T20:33:44.1631015Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1631163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1631271Z x = x_sign * x_clamp 2025-05-07T20:33:44.1631365Z x0 = x[:, :D] 2025-05-07T20:33:44.1631460Z x1 = x[:, D:] 2025-05-07T20:33:44.1631549Z 2025-05-07T20:33:44.1631651Z if contiguous: 2025-05-07T20:33:44.1631762Z x0 = x0.contiguous() 2025-05-07T20:33:44.1631872Z x1 = x1.contiguous() 2025-05-07T20:33:44.1631957Z 2025-05-07T20:33:44.1632067Z if scale_ub is not None: 2025-05-07T20:33:44.1632191Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1632346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1632441Z ) 2025-05-07T20:33:44.1632531Z else: 2025-05-07T20:33:44.1632641Z scale_ub_tensor = None 2025-05-07T20:33:44.1632730Z 2025-05-07T20:33:44.1632881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1633062Z op = silu_mul_quant 2025-05-07T20:33:44.1633227Z if compiled: 2025-05-07T20:33:44.1633346Z op = torch.compile(op) 2025-05-07T20:33:44.1633468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1633612Z 2025-05-07T20:33:44.1633719Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1633727Z 2025-05-07T20:33:44.1633847Z moe/activation_test.py:117: 2025-05-07T20:33:44.1633998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1634116Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1634235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1634808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1634921Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1635338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1635600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1635996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1636106Z kernel = self.compile( 2025-05-07T20:33:44.1636642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1636848Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1636996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1637000Z 2025-05-07T20:33:44.1637241Z self = 2025-05-07T20:33:44.1638125Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1638706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06c11ea0>} 2025-05-07T20:33:44.1639555Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1639778Z context = 2025-05-07T20:33:44.1639783Z 2025-05-07T20:33:44.1639975Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1640276Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1640402Z module_map=module_map) 2025-05-07T20:33:44.1640593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1640712Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1640806Z E ^ 2025-05-07T20:33:44.1641213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1641219Z 2025-05-07T20:33:44.1641695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1641700Z 2025-05-07T20:33:44.1641827Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1642081Z self=, 2025-05-07T20:33:44.1642172Z T=2048, 2025-05-07T20:33:44.1642266Z D=7168, 2025-05-07T20:33:44.1642362Z scale_ub=None, 2025-05-07T20:33:44.1642469Z contiguous=False, 2025-05-07T20:33:44.1642567Z compiled=False, 2025-05-07T20:33:44.1642653Z ) 2025-05-07T20:33:44.1642905Z self = 2025-05-07T20:33:44.1643184Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:44.1643232Z 2025-05-07T20:33:44.1643322Z @given( 2025-05-07T20:33:44.1643462Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1643579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1643716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1643858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1643990Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1644079Z ) 2025-05-07T20:33:44.1644360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1644469Z def test_silu_mul_quant( 2025-05-07T20:33:44.1644561Z self, 2025-05-07T20:33:44.1644652Z T: int, 2025-05-07T20:33:44.1644742Z D: int, 2025-05-07T20:33:44.1644858Z scale_ub: Optional[float], 2025-05-07T20:33:44.1644962Z contiguous: bool, 2025-05-07T20:33:44.1645066Z compiled: bool, 2025-05-07T20:33:44.1645162Z ) -> None: 2025-05-07T20:33:44.1645276Z torch.manual_seed(2025) 2025-05-07T20:33:44.1645361Z 2025-05-07T20:33:44.1645560Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1647594Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1647643Z 2025-05-07T20:33:44.1647788Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1647795Z 2025-05-07T20:33:44.1647915Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1648175Z self=, 2025-05-07T20:33:44.1648266Z T=128, 2025-05-07T20:33:44.1648355Z D=7168, 2025-05-07T20:33:44.1648457Z scale_ub=1200.0, 2025-05-07T20:33:44.1648559Z contiguous=True, 2025-05-07T20:33:44.1648657Z compiled=True, 2025-05-07T20:33:44.1648751Z ) 2025-05-07T20:33:44.1648997Z self = 2025-05-07T20:33:44.1649189Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.1649194Z 2025-05-07T20:33:44.1649289Z @given( 2025-05-07T20:33:44.1649424Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1649543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1649676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1649810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1649950Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1650040Z ) 2025-05-07T20:33:44.1650343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1650485Z def test_silu_mul_quant( 2025-05-07T20:33:44.1650596Z self, 2025-05-07T20:33:44.1650712Z T: int, 2025-05-07T20:33:44.1650830Z D: int, 2025-05-07T20:33:44.1650973Z scale_ub: Optional[float], 2025-05-07T20:33:44.1651107Z contiguous: bool, 2025-05-07T20:33:44.1651232Z compiled: bool, 2025-05-07T20:33:44.1651346Z ) -> None: 2025-05-07T20:33:44.1651486Z torch.manual_seed(2025) 2025-05-07T20:33:44.1651593Z 2025-05-07T20:33:44.1651831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1651942Z 2025-05-07T20:33:44.1652075Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1652224Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1652388Z x = x_sign * x_clamp 2025-05-07T20:33:44.1652482Z x0 = x[:, :D] 2025-05-07T20:33:44.1652618Z x1 = x[:, D:] 2025-05-07T20:33:44.1652707Z 2025-05-07T20:33:44.1652805Z if contiguous: 2025-05-07T20:33:44.1652913Z x0 = x0.contiguous() 2025-05-07T20:33:44.1653024Z x1 = x1.contiguous() 2025-05-07T20:33:44.1653108Z 2025-05-07T20:33:44.1653219Z if scale_ub is not None: 2025-05-07T20:33:44.1653340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.1653494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.1653585Z ) 2025-05-07T20:33:44.1653674Z else: 2025-05-07T20:33:44.1653783Z scale_ub_tensor = None 2025-05-07T20:33:44.1653871Z 2025-05-07T20:33:44.1654019Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.1654125Z op = silu_mul_quant 2025-05-07T20:33:44.1654226Z if compiled: 2025-05-07T20:33:44.1654345Z op = torch.compile(op) 2025-05-07T20:33:44.1654470Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1654558Z 2025-05-07T20:33:44.1654663Z > y_fp8, y_scale = fn() 2025-05-07T20:33:44.1654668Z 2025-05-07T20:33:44.1654783Z moe/activation_test.py:117: 2025-05-07T20:33:44.1654978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1655135Z moe/activation_test.py:115: in fn 2025-05-07T20:33:44.1655255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.1655673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:44.1655782Z return fn(*args, **kwargs) 2025-05-07T20:33:44.1656340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:44.1656454Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:44.1656943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.1657314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.1657842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.1658001Z kernel = self.compile( 2025-05-07T20:33:44.1658492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.1658695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.1658847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.1658852Z 2025-05-07T20:33:44.1659084Z self = 2025-05-07T20:33:44.1659959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.1660529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fad06c137f0>} 2025-05-07T20:33:44.1661388Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.1661645Z context = 2025-05-07T20:33:44.1661650Z 2025-05-07T20:33:44.1661843Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.1662147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.1662273Z module_map=module_map) 2025-05-07T20:33:44.1662527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.1662688Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:44.1662779Z E ^ 2025-05-07T20:33:44.1663183Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.1663191Z 2025-05-07T20:33:44.1663657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.1663662Z 2025-05-07T20:33:44.1663788Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1664040Z self=, 2025-05-07T20:33:44.1664131Z T=128, 2025-05-07T20:33:44.1664225Z D=7168, 2025-05-07T20:33:44.1664322Z scale_ub=1200.0, 2025-05-07T20:33:44.1664421Z contiguous=True, 2025-05-07T20:33:44.1664520Z compiled=False, 2025-05-07T20:33:44.1664605Z ) 2025-05-07T20:33:44.1664856Z self = 2025-05-07T20:33:44.1665056Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:44.1665062Z 2025-05-07T20:33:44.1665151Z @given( 2025-05-07T20:33:44.1665285Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1665458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1665630Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1665770Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1665904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1665991Z ) 2025-05-07T20:33:44.1666276Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1666384Z def test_silu_mul_quant( 2025-05-07T20:33:44.1666473Z self, 2025-05-07T20:33:44.1666565Z T: int, 2025-05-07T20:33:44.1666654Z D: int, 2025-05-07T20:33:44.1666767Z scale_ub: Optional[float], 2025-05-07T20:33:44.1666876Z contiguous: bool, 2025-05-07T20:33:44.1666980Z compiled: bool, 2025-05-07T20:33:44.1667075Z ) -> None: 2025-05-07T20:33:44.1667186Z torch.manual_seed(2025) 2025-05-07T20:33:44.1667271Z 2025-05-07T20:33:44.1667468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1667556Z 2025-05-07T20:33:44.1667665Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1667814Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1669788Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1669797Z 2025-05-07T20:33:44.1669938Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:44.1669943Z 2025-05-07T20:33:44.1670062Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1670319Z self=, 2025-05-07T20:33:44.1670414Z T=128, 2025-05-07T20:33:44.1670505Z D=5120, 2025-05-07T20:33:44.1670605Z scale_ub=1200.0, 2025-05-07T20:33:44.1670718Z contiguous=True, 2025-05-07T20:33:44.1670825Z compiled=True, 2025-05-07T20:33:44.1670933Z ) 2025-05-07T20:33:44.1671181Z self = 2025-05-07T20:33:44.1671370Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:44.1671375Z 2025-05-07T20:33:44.1671467Z @given( 2025-05-07T20:33:44.1671601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1671768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1671945Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1672080Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1672217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1672306Z ) 2025-05-07T20:33:44.1672588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1672699Z def test_silu_mul_quant( 2025-05-07T20:33:44.1672788Z self, 2025-05-07T20:33:44.1672878Z T: int, 2025-05-07T20:33:44.1672968Z D: int, 2025-05-07T20:33:44.1673083Z scale_ub: Optional[float], 2025-05-07T20:33:44.1673186Z contiguous: bool, 2025-05-07T20:33:44.1673287Z compiled: bool, 2025-05-07T20:33:44.1673377Z ) -> None: 2025-05-07T20:33:44.1673487Z torch.manual_seed(2025) 2025-05-07T20:33:44.1673637Z 2025-05-07T20:33:44.1673834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1673923Z 2025-05-07T20:33:44.1674038Z x_sign = torch.sign(x) 2025-05-07T20:33:44.1674183Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.1676215Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1676286Z 2025-05-07T20:33:44.1676424Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:44.1676429Z 2025-05-07T20:33:44.1676556Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.1676813Z self=, 2025-05-07T20:33:44.1676903Z T=128, 2025-05-07T20:33:44.1676995Z D=7168, 2025-05-07T20:33:44.1677092Z scale_ub=None, 2025-05-07T20:33:44.1677191Z contiguous=True, 2025-05-07T20:33:44.1677296Z compiled=True, 2025-05-07T20:33:44.1677382Z ) 2025-05-07T20:33:44.1677630Z self = 2025-05-07T20:33:44.1677822Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.1677827Z 2025-05-07T20:33:44.1677916Z @given( 2025-05-07T20:33:44.1678054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.1678168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.1678299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.1678435Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.1678564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.1678653Z ) 2025-05-07T20:33:44.1678939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.1679046Z def test_silu_mul_quant( 2025-05-07T20:33:44.1679135Z self, 2025-05-07T20:33:44.1679228Z T: int, 2025-05-07T20:33:44.1679320Z D: int, 2025-05-07T20:33:44.1679439Z scale_ub: Optional[float], 2025-05-07T20:33:44.1679546Z contiguous: bool, 2025-05-07T20:33:44.1679644Z compiled: bool, 2025-05-07T20:33:44.1679737Z ) -> None: 2025-05-07T20:33:44.1679847Z torch.manual_seed(2025) 2025-05-07T20:33:44.1679933Z 2025-05-07T20:33:44.1680128Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.1682234Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:44.1682285Z 2025-05-07T20:33:44.1682428Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:44.1682584Z =============================== warnings summary =============================== 2025-05-07T20:33:44.1682935Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:44.1683285Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:44.1683626Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:44.1684626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:44.1684893Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:44.1684944Z 2025-05-07T20:33:44.1685233Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:44.1685428Z ================= 1 failed, 1 deselected, 3 warnings in 18.32s ================= 2025-05-07T20:33:45.7790626Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:45.8414473Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:45.8414974Z 2025-05-07T20:33:45.8415327Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:45.8416543Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:45.8417382Z 2025-05-07T20:33:45.8417391Z 2025-05-07T20:33:45.8417399Z 2025-05-07T20:33:45.8435879Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:45.8516140Z Post job cleanup. 2025-05-07T20:33:45.9520806Z [command]/usr/bin/git version 2025-05-07T20:33:45.9566818Z git version 2.47.1 2025-05-07T20:33:45.9605867Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/b68dad46-55d8-40b5-a0ad-1b7c3566ed55/.gitconfig' 2025-05-07T20:33:45.9617301Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/b68dad46-55d8-40b5-a0ad-1b7c3566ed55' before making global git config changes 2025-05-07T20:33:45.9618203Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:45.9622944Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:45.9685902Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:45.9721244Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:46.0059874Z Entering 'external/asmjit' 2025-05-07T20:33:46.0126393Z Entering 'external/composable_kernel' 2025-05-07T20:33:46.0199912Z Entering 'external/cpuinfo' 2025-05-07T20:33:46.0267710Z Entering 'external/cutlass' 2025-05-07T20:33:46.0342560Z Entering 'external/googletest' 2025-05-07T20:33:46.0409502Z Entering 'external/hipify_torch' 2025-05-07T20:33:46.0476022Z Entering 'external/json' 2025-05-07T20:33:46.0563314Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:46.0588695Z http.https://github.com/.extraheader 2025-05-07T20:33:46.0601173Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:46.0632559Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:46.0962955Z Entering 'external/asmjit' 2025-05-07T20:33:46.1006108Z http.https://github.com/.extraheader 2025-05-07T20:33:46.1050104Z Entering 'external/composable_kernel' 2025-05-07T20:33:46.1093104Z http.https://github.com/.extraheader 2025-05-07T20:33:46.1142595Z Entering 'external/cpuinfo' 2025-05-07T20:33:46.1184714Z http.https://github.com/.extraheader 2025-05-07T20:33:46.1228112Z Entering 'external/cutlass' 2025-05-07T20:33:46.1270361Z http.https://github.com/.extraheader 2025-05-07T20:33:46.1321553Z Entering 'external/googletest' 2025-05-07T20:33:46.1364211Z http.https://github.com/.extraheader 2025-05-07T20:33:46.1407038Z Entering 'external/hipify_torch' 2025-05-07T20:33:46.1450101Z http.https://github.com/.extraheader 2025-05-07T20:33:46.1492560Z Entering 'external/json' 2025-05-07T20:33:46.1535621Z http.https://github.com/.extraheader 2025-05-07T20:33:46.1686274Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:46.1721409Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:46.1732523Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:46.1732906Z ##[endgroup] 2025-05-07T20:33:46.1833679Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:57.3815575Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:34:14.4723516Z Cleaning up orphan processes