2025-05-07T20:22:35.2670720Z Current runner version: '2.323.0' 2025-05-07T20:22:35.2678202Z Runner name: 'i-011bf0f995071f8f9' 2025-05-07T20:22:35.2679116Z Machine name: 'ip-10-0-45-1' 2025-05-07T20:22:35.2681860Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:22:35.2684193Z Contents: read 2025-05-07T20:22:35.2684703Z Metadata: read 2025-05-07T20:22:35.2685203Z Packages: read 2025-05-07T20:22:35.2685693Z ##[endgroup] 2025-05-07T20:22:35.2687614Z Secret source: None 2025-05-07T20:22:35.2688307Z Prepare workflow directory 2025-05-07T20:22:35.3214122Z Prepare all required actions 2025-05-07T20:22:35.3252446Z Getting action download info 2025-05-07T20:22:35.5395909Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:22:35.7604293Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:22:36.0455133Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:22:37.5700625Z Getting action download info 2025-05-07T20:22:37.6878658Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:22:37.9027493Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.9, 12.6.3, 12.6.3, clang) 2025-05-07T20:22:37.9676576Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:22:37.9818514Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:22:37.9831848Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:22:37.9833493Z ##[endgroup] 2025-05-07T20:22:39.2137976Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:22:39.2138678Z Instance Type: g5.4xlarge 2025-05-07T20:22:39.2139123Z AMI Name: unknown 2025-05-07T20:22:39.2178824Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:22:44.5873893Z ##[group]Run actions/checkout@v4 2025-05-07T20:22:44.5874216Z with: 2025-05-07T20:22:44.5874437Z submodules: true 2025-05-07T20:22:44.5874676Z repository: pytorch/FBGEMM 2025-05-07T20:22:44.5875074Z token: *** 2025-05-07T20:22:44.5875281Z ssh-strict: true 2025-05-07T20:22:44.5875500Z ssh-user: git 2025-05-07T20:22:44.5875717Z persist-credentials: true 2025-05-07T20:22:44.5875976Z clean: true 2025-05-07T20:22:44.5876211Z sparse-checkout-cone-mode: true 2025-05-07T20:22:44.5876487Z fetch-depth: 1 2025-05-07T20:22:44.5876707Z fetch-tags: false 2025-05-07T20:22:44.5876923Z show-progress: true 2025-05-07T20:22:44.5877149Z lfs: false 2025-05-07T20:22:44.5877356Z set-safe-directory: true 2025-05-07T20:22:44.5877614Z env: 2025-05-07T20:22:44.5877829Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:44.5878139Z BUILD_ENV: build_binary 2025-05-07T20:22:44.5878403Z BUILD_TARGET: genai 2025-05-07T20:22:44.5878639Z BUILD_VARIANT: cuda 2025-05-07T20:22:44.5878902Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:44.5879156Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:44.5879401Z ##[endgroup] 2025-05-07T20:22:44.7031960Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:22:44.7033174Z ##[group]Getting Git version info 2025-05-07T20:22:44.7033686Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.7034316Z [command]/usr/bin/git version 2025-05-07T20:22:44.7034590Z git version 2.47.1 2025-05-07T20:22:44.7057070Z ##[endgroup] 2025-05-07T20:22:44.7070918Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/2c362f20-a439-4bf9-ac60-aa165daf02d7' before making global git config changes 2025-05-07T20:22:44.7071831Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:22:44.7075815Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.7113141Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.7116497Z ##[group]Initializing the repository 2025-05-07T20:22:44.7120666Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.7162469Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T20:22:44.7163302Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T20:22:44.7164001Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T20:22:44.7164564Z hint: 2025-05-07T20:22:44.7164964Z hint: git config --global init.defaultBranch 2025-05-07T20:22:44.7165368Z hint: 2025-05-07T20:22:44.7165704Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T20:22:44.7166248Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T20:22:44.7166660Z hint: 2025-05-07T20:22:44.7166894Z hint: git branch -m 2025-05-07T20:22:44.7167388Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/ 2025-05-07T20:22:44.7174960Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T20:22:44.7209182Z ##[endgroup] 2025-05-07T20:22:44.7209759Z ##[group]Disabling automatic garbage collection 2025-05-07T20:22:44.7212974Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:22:44.7244893Z ##[endgroup] 2025-05-07T20:22:44.7245428Z ##[group]Setting up auth 2025-05-07T20:22:44.7251139Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:22:44.7283300Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:22:44.7650524Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:22:44.7683474Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:22:44.8023126Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:44.8072657Z ##[endgroup] 2025-05-07T20:22:44.8073254Z ##[group]Fetching the repository 2025-05-07T20:22:44.8080693Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:22:45.3845344Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:22:45.3846146Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:22:45.3870448Z ##[endgroup] 2025-05-07T20:22:45.3870990Z ##[group]Determining the checkout info 2025-05-07T20:22:45.3873096Z ##[endgroup] 2025-05-07T20:22:45.3877260Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:22:45.3914676Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:22:45.3956655Z ##[group]Checking out the ref 2025-05-07T20:22:45.3960052Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:22:45.5035812Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T20:22:45.5036233Z 2025-05-07T20:22:45.5036531Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T20:22:45.5037249Z changes and commit them, and you can discard any commits you make in this 2025-05-07T20:22:45.5037762Z state without impacting any branches by switching back to a branch. 2025-05-07T20:22:45.5038075Z 2025-05-07T20:22:45.5038291Z If you want to create a new branch to retain commits you create, you may 2025-05-07T20:22:45.5038767Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T20:22:45.5039035Z 2025-05-07T20:22:45.5039152Z git switch -c 2025-05-07T20:22:45.5039346Z 2025-05-07T20:22:45.5039480Z Or undo this operation with: 2025-05-07T20:22:45.5039656Z 2025-05-07T20:22:45.5039749Z git switch - 2025-05-07T20:22:45.5040399Z 2025-05-07T20:22:45.5040634Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T20:22:45.5040957Z 2025-05-07T20:22:45.5041341Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:22:45.5050812Z ##[endgroup] 2025-05-07T20:22:45.5056117Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:22:45.5056856Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:45.5103340Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:22:45.5135328Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:22:45.5169952Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:22:45.5197026Z ##[endgroup] 2025-05-07T20:22:45.5197552Z ##[group]Fetching submodules 2025-05-07T20:22:45.5199958Z [command]/usr/bin/git submodule sync 2025-05-07T20:22:45.5545891Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:22:45.5877003Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T20:22:45.5879036Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T20:22:45.5882501Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T20:22:45.5885880Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T20:22:45.5889501Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T20:22:45.5893469Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T20:22:45.5896517Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T20:22:45.5927445Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'... 2025-05-07T20:22:45.9480646Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'... 2025-05-07T20:22:46.4368854Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'... 2025-05-07T20:22:46.8554312Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'... 2025-05-07T20:22:47.9380797Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'... 2025-05-07T20:22:48.2509546Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'... 2025-05-07T20:22:48.4961794Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'... 2025-05-07T20:22:49.6374440Z From https://github.com/asmjit/asmjit 2025-05-07T20:22:49.6374926Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T20:22:49.6866045Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:22:50.4056862Z From https://github.com/jwfromm/composable_kernel 2025-05-07T20:22:50.4057350Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T20:22:50.6853641Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:22:51.3609760Z From https://github.com/pytorch/cpuinfo 2025-05-07T20:22:51.3610212Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T20:22:51.4599220Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:22:52.7314235Z From https://github.com/jwfromm/cutlass 2025-05-07T20:22:52.7315109Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T20:22:53.4193211Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:22:54.1170188Z From https://github.com/google/googletest 2025-05-07T20:22:54.1170665Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T20:22:54.1570786Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:22:55.0643418Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T20:22:55.0644396Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T20:22:55.0729991Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:22:55.7948994Z From https://github.com/nlohmann/json 2025-05-07T20:22:55.7949465Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T20:22:55.9057474Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:22:55.9076082Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:22:55.9409143Z Entering 'external/asmjit' 2025-05-07T20:22:55.9440619Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.9473470Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.9505157Z Entering 'external/cutlass' 2025-05-07T20:22:55.9536277Z Entering 'external/googletest' 2025-05-07T20:22:55.9567912Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.9598984Z Entering 'external/json' 2025-05-07T20:22:55.9643933Z ##[endgroup] 2025-05-07T20:22:55.9644349Z ##[group]Persisting credentials for submodules 2025-05-07T20:22:55.9650402Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:22:55.9982404Z Entering 'external/asmjit' 2025-05-07T20:22:56.0049012Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.0119482Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.0185088Z Entering 'external/cutlass' 2025-05-07T20:22:56.0258717Z Entering 'external/googletest' 2025-05-07T20:22:56.0327898Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.0396084Z Entering 'external/json' 2025-05-07T20:22:56.0479104Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:22:56.0808975Z Entering 'external/asmjit' 2025-05-07T20:22:56.0871826Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:22:56.0874347Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.0935311Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:22:56.0938426Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.1001364Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:22:56.1004194Z Entering 'external/cutlass' 2025-05-07T20:22:56.1064807Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:22:56.1067780Z Entering 'external/googletest' 2025-05-07T20:22:56.1132286Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:22:56.1134996Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.1195850Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:22:56.1199872Z Entering 'external/json' 2025-05-07T20:22:56.1260596Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:22:56.1346228Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:22:56.1673525Z Entering 'external/asmjit' 2025-05-07T20:22:56.1706107Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.1738355Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.1774510Z Entering 'external/cutlass' 2025-05-07T20:22:56.1805735Z Entering 'external/googletest' 2025-05-07T20:22:56.1836669Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.1869501Z Entering 'external/json' 2025-05-07T20:22:56.1915740Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:22:56.2243926Z Entering 'external/asmjit' 2025-05-07T20:22:56.2311562Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.2311885Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.2338677Z Entering 'external/cutlass' 2025-05-07T20:22:56.2371254Z Entering 'external/googletest' 2025-05-07T20:22:56.2401774Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.2434045Z Entering 'external/json' 2025-05-07T20:22:56.2477869Z ##[endgroup] 2025-05-07T20:22:56.2520152Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:22:56.2548276Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:22:56.2731375Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:22:56.2731688Z with: 2025-05-07T20:22:56.2731934Z name: fbgemm_genai_x86_clang_py3.9_cu12.6.3.whl 2025-05-07T20:22:56.2732263Z merge-multiple: false 2025-05-07T20:22:56.2732512Z repository: pytorch/FBGEMM 2025-05-07T20:22:56.2732768Z run-id: 14891846252 2025-05-07T20:22:56.2732976Z env: 2025-05-07T20:22:56.2733200Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.2733491Z BUILD_ENV: build_binary 2025-05-07T20:22:56.2733731Z BUILD_TARGET: genai 2025-05-07T20:22:56.2733950Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.2734185Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.2734431Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.2734665Z ##[endgroup] 2025-05-07T20:22:56.5068765Z Downloading single artifact 2025-05-07T20:22:56.6074921Z Preparing to download the following artifacts: 2025-05-07T20:22:56.6075890Z - fbgemm_genai_x86_clang_py3.9_cu12.6.3.whl (ID: 3081363869, Size: 12542866, Expected Digest: sha256:497773d2b688d8ce372143b11ddd93f307146ed7f45f4420437a8c620b3a9aa4) 2025-05-07T20:22:56.6585072Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-762f8d52-3fb2-51f9-ac57-047851dc6d3c/artifacts/0a0e162a22a3d874d00e499951e68dc83c18e66afa6b49ef075dcdcd39d2276e.zip 2025-05-07T20:22:56.6586479Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:56.7450438Z (node:57021) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:22:56.7451393Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:22:56.9609003Z SHA256 digest of downloaded artifact is 497773d2b688d8ce372143b11ddd93f307146ed7f45f4420437a8c620b3a9aa4 2025-05-07T20:22:56.9609616Z Artifact download completed successfully. 2025-05-07T20:22:56.9609958Z Total of 1 artifact(s) downloaded 2025-05-07T20:22:56.9614765Z Download artifact has finished successfully 2025-05-07T20:22:56.9873536Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:22:56.9873935Z with: 2025-05-07T20:22:56.9874155Z driver-version: 570.133.07 2025-05-07T20:22:56.9874412Z env: 2025-05-07T20:22:56.9874637Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.9874939Z BUILD_ENV: build_binary 2025-05-07T20:22:56.9875191Z BUILD_TARGET: genai 2025-05-07T20:22:56.9875483Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.9875721Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.9875984Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.9876229Z ##[endgroup] 2025-05-07T20:22:56.9965838Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:22:56.9966235Z with: 2025-05-07T20:22:56.9966643Z timeout_minutes: 10 2025-05-07T20:22:56.9966873Z max_attempts: 3 2025-05-07T20:22:56.9990170Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:22:57.0013026Z retry_wait_seconds: 10 2025-05-07T20:22:57.0013295Z polling_interval_seconds: 1 2025-05-07T20:22:57.0013561Z warning_on_retry: true 2025-05-07T20:22:57.0013813Z continue_on_error: false 2025-05-07T20:22:57.0014054Z env: 2025-05-07T20:22:57.0014275Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:57.0014590Z BUILD_ENV: build_binary 2025-05-07T20:22:57.0014840Z BUILD_TARGET: genai 2025-05-07T20:22:57.0015066Z BUILD_VARIANT: cuda 2025-05-07T20:22:57.0015316Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:57.0015583Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:57.0015827Z DRIVER_VERSION: 570.133.07 2025-05-07T20:22:57.0016081Z ##[endgroup] 2025-05-07T20:22:57.0813406Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:22:57.0814509Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:22:57.0817912Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:22:57.7246631Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:22:57.7247021Z No packages marked for removal. 2025-05-07T20:22:57.7308801Z Dependencies resolved. 2025-05-07T20:22:57.7318640Z Nothing to do. 2025-05-07T20:22:57.7319113Z Complete! 2025-05-07T20:22:57.7646226Z + install_nvidia_driver_common 2025-05-07T20:22:57.7650259Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:22:57.7650581Z + lspci 2025-05-07T20:22:57.7652236Z Before installing NVIDIA driver 2025-05-07T20:22:57.7834510Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:57.7835253Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:57.7835820Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:57.7836356Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:57.7836839Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:57.7837377Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:57.7837871Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:57.7838340Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:57.7838743Z + lsmod 2025-05-07T20:22:57.7879693Z Module Size Used by 2025-05-07T20:22:57.7880017Z xt_conntrack 16384 1 2025-05-07T20:22:57.7880283Z nft_chain_nat 16384 3 2025-05-07T20:22:57.7880554Z xt_MASQUERADE 20480 1 2025-05-07T20:22:57.7880866Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:57.7881196Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:57.7881600Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:57.7882041Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:57.7882362Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:57.7882656Z xfrm_user 57344 1 2025-05-07T20:22:57.7882928Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:57.7883223Z xt_addrtype 16384 2 2025-05-07T20:22:57.7883484Z nft_compat 20480 4 2025-05-07T20:22:57.7883796Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:57.7884216Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:57.7884597Z br_netfilter 36864 0 2025-05-07T20:22:57.7884887Z bridge 323584 1 br_netfilter 2025-05-07T20:22:57.7885201Z stp 16384 1 bridge 2025-05-07T20:22:57.7885495Z llc 16384 2 bridge,stp 2025-05-07T20:22:57.7885794Z overlay 167936 0 2025-05-07T20:22:57.7886056Z tls 135168 0 2025-05-07T20:22:57.7886316Z nls_ascii 16384 1 2025-05-07T20:22:57.7886571Z nls_cp437 20480 1 2025-05-07T20:22:57.7886828Z vfat 24576 1 2025-05-07T20:22:57.7887092Z fat 86016 1 vfat 2025-05-07T20:22:57.7887359Z sunrpc 696320 1 2025-05-07T20:22:57.7887613Z ena 180224 0 2025-05-07T20:22:57.7887870Z i8042 45056 0 2025-05-07T20:22:57.7888126Z serio 28672 3 i8042 2025-05-07T20:22:57.7888408Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:57.7888814Z button 24576 0 2025-05-07T20:22:57.7889077Z sch_fq_codel 20480 17 2025-05-07T20:22:57.7889335Z dm_mod 188416 0 2025-05-07T20:22:57.7889594Z fuse 163840 1 2025-05-07T20:22:57.7889855Z loop 36864 0 2025-05-07T20:22:57.7890109Z configfs 57344 1 2025-05-07T20:22:57.7890371Z dax 45056 1 dm_mod 2025-05-07T20:22:57.7890657Z dmi_sysfs 20480 0 2025-05-07T20:22:57.7890913Z crc32_pclmul 16384 0 2025-05-07T20:22:57.7891180Z crc32c_intel 24576 0 2025-05-07T20:22:57.7891443Z efivarfs 24576 1 2025-05-07T20:22:57.7891693Z + modinfo nvidia 2025-05-07T20:22:57.7898491Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:57.7898981Z import_ns: DMA_BUF 2025-05-07T20:22:57.7899241Z alias: char-major-195-* 2025-05-07T20:22:57.7899509Z version: 570.133.07 2025-05-07T20:22:57.7899771Z supported: external 2025-05-07T20:22:57.7900042Z license: Dual MIT/GPL 2025-05-07T20:22:57.7900335Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:57.7900682Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:57.7901449Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:57.7901793Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:57.7902224Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:57.7902594Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:57.7902914Z depends: i2c-core,drm 2025-05-07T20:22:57.7903185Z retpoline: Y 2025-05-07T20:22:57.7903404Z name: nvidia 2025-05-07T20:22:57.7903767Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:57.7904246Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:57.7904692Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:57.7905249Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:57.7905565Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:57.7905867Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:57.7906202Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:57.7906509Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:57.7906834Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:57.7907201Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:57.7907660Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:57.7908003Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:57.7908304Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:57.7908622Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:57.7908995Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:57.7909401Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:57.7909796Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:57.7910231Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.7910652Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:57.7911084Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.7911509Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:57.7911861Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:57.7912239Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:57.7912628Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:57.7912982Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:57.7913310Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:57.7913656Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:57.7913992Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:57.7914312Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:57.7914663Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:57.7915043Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:57.7915385Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:57.7915726Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:57.7916082Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:57.7916434Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:57.7916782Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:57.7917125Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:57.7917422Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:57.7917757Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:57.7918093Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:57.7918419Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:57.7918764Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:57.7919126Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:57.7919483Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:57.7919825Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:57.7920175Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:57.7920523Z parm: rm_firmware_active:charp 2025-05-07T20:22:57.7920916Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:22:57.7921162Z ++ command -v nvidia-smi 2025-05-07T20:22:57.7921429Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:22:57.7921693Z + set +e 2025-05-07T20:22:57.7921998Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:22:59.5923985Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:22:59.5924383Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:22:59.5924641Z + '[' 0 -ne 0 ']' 2025-05-07T20:22:59.5924878Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:22:59.5925154Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:22:59.5925611Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:22:59.5926094Z + set -e 2025-05-07T20:22:59.5926661Z + '[' 1 -eq 0 ']' 2025-05-07T20:22:59.5927064Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:22:59.5927538Z + post_install_nvidia_driver_common 2025-05-07T20:22:59.5930497Z + sudo modprobe nvidia 2025-05-07T20:22:59.7343945Z + echo 'After installing NVIDIA driver' 2025-05-07T20:22:59.7344301Z + lspci 2025-05-07T20:22:59.7344538Z After installing NVIDIA driver 2025-05-07T20:22:59.7458655Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:59.7459175Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:59.7459735Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:59.7460273Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:59.7460754Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:59.7461333Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:59.7461851Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:59.7462338Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:59.7462742Z + lsmod 2025-05-07T20:22:59.7490954Z Module Size Used by 2025-05-07T20:22:59.7491260Z nvidia_uvm 1884160 0 2025-05-07T20:22:59.7491552Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:22:59.7491861Z drm 602112 1 nvidia 2025-05-07T20:22:59.7492174Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:22:59.7492502Z backlight 24576 1 drm 2025-05-07T20:22:59.7492796Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:22:59.7493099Z xt_conntrack 16384 1 2025-05-07T20:22:59.7493362Z nft_chain_nat 16384 3 2025-05-07T20:22:59.7493634Z xt_MASQUERADE 20480 1 2025-05-07T20:22:59.7493948Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:59.7494296Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:59.7494695Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:59.7495134Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:59.7495459Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:59.7495756Z xfrm_user 57344 1 2025-05-07T20:22:59.7496035Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:59.7496334Z xt_addrtype 16384 2 2025-05-07T20:22:59.7496593Z nft_compat 20480 4 2025-05-07T20:22:59.7496914Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:59.7497335Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:59.7497716Z br_netfilter 36864 0 2025-05-07T20:22:59.7497994Z bridge 323584 1 br_netfilter 2025-05-07T20:22:59.7498298Z stp 16384 1 bridge 2025-05-07T20:22:59.7498590Z llc 16384 2 bridge,stp 2025-05-07T20:22:59.7498878Z overlay 167936 0 2025-05-07T20:22:59.7499138Z tls 135168 0 2025-05-07T20:22:59.7499395Z nls_ascii 16384 1 2025-05-07T20:22:59.7499861Z nls_cp437 20480 1 2025-05-07T20:22:59.7500128Z vfat 24576 1 2025-05-07T20:22:59.7500389Z fat 86016 1 vfat 2025-05-07T20:22:59.7500654Z sunrpc 696320 1 2025-05-07T20:22:59.7500913Z ena 180224 0 2025-05-07T20:22:59.7501240Z i8042 45056 0 2025-05-07T20:22:59.7501498Z serio 28672 3 i8042 2025-05-07T20:22:59.7502022Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:59.7502302Z button 24576 0 2025-05-07T20:22:59.7502560Z sch_fq_codel 20480 17 2025-05-07T20:22:59.7502816Z dm_mod 188416 0 2025-05-07T20:22:59.7503067Z fuse 163840 1 2025-05-07T20:22:59.7503321Z loop 36864 0 2025-05-07T20:22:59.7503728Z configfs 57344 1 2025-05-07T20:22:59.7503992Z dax 45056 1 dm_mod 2025-05-07T20:22:59.7504273Z dmi_sysfs 20480 0 2025-05-07T20:22:59.7504530Z crc32_pclmul 16384 0 2025-05-07T20:22:59.7504802Z crc32c_intel 24576 0 2025-05-07T20:22:59.7505062Z efivarfs 24576 1 2025-05-07T20:22:59.7505316Z + modinfo nvidia 2025-05-07T20:22:59.7507960Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:59.7508428Z import_ns: DMA_BUF 2025-05-07T20:22:59.7508687Z alias: char-major-195-* 2025-05-07T20:22:59.7508957Z version: 570.133.07 2025-05-07T20:22:59.7509213Z supported: external 2025-05-07T20:22:59.7509474Z license: Dual MIT/GPL 2025-05-07T20:22:59.7509766Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:59.7510119Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:59.7510448Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:59.7510779Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:59.7511119Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:59.7511463Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:59.7511787Z depends: i2c-core,drm 2025-05-07T20:22:59.7512049Z retpoline: Y 2025-05-07T20:22:59.7512279Z name: nvidia 2025-05-07T20:22:59.7512646Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:59.7513122Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:59.7513575Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:59.7513998Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:59.7514314Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:59.7514616Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:59.7514938Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:59.7515243Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:59.7515552Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:59.7515922Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:59.7516319Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:59.7516657Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:59.7516963Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:59.7517274Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:59.7517638Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:59.7518042Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:59.7518428Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:59.7518851Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.7519261Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:59.7519692Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.7520119Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:59.7520459Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:59.7520834Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:59.7521318Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:59.7521667Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:59.7521996Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:59.7522338Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:59.7522671Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:59.7522990Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:59.7523348Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:59.7523720Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:59.7524047Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:59.7524398Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:59.7524754Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:59.7525191Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:59.7525541Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:59.7525882Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:59.7526187Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:59.7526516Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:59.7526849Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:59.7527174Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:59.7527506Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:59.7527871Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:59.7528229Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:59.7528554Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:59.7528912Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:59.7529267Z parm: rm_firmware_active:charp 2025-05-07T20:22:59.7529552Z + set +e 2025-05-07T20:22:59.7529764Z + nvidia-smi 2025-05-07T20:23:01.1545660Z Wed May 7 20:23:01 2025 2025-05-07T20:23:01.1546259Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.1546801Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:01.1547284Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:01.1547780Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:01.1548317Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:01.1548763Z | | | MIG M. | 2025-05-07T20:23:01.1549102Z |=========================================+========================+======================| 2025-05-07T20:23:01.1610227Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:01.1610697Z | 0% 32C P0 64W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:01.1611094Z | | | N/A | 2025-05-07T20:23:01.1611487Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:01.1612109Z 2025-05-07T20:23:01.1612558Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.1613010Z | Processes: | 2025-05-07T20:23:01.1613468Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:01.1613891Z | ID ID Usage | 2025-05-07T20:23:01.1614250Z |=========================================================================================| 2025-05-07T20:23:01.1615146Z | No running processes found | 2025-05-07T20:23:01.1615877Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.5844062Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:02.9922067Z NVIDIA A10G 2025-05-07T20:23:03.2587470Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:03.2587820Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:03.2588067Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:03.2588361Z + set -e 2025-05-07T20:23:03.2588578Z INFO: Ignoring allowed status 0 2025-05-07T20:23:03.2596749Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:03.2599638Z + sudo yum install -y yum-utils 2025-05-07T20:23:03.6775342Z Last metadata expiration check: 0:04:50 ago on Wed May 7 20:18:13 2025. 2025-05-07T20:23:03.7032420Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:03.7431976Z Dependencies resolved. 2025-05-07T20:23:03.7613704Z Nothing to do. 2025-05-07T20:23:03.7614257Z Complete! 2025-05-07T20:23:03.8012604Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:03.8013247Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.8014095Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.1249476Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.1826875Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:04.7123061Z nvidia-container-toolkit 15 kB/s | 833 B 00:00 2025-05-07T20:23:04.7375003Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:04.7772911Z Dependencies resolved. 2025-05-07T20:23:04.7954232Z ================================================================================ 2025-05-07T20:23:04.7955284Z Package Arch Version Repository Size 2025-05-07T20:23:04.7956034Z ================================================================================ 2025-05-07T20:23:04.7956439Z Downgrading: 2025-05-07T20:23:04.7956823Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:04.7957433Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:04.7957851Z 2025-05-07T20:23:04.7957952Z Transaction Summary 2025-05-07T20:23:04.7958214Z ================================================================================ 2025-05-07T20:23:04.7958536Z Downgrade 2 Packages 2025-05-07T20:23:04.7958687Z 2025-05-07T20:23:04.7958802Z Total download size: 6.8 M 2025-05-07T20:23:04.7959058Z Downloading Packages: 2025-05-07T20:23:04.8447529Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 26 MB/s | 1.2 MB 00:00 2025-05-07T20:23:04.8881809Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 62 MB/s | 5.6 MB 00:00 2025-05-07T20:23:04.8890460Z -------------------------------------------------------------------------------- 2025-05-07T20:23:04.8893565Z Total 73 MB/s | 6.8 MB 00:00 2025-05-07T20:23:04.8895974Z Running transaction check 2025-05-07T20:23:04.9001005Z Transaction check succeeded. 2025-05-07T20:23:04.9001360Z Running transaction test 2025-05-07T20:23:04.9296166Z Transaction test succeeded. 2025-05-07T20:23:04.9298495Z Running transaction 2025-05-07T20:23:05.4805799Z Preparing : 1/1 2025-05-07T20:23:05.5855574Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:05.5878432Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.6098261Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.6098860Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.6201839Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.6223011Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:06.9932958Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:06.9933583Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:06.9934146Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:06.9934680Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:07.1285407Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:07.1286344Z WARNING: 2025-05-07T20:23:07.1286600Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:07.1286832Z 2025-05-07T20:23:07.1286929Z Available Versions: 2025-05-07T20:23:07.1287082Z 2025-05-07T20:23:07.1287187Z Version 2023.7.20250331: 2025-05-07T20:23:07.1287505Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:07.1287760Z 2025-05-07T20:23:07.1287896Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:07.1288110Z 2025-05-07T20:23:07.1288199Z Release notes: 2025-05-07T20:23:07.1296835Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:07.1297272Z 2025-05-07T20:23:07.1297374Z Version 2023.7.20250414: 2025-05-07T20:23:07.1297699Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:07.1297956Z 2025-05-07T20:23:07.1298087Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:07.1298300Z 2025-05-07T20:23:07.1298388Z Release notes: 2025-05-07T20:23:07.1298806Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:07.1299184Z 2025-05-07T20:23:07.1299274Z Version 2023.7.20250428: 2025-05-07T20:23:07.1299595Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:07.1299849Z 2025-05-07T20:23:07.1299968Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:07.1300186Z 2025-05-07T20:23:07.1300276Z Release notes: 2025-05-07T20:23:07.1300725Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:07.1301226Z 2025-05-07T20:23:07.1301376Z ================================================================================ 2025-05-07T20:23:07.1642947Z 2025-05-07T20:23:07.1643128Z 2025-05-07T20:23:07.1643420Z Downgraded: 2025-05-07T20:23:07.1643818Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:07.1644401Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:07.1644782Z 2025-05-07T20:23:07.1644870Z Complete! 2025-05-07T20:23:07.2085183Z + sudo systemctl restart docker 2025-05-07T20:23:11.2014843Z Wed May 7 20:23:11 2025 2025-05-07T20:23:11.2015304Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.2015830Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:11.2016323Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:11.2016824Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:11.2017367Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:11.2017811Z | | | MIG M. | 2025-05-07T20:23:11.2018162Z |=========================================+========================+======================| 2025-05-07T20:23:11.2098879Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:11.2101591Z | 0% 32C P0 63W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:11.2102098Z | | | N/A | 2025-05-07T20:23:11.2102513Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:11.2102974Z 2025-05-07T20:23:11.2103592Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.2104217Z | Processes: | 2025-05-07T20:23:11.2104752Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:11.2105418Z | ID ID Usage | 2025-05-07T20:23:11.2105773Z |=========================================================================================| 2025-05-07T20:23:11.2106203Z | No running processes found | 2025-05-07T20:23:11.2106677Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:12.0577219Z Command completed after 1 attempt(s). 2025-05-07T20:23:12.0664841Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:12.0665321Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:12.0679759Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.0680117Z env: 2025-05-07T20:23:12.0680345Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.0680646Z BUILD_ENV: build_binary 2025-05-07T20:23:12.0680896Z BUILD_TARGET: genai 2025-05-07T20:23:12.0681142Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.0681377Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:12.0681643Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.0681960Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.0682336Z ##[endgroup] 2025-05-07T20:23:12.4056535Z ################################################################################ 2025-05-07T20:23:12.4056895Z # Print System Info 2025-05-07T20:23:12.4057128Z # 2025-05-07T20:23:12.4073580Z # [2025-05-07T20:23:12.407Z] + print_system_info 2025-05-07T20:23:12.4073943Z ################################################################################ 2025-05-07T20:23:12.4074166Z 2025-05-07T20:23:12.4074281Z ################################################################################ 2025-05-07T20:23:12.4074624Z [INFO] Printing environment variables ... 2025-05-07T20:23:12.4074930Z + printenv 2025-05-07T20:23:12.4075049Z 2025-05-07T20:23:12.4097282Z SHELL=/bin/bash 2025-05-07T20:23:12.4097626Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:12.4098050Z BUILD_VARIANT=cuda 2025-05-07T20:23:12.4098616Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_b440ea1e-c694-438b-b960-cd28a028bf37 2025-05-07T20:23:12.4099199Z GITHUB_ACTION=__run 2025-05-07T20:23:12.4099486Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.4099831Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:12.4100092Z RUNNER_NAME=i-011bf0f995071f8f9 2025-05-07T20:23:12.4100371Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:12.4100687Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:12.4100963Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:12.4101430Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:12.4101865Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:12.4102157Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:12.4102460Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:12.4103090Z *** 2025-05-07T20:23:12.4103298Z LOGNAME=ec2-user 2025-05-07T20:23:12.4103544Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:12.4103814Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:12.4104067Z GITHUB_ACTIONS=true 2025-05-07T20:23:12.4104302Z SYSTEMD_EXEC_PID=55588 2025-05-07T20:23:12.4104585Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:12.4105141Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:12.4105656Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:12.4105939Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:12.4106210Z RUNNER_OS=Linux 2025-05-07T20:23:12.4106440Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:12.4106699Z HOME=/home/ec2-user 2025-05-07T20:23:12.4106956Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:12.4107252Z LANG=C.UTF-8 2025-05-07T20:23:12.4107556Z RUNNER_TRACKING_ID=github_ae3f7369-4363-4024-b8cc-9d7f5b212b73 2025-05-07T20:23:12.4107920Z RUNNER_ARCH=X64 2025-05-07T20:23:12.4108203Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:12.4108890Z BUILD_TARGET=genai 2025-05-07T20:23:12.4109415Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_b440ea1e-c694-438b-b960-cd28a028bf37 2025-05-07T20:23:12.4110278Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_b440ea1e-c694-438b-b960-cd28a028bf37 2025-05-07T20:23:12.4111007Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:12.4113359Z INVOCATION_ID=95b235614dca4c4e829ee33e73dd6c05 2025-05-07T20:23:12.4113700Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:12.4113969Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:12.4114556Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_b440ea1e-c694-438b-b960-cd28a028bf37 2025-05-07T20:23:12.4115169Z BUILD_ENV=build_binary 2025-05-07T20:23:12.4115397Z GITHUB_ACTOR=q10 2025-05-07T20:23:12.4115623Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:12.4115855Z KERN_NAME_LC=linux 2025-05-07T20:23:12.4116084Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:12.4116392Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:12.4116737Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:12.4116981Z USER=ec2-user 2025-05-07T20:23:12.4117219Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:12.4117500Z SHLVL=1 2025-05-07T20:23:12.4117698Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:12.4118019Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:12.4118467Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:12.4118826Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:12.4119071Z KERN_NAME=Linux 2025-05-07T20:23:12.4119302Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:12.4119710Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:12.4120134Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:12.4120416Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:12.4120660Z JOURNAL_STREAM=8:94509 2025-05-07T20:23:12.4120976Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:12.4121343Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:12.4121656Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:12.4121988Z GITHUB_BASE_REF=main 2025-05-07T20:23:12.4122212Z CI=true 2025-05-07T20:23:12.4122429Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:12.4122711Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:12.4122991Z GITHUB_ACTION_REF= 2025-05-07T20:23:12.4123244Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:12.4123854Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_b440ea1e-c694-438b-b960-cd28a028bf37 2025-05-07T20:23:12.4124440Z MACHINE_NAME=x86_64 2025-05-07T20:23:12.4124663Z _=/usr/bin/printenv 2025-05-07T20:23:12.4124810Z 2025-05-07T20:23:12.4124933Z ################################################################################ 2025-05-07T20:23:12.4125249Z [INFO] Print ldd version ... 2025-05-07T20:23:12.4125515Z + ldd --version 2025-05-07T20:23:12.4125644Z 2025-05-07T20:23:12.4125738Z ldd (GNU libc) 2.34 2025-05-07T20:23:12.4126003Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:12.4126447Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:12.4126978Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:12.4127422Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:12.4127643Z 2025-05-07T20:23:12.4127762Z ################################################################################ 2025-05-07T20:23:12.4128076Z [INFO] Print CPU info ... 2025-05-07T20:23:12.4128320Z + nproc 2025-05-07T20:23:12.4128430Z 2025-05-07T20:23:12.4141832Z 16 2025-05-07T20:23:12.4143362Z 2025-05-07T20:23:12.4143591Z + lscpu 2025-05-07T20:23:12.4143715Z 2025-05-07T20:23:12.4251405Z Architecture: x86_64 2025-05-07T20:23:12.4251791Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:12.4252457Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4252860Z Byte Order: Little Endian 2025-05-07T20:23:12.4253182Z CPU(s): 16 2025-05-07T20:23:12.4253483Z On-line CPU(s) list: 0-15 2025-05-07T20:23:12.4253800Z Vendor ID: AuthenticAMD 2025-05-07T20:23:12.4254147Z Model name: AMD EPYC 7R32 2025-05-07T20:23:12.4254470Z CPU family: 23 2025-05-07T20:23:12.4254897Z Model: 49 2025-05-07T20:23:12.4255192Z Thread(s) per core: 2 2025-05-07T20:23:12.4255486Z Core(s) per socket: 8 2025-05-07T20:23:12.4255764Z Socket(s): 1 2025-05-07T20:23:12.4256049Z Stepping: 0 2025-05-07T20:23:12.4256358Z BogoMIPS: 5599.99 2025-05-07T20:23:12.4258400Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4260433Z Hypervisor vendor: KVM 2025-05-07T20:23:12.4260745Z Virtualization type: full 2025-05-07T20:23:12.4261080Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:12.4261525Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:12.4261889Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:12.4262289Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:12.4262617Z NUMA node(s): 1 2025-05-07T20:23:12.4262912Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:12.4263250Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:12.4263623Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:12.4263985Z Vulnerability L1tf: Not affected 2025-05-07T20:23:12.4264339Z Vulnerability Mds: Not affected 2025-05-07T20:23:12.4264695Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:12.4265066Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:12.4265441Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:12.4266084Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:12.4266798Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:12.4267576Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:12.4268542Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:12.4269571Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:12.4270247Z Vulnerability Srbds: Not affected 2025-05-07T20:23:12.4270619Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:12.4270851Z 2025-05-07T20:23:12.4270953Z + cat /proc/cpuinfo 2025-05-07T20:23:12.4271092Z 2025-05-07T20:23:12.4271180Z processor : 0 2025-05-07T20:23:12.4271407Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4271661Z cpu family : 23 2025-05-07T20:23:12.4271868Z model : 49 2025-05-07T20:23:12.4272081Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4272334Z stepping : 0 2025-05-07T20:23:12.4272548Z microcode : 0x830107f 2025-05-07T20:23:12.4272889Z cpu MHz : 3305.709 2025-05-07T20:23:12.4273114Z cache size : 512 KB 2025-05-07T20:23:12.4273331Z physical id : 0 2025-05-07T20:23:12.4273548Z siblings : 16 2025-05-07T20:23:12.4273753Z core id : 0 2025-05-07T20:23:12.4273953Z cpu cores : 8 2025-05-07T20:23:12.4274161Z apicid : 0 2025-05-07T20:23:12.4274367Z initial apicid : 0 2025-05-07T20:23:12.4274578Z fpu : yes 2025-05-07T20:23:12.4274784Z fpu_exception : yes 2025-05-07T20:23:12.4275009Z cpuid level : 13 2025-05-07T20:23:12.4275214Z wp : yes 2025-05-07T20:23:12.4277273Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4279476Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4279960Z bogomips : 5599.99 2025-05-07T20:23:12.4280183Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4280422Z clflush size : 64 2025-05-07T20:23:12.4280641Z cache_alignment : 64 2025-05-07T20:23:12.4280916Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4281236Z power management: 2025-05-07T20:23:12.4281376Z 2025-05-07T20:23:12.4281464Z processor : 1 2025-05-07T20:23:12.4281685Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4281922Z cpu family : 23 2025-05-07T20:23:12.4282172Z model : 49 2025-05-07T20:23:12.4282401Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4282653Z stepping : 0 2025-05-07T20:23:12.4282860Z microcode : 0x830107f 2025-05-07T20:23:12.4283091Z cpu MHz : 3297.096 2025-05-07T20:23:12.4283311Z cache size : 512 KB 2025-05-07T20:23:12.4283530Z physical id : 0 2025-05-07T20:23:12.4283742Z siblings : 16 2025-05-07T20:23:12.4283946Z core id : 1 2025-05-07T20:23:12.4284142Z cpu cores : 8 2025-05-07T20:23:12.4284345Z apicid : 2 2025-05-07T20:23:12.4284547Z initial apicid : 2 2025-05-07T20:23:12.4284755Z fpu : yes 2025-05-07T20:23:12.4284960Z fpu_exception : yes 2025-05-07T20:23:12.4285183Z cpuid level : 13 2025-05-07T20:23:12.4285388Z wp : yes 2025-05-07T20:23:12.4287314Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4289505Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4289996Z bogomips : 5599.99 2025-05-07T20:23:12.4290215Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4290458Z clflush size : 64 2025-05-07T20:23:12.4290684Z cache_alignment : 64 2025-05-07T20:23:12.4290950Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4291268Z power management: 2025-05-07T20:23:12.4291409Z 2025-05-07T20:23:12.4291500Z processor : 2 2025-05-07T20:23:12.4291725Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4291975Z cpu family : 23 2025-05-07T20:23:12.4292223Z model : 49 2025-05-07T20:23:12.4292439Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4292678Z stepping : 0 2025-05-07T20:23:12.4292894Z microcode : 0x830107f 2025-05-07T20:23:12.4293126Z cpu MHz : 3301.203 2025-05-07T20:23:12.4293341Z cache size : 512 KB 2025-05-07T20:23:12.4293564Z physical id : 0 2025-05-07T20:23:12.4293782Z siblings : 16 2025-05-07T20:23:12.4294069Z core id : 2 2025-05-07T20:23:12.4294278Z cpu cores : 8 2025-05-07T20:23:12.4294483Z apicid : 4 2025-05-07T20:23:12.4294681Z initial apicid : 4 2025-05-07T20:23:12.4294897Z fpu : yes 2025-05-07T20:23:12.4295101Z fpu_exception : yes 2025-05-07T20:23:12.4295315Z cpuid level : 13 2025-05-07T20:23:12.4295529Z wp : yes 2025-05-07T20:23:12.4297563Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4299745Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4300236Z bogomips : 5599.99 2025-05-07T20:23:12.4300453Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4300695Z clflush size : 64 2025-05-07T20:23:12.4300919Z cache_alignment : 64 2025-05-07T20:23:12.4301338Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4301656Z power management: 2025-05-07T20:23:12.4301798Z 2025-05-07T20:23:12.4301888Z processor : 3 2025-05-07T20:23:12.4302110Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4302353Z cpu family : 23 2025-05-07T20:23:12.4302565Z model : 49 2025-05-07T20:23:12.4302777Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4303016Z stepping : 0 2025-05-07T20:23:12.4303228Z microcode : 0x830107f 2025-05-07T20:23:12.4303459Z cpu MHz : 3298.878 2025-05-07T20:23:12.4303672Z cache size : 512 KB 2025-05-07T20:23:12.4303892Z physical id : 0 2025-05-07T20:23:12.4304108Z siblings : 16 2025-05-07T20:23:12.4304307Z core id : 3 2025-05-07T20:23:12.4304515Z cpu cores : 8 2025-05-07T20:23:12.4304718Z apicid : 6 2025-05-07T20:23:12.4304915Z initial apicid : 6 2025-05-07T20:23:12.4305135Z fpu : yes 2025-05-07T20:23:12.4305350Z fpu_exception : yes 2025-05-07T20:23:12.4305574Z cpuid level : 13 2025-05-07T20:23:12.4305780Z wp : yes 2025-05-07T20:23:12.4307704Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4309933Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4310422Z bogomips : 5599.99 2025-05-07T20:23:12.4310650Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4310886Z clflush size : 64 2025-05-07T20:23:12.4311107Z cache_alignment : 64 2025-05-07T20:23:12.4311386Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4311702Z power management: 2025-05-07T20:23:12.4311837Z 2025-05-07T20:23:12.4321683Z processor : 4 2025-05-07T20:23:12.4321933Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4322224Z cpu family : 23 2025-05-07T20:23:12.4322457Z model : 49 2025-05-07T20:23:12.4322674Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4322931Z stepping : 0 2025-05-07T20:23:12.4323151Z microcode : 0x830107f 2025-05-07T20:23:12.4323380Z cpu MHz : 3306.286 2025-05-07T20:23:12.4323604Z cache size : 512 KB 2025-05-07T20:23:12.4323827Z physical id : 0 2025-05-07T20:23:12.4324036Z siblings : 16 2025-05-07T20:23:12.4324242Z core id : 4 2025-05-07T20:23:12.4324449Z cpu cores : 8 2025-05-07T20:23:12.4324658Z apicid : 8 2025-05-07T20:23:12.4325007Z initial apicid : 8 2025-05-07T20:23:12.4325231Z fpu : yes 2025-05-07T20:23:12.4325440Z fpu_exception : yes 2025-05-07T20:23:12.4325660Z cpuid level : 13 2025-05-07T20:23:12.4325875Z wp : yes 2025-05-07T20:23:12.4327919Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4330155Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4330636Z bogomips : 5599.99 2025-05-07T20:23:12.4330874Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4331121Z clflush size : 64 2025-05-07T20:23:12.4331335Z cache_alignment : 64 2025-05-07T20:23:12.4331615Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4331939Z power management: 2025-05-07T20:23:12.4332076Z 2025-05-07T20:23:12.4332170Z processor : 5 2025-05-07T20:23:12.4332384Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4332628Z cpu family : 23 2025-05-07T20:23:12.4332842Z model : 49 2025-05-07T20:23:12.4333049Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4333302Z stepping : 0 2025-05-07T20:23:12.4333517Z microcode : 0x830107f 2025-05-07T20:23:12.4333744Z cpu MHz : 3299.808 2025-05-07T20:23:12.4333970Z cache size : 512 KB 2025-05-07T20:23:12.4334191Z physical id : 0 2025-05-07T20:23:12.4334398Z siblings : 16 2025-05-07T20:23:12.4334603Z core id : 5 2025-05-07T20:23:12.4334808Z cpu cores : 8 2025-05-07T20:23:12.4335007Z apicid : 10 2025-05-07T20:23:12.4335216Z initial apicid : 10 2025-05-07T20:23:12.4335432Z fpu : yes 2025-05-07T20:23:12.4335637Z fpu_exception : yes 2025-05-07T20:23:12.4335861Z cpuid level : 13 2025-05-07T20:23:12.4336079Z wp : yes 2025-05-07T20:23:12.4337987Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4341046Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4341609Z bogomips : 5599.99 2025-05-07T20:23:12.4341836Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4342084Z clflush size : 64 2025-05-07T20:23:12.4342305Z cache_alignment : 64 2025-05-07T20:23:12.4342583Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4342900Z power management: 2025-05-07T20:23:12.4343036Z 2025-05-07T20:23:12.4343121Z processor : 6 2025-05-07T20:23:12.4343340Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4343586Z cpu family : 23 2025-05-07T20:23:12.4343790Z model : 49 2025-05-07T20:23:12.4343998Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4344243Z stepping : 0 2025-05-07T20:23:12.4344450Z microcode : 0x830107f 2025-05-07T20:23:12.4344685Z cpu MHz : 2788.357 2025-05-07T20:23:12.4344907Z cache size : 512 KB 2025-05-07T20:23:12.4345120Z physical id : 0 2025-05-07T20:23:12.4345333Z siblings : 16 2025-05-07T20:23:12.4345538Z core id : 6 2025-05-07T20:23:12.4345737Z cpu cores : 8 2025-05-07T20:23:12.4345946Z apicid : 12 2025-05-07T20:23:12.4346154Z initial apicid : 12 2025-05-07T20:23:12.4346369Z fpu : yes 2025-05-07T20:23:12.4346572Z fpu_exception : yes 2025-05-07T20:23:12.4346791Z cpuid level : 13 2025-05-07T20:23:12.4347159Z wp : yes 2025-05-07T20:23:12.4349211Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4351435Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4351920Z bogomips : 5599.99 2025-05-07T20:23:12.4352150Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4352424Z clflush size : 64 2025-05-07T20:23:12.4352655Z cache_alignment : 64 2025-05-07T20:23:12.4352937Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4353248Z power management: 2025-05-07T20:23:12.4353387Z 2025-05-07T20:23:12.4353471Z processor : 7 2025-05-07T20:23:12.4353689Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4353926Z cpu family : 23 2025-05-07T20:23:12.4354135Z model : 49 2025-05-07T20:23:12.4354348Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4354587Z stepping : 0 2025-05-07T20:23:12.4354798Z microcode : 0x830107f 2025-05-07T20:23:12.4355031Z cpu MHz : 3300.822 2025-05-07T20:23:12.4355248Z cache size : 512 KB 2025-05-07T20:23:12.4355468Z physical id : 0 2025-05-07T20:23:12.4355683Z siblings : 16 2025-05-07T20:23:12.4355881Z core id : 7 2025-05-07T20:23:12.4356078Z cpu cores : 8 2025-05-07T20:23:12.4356280Z apicid : 14 2025-05-07T20:23:12.4356488Z initial apicid : 14 2025-05-07T20:23:12.4356744Z fpu : yes 2025-05-07T20:23:12.4356942Z fpu_exception : yes 2025-05-07T20:23:12.4357163Z cpuid level : 13 2025-05-07T20:23:12.4357378Z wp : yes 2025-05-07T20:23:12.4359298Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4361488Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4361974Z bogomips : 5599.99 2025-05-07T20:23:12.4362200Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4362445Z clflush size : 64 2025-05-07T20:23:12.4362660Z cache_alignment : 64 2025-05-07T20:23:12.4362937Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4363265Z power management: 2025-05-07T20:23:12.4363400Z 2025-05-07T20:23:12.4363485Z processor : 8 2025-05-07T20:23:12.4363705Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4363943Z cpu family : 23 2025-05-07T20:23:12.4364147Z model : 49 2025-05-07T20:23:12.4364361Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4364608Z stepping : 0 2025-05-07T20:23:12.4364813Z microcode : 0x830107f 2025-05-07T20:23:12.4365045Z cpu MHz : 3302.151 2025-05-07T20:23:12.4365263Z cache size : 512 KB 2025-05-07T20:23:12.4365472Z physical id : 0 2025-05-07T20:23:12.4365682Z siblings : 16 2025-05-07T20:23:12.4365890Z core id : 0 2025-05-07T20:23:12.4366088Z cpu cores : 8 2025-05-07T20:23:12.4366286Z apicid : 1 2025-05-07T20:23:12.4366484Z initial apicid : 1 2025-05-07T20:23:12.4366698Z fpu : yes 2025-05-07T20:23:12.4366891Z fpu_exception : yes 2025-05-07T20:23:12.4367109Z cpuid level : 13 2025-05-07T20:23:12.4367318Z wp : yes 2025-05-07T20:23:12.4369218Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4371768Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4372250Z bogomips : 5599.99 2025-05-07T20:23:12.4372471Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4372707Z clflush size : 64 2025-05-07T20:23:12.4372920Z cache_alignment : 64 2025-05-07T20:23:12.4373192Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4373502Z power management: 2025-05-07T20:23:12.4373635Z 2025-05-07T20:23:12.4373724Z processor : 9 2025-05-07T20:23:12.4373939Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4374179Z cpu family : 23 2025-05-07T20:23:12.4374381Z model : 49 2025-05-07T20:23:12.4374588Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4374829Z stepping : 0 2025-05-07T20:23:12.4375033Z microcode : 0x830107f 2025-05-07T20:23:12.4375261Z cpu MHz : 3295.300 2025-05-07T20:23:12.4375476Z cache size : 512 KB 2025-05-07T20:23:12.4375746Z physical id : 0 2025-05-07T20:23:12.4375954Z siblings : 16 2025-05-07T20:23:12.4376156Z core id : 1 2025-05-07T20:23:12.4376364Z cpu cores : 8 2025-05-07T20:23:12.4376564Z apicid : 3 2025-05-07T20:23:12.4376764Z initial apicid : 3 2025-05-07T20:23:12.4376976Z fpu : yes 2025-05-07T20:23:12.4377169Z fpu_exception : yes 2025-05-07T20:23:12.4377387Z cpuid level : 13 2025-05-07T20:23:12.4377596Z wp : yes 2025-05-07T20:23:12.4379494Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4381737Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4382225Z bogomips : 5599.99 2025-05-07T20:23:12.4382448Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4382679Z clflush size : 64 2025-05-07T20:23:12.4382896Z cache_alignment : 64 2025-05-07T20:23:12.4383166Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4383474Z power management: 2025-05-07T20:23:12.4383609Z 2025-05-07T20:23:12.4383694Z processor : 10 2025-05-07T20:23:12.4383914Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4384155Z cpu family : 23 2025-05-07T20:23:12.4384358Z model : 49 2025-05-07T20:23:12.4384563Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4384805Z stepping : 0 2025-05-07T20:23:12.4385009Z microcode : 0x830107f 2025-05-07T20:23:12.4385238Z cpu MHz : 3298.605 2025-05-07T20:23:12.4385455Z cache size : 512 KB 2025-05-07T20:23:12.4385665Z physical id : 0 2025-05-07T20:23:12.4385880Z siblings : 16 2025-05-07T20:23:12.4386082Z core id : 2 2025-05-07T20:23:12.4386277Z cpu cores : 8 2025-05-07T20:23:12.4386479Z apicid : 5 2025-05-07T20:23:12.4386682Z initial apicid : 5 2025-05-07T20:23:12.4386889Z fpu : yes 2025-05-07T20:23:12.4387086Z fpu_exception : yes 2025-05-07T20:23:12.4387305Z cpuid level : 13 2025-05-07T20:23:12.4387506Z wp : yes 2025-05-07T20:23:12.4389408Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4391716Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4392245Z bogomips : 5599.99 2025-05-07T20:23:12.4392577Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4392811Z clflush size : 64 2025-05-07T20:23:12.4393031Z cache_alignment : 64 2025-05-07T20:23:12.4393306Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4393616Z power management: 2025-05-07T20:23:12.4393753Z 2025-05-07T20:23:12.4393841Z processor : 11 2025-05-07T20:23:12.4394062Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4394292Z cpu family : 23 2025-05-07T20:23:12.4394502Z model : 49 2025-05-07T20:23:12.4394710Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4394949Z stepping : 0 2025-05-07T20:23:12.4395163Z microcode : 0x830107f 2025-05-07T20:23:12.4395389Z cpu MHz : 3300.256 2025-05-07T20:23:12.4395595Z cache size : 512 KB 2025-05-07T20:23:12.4395812Z physical id : 0 2025-05-07T20:23:12.4396022Z siblings : 16 2025-05-07T20:23:12.4396216Z core id : 3 2025-05-07T20:23:12.4396421Z cpu cores : 8 2025-05-07T20:23:12.4396621Z apicid : 7 2025-05-07T20:23:12.4396820Z initial apicid : 7 2025-05-07T20:23:12.4397039Z fpu : yes 2025-05-07T20:23:12.4397239Z fpu_exception : yes 2025-05-07T20:23:12.4397452Z cpuid level : 13 2025-05-07T20:23:12.4397665Z wp : yes 2025-05-07T20:23:12.4399606Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4401781Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4402261Z bogomips : 5599.99 2025-05-07T20:23:12.4402476Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4402715Z clflush size : 64 2025-05-07T20:23:12.4402932Z cache_alignment : 64 2025-05-07T20:23:12.4403196Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4403513Z power management: 2025-05-07T20:23:12.4403644Z 2025-05-07T20:23:12.4403738Z processor : 12 2025-05-07T20:23:12.4403948Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4404187Z cpu family : 23 2025-05-07T20:23:12.4404395Z model : 49 2025-05-07T20:23:12.4404593Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4404844Z stepping : 0 2025-05-07T20:23:12.4405055Z microcode : 0x830107f 2025-05-07T20:23:12.4405279Z cpu MHz : 3300.699 2025-05-07T20:23:12.4405489Z cache size : 512 KB 2025-05-07T20:23:12.4405700Z physical id : 0 2025-05-07T20:23:12.4405909Z siblings : 16 2025-05-07T20:23:12.4406103Z core id : 4 2025-05-07T20:23:12.4406299Z cpu cores : 8 2025-05-07T20:23:12.4406498Z apicid : 9 2025-05-07T20:23:12.4406689Z initial apicid : 9 2025-05-07T20:23:12.4406902Z fpu : yes 2025-05-07T20:23:12.4407101Z fpu_exception : yes 2025-05-07T20:23:12.4407316Z cpuid level : 13 2025-05-07T20:23:12.4407525Z wp : yes 2025-05-07T20:23:12.4409426Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4411724Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4412200Z bogomips : 5599.99 2025-05-07T20:23:12.4412417Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4412651Z clflush size : 64 2025-05-07T20:23:12.4412865Z cache_alignment : 64 2025-05-07T20:23:12.4413215Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4413531Z power management: 2025-05-07T20:23:12.4413661Z 2025-05-07T20:23:12.4413749Z processor : 13 2025-05-07T20:23:12.4413960Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4414196Z cpu family : 23 2025-05-07T20:23:12.4414401Z model : 49 2025-05-07T20:23:12.4414600Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4414842Z stepping : 0 2025-05-07T20:23:12.4415050Z microcode : 0x830107f 2025-05-07T20:23:12.4415278Z cpu MHz : 3299.588 2025-05-07T20:23:12.4415488Z cache size : 512 KB 2025-05-07T20:23:12.4415706Z physical id : 0 2025-05-07T20:23:12.4415911Z siblings : 16 2025-05-07T20:23:12.4416112Z core id : 5 2025-05-07T20:23:12.4416309Z cpu cores : 8 2025-05-07T20:23:12.4416508Z apicid : 11 2025-05-07T20:23:12.4416709Z initial apicid : 11 2025-05-07T20:23:12.4416920Z fpu : yes 2025-05-07T20:23:12.4417113Z fpu_exception : yes 2025-05-07T20:23:12.4417331Z cpuid level : 13 2025-05-07T20:23:12.4417539Z wp : yes 2025-05-07T20:23:12.4419486Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4421758Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4422282Z bogomips : 5599.99 2025-05-07T20:23:12.4422501Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4422737Z clflush size : 64 2025-05-07T20:23:12.4422947Z cache_alignment : 64 2025-05-07T20:23:12.4423217Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4423533Z power management: 2025-05-07T20:23:12.4423667Z 2025-05-07T20:23:12.4423749Z processor : 14 2025-05-07T20:23:12.4423963Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4424199Z cpu family : 23 2025-05-07T20:23:12.4424401Z model : 49 2025-05-07T20:23:12.4424608Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4424852Z stepping : 0 2025-05-07T20:23:12.4425055Z microcode : 0x830107f 2025-05-07T20:23:12.4425279Z cpu MHz : 3302.929 2025-05-07T20:23:12.4425499Z cache size : 512 KB 2025-05-07T20:23:12.4425709Z physical id : 0 2025-05-07T20:23:12.4425920Z siblings : 16 2025-05-07T20:23:12.4426119Z core id : 6 2025-05-07T20:23:12.4426311Z cpu cores : 8 2025-05-07T20:23:12.4426510Z apicid : 13 2025-05-07T20:23:12.4426715Z initial apicid : 13 2025-05-07T20:23:12.4426924Z fpu : yes 2025-05-07T20:23:12.4427121Z fpu_exception : yes 2025-05-07T20:23:12.4427340Z cpuid level : 13 2025-05-07T20:23:12.4427541Z wp : yes 2025-05-07T20:23:12.4429487Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4431786Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4432263Z bogomips : 5599.99 2025-05-07T20:23:12.4432482Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4432711Z clflush size : 64 2025-05-07T20:23:12.4432936Z cache_alignment : 64 2025-05-07T20:23:12.4433203Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4433520Z power management: 2025-05-07T20:23:12.4433651Z 2025-05-07T20:23:12.4433834Z processor : 15 2025-05-07T20:23:12.4434053Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.4434296Z cpu family : 23 2025-05-07T20:23:12.4434507Z model : 49 2025-05-07T20:23:12.4434717Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.4434959Z stepping : 0 2025-05-07T20:23:12.4435171Z microcode : 0x830107f 2025-05-07T20:23:12.4435401Z cpu MHz : 3299.879 2025-05-07T20:23:12.4435608Z cache size : 512 KB 2025-05-07T20:23:12.4435827Z physical id : 0 2025-05-07T20:23:12.4436045Z siblings : 16 2025-05-07T20:23:12.4436244Z core id : 7 2025-05-07T20:23:12.4436445Z cpu cores : 8 2025-05-07T20:23:12.4436648Z apicid : 15 2025-05-07T20:23:12.4436849Z initial apicid : 15 2025-05-07T20:23:12.4437067Z fpu : yes 2025-05-07T20:23:12.4437270Z fpu_exception : yes 2025-05-07T20:23:12.4437486Z cpuid level : 13 2025-05-07T20:23:12.4437695Z wp : yes 2025-05-07T20:23:12.4439610Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.4443013Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.4443500Z bogomips : 5599.99 2025-05-07T20:23:12.4443723Z TLB size : 3072 4K pages 2025-05-07T20:23:12.4443961Z clflush size : 64 2025-05-07T20:23:12.4444171Z cache_alignment : 64 2025-05-07T20:23:12.4444442Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.4444755Z power management: 2025-05-07T20:23:12.4444885Z 2025-05-07T20:23:12.4444889Z 2025-05-07T20:23:12.4445021Z ################################################################################ 2025-05-07T20:23:12.4445329Z [INFO] Print PCI info ... 2025-05-07T20:23:12.4445578Z + lspci -v 2025-05-07T20:23:12.4445694Z 2025-05-07T20:23:12.4445904Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:12.4446286Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:12.4446599Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:12.4446814Z 2025-05-07T20:23:12.4447015Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:12.4447399Z Physical Slot: 1 2025-05-07T20:23:12.4447645Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.4447850Z 2025-05-07T20:23:12.4448093Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:12.4448528Z Physical Slot: 1 2025-05-07T20:23:12.4448787Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:12.4449010Z 2025-05-07T20:23:12.4449275Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:12.4449726Z Physical Slot: 3 2025-05-07T20:23:12.4449966Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.4450310Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.4450661Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:12.4450889Z 2025-05-07T20:23:12.4451187Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.4451855Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.4452187Z Physical Slot: 4 2025-05-07T20:23:12.4452450Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:12.4452833Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.4453180Z Capabilities: 2025-05-07T20:23:12.4453444Z Kernel driver in use: nvme 2025-05-07T20:23:12.4453610Z 2025-05-07T20:23:12.4453937Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.4454418Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.4454754Z Physical Slot: 5 2025-05-07T20:23:12.4455001Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.4455363Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.4455738Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.4456061Z Capabilities: 2025-05-07T20:23:12.4456332Z Kernel driver in use: ena 2025-05-07T20:23:12.4456571Z Kernel modules: ena 2025-05-07T20:23:12.4456717Z 2025-05-07T20:23:12.4456886Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:12.4457265Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:12.4457560Z Physical Slot: 30 2025-05-07T20:23:12.4457814Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:12.4458194Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:12.4458587Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:12.4458965Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:12.4459298Z Capabilities: 2025-05-07T20:23:12.4459567Z Kernel driver in use: nvidia 2025-05-07T20:23:12.4459817Z Kernel modules: nvidia 2025-05-07T20:23:12.4459971Z 2025-05-07T20:23:12.4460269Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.4460780Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.4461066Z Physical Slot: 31 2025-05-07T20:23:12.4461371Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.4461728Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.4462104Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:12.4462437Z Capabilities: 2025-05-07T20:23:12.4462708Z Kernel driver in use: nvme 2025-05-07T20:23:12.4462869Z 2025-05-07T20:23:12.4462873Z 2025-05-07T20:23:12.4463004Z ################################################################################ 2025-05-07T20:23:12.4463328Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:12.4471598Z + uname -a 2025-05-07T20:23:12.4471733Z 2025-05-07T20:23:12.4472131Z Linux ip-10-0-45-1.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:12.4472626Z 2025-05-07T20:23:12.4472709Z + uname -m 2025-05-07T20:23:12.4472831Z 2025-05-07T20:23:12.4472912Z x86_64 2025-05-07T20:23:12.4473020Z 2025-05-07T20:23:12.4473106Z + cat /proc/version 2025-05-07T20:23:12.4473246Z 2025-05-07T20:23:12.4473776Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:12.4474405Z 2025-05-07T20:23:12.4474494Z + cat /etc/os-release 2025-05-07T20:23:12.4474637Z 2025-05-07T20:23:12.4474753Z NAME="Amazon Linux" 2025-05-07T20:23:12.4474967Z VERSION="2023" 2025-05-07T20:23:12.4475173Z ID="amzn" 2025-05-07T20:23:12.4475366Z ID_LIKE="fedora" 2025-05-07T20:23:12.4475566Z VERSION_ID="2023" 2025-05-07T20:23:12.4475803Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:12.4476090Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:12.4476373Z ANSI_COLOR="0;33" 2025-05-07T20:23:12.4476626Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:12.4477140Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:12.4477581Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:12.4477996Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:12.4478439Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:12.4478808Z VENDOR_NAME="AWS" 2025-05-07T20:23:12.4479044Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:12.4479335Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:12.4479487Z 2025-05-07T20:23:12.4479726Z ################################################################################ 2025-05-07T20:23:12.4480036Z # Print EC2 Instance Info 2025-05-07T20:23:12.4480274Z # 2025-05-07T20:23:12.4480478Z # [2025-05-07T20:23:12.446Z] + print_ec2_info 2025-05-07T20:23:12.4480785Z ################################################################################ 2025-05-07T20:23:12.4481000Z 2025-05-07T20:23:12.4591636Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:12.4724525Z instance-id: i-011bf0f995071f8f9 2025-05-07T20:23:12.4841318Z instance-type: g5.4xlarge 2025-05-07T20:23:12.4880115Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.4880480Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.4889528Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.4889887Z env: 2025-05-07T20:23:12.4890112Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.4890425Z BUILD_ENV: build_binary 2025-05-07T20:23:12.4890682Z BUILD_TARGET: genai 2025-05-07T20:23:12.4890913Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.4891160Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:12.4891429Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.4891729Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.4892094Z ##[endgroup] 2025-05-07T20:23:12.8220722Z ################################################################################ 2025-05-07T20:23:12.8221271Z [INFO] Printing general display info ... 2025-05-07T20:23:12.8250575Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:12.9335574Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:12.9345705Z /usr/bin/sudo 2025-05-07T20:23:12.9356125Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:12.9366221Z /usr/bin/yum 2025-05-07T20:23:12.9367827Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:12.9388651Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:13.3827348Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:13.4606057Z ================================================================================ 2025-05-07T20:23:13.4606697Z WARNING: 2025-05-07T20:23:13.4607157Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:13.4607593Z 2025-05-07T20:23:13.4607766Z Available Versions: 2025-05-07T20:23:13.4608040Z 2025-05-07T20:23:13.4608220Z Version 2023.7.20250331: 2025-05-07T20:23:13.4608789Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:13.4609300Z 2025-05-07T20:23:13.4609548Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:13.4609944Z 2025-05-07T20:23:13.4610115Z Release notes: 2025-05-07T20:23:13.4610851Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:13.4611535Z 2025-05-07T20:23:13.4611704Z Version 2023.7.20250414: 2025-05-07T20:23:13.4612269Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:13.4612725Z 2025-05-07T20:23:13.4612950Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:13.4613264Z 2025-05-07T20:23:13.4613352Z Release notes: 2025-05-07T20:23:13.4613755Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:13.4614118Z 2025-05-07T20:23:13.4614218Z Version 2023.7.20250428: 2025-05-07T20:23:13.4614533Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:13.4614781Z 2025-05-07T20:23:13.4615156Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:13.4615379Z 2025-05-07T20:23:13.4615469Z Release notes: 2025-05-07T20:23:13.4615869Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:13.4616229Z 2025-05-07T20:23:13.4616342Z ================================================================================ 2025-05-07T20:23:13.5769021Z Dependencies resolved. 2025-05-07T20:23:13.6053771Z ================================================================================ 2025-05-07T20:23:13.6054234Z Package Arch Version Repository Size 2025-05-07T20:23:13.6054739Z ================================================================================ 2025-05-07T20:23:13.6055063Z Upgrading: 2025-05-07T20:23:13.6055427Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:13.6056022Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:13.6056383Z 2025-05-07T20:23:13.6056705Z Transaction Summary 2025-05-07T20:23:13.6056968Z ================================================================================ 2025-05-07T20:23:13.6057285Z Upgrade 2 Packages 2025-05-07T20:23:13.6057442Z 2025-05-07T20:23:13.6057595Z Total download size: 6.9 M 2025-05-07T20:23:13.6058593Z Downloading Packages: 2025-05-07T20:23:13.6510384Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 28 MB/s | 1.2 MB 00:00 2025-05-07T20:23:13.7053967Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 58 MB/s | 5.7 MB 00:00 2025-05-07T20:23:13.7062625Z -------------------------------------------------------------------------------- 2025-05-07T20:23:13.7065553Z Total 69 MB/s | 6.9 MB 00:00 2025-05-07T20:23:13.7068159Z Running transaction check 2025-05-07T20:23:13.7166979Z Transaction check succeeded. 2025-05-07T20:23:13.7167917Z Running transaction test 2025-05-07T20:23:13.7463614Z Transaction test succeeded. 2025-05-07T20:23:13.7466488Z Running transaction 2025-05-07T20:23:14.2979563Z Preparing : 1/1 2025-05-07T20:23:14.4030757Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.4051349Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.4253274Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.4254505Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.4358596Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.4380149Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.5801523Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:14.5802115Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.5802687Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:14.5803227Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:14.7311252Z ================================================================================ 2025-05-07T20:23:14.7311989Z WARNING: 2025-05-07T20:23:14.7312484Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:14.7312823Z 2025-05-07T20:23:14.7312938Z Available Versions: 2025-05-07T20:23:14.7313121Z 2025-05-07T20:23:14.7313215Z Version 2023.7.20250331: 2025-05-07T20:23:14.7313547Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:14.7313811Z 2025-05-07T20:23:14.7313941Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:14.7314168Z 2025-05-07T20:23:14.7314257Z Release notes: 2025-05-07T20:23:14.7314677Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:14.7315382Z 2025-05-07T20:23:14.7315499Z Version 2023.7.20250414: 2025-05-07T20:23:14.7315811Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:14.7316072Z 2025-05-07T20:23:14.7316189Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:14.7316404Z 2025-05-07T20:23:14.7316500Z Release notes: 2025-05-07T20:23:14.7316895Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:14.7317270Z 2025-05-07T20:23:14.7317362Z Version 2023.7.20250428: 2025-05-07T20:23:14.7317677Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:14.7317925Z 2025-05-07T20:23:14.7318049Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:14.7318264Z 2025-05-07T20:23:14.7318352Z Release notes: 2025-05-07T20:23:14.7318752Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:14.7319121Z 2025-05-07T20:23:14.7319473Z ================================================================================ 2025-05-07T20:23:14.7883713Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.7884098Z 2025-05-07T20:23:14.7884192Z Upgraded: 2025-05-07T20:23:14.7884550Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:14.7885135Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:14.7885474Z 2025-05-07T20:23:14.7885570Z Complete! 2025-05-07T20:23:14.8316142Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:14.8337529Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:15.2752854Z Last metadata expiration check: 0:00:11 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:15.2994508Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:15.3396062Z Dependencies resolved. 2025-05-07T20:23:15.3574454Z ================================================================================ 2025-05-07T20:23:15.3575483Z Package Architecture Version Repository Size 2025-05-07T20:23:15.3575957Z ================================================================================ 2025-05-07T20:23:15.3576272Z Installing: 2025-05-07T20:23:15.3576572Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:15.3576851Z 2025-05-07T20:23:15.3576952Z Transaction Summary 2025-05-07T20:23:15.3577200Z ================================================================================ 2025-05-07T20:23:15.3577521Z Install 1 Package 2025-05-07T20:23:15.3577666Z 2025-05-07T20:23:15.3577772Z Total download size: 319 k 2025-05-07T20:23:15.3578035Z Installed size: 837 k 2025-05-07T20:23:15.3578793Z Downloading Packages: 2025-05-07T20:23:15.4276576Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 7.7 MB/s | 319 kB 00:00 2025-05-07T20:23:15.4282090Z -------------------------------------------------------------------------------- 2025-05-07T20:23:15.4284827Z Total 4.4 MB/s | 319 kB 00:00 2025-05-07T20:23:15.4441511Z Running transaction check 2025-05-07T20:23:15.4496618Z Transaction check succeeded. 2025-05-07T20:23:15.4497377Z Running transaction test 2025-05-07T20:23:15.4953241Z Transaction test succeeded. 2025-05-07T20:23:15.4956483Z Running transaction 2025-05-07T20:23:15.5974914Z Preparing : 1/1 2025-05-07T20:23:15.6476704Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.8293862Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.9509061Z ================================================================================ 2025-05-07T20:23:15.9509421Z WARNING: 2025-05-07T20:23:15.9509705Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:15.9510411Z 2025-05-07T20:23:15.9510545Z Available Versions: 2025-05-07T20:23:15.9510779Z 2025-05-07T20:23:15.9510918Z Version 2023.7.20250331: 2025-05-07T20:23:15.9511375Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:15.9511737Z 2025-05-07T20:23:15.9511909Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:15.9512224Z 2025-05-07T20:23:15.9512347Z Release notes: 2025-05-07T20:23:15.9512935Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:15.9513544Z 2025-05-07T20:23:15.9513672Z Version 2023.7.20250414: 2025-05-07T20:23:15.9514119Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:15.9514472Z 2025-05-07T20:23:15.9514647Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:15.9515169Z 2025-05-07T20:23:15.9515271Z Release notes: 2025-05-07T20:23:15.9515682Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:15.9516058Z 2025-05-07T20:23:15.9516369Z Version 2023.7.20250428: 2025-05-07T20:23:15.9516684Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:15.9516941Z 2025-05-07T20:23:15.9517066Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:15.9517281Z 2025-05-07T20:23:15.9517375Z Release notes: 2025-05-07T20:23:15.9517763Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:15.9518134Z 2025-05-07T20:23:15.9518249Z ================================================================================ 2025-05-07T20:23:15.9859691Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.9860031Z 2025-05-07T20:23:15.9860128Z Installed: 2025-05-07T20:23:15.9860443Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:15.9860746Z 2025-05-07T20:23:15.9860831Z Complete! 2025-05-07T20:23:16.0300256Z + hostname 2025-05-07T20:23:16.0300964Z 2025-05-07T20:23:16.0313015Z ip-10-0-45-1.ec2.internal 2025-05-07T20:23:16.0314277Z 2025-05-07T20:23:16.0314555Z + sudo lshw -C display 2025-05-07T20:23:16.0314723Z 2025-05-07T20:23:16.5262134Z *-display:0 UNCLAIMED 2025-05-07T20:23:16.5262676Z description: VGA compatible controller 2025-05-07T20:23:16.5263199Z product: Amazon.com, Inc. 2025-05-07T20:23:16.5263674Z vendor: Amazon.com, Inc. 2025-05-07T20:23:16.5264085Z physical id: 3 2025-05-07T20:23:16.5264454Z bus info: pci@0000:00:03.0 2025-05-07T20:23:16.5264862Z version: 00 2025-05-07T20:23:16.5265194Z width: 32 bits 2025-05-07T20:23:16.5265543Z clock: 33MHz 2025-05-07T20:23:16.5265919Z capabilities: vga_controller bus_master 2025-05-07T20:23:16.5266419Z configuration: latency=0 2025-05-07T20:23:16.5266881Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:16.5267340Z *-display:1 2025-05-07T20:23:16.5267656Z description: 3D controller 2025-05-07T20:23:16.5268075Z product: GA102GL [A10G] 2025-05-07T20:23:16.5268475Z vendor: NVIDIA Corporation 2025-05-07T20:23:16.5268864Z physical id: 1e 2025-05-07T20:23:16.5269205Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:16.5269576Z version: a1 2025-05-07T20:23:16.5269897Z width: 64 bits 2025-05-07T20:23:16.5270231Z clock: 33MHz 2025-05-07T20:23:16.5270655Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:16.5271156Z configuration: driver=nvidia latency=0 2025-05-07T20:23:16.5271998Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:16.5305376Z 2025-05-07T20:23:16.5305614Z ################################################################################ 2025-05-07T20:23:16.5305963Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:16.5436752Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:16.5611729Z Wed May 7 20:23:16 2025 2025-05-07T20:23:16.5612543Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.5613439Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:16.5613977Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.5614487Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:16.5615026Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:16.5615458Z | | | MIG M. | 2025-05-07T20:23:16.5615805Z |=========================================+========================+======================| 2025-05-07T20:23:16.5694305Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:16.5695141Z | 0% 33C P0 60W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:16.5695530Z | | | N/A | 2025-05-07T20:23:16.5695937Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.5696347Z 2025-05-07T20:23:16.5696745Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.5697176Z | Processes: | 2025-05-07T20:23:16.5697623Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:16.5698041Z | ID ID Usage | 2025-05-07T20:23:16.5698404Z |=========================================================================================| 2025-05-07T20:23:16.5699039Z | No running processes found | 2025-05-07T20:23:16.5699518Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.7090005Z ################################################################################ 2025-05-07T20:23:16.7090382Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:16.7233045Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.7233806Z [CHECK] rocminfo not found 2025-05-07T20:23:16.7243283Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.7244290Z [CHECK] rocm-smi not found 2025-05-07T20:23:16.7289689Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.7290128Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.7301708Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:16.7302074Z env: 2025-05-07T20:23:16.7302306Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:16.7302606Z BUILD_ENV: build_binary 2025-05-07T20:23:16.7302856Z BUILD_TARGET: genai 2025-05-07T20:23:16.7303089Z BUILD_VARIANT: cuda 2025-05-07T20:23:16.7303324Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:16.7303591Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:16.7303905Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:16.7304231Z ##[endgroup] 2025-05-07T20:23:17.0640591Z ################################################################################ 2025-05-07T20:23:17.0641043Z # Setup Miniconda 2025-05-07T20:23:17.0641271Z # 2025-05-07T20:23:17.0657071Z # [2025-05-07T20:23:17.065Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:17.0657505Z ################################################################################ 2025-05-07T20:23:17.0657732Z 2025-05-07T20:23:17.0673594Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:17.1556148Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:17.1556530Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:17.1556738Z 2025-05-07T20:23:17.1574022Z 2025-05-07T20:23:17.1574429Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:17.1595159Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:18.2520840Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:18.2521226Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:18.2521482Z 2025-05-07T20:23:18.2666231Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:18.7180506Z Unpacking payload ... 2025-05-07T20:23:19.2355301Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:20.0341783Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:22.1371922Z 2025-05-07T20:23:22.1372302Z Installing base environment... 2025-05-07T20:23:22.1372536Z 2025-05-07T20:23:23.2092702Z Preparing transaction: ...working... done 2025-05-07T20:23:26.1758876Z Executing transaction: ...working... done 2025-05-07T20:23:26.8379664Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:26.9270312Z installation finished. 2025-05-07T20:23:26.9277087Z 2025-05-07T20:23:26.9277397Z + rm -f miniconda.sh 2025-05-07T20:23:26.9277588Z 2025-05-07T20:23:26.9583898Z 2025-05-07T20:23:26.9584289Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:26.9584657Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:26.9584907Z 2025-05-07T20:23:27.3233210Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:27.3233783Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:27.3234286Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:27.3234827Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:27.3235281Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:27.3235688Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:27.3236133Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:27.3236572Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:27.3237036Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:27.3237847Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:27.3238375Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:27.3238745Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:27.3238944Z 2025-05-07T20:23:27.3239142Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:27.3239443Z 2025-05-07T20:23:27.3909994Z 2025-05-07T20:23:27.3910423Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:27.3910622Z 2025-05-07T20:23:28.2290789Z 2025-05-07T20:23:28.2291716Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:28.2316019Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:23:41.6993674Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:23:43.3069946Z Solving environment: \ | / - \ | / - \ | / - done 2025-05-07T20:23:43.4039499Z 2025-05-07T20:23:43.4039635Z ## Package Plan ## 2025-05-07T20:23:43.4039786Z 2025-05-07T20:23:43.4040038Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:43.4040482Z 2025-05-07T20:23:43.4040616Z added / updated specs: 2025-05-07T20:23:43.4040894Z - conda-libmamba-solver 2025-05-07T20:23:43.4041148Z - libarchive 2025-05-07T20:23:43.4041381Z - libmamba 2025-05-07T20:23:43.4041599Z - libmambapy 2025-05-07T20:23:43.4041730Z 2025-05-07T20:23:43.4041735Z 2025-05-07T20:23:43.4041879Z The following packages will be downloaded: 2025-05-07T20:23:43.4042103Z 2025-05-07T20:23:43.4042222Z package | build 2025-05-07T20:23:43.4042552Z ---------------------------|----------------- 2025-05-07T20:23:43.4042977Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:23:43.4043451Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:23:43.4043894Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:23:43.4044376Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:23:43.4044825Z ------------------------------------------------------------ 2025-05-07T20:23:43.4045174Z Total: 1.4 MB 2025-05-07T20:23:43.4045391Z 2025-05-07T20:23:43.4045506Z The following packages will be UPDATED: 2025-05-07T20:23:43.4045710Z 2025-05-07T20:23:43.4049893Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:23:43.4050691Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:23:43.4051092Z 2025-05-07T20:23:43.4051320Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:23:43.4051644Z 2025-05-07T20:23:43.4051965Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:23:43.4052770Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:23:43.4053254Z 2025-05-07T20:23:43.4053259Z 2025-05-07T20:23:43.4053263Z 2025-05-07T20:23:43.4053415Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:43.4053797Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:23:43.4054026Z 2025-05-07T20:23:43.4054549Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:23:43.4054799Z 2025-05-07T20:23:43.4054803Z 2025-05-07T20:23:43.4071471Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:23:43.4071737Z 2025-05-07T20:23:43.4071748Z 2025-05-07T20:23:43.4072098Z 2025-05-07T20:23:43.4588697Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:23:43.4590011Z 2025-05-07T20:23:43.4693338Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.4693591Z 2025-05-07T20:23:43.4697290Z 2025-05-07T20:23:43.4781676Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.4781951Z 2025-05-07T20:23:43.4781955Z 2025-05-07T20:23:43.4784982Z 2025-05-07T20:23:43.4795952Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.4798389Z 2025-05-07T20:23:43.4970433Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.4970872Z 2025-05-07T20:23:43.4970890Z 2025-05-07T20:23:43.4977313Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.5055581Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.5055928Z 2025-05-07T20:23:43.5055934Z 2025-05-07T20:23:43.5056208Z 2025-05-07T20:23:43.6094764Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.6095206Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.6101924Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.6102276Z 2025-05-07T20:23:43.6102564Z 2025-05-07T20:23:43.6102831Z  2025-05-07T20:23:43.6103043Z 2025-05-07T20:23:43.6103047Z 2025-05-07T20:23:43.6103222Z  2025-05-07T20:23:43.6103440Z 2025-05-07T20:23:43.6103444Z 2025-05-07T20:23:43.6103447Z 2025-05-07T20:23:43.6103629Z  done 2025-05-07T20:23:43.7107091Z Preparing transaction: | done 2025-05-07T20:23:43.8110898Z Verifying transaction: - done 2025-05-07T20:23:45.2133699Z Executing transaction: | / - \ | / - \ | / - \ | / done 2025-05-07T20:23:46.9785618Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:23:46.9814547Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:23:47.9177302Z Channels: 2025-05-07T20:23:47.9177585Z - defaults 2025-05-07T20:23:47.9177819Z Platform: linux-64 2025-05-07T20:23:49.1386550Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:23:49.2578868Z Solving environment: - \ Channels: 2025-05-07T20:23:49.2579298Z - defaults 2025-05-07T20:23:49.2579615Z Platform: linux-64 2025-05-07T20:23:49.5502043Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:23:49.7607914Z Solving environment: - \ | / done 2025-05-07T20:23:49.8459278Z done 2025-05-07T20:23:49.9116280Z 2025-05-07T20:23:49.9116590Z ## Package Plan ## 2025-05-07T20:23:49.9116754Z 2025-05-07T20:23:49.9116906Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:49.9117147Z 2025-05-07T20:23:49.9117261Z added / updated specs: 2025-05-07T20:23:49.9117512Z - conda 2025-05-07T20:23:49.9117630Z 2025-05-07T20:23:49.9117634Z 2025-05-07T20:23:49.9117764Z The following packages will be downloaded: 2025-05-07T20:23:49.9117987Z 2025-05-07T20:23:49.9118112Z package | build 2025-05-07T20:23:49.9118432Z ---------------------------|----------------- 2025-05-07T20:23:49.9118789Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:23:49.9119178Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:23:49.9119551Z ------------------------------------------------------------ 2025-05-07T20:23:49.9119908Z Total: 1.4 MB 2025-05-07T20:23:49.9120122Z 2025-05-07T20:23:49.9121001Z The following packages will be UPDATED: 2025-05-07T20:23:49.9121218Z 2025-05-07T20:23:49.9121521Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:49.9122038Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:23:49.9122294Z 2025-05-07T20:23:49.9122298Z 2025-05-07T20:23:49.9122301Z 2025-05-07T20:23:49.9122444Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:49.9122806Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:23:49.9123027Z 2025-05-07T20:23:49.9392101Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:23:49.9392611Z 2025-05-07T20:23:49.9677646Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1585906Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.1586291Z 2025-05-07T20:23:50.1588879Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1589268Z 2025-05-07T20:23:50.1658389Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1659820Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.1663608Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.1664037Z 2025-05-07T20:23:50.1664302Z 2025-05-07T20:23:50.1664677Z  done 2025-05-07T20:23:50.2668866Z Preparing transaction: \ done 2025-05-07T20:23:50.3671715Z Verifying transaction: / done 2025-05-07T20:23:52.3775088Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:23:52.9892299Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:23:52.9897233Z + conda clean --packages --tarball -y 2025-05-07T20:23:52.9897455Z 2025-05-07T20:23:53.9911861Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:23:53.9912235Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:23:54.0592123Z 2025-05-07T20:23:54.0600388Z + conda clean --all -y 2025-05-07T20:23:54.0600568Z 2025-05-07T20:23:54.6097727Z There are no unused tarball(s) to remove. 2025-05-07T20:23:54.6098123Z Will remove 1 index cache(s). 2025-05-07T20:23:54.6098420Z There are no unused package(s) to remove. 2025-05-07T20:23:54.6098746Z There are no tempfile(s) to remove. 2025-05-07T20:23:54.6099051Z There are no logfile(s) to remove. 2025-05-07T20:23:54.6824919Z 2025-05-07T20:23:54.6829599Z + conda info 2025-05-07T20:23:54.6829952Z 2025-05-07T20:23:55.4276801Z 2025-05-07T20:23:55.4277409Z active environment : base 2025-05-07T20:23:55.4277781Z active env location : /home/ec2-user/miniconda 2025-05-07T20:23:55.4278118Z shell level : 1 2025-05-07T20:23:55.4278405Z user config file : /home/ec2-user/.condarc 2025-05-07T20:23:55.4278795Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:23:55.4279171Z conda version : 25.3.1 2025-05-07T20:23:55.4279482Z conda-build version : not installed 2025-05-07T20:23:55.4279783Z python version : 3.13.2.final.0 2025-05-07T20:23:55.4280087Z solver : libmamba (default) 2025-05-07T20:23:55.4280415Z virtual packages : __archspec=1=zen2 2025-05-07T20:23:55.4280726Z __conda=25.3.1=0 2025-05-07T20:23:55.4281015Z __cuda=12.8=0 2025-05-07T20:23:55.4281298Z __glibc=2.34=0 2025-05-07T20:23:55.4281582Z __linux=6.1.130=0 2025-05-07T20:23:55.4281857Z __unix=0=0 2025-05-07T20:23:55.4282197Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:23:55.4282606Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:23:55.4282950Z conda av metadata url : None 2025-05-07T20:23:55.4283325Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:23:55.4284149Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:23:55.4284548Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:23:55.4284922Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:23:55.4285303Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:23:55.4285650Z /home/ec2-user/.conda/pkgs 2025-05-07T20:23:55.4285990Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:23:55.4286332Z /home/ec2-user/.conda/envs 2025-05-07T20:23:55.4286641Z platform : linux-64 2025-05-07T20:23:55.4287466Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:23:55.4288279Z UID:GID : 1000:1000 2025-05-07T20:23:55.4288561Z netrc file : None 2025-05-07T20:23:55.4288833Z offline mode : False 2025-05-07T20:23:55.4289003Z 2025-05-07T20:23:55.4922965Z 2025-05-07T20:23:55.4923448Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:23:55.4924189Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_d0c2c61f-7224-4a35-a81e-b4e14e84e54d ... 2025-05-07T20:23:55.4925704Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:23:55.5078854Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.9 2025-05-07T20:23:55.5079344Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.9 2025-05-07T20:23:55.5097134Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:55.5097493Z env: 2025-05-07T20:23:55.5097719Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:55.5098028Z BUILD_ENV: build_binary 2025-05-07T20:23:55.5098277Z BUILD_TARGET: genai 2025-05-07T20:23:55.5098508Z BUILD_VARIANT: cuda 2025-05-07T20:23:55.5098742Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:55.5099002Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:55.5099310Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:55.5099643Z ##[endgroup] 2025-05-07T20:23:55.8465834Z ################################################################################ 2025-05-07T20:23:55.8466189Z # Create Conda Environment 2025-05-07T20:23:55.8466441Z # 2025-05-07T20:23:55.8481339Z # [2025-05-07T20:23:55.847Z] + create_conda_environment build_binary 3.9 2025-05-07T20:23:55.8481806Z ################################################################################ 2025-05-07T20:23:55.8491154Z 2025-05-07T20:23:55.8496149Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:55.9370489Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:55.9370863Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:23:55.9371192Z + conda info --envs 2025-05-07T20:23:55.9371332Z 2025-05-07T20:23:56.6833877Z 2025-05-07T20:23:56.6834617Z # conda environments: 2025-05-07T20:23:56.6834916Z # 2025-05-07T20:23:56.6835151Z base /home/ec2-user/miniconda 2025-05-07T20:23:56.6835385Z 2025-05-07T20:23:56.7517908Z 2025-05-07T20:23:56.7518469Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:23:58.3886359Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:58.3886664Z 2025-05-07T20:23:58.3900035Z 2025-05-07T20:23:58.3909367Z [SETUP] Creating new Conda environment (Python 3.9) ... 2025-05-07T20:23:58.3931716Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.9 2025-05-07T20:23:59.1452066Z Channels: 2025-05-07T20:23:59.1452388Z - defaults 2025-05-07T20:23:59.1452683Z Platform: linux-64 2025-05-07T20:24:00.4747875Z Collecting package metadata (repodata.json): - \ | / - \ | / done 2025-05-07T20:24:00.5753615Z Solving environment: \ done 2025-05-07T20:24:00.6039067Z 2025-05-07T20:24:00.6039226Z ## Package Plan ## 2025-05-07T20:24:00.6039384Z 2025-05-07T20:24:00.6039594Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:00.6039907Z 2025-05-07T20:24:00.6040008Z added / updated specs: 2025-05-07T20:24:00.6040576Z - python=3.9 2025-05-07T20:24:00.6040827Z 2025-05-07T20:24:00.6040833Z 2025-05-07T20:24:00.6041001Z The following packages will be downloaded: 2025-05-07T20:24:00.6041288Z 2025-05-07T20:24:00.6041459Z package | build 2025-05-07T20:24:00.6041884Z ---------------------------|----------------- 2025-05-07T20:24:00.6042257Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:00.6042659Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:00.6043086Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:00.6043503Z python-3.9.21 | he870216_1 25.1 MB 2025-05-07T20:24:00.6043904Z setuptools-78.1.1 | py39h06a4308_0 1.7 MB 2025-05-07T20:24:00.6044303Z wheel-0.45.1 | py39h06a4308_0 114 KB 2025-05-07T20:24:00.6044674Z ------------------------------------------------------------ 2025-05-07T20:24:00.6045017Z Total: 27.1 MB 2025-05-07T20:24:00.6045563Z 2025-05-07T20:24:00.6045693Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:00.6045929Z 2025-05-07T20:24:00.6046319Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:00.6046779Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:00.6047326Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:00.6047888Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:00.6048353Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:00.6048789Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:00.6049234Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:00.6049694Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:00.6050151Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:00.6050644Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:00.6051230Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:00.6051807Z python pkgs/main/linux-64::python-3.9.21-he870216_1 2025-05-07T20:24:00.6052415Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:00.6053072Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py39h06a4308_0 2025-05-07T20:24:00.6053614Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:00.6054007Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:00.6054398Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:00.6054816Z wheel pkgs/main/linux-64::wheel-0.45.1-py39h06a4308_0 2025-05-07T20:24:00.6055201Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:00.6055584Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:00.6055826Z 2025-05-07T20:24:00.6055830Z 2025-05-07T20:24:00.6055842Z 2025-05-07T20:24:00.6055990Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:00.6056382Z python-3.9.21 | 25.1 MB | | 0% 2025-05-07T20:24:00.6056609Z 2025-05-07T20:24:00.6056935Z setuptools-78.1.1 | 1.7 MB | | 0%  2025-05-07T20:24:00.6057184Z 2025-05-07T20:24:00.6057188Z 2025-05-07T20:24:00.6075198Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:00.6075576Z 2025-05-07T20:24:00.6075582Z 2025-05-07T20:24:00.6079803Z 2025-05-07T20:24:00.6088769Z wheel-0.45.1 | 114 KB | | 0%  2025-05-07T20:24:00.6089131Z 2025-05-07T20:24:00.6089137Z 2025-05-07T20:24:00.6089142Z 2025-05-07T20:24:00.6089147Z 2025-05-07T20:24:00.6098346Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:00.6098733Z 2025-05-07T20:24:00.6098751Z 2025-05-07T20:24:00.6098756Z 2025-05-07T20:24:00.6098761Z 2025-05-07T20:24:00.6101051Z 2025-05-07T20:24:00.6393038Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:00.6393462Z 2025-05-07T20:24:00.6393468Z 2025-05-07T20:24:00.6393473Z 2025-05-07T20:24:00.6394171Z 2025-05-07T20:24:00.6489855Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:00.6490243Z 2025-05-07T20:24:00.6492869Z 2025-05-07T20:24:00.6619471Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:00.6619853Z 2025-05-07T20:24:00.6619859Z 2025-05-07T20:24:00.6619881Z 2025-05-07T20:24:00.6619886Z 2025-05-07T20:24:00.6622955Z 2025-05-07T20:24:00.6844985Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:00.6845366Z 2025-05-07T20:24:00.6849316Z 2025-05-07T20:24:00.6942876Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:00.6943267Z 2025-05-07T20:24:00.6943694Z 2025-05-07T20:24:00.6943700Z 2025-05-07T20:24:00.6943703Z 2025-05-07T20:24:00.6946561Z 2025-05-07T20:24:00.7041755Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:00.7046649Z python-3.9.21 | 25.1 MB | 8 | 8% 2025-05-07T20:24:00.7046899Z 2025-05-07T20:24:00.7094106Z setuptools-78.1.1 | 1.7 MB | #######4 | 74%  2025-05-07T20:24:00.7094468Z 2025-05-07T20:24:00.7094474Z 2025-05-07T20:24:00.7094480Z 2025-05-07T20:24:00.7095161Z 2025-05-07T20:24:00.7099710Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:00.7100029Z 2025-05-07T20:24:00.7100033Z 2025-05-07T20:24:00.7100037Z 2025-05-07T20:24:00.7100131Z 2025-05-07T20:24:00.7135251Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:00.7137094Z 2025-05-07T20:24:00.7435016Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:00.7435277Z 2025-05-07T20:24:00.7435281Z 2025-05-07T20:24:00.7435296Z 2025-05-07T20:24:00.7488137Z wheel-0.45.1 | 114 KB | #4 | 14%  2025-05-07T20:24:00.7488515Z 2025-05-07T20:24:00.7488521Z 2025-05-07T20:24:00.7490845Z 2025-05-07T20:24:00.8042349Z wheel-0.45.1 | 114 KB | ########## | 100%  2025-05-07T20:24:00.8209947Z python-3.9.21 | 25.1 MB | ####9 | 50% 2025-05-07T20:24:00.8210289Z 2025-05-07T20:24:00.8210295Z 2025-05-07T20:24:00.8210300Z 2025-05-07T20:24:00.9278329Z wheel-0.45.1 | 114 KB | ########## | 100%  2025-05-07T20:24:00.9281251Z python-3.9.21 | 25.1 MB | ########## | 100% 2025-05-07T20:24:01.1683970Z python-3.9.21 | 25.1 MB | ########## | 100% 2025-05-07T20:24:01.1684238Z 2025-05-07T20:24:01.6407429Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.6414252Z python-3.9.21 | 25.1 MB | ########## | 100% 2025-05-07T20:24:01.6414698Z 2025-05-07T20:24:01.6415029Z 2025-05-07T20:24:01.6415295Z  2025-05-07T20:24:01.6415564Z 2025-05-07T20:24:01.6415569Z 2025-05-07T20:24:01.6415747Z  2025-05-07T20:24:01.6415977Z 2025-05-07T20:24:01.6415981Z 2025-05-07T20:24:01.6415985Z 2025-05-07T20:24:01.6416157Z  2025-05-07T20:24:01.6416369Z 2025-05-07T20:24:01.6416372Z 2025-05-07T20:24:01.6416376Z 2025-05-07T20:24:01.6416380Z 2025-05-07T20:24:01.6416600Z  2025-05-07T20:24:01.6416877Z 2025-05-07T20:24:01.6416881Z 2025-05-07T20:24:01.6416884Z 2025-05-07T20:24:01.6416888Z 2025-05-07T20:24:01.6416892Z 2025-05-07T20:24:01.6417089Z  done 2025-05-07T20:24:01.8522798Z Preparing transaction: / - done 2025-05-07T20:24:02.9868051Z Verifying transaction: | / - \ | / - \ | / - done 2025-05-07T20:24:05.2056223Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:24:05.2561493Z # 2025-05-07T20:24:05.2561882Z # To activate this environment, use 2025-05-07T20:24:05.2562300Z # 2025-05-07T20:24:05.2562536Z # $ conda activate build_binary 2025-05-07T20:24:05.2562821Z # 2025-05-07T20:24:05.2563051Z # To deactivate an active environment, use 2025-05-07T20:24:05.2563351Z # 2025-05-07T20:24:05.2563540Z # $ conda deactivate 2025-05-07T20:24:05.2563709Z 2025-05-07T20:24:05.3707639Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:05.3729693Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:08.1790268Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (25.1) 2025-05-07T20:24:08.1791202Z Collecting pip 2025-05-07T20:24:08.1791669Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:08.1792645Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:08.1793871Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 113.9 MB/s eta 0:00:00 2025-05-07T20:24:08.1794675Z Installing collected packages: pip 2025-05-07T20:24:08.1795137Z Attempting uninstall: pip 2025-05-07T20:24:08.1795571Z Found existing installation: pip 25.1 2025-05-07T20:24:08.1796024Z Uninstalling pip-25.1: 2025-05-07T20:24:08.1796443Z Successfully uninstalled pip-25.1 2025-05-07T20:24:08.1796927Z Successfully installed pip-25.1.1 2025-05-07T20:24:08.1797229Z 2025-05-07T20:24:08.2428509Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:08.2451088Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:09.0957627Z Channels: 2025-05-07T20:24:09.0957892Z - conda-forge 2025-05-07T20:24:09.0958136Z Platform: linux-64 2025-05-07T20:24:19.5247043Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:21.0403354Z Solving environment: / - \ | / done 2025-05-07T20:24:21.1020975Z 2025-05-07T20:24:21.1021277Z ## Package Plan ## 2025-05-07T20:24:21.1021490Z 2025-05-07T20:24:21.1021792Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:21.1022220Z 2025-05-07T20:24:21.1022336Z added / updated specs: 2025-05-07T20:24:21.1022619Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:21.1022826Z 2025-05-07T20:24:21.1022830Z 2025-05-07T20:24:21.1022955Z The following packages will be downloaded: 2025-05-07T20:24:21.1023174Z 2025-05-07T20:24:21.1023302Z package | build 2025-05-07T20:24:21.1023661Z ---------------------------|----------------- 2025-05-07T20:24:21.1024057Z cffi-1.17.1 | py39h15c3d72_0 236 KB conda-forge 2025-05-07T20:24:21.1024700Z cryptography-44.0.3 | py39h7170ec2_0 1.5 MB conda-forge 2025-05-07T20:24:21.1025159Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:21.1025584Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:21.1026021Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:21.1026440Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:21.1027005Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:21.1027548Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:21.1027991Z python_abi-3.9 | 2_cp39 4 KB conda-forge 2025-05-07T20:24:21.1028453Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:21.1028942Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:21.1029503Z ------------------------------------------------------------ 2025-05-07T20:24:21.1029872Z Total: 6.3 MB 2025-05-07T20:24:21.1030085Z 2025-05-07T20:24:21.1030215Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:21.1030446Z 2025-05-07T20:24:21.1030641Z cffi conda-forge/linux-64::cffi-1.17.1-py39h15c3d72_0 2025-05-07T20:24:21.1031139Z cryptography conda-forge/linux-64::cryptography-44.0.3-py39h7170ec2_0 2025-05-07T20:24:21.1031637Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:21.1032108Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:21.1032647Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:21.1033120Z python_abi conda-forge/linux-64::python_abi-3.9-2_cp39 2025-05-07T20:24:21.1033639Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:21.1034224Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:21.1034812Z 2025-05-07T20:24:21.1034931Z The following packages will be UPDATED: 2025-05-07T20:24:21.1035148Z 2025-05-07T20:24:21.1035859Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:21.1036643Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:21.1037294Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:21.1037925Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:21.1038298Z 2025-05-07T20:24:21.1038302Z 2025-05-07T20:24:21.1038306Z 2025-05-07T20:24:21.1038451Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:21.1038829Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:21.1039061Z 2025-05-07T20:24:21.1039367Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:21.1039618Z 2025-05-07T20:24:21.1039629Z 2025-05-07T20:24:21.1052432Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:21.1052702Z 2025-05-07T20:24:21.1052711Z 2025-05-07T20:24:21.1052715Z 2025-05-07T20:24:21.1060267Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:21.1060574Z 2025-05-07T20:24:21.1060580Z 2025-05-07T20:24:21.1060586Z 2025-05-07T20:24:21.1064270Z 2025-05-07T20:24:21.1080285Z cffi-1.17.1 | 236 KB | | 0%  2025-05-07T20:24:21.1080546Z 2025-05-07T20:24:21.1080552Z 2025-05-07T20:24:21.1080558Z 2025-05-07T20:24:21.1080564Z 2025-05-07T20:24:21.1080569Z 2025-05-07T20:24:21.1082092Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:21.1082375Z 2025-05-07T20:24:21.1082380Z 2025-05-07T20:24:21.1082384Z 2025-05-07T20:24:21.1082389Z 2025-05-07T20:24:21.1082394Z 2025-05-07T20:24:21.1082424Z 2025-05-07T20:24:21.1083712Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:21.1084049Z 2025-05-07T20:24:21.1084055Z 2025-05-07T20:24:21.1084061Z 2025-05-07T20:24:21.1084084Z 2025-05-07T20:24:21.1084091Z 2025-05-07T20:24:21.1084096Z 2025-05-07T20:24:21.1084101Z 2025-05-07T20:24:21.1100685Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:21.1101001Z 2025-05-07T20:24:21.1101006Z 2025-05-07T20:24:21.1101009Z 2025-05-07T20:24:21.1101013Z 2025-05-07T20:24:21.1101017Z 2025-05-07T20:24:21.1101021Z 2025-05-07T20:24:21.1101024Z 2025-05-07T20:24:21.1106889Z 2025-05-07T20:24:21.1108265Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:21.1108571Z 2025-05-07T20:24:21.1108575Z 2025-05-07T20:24:21.1108579Z 2025-05-07T20:24:21.1108582Z 2025-05-07T20:24:21.1108593Z 2025-05-07T20:24:21.1108597Z 2025-05-07T20:24:21.1108600Z 2025-05-07T20:24:21.1108620Z 2025-05-07T20:24:21.1108624Z 2025-05-07T20:24:21.1110070Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:21.1110354Z 2025-05-07T20:24:21.1110358Z 2025-05-07T20:24:21.1110362Z 2025-05-07T20:24:21.1110373Z 2025-05-07T20:24:21.1110377Z 2025-05-07T20:24:21.1110380Z 2025-05-07T20:24:21.1110384Z 2025-05-07T20:24:21.1110388Z 2025-05-07T20:24:21.1110391Z 2025-05-07T20:24:21.1110395Z 2025-05-07T20:24:21.1620164Z python_abi-3.9 | 4 KB | | 0%  2025-05-07T20:24:21.1620464Z 2025-05-07T20:24:21.1620469Z 2025-05-07T20:24:21.1620473Z 2025-05-07T20:24:21.1620476Z 2025-05-07T20:24:21.1815027Z cffi-1.17.1 | 236 KB | ########## | 100%  2025-05-07T20:24:21.1815383Z 2025-05-07T20:24:21.1815388Z 2025-05-07T20:24:21.1815391Z 2025-05-07T20:24:21.1815396Z 2025-05-07T20:24:21.1931977Z cffi-1.17.1 | 236 KB | ########## | 100%  2025-05-07T20:24:21.1932323Z 2025-05-07T20:24:21.1932634Z 2025-05-07T20:24:21.1932640Z 2025-05-07T20:24:21.2036176Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:21.2036496Z 2025-05-07T20:24:21.2048675Z cryptography-44.0.3 | 1.5 MB | 5 | 5%  2025-05-07T20:24:21.2049057Z 2025-05-07T20:24:21.2049062Z 2025-05-07T20:24:21.2049071Z 2025-05-07T20:24:21.2049075Z 2025-05-07T20:24:21.2051450Z 2025-05-07T20:24:21.2110102Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:21.2113882Z openssl-3.5.0 | 3.0 MB | | 1% 2025-05-07T20:24:21.2114166Z 2025-05-07T20:24:21.2116877Z 2025-05-07T20:24:21.2339296Z libgcc-15.1.0 | 810 KB | 1 | 2%  2025-05-07T20:24:21.2339627Z 2025-05-07T20:24:21.2339631Z 2025-05-07T20:24:21.2339635Z 2025-05-07T20:24:21.2339639Z 2025-05-07T20:24:21.2339708Z 2025-05-07T20:24:21.2343686Z 2025-05-07T20:24:21.2419540Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:21.2419869Z 2025-05-07T20:24:21.2419873Z 2025-05-07T20:24:21.2419877Z 2025-05-07T20:24:21.2419881Z 2025-05-07T20:24:21.2419885Z 2025-05-07T20:24:21.2421701Z 2025-05-07T20:24:21.2438081Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:21.2438381Z 2025-05-07T20:24:21.2438385Z 2025-05-07T20:24:21.2438389Z 2025-05-07T20:24:21.2438393Z 2025-05-07T20:24:21.2438397Z 2025-05-07T20:24:21.2438400Z 2025-05-07T20:24:21.2441576Z 2025-05-07T20:24:21.2519645Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:21.2519977Z 2025-05-07T20:24:21.2519981Z 2025-05-07T20:24:21.2519985Z 2025-05-07T20:24:21.2519989Z 2025-05-07T20:24:21.2519992Z 2025-05-07T20:24:21.2519996Z 2025-05-07T20:24:21.2521329Z 2025-05-07T20:24:21.2800856Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:21.2801179Z 2025-05-07T20:24:21.2801183Z 2025-05-07T20:24:21.2801187Z 2025-05-07T20:24:21.2801205Z 2025-05-07T20:24:21.2801217Z 2025-05-07T20:24:21.2801220Z 2025-05-07T20:24:21.2801224Z 2025-05-07T20:24:21.2806385Z 2025-05-07T20:24:21.2818205Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:21.2818635Z 2025-05-07T20:24:21.2823906Z 2025-05-07T20:24:21.2864705Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:21.2865102Z 2025-05-07T20:24:21.2865108Z 2025-05-07T20:24:21.2865114Z 2025-05-07T20:24:21.2865128Z 2025-05-07T20:24:21.2865133Z 2025-05-07T20:24:21.2865138Z 2025-05-07T20:24:21.2865144Z 2025-05-07T20:24:21.2868119Z 2025-05-07T20:24:21.2884841Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:21.2885161Z 2025-05-07T20:24:21.2885166Z 2025-05-07T20:24:21.2885170Z 2025-05-07T20:24:21.2885173Z 2025-05-07T20:24:21.2885177Z 2025-05-07T20:24:21.2885181Z 2025-05-07T20:24:21.2885185Z 2025-05-07T20:24:21.2885188Z 2025-05-07T20:24:21.2887936Z 2025-05-07T20:24:21.2931794Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:21.2932224Z 2025-05-07T20:24:21.2932230Z 2025-05-07T20:24:21.2932235Z 2025-05-07T20:24:21.2932240Z 2025-05-07T20:24:21.2932245Z 2025-05-07T20:24:21.2932264Z 2025-05-07T20:24:21.2932270Z 2025-05-07T20:24:21.2932276Z 2025-05-07T20:24:21.2932281Z 2025-05-07T20:24:21.3054233Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:21.3054580Z 2025-05-07T20:24:21.3054584Z 2025-05-07T20:24:21.3055218Z 2025-05-07T20:24:21.3071666Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:21.3072065Z 2025-05-07T20:24:21.3072071Z 2025-05-07T20:24:21.3073395Z 2025-05-07T20:24:21.3087881Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:21.3088218Z 2025-05-07T20:24:21.3088224Z 2025-05-07T20:24:21.3088229Z 2025-05-07T20:24:21.3088235Z 2025-05-07T20:24:21.3088240Z 2025-05-07T20:24:21.3096366Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:21.3096946Z 2025-05-07T20:24:21.3096953Z 2025-05-07T20:24:21.3096958Z 2025-05-07T20:24:21.3096964Z 2025-05-07T20:24:21.3096969Z 2025-05-07T20:24:21.3111925Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:21.3159521Z openssl-3.5.0 | 3.0 MB | #########4 | 94% 2025-05-07T20:24:21.3161545Z 2025-05-07T20:24:21.3161930Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:21.3162222Z 2025-05-07T20:24:21.3388091Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:21.3388425Z 2025-05-07T20:24:21.3388430Z 2025-05-07T20:24:21.3388436Z 2025-05-07T20:24:21.3388441Z 2025-05-07T20:24:21.3388446Z 2025-05-07T20:24:21.3388451Z 2025-05-07T20:24:21.3388457Z 2025-05-07T20:24:21.3613302Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:21.3689112Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:21.3689362Z 2025-05-07T20:24:21.3689372Z 2025-05-07T20:24:21.3689376Z 2025-05-07T20:24:21.3689380Z 2025-05-07T20:24:21.3689383Z 2025-05-07T20:24:21.3689387Z 2025-05-07T20:24:21.3689391Z 2025-05-07T20:24:21.3689920Z 2025-05-07T20:24:21.3758181Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:21.3758506Z 2025-05-07T20:24:21.3758510Z 2025-05-07T20:24:21.3758514Z 2025-05-07T20:24:21.3758518Z 2025-05-07T20:24:21.3758521Z 2025-05-07T20:24:21.3758525Z 2025-05-07T20:24:21.3758529Z 2025-05-07T20:24:21.3758532Z 2025-05-07T20:24:21.3758536Z 2025-05-07T20:24:21.3763766Z 2025-05-07T20:24:21.3783743Z python_abi-3.9 | 4 KB | ########## | 100%  2025-05-07T20:24:21.3784031Z 2025-05-07T20:24:21.3784036Z 2025-05-07T20:24:21.3784039Z 2025-05-07T20:24:21.3784043Z 2025-05-07T20:24:21.3784047Z 2025-05-07T20:24:21.3784050Z 2025-05-07T20:24:21.3784054Z 2025-05-07T20:24:21.3784058Z 2025-05-07T20:24:21.3784061Z 2025-05-07T20:24:21.3784074Z 2025-05-07T20:24:21.4075164Z python_abi-3.9 | 4 KB | ########## | 100%  2025-05-07T20:24:21.4075457Z 2025-05-07T20:24:21.4075461Z 2025-05-07T20:24:21.4075464Z 2025-05-07T20:24:21.4075478Z 2025-05-07T20:24:21.4075482Z 2025-05-07T20:24:21.4075486Z 2025-05-07T20:24:21.4075490Z 2025-05-07T20:24:21.4075494Z 2025-05-07T20:24:21.4075497Z 2025-05-07T20:24:21.4080444Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:21.4080790Z 2025-05-07T20:24:21.4080795Z 2025-05-07T20:24:21.4080800Z 2025-05-07T20:24:21.4080805Z 2025-05-07T20:24:21.4080819Z 2025-05-07T20:24:21.4080824Z 2025-05-07T20:24:21.4080829Z 2025-05-07T20:24:21.4080834Z 2025-05-07T20:24:21.4080839Z 2025-05-07T20:24:21.4117371Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:21.4117647Z 2025-05-07T20:24:21.4117753Z 2025-05-07T20:24:21.4121183Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:21.4121441Z 2025-05-07T20:24:21.4121894Z 2025-05-07T20:24:21.4480978Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:21.4481227Z 2025-05-07T20:24:21.4481231Z 2025-05-07T20:24:21.4481245Z 2025-05-07T20:24:21.4481249Z 2025-05-07T20:24:21.4481260Z 2025-05-07T20:24:21.4482421Z 2025-05-07T20:24:21.4486245Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:21.4486591Z 2025-05-07T20:24:21.4486595Z 2025-05-07T20:24:21.4486605Z 2025-05-07T20:24:21.4486609Z 2025-05-07T20:24:21.4486613Z 2025-05-07T20:24:21.4486617Z 2025-05-07T20:24:21.4644213Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:21.4644497Z 2025-05-07T20:24:21.4644501Z 2025-05-07T20:24:21.4644504Z 2025-05-07T20:24:21.4644508Z 2025-05-07T20:24:21.4644512Z 2025-05-07T20:24:21.4644515Z 2025-05-07T20:24:21.4644519Z 2025-05-07T20:24:21.4644523Z 2025-05-07T20:24:21.4644526Z 2025-05-07T20:24:21.4645417Z 2025-05-07T20:24:21.5729424Z python_abi-3.9 | 4 KB | ########## | 100%  2025-05-07T20:24:21.5905238Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:21.5905595Z 2025-05-07T20:24:21.5914223Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:21.5914765Z 2025-05-07T20:24:21.5915048Z 2025-05-07T20:24:21.5915292Z  2025-05-07T20:24:21.5915579Z 2025-05-07T20:24:21.5915585Z 2025-05-07T20:24:21.5915807Z  2025-05-07T20:24:21.5916120Z 2025-05-07T20:24:21.5916126Z 2025-05-07T20:24:21.5916131Z 2025-05-07T20:24:21.5916371Z  2025-05-07T20:24:21.5916674Z 2025-05-07T20:24:21.5916681Z 2025-05-07T20:24:21.5916686Z 2025-05-07T20:24:21.5916692Z 2025-05-07T20:24:21.5916931Z  2025-05-07T20:24:21.5917254Z 2025-05-07T20:24:21.5917260Z 2025-05-07T20:24:21.5917265Z 2025-05-07T20:24:21.5917270Z 2025-05-07T20:24:21.5917276Z 2025-05-07T20:24:21.5917526Z  2025-05-07T20:24:21.5917824Z 2025-05-07T20:24:21.5917841Z 2025-05-07T20:24:21.5917847Z 2025-05-07T20:24:21.5917852Z 2025-05-07T20:24:21.5917858Z 2025-05-07T20:24:21.5917863Z 2025-05-07T20:24:21.5918054Z  2025-05-07T20:24:21.5918272Z 2025-05-07T20:24:21.5918283Z 2025-05-07T20:24:21.5918286Z 2025-05-07T20:24:21.5918290Z 2025-05-07T20:24:21.5918294Z 2025-05-07T20:24:21.5918297Z 2025-05-07T20:24:21.5918301Z 2025-05-07T20:24:21.5918491Z  2025-05-07T20:24:21.5918715Z 2025-05-07T20:24:21.5918719Z 2025-05-07T20:24:21.5918722Z 2025-05-07T20:24:21.5918726Z 2025-05-07T20:24:21.5918735Z 2025-05-07T20:24:21.5918738Z 2025-05-07T20:24:21.5918742Z 2025-05-07T20:24:21.5918745Z 2025-05-07T20:24:21.5918929Z  2025-05-07T20:24:21.5919154Z 2025-05-07T20:24:21.5919161Z 2025-05-07T20:24:21.5919165Z 2025-05-07T20:24:21.5919168Z 2025-05-07T20:24:21.5919172Z 2025-05-07T20:24:21.5919176Z 2025-05-07T20:24:21.5919179Z 2025-05-07T20:24:21.5919183Z 2025-05-07T20:24:21.5919187Z 2025-05-07T20:24:21.5919369Z  2025-05-07T20:24:21.5919598Z 2025-05-07T20:24:21.5919601Z 2025-05-07T20:24:21.5919605Z 2025-05-07T20:24:21.5919608Z 2025-05-07T20:24:21.5919612Z 2025-05-07T20:24:21.5919616Z 2025-05-07T20:24:21.5919619Z 2025-05-07T20:24:21.5919623Z 2025-05-07T20:24:21.5919627Z 2025-05-07T20:24:21.5919630Z 2025-05-07T20:24:21.5919825Z  done 2025-05-07T20:24:21.6920631Z Preparing transaction: \ done 2025-05-07T20:24:21.7925542Z Verifying transaction: / done 2025-05-07T20:24:23.2951806Z Executing transaction: \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:24:23.4752157Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:25.1967327Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:25.1979300Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:25.2002720Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:26.0644748Z Channels: 2025-05-07T20:24:26.0645021Z - conda-forge 2025-05-07T20:24:26.0645309Z Platform: linux-64 2025-05-07T20:24:29.4120777Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:29.7807400Z Solving environment: \ done 2025-05-07T20:24:29.8415116Z 2025-05-07T20:24:29.8415455Z ## Package Plan ## 2025-05-07T20:24:29.8415633Z 2025-05-07T20:24:29.8415842Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:29.8416436Z 2025-05-07T20:24:29.8416538Z added / updated specs: 2025-05-07T20:24:29.8416802Z - libxcrypt 2025-05-07T20:24:29.8416940Z 2025-05-07T20:24:29.8416945Z 2025-05-07T20:24:29.8417238Z The following packages will be downloaded: 2025-05-07T20:24:29.8417471Z 2025-05-07T20:24:29.8417592Z package | build 2025-05-07T20:24:29.8417930Z ---------------------------|----------------- 2025-05-07T20:24:29.8418323Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:29.8418737Z ------------------------------------------------------------ 2025-05-07T20:24:29.8419090Z Total: 98 KB 2025-05-07T20:24:29.8419302Z 2025-05-07T20:24:29.8419436Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:29.8419657Z 2025-05-07T20:24:29.8419880Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:29.8420184Z 2025-05-07T20:24:29.8420188Z 2025-05-07T20:24:29.8420192Z 2025-05-07T20:24:29.8420334Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:30.0010008Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:30.0027946Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:30.0130109Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:30.0132509Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:30.0132875Z 2025-05-07T20:24:30.0133164Z done 2025-05-07T20:24:30.1138277Z Preparing transaction: / done 2025-05-07T20:24:30.2143365Z Verifying transaction: \ done 2025-05-07T20:24:30.3149672Z Executing transaction: / done 2025-05-07T20:24:33.7482165Z [SETUP] Copying over ... 2025-05-07T20:24:33.7482892Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.9/crypt.h 2025-05-07T20:24:33.7483466Z 2025-05-07T20:24:33.7513035Z 2025-05-07T20:24:35.3939973Z [SETUP] Installed Python version: Python 3.9.21 2025-05-07T20:24:35.3940757Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:24:35.3972981Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:35.3973445Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:35.3986307Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:35.3986661Z env: 2025-05-07T20:24:35.3986889Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:35.3987192Z BUILD_ENV: build_binary 2025-05-07T20:24:35.3987441Z BUILD_TARGET: genai 2025-05-07T20:24:35.3987676Z BUILD_VARIANT: cuda 2025-05-07T20:24:35.3987924Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:35.3988219Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:35.3988538Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:35.3988878Z ##[endgroup] 2025-05-07T20:24:35.7370111Z ################################################################################ 2025-05-07T20:24:35.7370572Z # Install C/C++ Compilers 2025-05-07T20:24:35.7370834Z # 2025-05-07T20:24:35.7384955Z # [2025-05-07T20:24:35.738Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:24:35.7385483Z ################################################################################ 2025-05-07T20:24:35.7385712Z 2025-05-07T20:24:35.7400085Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:35.8267846Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:35.8278539Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:24:35.8300099Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:24:36.6892927Z Channels: 2025-05-07T20:24:36.6893260Z - conda-forge 2025-05-07T20:24:36.6893608Z Platform: linux-64 2025-05-07T20:24:40.0514183Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:40.4225716Z Solving environment: \ done 2025-05-07T20:24:40.4840791Z 2025-05-07T20:24:40.4841202Z ## Package Plan ## 2025-05-07T20:24:40.4841452Z 2025-05-07T20:24:40.4841793Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:40.4842260Z 2025-05-07T20:24:40.4842404Z added / updated specs: 2025-05-07T20:24:40.4842705Z - sysroot_linux-64=2.17 2025-05-07T20:24:40.4842874Z 2025-05-07T20:24:40.4842879Z 2025-05-07T20:24:40.4843046Z The following packages will be downloaded: 2025-05-07T20:24:40.4843329Z 2025-05-07T20:24:40.4843448Z package | build 2025-05-07T20:24:40.4843805Z ---------------------------|----------------- 2025-05-07T20:24:40.4844442Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:24:40.4845220Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:24:40.4845851Z ------------------------------------------------------------ 2025-05-07T20:24:40.4846416Z Total: 15.4 MB 2025-05-07T20:24:40.4846753Z 2025-05-07T20:24:40.4846949Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:40.4847289Z 2025-05-07T20:24:40.4847617Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:24:40.4848284Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:24:40.4848603Z 2025-05-07T20:24:40.4848610Z 2025-05-07T20:24:40.4848616Z 2025-05-07T20:24:40.4848832Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:40.4849231Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:40.4849458Z 2025-05-07T20:24:40.5864641Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:24:40.6251696Z sysroot_linux-64-2.1 | 14.5 MB | 9 | 10% 2025-05-07T20:24:40.6251946Z 2025-05-07T20:24:40.6374887Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:24:40.6382465Z 2025-05-07T20:24:40.6864621Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:40.7832008Z sysroot_linux-64-2.1 | 14.5 MB | #####7 | 58% 2025-05-07T20:24:40.8902935Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:40.8903206Z 2025-05-07T20:24:40.8904748Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:40.8905017Z 2025-05-07T20:24:41.3564726Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.3565171Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:41.3570983Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:41.3571535Z 2025-05-07T20:24:41.3571830Z 2025-05-07T20:24:41.3572111Z  done 2025-05-07T20:24:41.4578674Z Preparing transaction: / done 2025-05-07T20:24:41.6588864Z Verifying transaction: \ | done 2025-05-07T20:24:41.8634389Z Executing transaction: - \ done 2025-05-07T20:24:42.0224014Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:24:42.0224473Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:24:43.7176234Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:24:43.7191078Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:24:43.7214991Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:24:44.6097775Z Channels: 2025-05-07T20:24:44.6098275Z - conda-forge 2025-05-07T20:24:44.6098751Z Platform: linux-64 2025-05-07T20:24:47.9774201Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:48.9442993Z Solving environment: \ | / - done 2025-05-07T20:24:49.0081965Z 2025-05-07T20:24:49.0082369Z ## Package Plan ## 2025-05-07T20:24:49.0082623Z 2025-05-07T20:24:49.0082996Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:49.0083886Z 2025-05-07T20:24:49.0084074Z added / updated specs: 2025-05-07T20:24:49.0084424Z - gxx_linux-64=11.4.0 2025-05-07T20:24:49.0084622Z 2025-05-07T20:24:49.0084638Z 2025-05-07T20:24:49.0084776Z The following packages will be downloaded: 2025-05-07T20:24:49.0084998Z 2025-05-07T20:24:49.0085123Z package | build 2025-05-07T20:24:49.0085463Z ---------------------------|----------------- 2025-05-07T20:24:49.0085883Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:24:49.0086373Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:24:49.0086858Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:24:49.0087327Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:24:49.0087789Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:24:49.0088242Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:24:49.0088689Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:24:49.0089173Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:24:49.0089662Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:24:49.0090114Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:24:49.0090603Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:24:49.0091094Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:24:49.0091506Z ------------------------------------------------------------ 2025-05-07T20:24:49.0091862Z Total: 91.6 MB 2025-05-07T20:24:49.0092086Z 2025-05-07T20:24:49.0092219Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:49.0092452Z 2025-05-07T20:24:49.0092730Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:24:49.0093484Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:24:49.0094037Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:24:49.0094549Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:24:49.0095067Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:24:49.0095571Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:24:49.0096109Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:49.0096681Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:24:49.0097189Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:24:49.0097737Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:49.0098109Z 2025-05-07T20:24:49.0098227Z The following packages will be UPDATED: 2025-05-07T20:24:49.0098448Z 2025-05-07T20:24:49.0098768Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:24:49.0099489Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:24:49.0099894Z 2025-05-07T20:24:49.0099898Z 2025-05-07T20:24:49.0099902Z 2025-05-07T20:24:49.0100048Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:49.0100442Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:49.0100684Z 2025-05-07T20:24:49.0100953Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:49.0101294Z 2025-05-07T20:24:49.0101298Z 2025-05-07T20:24:49.0101976Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:49.0102336Z 2025-05-07T20:24:49.0102341Z 2025-05-07T20:24:49.0105079Z 2025-05-07T20:24:49.0122811Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:49.0123116Z 2025-05-07T20:24:49.0123122Z 2025-05-07T20:24:49.0123134Z 2025-05-07T20:24:49.0123138Z 2025-05-07T20:24:49.0127867Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:49.0128260Z 2025-05-07T20:24:49.0128266Z 2025-05-07T20:24:49.0128272Z 2025-05-07T20:24:49.0128288Z 2025-05-07T20:24:49.0131027Z 2025-05-07T20:24:49.0145755Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:49.0146042Z 2025-05-07T20:24:49.0146054Z 2025-05-07T20:24:49.0146159Z 2025-05-07T20:24:49.0146165Z 2025-05-07T20:24:49.0146169Z 2025-05-07T20:24:49.0152292Z 2025-05-07T20:24:49.0153993Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:24:49.0154429Z 2025-05-07T20:24:49.0154435Z 2025-05-07T20:24:49.0154454Z 2025-05-07T20:24:49.0154470Z 2025-05-07T20:24:49.0154475Z 2025-05-07T20:24:49.0154481Z 2025-05-07T20:24:49.0154487Z 2025-05-07T20:24:49.0158993Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:24:49.0159295Z 2025-05-07T20:24:49.0159299Z 2025-05-07T20:24:49.0159303Z 2025-05-07T20:24:49.0159306Z 2025-05-07T20:24:49.0159310Z 2025-05-07T20:24:49.0159314Z 2025-05-07T20:24:49.0159318Z 2025-05-07T20:24:49.0160739Z 2025-05-07T20:24:49.0172283Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:49.0172650Z 2025-05-07T20:24:49.0172656Z 2025-05-07T20:24:49.0172662Z 2025-05-07T20:24:49.0172667Z 2025-05-07T20:24:49.0172672Z 2025-05-07T20:24:49.0172678Z 2025-05-07T20:24:49.0172683Z 2025-05-07T20:24:49.0172688Z 2025-05-07T20:24:49.0190750Z 2025-05-07T20:24:49.0207330Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:24:49.0207748Z 2025-05-07T20:24:49.0207754Z 2025-05-07T20:24:49.0207769Z 2025-05-07T20:24:49.0207775Z 2025-05-07T20:24:49.0207780Z 2025-05-07T20:24:49.0207786Z 2025-05-07T20:24:49.0207791Z 2025-05-07T20:24:49.0207797Z 2025-05-07T20:24:49.0207802Z 2025-05-07T20:24:49.0210542Z 2025-05-07T20:24:49.0214261Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:24:49.0214656Z 2025-05-07T20:24:49.0214660Z 2025-05-07T20:24:49.0214663Z 2025-05-07T20:24:49.0214667Z 2025-05-07T20:24:49.0214671Z 2025-05-07T20:24:49.0214674Z 2025-05-07T20:24:49.0214678Z 2025-05-07T20:24:49.0214681Z 2025-05-07T20:24:49.0214694Z 2025-05-07T20:24:49.0214697Z 2025-05-07T20:24:49.0217299Z 2025-05-07T20:24:49.1546266Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:24:49.1546696Z 2025-05-07T20:24:49.1546715Z 2025-05-07T20:24:49.1546720Z 2025-05-07T20:24:49.1546762Z 2025-05-07T20:24:49.1727639Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:49.1728312Z 2025-05-07T20:24:49.1728732Z 2025-05-07T20:24:49.1733757Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:49.1734077Z 2025-05-07T20:24:49.1734081Z 2025-05-07T20:24:49.1735835Z 2025-05-07T20:24:49.2313635Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:49.2314031Z 2025-05-07T20:24:49.2731057Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:49.2731380Z 2025-05-07T20:24:49.2731921Z 2025-05-07T20:24:49.2737441Z libstdcxx-devel_linu | 11.1 MB | #####1 | 52%  2025-05-07T20:24:49.2737826Z 2025-05-07T20:24:49.2737831Z 2025-05-07T20:24:49.2738574Z 2025-05-07T20:24:49.2813051Z binutils_impl_linux- | 6.0 MB | #######4 | 75%  2025-05-07T20:24:49.3042576Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:49.3042825Z 2025-05-07T20:24:49.3042830Z 2025-05-07T20:24:49.3042888Z 2025-05-07T20:24:49.3046129Z 2025-05-07T20:24:49.3046820Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:49.3047322Z 2025-05-07T20:24:49.3047333Z 2025-05-07T20:24:49.3047337Z 2025-05-07T20:24:49.3047341Z 2025-05-07T20:24:49.3313467Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:49.3314019Z 2025-05-07T20:24:49.3535195Z gxx_impl_linux-64-11 | 11.2 MB | ###8 | 38%  2025-05-07T20:24:49.3535555Z 2025-05-07T20:24:49.3535562Z 2025-05-07T20:24:49.3535567Z 2025-05-07T20:24:49.3535573Z 2025-05-07T20:24:49.3537649Z 2025-05-07T20:24:49.3733091Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:49.3733397Z 2025-05-07T20:24:49.3736137Z 2025-05-07T20:24:49.3815345Z libstdcxx-devel_linu | 11.1 MB | ########7 | 88%  2025-05-07T20:24:49.4314051Z gcc_impl_linux-64-11 | 53.0 MB | 3 | 4% 2025-05-07T20:24:49.4315909Z 2025-05-07T20:24:49.4536328Z gxx_impl_linux-64-11 | 11.2 MB | #######1 | 71%  2025-05-07T20:24:49.4536601Z 2025-05-07T20:24:49.4536845Z 2025-05-07T20:24:49.4536887Z 2025-05-07T20:24:49.4536893Z 2025-05-07T20:24:49.4536964Z 2025-05-07T20:24:49.4577236Z libsanitizer-11.4.0 | 3.5 MB | #########1 | 92%  2025-05-07T20:24:49.4577635Z 2025-05-07T20:24:49.4577653Z 2025-05-07T20:24:49.4578027Z 2025-05-07T20:24:49.4816644Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:49.5199490Z gcc_impl_linux-64-11 | 53.0 MB | 9 | 10% 2025-05-07T20:24:49.5199860Z 2025-05-07T20:24:49.5199864Z 2025-05-07T20:24:49.5199868Z 2025-05-07T20:24:49.5199880Z 2025-05-07T20:24:49.5199884Z 2025-05-07T20:24:49.5202558Z 2025-05-07T20:24:49.5315974Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:24:49.5320108Z 2025-05-07T20:24:49.5821854Z gxx_impl_linux-64-11 | 11.2 MB | #########9 | 100%  2025-05-07T20:24:49.5931631Z gcc_impl_linux-64-11 | 53.0 MB | #5 | 16% 2025-05-07T20:24:49.5931978Z 2025-05-07T20:24:49.5931984Z 2025-05-07T20:24:49.5931989Z 2025-05-07T20:24:49.5932019Z 2025-05-07T20:24:49.5932024Z 2025-05-07T20:24:49.6345922Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:49.6346344Z 2025-05-07T20:24:49.6346350Z 2025-05-07T20:24:49.6346645Z 2025-05-07T20:24:49.6346652Z 2025-05-07T20:24:49.6346656Z 2025-05-07T20:24:49.6346659Z 2025-05-07T20:24:49.6346663Z 2025-05-07T20:24:49.6790903Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:24:49.6791318Z 2025-05-07T20:24:49.6791324Z 2025-05-07T20:24:49.6791329Z 2025-05-07T20:24:49.6791334Z 2025-05-07T20:24:49.6791339Z 2025-05-07T20:24:49.6791345Z 2025-05-07T20:24:49.6791361Z 2025-05-07T20:24:49.6821559Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:49.6947347Z gcc_impl_linux-64-11 | 53.0 MB | ##1 | 21% 2025-05-07T20:24:49.6947703Z 2025-05-07T20:24:49.6947709Z 2025-05-07T20:24:49.6947715Z 2025-05-07T20:24:49.6947720Z 2025-05-07T20:24:49.6947725Z 2025-05-07T20:24:49.6947749Z 2025-05-07T20:24:49.6948168Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:49.6948570Z 2025-05-07T20:24:49.6948575Z 2025-05-07T20:24:49.6948581Z 2025-05-07T20:24:49.6948598Z 2025-05-07T20:24:49.6948603Z 2025-05-07T20:24:49.6948616Z 2025-05-07T20:24:49.7088525Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:49.7088942Z 2025-05-07T20:24:49.7088947Z 2025-05-07T20:24:49.7088953Z 2025-05-07T20:24:49.7088958Z 2025-05-07T20:24:49.7088963Z 2025-05-07T20:24:49.7088968Z 2025-05-07T20:24:49.7088973Z 2025-05-07T20:24:49.7088978Z 2025-05-07T20:24:49.7130237Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:49.7130655Z 2025-05-07T20:24:49.7130661Z 2025-05-07T20:24:49.7130666Z 2025-05-07T20:24:49.7130671Z 2025-05-07T20:24:49.7130676Z 2025-05-07T20:24:49.7130690Z 2025-05-07T20:24:49.7130696Z 2025-05-07T20:24:49.7130701Z 2025-05-07T20:24:49.7431528Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:49.7432242Z 2025-05-07T20:24:49.7432248Z 2025-05-07T20:24:49.7432265Z 2025-05-07T20:24:49.7432270Z 2025-05-07T20:24:49.7432288Z 2025-05-07T20:24:49.7432294Z 2025-05-07T20:24:49.7432299Z 2025-05-07T20:24:49.7432304Z 2025-05-07T20:24:49.7432309Z 2025-05-07T20:24:49.7467107Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:24:49.7467522Z 2025-05-07T20:24:49.7467528Z 2025-05-07T20:24:49.7467534Z 2025-05-07T20:24:49.7467539Z 2025-05-07T20:24:49.7467545Z 2025-05-07T20:24:49.7467550Z 2025-05-07T20:24:49.7467555Z 2025-05-07T20:24:49.7467562Z 2025-05-07T20:24:49.7467568Z 2025-05-07T20:24:49.7514825Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:49.7515220Z 2025-05-07T20:24:49.7515225Z 2025-05-07T20:24:49.7515230Z 2025-05-07T20:24:49.7515236Z 2025-05-07T20:24:49.7515241Z 2025-05-07T20:24:49.7515255Z 2025-05-07T20:24:49.7515273Z 2025-05-07T20:24:49.7515279Z 2025-05-07T20:24:49.7515284Z 2025-05-07T20:24:49.7519158Z 2025-05-07T20:24:49.7551723Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:24:49.7552135Z 2025-05-07T20:24:49.7552141Z 2025-05-07T20:24:49.7552146Z 2025-05-07T20:24:49.7552151Z 2025-05-07T20:24:49.7552157Z 2025-05-07T20:24:49.7552162Z 2025-05-07T20:24:49.7552167Z 2025-05-07T20:24:49.7552172Z 2025-05-07T20:24:49.7552178Z 2025-05-07T20:24:49.7556865Z 2025-05-07T20:24:49.7660006Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:49.7660401Z 2025-05-07T20:24:49.7660407Z 2025-05-07T20:24:49.7825444Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:49.7916987Z gcc_impl_linux-64-11 | 53.0 MB | ##7 | 28% 2025-05-07T20:24:49.7917341Z 2025-05-07T20:24:49.7917347Z 2025-05-07T20:24:49.7917352Z 2025-05-07T20:24:49.7917357Z 2025-05-07T20:24:49.7917363Z 2025-05-07T20:24:49.7917387Z 2025-05-07T20:24:49.7917393Z 2025-05-07T20:24:49.7917398Z 2025-05-07T20:24:49.7917403Z 2025-05-07T20:24:49.7917408Z 2025-05-07T20:24:49.7922369Z 2025-05-07T20:24:49.7964449Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:24:49.7964895Z 2025-05-07T20:24:49.7964901Z 2025-05-07T20:24:49.7964906Z 2025-05-07T20:24:49.7964911Z 2025-05-07T20:24:49.7964926Z 2025-05-07T20:24:49.7964931Z 2025-05-07T20:24:49.7964937Z 2025-05-07T20:24:49.7964942Z 2025-05-07T20:24:49.7964947Z 2025-05-07T20:24:49.7964952Z 2025-05-07T20:24:49.7964957Z 2025-05-07T20:24:49.8184294Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:49.8184731Z 2025-05-07T20:24:49.8184737Z 2025-05-07T20:24:49.8184742Z 2025-05-07T20:24:49.8184747Z 2025-05-07T20:24:49.8467045Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:49.8467441Z 2025-05-07T20:24:49.8828365Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:49.8959944Z gcc_impl_linux-64-11 | 53.0 MB | ###7 | 38% 2025-05-07T20:24:49.8960306Z 2025-05-07T20:24:49.8960312Z 2025-05-07T20:24:49.8960317Z 2025-05-07T20:24:49.8960342Z 2025-05-07T20:24:49.8960348Z 2025-05-07T20:24:49.8960355Z 2025-05-07T20:24:49.8960362Z 2025-05-07T20:24:49.8966069Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:49.8966478Z 2025-05-07T20:24:49.8966484Z 2025-05-07T20:24:49.8966489Z 2025-05-07T20:24:49.8966494Z 2025-05-07T20:24:49.8966500Z 2025-05-07T20:24:49.8966505Z 2025-05-07T20:24:49.8966510Z 2025-05-07T20:24:49.9829782Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:49.9971803Z gcc_impl_linux-64-11 | 53.0 MB | ####6 | 46% 2025-05-07T20:24:49.9972165Z 2025-05-07T20:24:49.9972171Z 2025-05-07T20:24:49.9972176Z 2025-05-07T20:24:49.9972181Z 2025-05-07T20:24:49.9972187Z 2025-05-07T20:24:50.0377197Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:50.0377924Z 2025-05-07T20:24:50.0377930Z 2025-05-07T20:24:50.0377935Z 2025-05-07T20:24:50.0377940Z 2025-05-07T20:24:50.0377946Z 2025-05-07T20:24:50.0377965Z 2025-05-07T20:24:50.0478830Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.0479254Z 2025-05-07T20:24:50.0479260Z 2025-05-07T20:24:50.0479265Z 2025-05-07T20:24:50.0479270Z 2025-05-07T20:24:50.0479275Z 2025-05-07T20:24:50.0479281Z 2025-05-07T20:24:50.0479286Z 2025-05-07T20:24:50.0479291Z 2025-05-07T20:24:50.0486871Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.0487243Z 2025-05-07T20:24:50.0487248Z 2025-05-07T20:24:50.0487251Z 2025-05-07T20:24:50.0487255Z 2025-05-07T20:24:50.0487259Z 2025-05-07T20:24:50.0487263Z 2025-05-07T20:24:50.0487266Z 2025-05-07T20:24:50.0487780Z 2025-05-07T20:24:50.0872755Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.0873190Z 2025-05-07T20:24:50.0873194Z 2025-05-07T20:24:50.0873198Z 2025-05-07T20:24:50.0873213Z 2025-05-07T20:24:50.0873217Z 2025-05-07T20:24:50.0873220Z 2025-05-07T20:24:50.0873236Z 2025-05-07T20:24:50.0873240Z 2025-05-07T20:24:50.0873243Z 2025-05-07T20:24:50.0875128Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.0875508Z 2025-05-07T20:24:50.0875513Z 2025-05-07T20:24:50.0875517Z 2025-05-07T20:24:50.0875520Z 2025-05-07T20:24:50.0875524Z 2025-05-07T20:24:50.0875528Z 2025-05-07T20:24:50.0875531Z 2025-05-07T20:24:50.0875535Z 2025-05-07T20:24:50.0875539Z 2025-05-07T20:24:50.0971509Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.0971916Z 2025-05-07T20:24:50.0971922Z 2025-05-07T20:24:50.0971927Z 2025-05-07T20:24:50.0971933Z 2025-05-07T20:24:50.0971938Z 2025-05-07T20:24:50.0971943Z 2025-05-07T20:24:50.0971948Z 2025-05-07T20:24:50.0971953Z 2025-05-07T20:24:50.0971969Z 2025-05-07T20:24:50.0971974Z 2025-05-07T20:24:50.0979414Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.0979845Z 2025-05-07T20:24:50.0980097Z 2025-05-07T20:24:50.0980105Z 2025-05-07T20:24:50.0980110Z 2025-05-07T20:24:50.0980115Z 2025-05-07T20:24:50.0980121Z 2025-05-07T20:24:50.0980126Z 2025-05-07T20:24:50.0980131Z 2025-05-07T20:24:50.0980137Z 2025-05-07T20:24:50.0980142Z 2025-05-07T20:24:50.1472269Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.1472696Z 2025-05-07T20:24:50.1472702Z 2025-05-07T20:24:50.1472707Z 2025-05-07T20:24:50.1472711Z 2025-05-07T20:24:50.1472716Z 2025-05-07T20:24:50.1472721Z 2025-05-07T20:24:50.1472736Z 2025-05-07T20:24:50.1472740Z 2025-05-07T20:24:50.1472744Z 2025-05-07T20:24:50.1472748Z 2025-05-07T20:24:50.1472752Z 2025-05-07T20:24:50.1476193Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:50.1476664Z 2025-05-07T20:24:50.1476670Z 2025-05-07T20:24:50.1476675Z 2025-05-07T20:24:50.1476680Z 2025-05-07T20:24:50.1476686Z 2025-05-07T20:24:50.1476691Z 2025-05-07T20:24:50.1476697Z 2025-05-07T20:24:50.1476711Z 2025-05-07T20:24:50.1476715Z 2025-05-07T20:24:50.1476718Z 2025-05-07T20:24:50.1476722Z 2025-05-07T20:24:50.1981542Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:50.2982117Z gcc_impl_linux-64-11 | 53.0 MB | #####3 | 54% 2025-05-07T20:24:50.3090298Z gcc_impl_linux-64-11 | 53.0 MB | ######4 | 65% 2025-05-07T20:24:50.3090653Z 2025-05-07T20:24:50.3090659Z 2025-05-07T20:24:50.3092554Z 2025-05-07T20:24:50.3996758Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:50.4768659Z gcc_impl_linux-64-11 | 53.0 MB | #######2 | 72% 2025-05-07T20:24:50.4769289Z 2025-05-07T20:24:50.4995870Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:50.6024432Z gcc_impl_linux-64-11 | 53.0 MB | ########3 | 83% 2025-05-07T20:24:50.6599640Z gcc_impl_linux-64-11 | 53.0 MB | #########3 | 93% 2025-05-07T20:24:50.6600116Z 2025-05-07T20:24:50.6600121Z 2025-05-07T20:24:50.7285980Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:51.3198595Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:51.3206365Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:51.3206786Z 2025-05-07T20:24:51.3207127Z 2025-05-07T20:24:51.3207383Z  2025-05-07T20:24:51.3207709Z 2025-05-07T20:24:51.3207713Z 2025-05-07T20:24:51.3207986Z  2025-05-07T20:24:51.3208225Z 2025-05-07T20:24:51.3208229Z 2025-05-07T20:24:51.3208233Z 2025-05-07T20:24:51.3208672Z  2025-05-07T20:24:51.3208901Z 2025-05-07T20:24:51.3208919Z 2025-05-07T20:24:51.3208923Z 2025-05-07T20:24:51.3208926Z 2025-05-07T20:24:51.3209622Z  2025-05-07T20:24:51.3209870Z 2025-05-07T20:24:51.3209896Z 2025-05-07T20:24:51.3209900Z 2025-05-07T20:24:51.3209903Z 2025-05-07T20:24:51.3209907Z 2025-05-07T20:24:51.3210525Z  2025-05-07T20:24:51.3210776Z 2025-05-07T20:24:51.3210780Z 2025-05-07T20:24:51.3210783Z 2025-05-07T20:24:51.3210787Z 2025-05-07T20:24:51.3210791Z 2025-05-07T20:24:51.3210795Z 2025-05-07T20:24:51.3211090Z  2025-05-07T20:24:51.3211351Z 2025-05-07T20:24:51.3211354Z 2025-05-07T20:24:51.3211358Z 2025-05-07T20:24:51.3211362Z 2025-05-07T20:24:51.3211366Z 2025-05-07T20:24:51.3211369Z 2025-05-07T20:24:51.3211373Z 2025-05-07T20:24:51.3211681Z  2025-05-07T20:24:51.3211938Z 2025-05-07T20:24:51.3211942Z 2025-05-07T20:24:51.3211946Z 2025-05-07T20:24:51.3211949Z 2025-05-07T20:24:51.3211953Z 2025-05-07T20:24:51.3211957Z 2025-05-07T20:24:51.3211960Z 2025-05-07T20:24:51.3212171Z 2025-05-07T20:24:51.3212414Z  2025-05-07T20:24:51.3212719Z 2025-05-07T20:24:51.3212722Z 2025-05-07T20:24:51.3212726Z 2025-05-07T20:24:51.3212730Z 2025-05-07T20:24:51.3212733Z 2025-05-07T20:24:51.3212737Z 2025-05-07T20:24:51.3212741Z 2025-05-07T20:24:51.3212744Z 2025-05-07T20:24:51.3212748Z 2025-05-07T20:24:51.3213037Z  2025-05-07T20:24:51.3213287Z 2025-05-07T20:24:51.3213290Z 2025-05-07T20:24:51.3213294Z 2025-05-07T20:24:51.3213298Z 2025-05-07T20:24:51.3213301Z 2025-05-07T20:24:51.3213305Z 2025-05-07T20:24:51.3213342Z 2025-05-07T20:24:51.3213346Z 2025-05-07T20:24:51.3213349Z 2025-05-07T20:24:51.3213353Z 2025-05-07T20:24:51.3213575Z  2025-05-07T20:24:51.3213829Z 2025-05-07T20:24:51.3213833Z 2025-05-07T20:24:51.3213837Z 2025-05-07T20:24:51.3213916Z 2025-05-07T20:24:51.3213920Z 2025-05-07T20:24:51.3213924Z 2025-05-07T20:24:51.3213928Z 2025-05-07T20:24:51.3213932Z 2025-05-07T20:24:51.3213935Z 2025-05-07T20:24:51.3213939Z 2025-05-07T20:24:51.3213942Z 2025-05-07T20:24:51.3214219Z  done 2025-05-07T20:24:51.4215184Z Preparing transaction: | done 2025-05-07T20:24:51.7222398Z Verifying transaction: - \ | done 2025-05-07T20:24:51.8230974Z Executing transaction: - done 2025-05-07T20:24:51.9877936Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:24:55.8957280Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:55.8957861Z 2025-05-07T20:24:55.8970510Z 2025-05-07T20:24:55.8986481Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:55.8987148Z 2025-05-07T20:24:55.9001819Z 2025-05-07T20:24:55.9019697Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:55.9020258Z 2025-05-07T20:24:55.9032694Z 2025-05-07T20:24:55.9050774Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:55.9051341Z 2025-05-07T20:24:55.9062295Z 2025-05-07T20:24:57.7900022Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:57.7900333Z 2025-05-07T20:24:57.8527692Z [CHECK] Binary cc found in PATH 2025-05-07T20:24:59.7376333Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:59.7376677Z 2025-05-07T20:24:59.8031130Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:01.6866573Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:01.6866933Z 2025-05-07T20:25:01.7519119Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:03.6405087Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:03.6405489Z 2025-05-07T20:25:03.7059335Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:03.7063133Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:03.7063817Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:03.7064057Z 2025-05-07T20:25:05.6080407Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:05.6081062Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:05.6081700Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:05.6082180Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:05.6082701Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:05.6083259Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:05.6083987Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:05.6084898Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:05.6085282Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:05.6085955Z #define __CHAR_BIT__ 8 2025-05-07T20:25:05.6086646Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:05.6087513Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:05.6087915Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:05.6088844Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:05.6089834Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:05.6090433Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6090819Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:05.6091850Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:05.6092479Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:05.6092986Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:05.6093726Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:05.6094395Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:05.6095262Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:05.6095736Z #define __GCC_IEC_559 2 2025-05-07T20:25:05.6096077Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:05.6096556Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:05.6096888Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:05.6097274Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:05.6098090Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6098477Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:05.6098862Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:05.6099323Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:05.6099682Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:05.6100017Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:05.6100451Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:05.6100834Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:05.6101282Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:05.6101699Z #define __INT8_C(c) c 2025-05-07T20:25:05.6102055Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:05.6102639Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6103130Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:05.6103567Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:05.6104075Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:05.6104429Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:05.6104809Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6105250Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:05.6105622Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:05.6106117Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:05.6106698Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:05.6107086Z #define __linux 1 2025-05-07T20:25:05.6107414Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:05.6107908Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:05.6108315Z #define __unix 1 2025-05-07T20:25:05.6108602Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:05.6109057Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:05.6109467Z #define __WINT_MIN__ 0U 2025-05-07T20:25:05.6109779Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:05.6110225Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:05.6110634Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:05.6110966Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:05.6111399Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:05.6111806Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:05.6112202Z #define __INT64_C(c) c ## L 2025-05-07T20:25:05.6112607Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:05.6113025Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:05.6113388Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:05.6113858Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:05.6114343Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:05.6114716Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:05.6115073Z #define __DBL_DIG__ 15 2025-05-07T20:25:05.6115421Z #define __FLT32_DIG__ 6 2025-05-07T20:25:05.6115853Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:05.6116341Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:05.6116766Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:05.6117218Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:05.6117760Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:05.6118075Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:05.6118458Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:05.6118979Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:05.6119462Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:05.6119839Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:05.6120244Z #define __unix__ 1 2025-05-07T20:25:05.6120586Z #define __INT_WIDTH__ 32 2025-05-07T20:25:05.6120892Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:05.6121279Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:05.6121650Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:05.6121986Z #define __UINT16_C(c) c 2025-05-07T20:25:05.6122369Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:05.6122751Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:05.6123184Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:05.6123719Z #define __gnu_linux__ 1 2025-05-07T20:25:05.6124062Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:05.6124424Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:05.6124861Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6125232Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:05.6125574Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:05.6125992Z #define __GNUC__ 11 2025-05-07T20:25:05.6126306Z #define __pie__ 2 2025-05-07T20:25:05.6126605Z #define __MMX__ 1 2025-05-07T20:25:05.6126993Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:05.6127361Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:05.6127772Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:05.6128193Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:05.6128745Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:05.6129248Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6129744Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:05.6130073Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:05.6130441Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:05.6130914Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:05.6131243Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:05.6131613Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:05.6132093Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:05.6132474Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:05.6132858Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:05.6145006Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:05.6145314Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:05.6145597Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:05.6145899Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:05.6146182Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:05.6146455Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:05.6146802Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:05.6147187Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:05.6147478Z #define __SSE2_MATH__ 1 2025-05-07T20:25:05.6147745Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:05.6148072Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6148376Z #define __amd64 1 2025-05-07T20:25:05.6148622Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:05.6148906Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:05.6149230Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:05.6149549Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:05.6149819Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:05.6150088Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:05.6150350Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:05.6150623Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:05.6150888Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:05.6151170Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:05.6151450Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:05.6151735Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:05.6152310Z #define __x86_64 1 2025-05-07T20:25:05.6152574Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:05.6152953Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:05.6153437Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:05.6153902Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:05.6154377Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:05.6154762Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:05.6155028Z #define __LP64__ 1 2025-05-07T20:25:05.6155274Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6155632Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:05.6156023Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:05.6156319Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:05.6156603Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:05.6156911Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:05.6157203Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:05.6157494Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:05.6157762Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:05.6158046Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:05.6158322Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:05.6158654Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:05.6159027Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:05.6159312Z #define __FLT_DIG__ 6 2025-05-07T20:25:05.6159547Z #define __NO_INLINE__ 1 2025-05-07T20:25:05.6159800Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:05.6160135Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:05.6160486Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:05.6160918Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:05.6161187Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:05.6161441Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:05.6161707Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:05.6161980Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:05.6162284Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:05.6162567Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:05.6162845Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:05.6163156Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:05.6163484Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:05.6163755Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:05.6164024Z #define __FLT128_DIG__ 33 2025-05-07T20:25:05.6164264Z #define __INT32_C(c) c 2025-05-07T20:25:05.6164520Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:05.6164811Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:05.6165088Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:05.6165379Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:05.6165708Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:05.6166013Z #define unix 1 2025-05-07T20:25:05.6166257Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:05.6166584Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6166897Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:05.6167211Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:05.6167552Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:05.6167816Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:05.6168084Z #define __ELF__ 1 2025-05-07T20:25:05.6168321Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:05.6168611Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:05.6168886Z #define __FLT_RADIX__ 2 2025-05-07T20:25:05.6169145Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:05.6169514Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:05.6169879Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:05.6170143Z #define __SSE_MATH__ 1 2025-05-07T20:25:05.6170381Z #define __k8 1 2025-05-07T20:25:05.6170677Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:05.6171154Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:05.6171460Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:05.6171768Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:05.6172028Z #define __LDBL_DIG__ 18 2025-05-07T20:25:05.6172280Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:05.6172545Z #define __x86_64__ 1 2025-05-07T20:25:05.6172783Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:05.6173089Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:05.6173435Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6173743Z #define __FLT64_DIG__ 15 2025-05-07T20:25:05.6174038Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6174395Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:05.6174710Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6174985Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:05.6175271Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6175573Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:05.6175944Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:05.6176344Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:05.6176641Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:05.6176977Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:05.6177310Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:05.6177664Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:05.6177944Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:05.6178259Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:05.6178544Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:05.6178781Z #define __SEG_FS 1 2025-05-07T20:25:05.6179016Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:05.6179296Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:05.6179572Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6179988Z #define __SEG_GS 1 2025-05-07T20:25:05.6180307Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:05.6180694Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:05.6180971Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:05.6181352Z #define __INT16_TYPE__ short int 2025-05-07T20:25:05.6181637Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:05.6181931Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:05.6182201Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:05.6182454Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:05.6182713Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:05.6183058Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:05.6183449Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6183738Z #define linux 1 2025-05-07T20:25:05.6183970Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6184252Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:05.6184540Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:05.6184789Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:05.6185055Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:05.6185326Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:05.6185671Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:05.6186091Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:05.6186432Z #define __code_model_small__ 1 2025-05-07T20:25:05.6186706Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:05.6187000Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:05.6187256Z #define __k8__ 1 2025-05-07T20:25:05.6187487Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:05.6187785Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:05.6188095Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:05.6188338Z #define __pic__ 2 2025-05-07T20:25:05.6188600Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6188918Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:05.6189225Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6189554Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:05.6190025Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:05.6190389Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:05.6190660Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:05.6190962Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:05.6191279Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:05.6191531Z #define __linux__ 1 2025-05-07T20:25:05.6191764Z #define __INT64_TYPE__ long int 2025-05-07T20:25:05.6192034Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:05.6192297Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:05.6192577Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:05.6192839Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:05.6193134Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6193474Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:05.6193775Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:05.6194054Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:05.6194350Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:05.6194655Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:05.6194993Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:05.6195352Z #define __SSE__ 1 2025-05-07T20:25:05.6195591Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:05.6195937Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:05.6196280Z #define __amd64__ 1 2025-05-07T20:25:05.6196509Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:05.6196768Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:05.6197038Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:05.6197324Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:05.6197639Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:05.6197929Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:05.6198187Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:05.6198562Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:05.6198841Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:05.6199193Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:05.6199672Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:05.6200035Z #define _LP64 1 2025-05-07T20:25:05.6200251Z #define __UINT8_C(c) c 2025-05-07T20:25:05.6200502Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:05.6200777Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:05.6201049Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:05.6201331Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:05.6201640Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:05.6202004Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:05.6202464Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:05.6202841Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6203142Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6203457Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:05.6203831Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:05.6204205Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:05.6204473Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:05.6204815Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:05.6205184Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:05.6205445Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:05.6205701Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:05.6205963Z #define __FXSR__ 1 2025-05-07T20:25:05.6206269Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:05.6206730Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:05.6207148Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:05.6207458Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:05.6207769Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:05.6208118Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:05.6208482Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:05.6208729Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:05.6209072Z #define __PIC__ 2 2025-05-07T20:25:05.6209341Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:05.6209741Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:05.6210139Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:05.6210481Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:05.6210809Z #define __SSE2__ 1 2025-05-07T20:25:05.6211049Z #define __INT32_TYPE__ int 2025-05-07T20:25:05.6211309Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:05.6211569Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:05.6211917Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:05.6212282Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:05.6212564Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:05.6212842Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:05.6213116Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6213391Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:05.6213641Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:05.6213893Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:05.6214187Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6214484Z #define __PIE__ 2 2025-05-07T20:25:05.6214811Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:05.6215204Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:05.6215549Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:05.6215921Z #define __INT16_C(c) c 2025-05-07T20:25:05.6216153Z #define __STDC__ 1 2025-05-07T20:25:05.6216383Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:05.6216666Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:05.6216926Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:05.6217232Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:05.6217667Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:05.6218003Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:05.6218278Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:05.6218562Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:05.6218831Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:05.6219122Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:05.6219409Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6219692Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:05.6219997Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6220394Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:05.6220773Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:05.6221177Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:05.6221479Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:05.6221730Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:05.6221898Z 2025-05-07T20:25:05.6750256Z 2025-05-07T20:25:05.6751055Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:05.6751524Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:05.6751789Z 2025-05-07T20:25:07.5747491Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:07.5748071Z #define __cpp_attributes 200809L 2025-05-07T20:25:07.5748529Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:07.5749024Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:07.5749422Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:07.5749781Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:07.5750241Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:07.5750604Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:07.5750888Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:07.5751206Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:07.5751525Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:07.5751799Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:07.5752082Z #define __CHAR_BIT__ 8 2025-05-07T20:25:07.5752329Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:07.5752589Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:07.5753201Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:07.5753494Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:07.5753782Z #define __cpp_static_assert 201411L 2025-05-07T20:25:07.5754076Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:07.5754385Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5754698Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:07.5754990Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:07.5755329Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:07.5755663Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:07.5756070Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:07.5756488Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:07.5756814Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:07.5757106Z #define __GCC_IEC_559 2 2025-05-07T20:25:07.5757356Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:07.5757641Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:07.5757933Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:07.5758265Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:07.5758569Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:07.5758897Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:07.5759207Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:07.5759547Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5759876Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:07.5760148Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.5760433Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:07.5760721Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:07.5761026Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:07.5761291Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:07.5761559Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:07.5762008Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:07.5762341Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:07.5762688Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:07.5762952Z #define __INT8_C(c) c 2025-05-07T20:25:07.5763190Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:07.5763473Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:07.5763802Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5764126Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:07.5764411Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:07.5764709Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:07.5765026Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.5765389Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:07.5765680Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:07.5765965Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.5766236Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5766526Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:07.5766812Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:07.5767213Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:07.5767633Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:07.5767931Z #define __linux 1 2025-05-07T20:25:07.5768163Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:07.5768453Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:07.5768739Z #define __unix 1 2025-05-07T20:25:07.5768967Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:07.5769261Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:07.5769555Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:07.5769835Z #define __WINT_MIN__ 0U 2025-05-07T20:25:07.5770085Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.5770381Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:07.5770663Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:07.5770931Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:07.5771195Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:07.5771485Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:07.5771785Z #define __INT64_C(c) c ## L 2025-05-07T20:25:07.5772156Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:07.5772465Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:07.5772743Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:07.5773053Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:07.5773341Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:07.5773605Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:07.5773963Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:07.5774346Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:07.5774608Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:07.5774889Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:07.5775171Z #define __DBL_DIG__ 15 2025-05-07T20:25:07.5775409Z #define __FLT32_DIG__ 6 2025-05-07T20:25:07.5775711Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:07.5776068Z #define __GXX_WEAK__ 1 2025-05-07T20:25:07.5776310Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:07.5776557Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:07.5776900Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:07.5777255Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.5777521Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:07.5777828Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:07.5778161Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:07.5778572Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:07.5778968Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:07.5779248Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:07.5779517Z #define __unix__ 1 2025-05-07T20:25:07.5779741Z #define __INT_WIDTH__ 32 2025-05-07T20:25:07.5779990Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:07.5780242Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:07.5780582Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:07.5780855Z #define __UINT16_C(c) c 2025-05-07T20:25:07.5781250Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:07.5781555Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:07.5781922Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:07.5782293Z #define __gnu_linux__ 1 2025-05-07T20:25:07.5782533Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:07.5782801Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:07.5783089Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.5793030Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5793330Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:07.5793615Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:07.5793885Z #define __GNUC__ 11 2025-05-07T20:25:07.5794110Z #define __GXX_RTTI 1 2025-05-07T20:25:07.5794355Z #define __pie__ 2 2025-05-07T20:25:07.5794587Z #define __MMX__ 1 2025-05-07T20:25:07.5794816Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:07.5795101Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:07.5795408Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:07.5795694Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:07.5795957Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:07.5796287Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:07.5796628Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:07.5796985Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.5797369Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:07.5797682Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5798009Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.5798283Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:07.5798555Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:07.5798872Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:07.5799177Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:07.5799455Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:07.5799717Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:07.5800014Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:07.5800316Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:07.5800587Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:07.5801061Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:07.5801327Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:07.5801592Z #define __cplusplus 201703L 2025-05-07T20:25:07.5801868Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:07.5802159Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:07.5802417Z #define __DEPRECATED 1 2025-05-07T20:25:07.5802679Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:07.5802980Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:07.5803237Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:07.5803561Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.5803927Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:07.5804204Z #define __SSE2_MATH__ 1 2025-05-07T20:25:07.5804452Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:07.5804769Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5805069Z #define __amd64 1 2025-05-07T20:25:07.5805294Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:07.5805572Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:07.5805844Z #define __GNUG__ 11 2025-05-07T20:25:07.5806102Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:07.5806419Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:07.5806683Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:07.5806941Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:07.5807225Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:07.5807487Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:07.5807762Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:07.5808066Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:07.5808341Z #define __cpp_hex_float 201603L 2025-05-07T20:25:07.5808610Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:07.5808885Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:07.5809169Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:07.5809601Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:07.5809874Z #define __x86_64 1 2025-05-07T20:25:07.5810106Z #define __cpp_lambdas 200907L 2025-05-07T20:25:07.5810389Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:07.5810764Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:07.5811160Z #define __cpp_template_auto 201606L 2025-05-07T20:25:07.5811527Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:07.5811980Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:07.5812463Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.5812862Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:07.5813125Z #define __LP64__ 1 2025-05-07T20:25:07.5813355Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5813716Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:07.5814098Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:07.5814381Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.5814670Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:07.5814951Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:07.5815228Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:07.5815491Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:07.5815761Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.5816090Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:07.5816454Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:07.5816735Z #define __FLT_DIG__ 6 2025-05-07T20:25:07.5816966Z #define __NO_INLINE__ 1 2025-05-07T20:25:07.5817214Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:07.5817546Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:07.5817901Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:07.5818160Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:07.5818428Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:07.5818691Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:07.5818976Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:07.5819278Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:07.5819543Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:07.5819965Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:07.5820259Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:07.5820530Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:07.5820833Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.5821345Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:07.5821656Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:07.5821921Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:07.5822188Z #define __FLT128_DIG__ 33 2025-05-07T20:25:07.5822436Z #define __INT32_C(c) c 2025-05-07T20:25:07.5822684Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:07.5822966Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:07.5823260Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:07.5823550Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:07.5823876Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:07.5824191Z #define unix 1 2025-05-07T20:25:07.5824421Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:07.5824691Z #define __cpp_rtti 199711L 2025-05-07T20:25:07.5824966Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:07.5825289Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5825596Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:07.5825915Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:07.5826253Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:07.5826509Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:07.5826813Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:07.5827110Z #define __ELF__ 1 2025-05-07T20:25:07.5827355Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:07.5827640Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:07.5827926Z #define __FLT_RADIX__ 2 2025-05-07T20:25:07.5828189Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:07.5828643Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:07.5829017Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:07.5829296Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:07.5829575Z #define __k8 1 2025-05-07T20:25:07.5829879Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:07.5830258Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:07.5830553Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:07.5830860Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:07.5831128Z #define __LDBL_DIG__ 18 2025-05-07T20:25:07.5831377Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:07.5831641Z #define __x86_64__ 1 2025-05-07T20:25:07.5831885Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:07.5832191Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:07.5832528Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5832844Z #define __FLT64_DIG__ 15 2025-05-07T20:25:07.5833134Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5833495Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.5833820Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5834102Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:07.5834379Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5834682Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:07.5835055Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:07.5835454Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:07.5835747Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:07.5836076Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:07.5836404Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:07.5836734Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:07.5837044Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:07.5837337Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:07.5837643Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:07.5837945Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:07.5838194Z #define __SEG_FS 1 2025-05-07T20:25:07.5838431Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:07.5838807Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:07.5839095Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5839386Z #define __SEG_GS 1 2025-05-07T20:25:07.5839714Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:07.5840807Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:07.5841257Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:07.5841640Z #define __INT16_TYPE__ short int 2025-05-07T20:25:07.5842017Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:07.5842433Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:07.5842832Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:07.5843129Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:07.5843401Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:07.5843743Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.5844143Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5844464Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:07.5844794Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:07.5845096Z #define linux 1 2025-05-07T20:25:07.5845323Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5845602Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.5845873Z #define __EXCEPTIONS 1 2025-05-07T20:25:07.5846118Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:07.5846379Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:07.5846644Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:07.5846934Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:07.5847287Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.5847669Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:07.5848068Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:07.5848514Z #define __code_model_small__ 1 2025-05-07T20:25:07.5850456Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:07.5850768Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:07.5851073Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:07.5851360Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:07.5851650Z #define __k8__ 1 2025-05-07T20:25:07.5851880Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:07.5852169Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:07.5852464Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:07.5852715Z #define __pic__ 2 2025-05-07T20:25:07.5852966Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5853271Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:07.5853544Z #define __cpp_decltype 200707L 2025-05-07T20:25:07.5853845Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5854171Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:07.5854544Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.5854925Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:07.5855219Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.5855546Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:07.5855842Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:07.5856107Z #define __linux__ 1 2025-05-07T20:25:07.5856339Z #define __INT64_TYPE__ long int 2025-05-07T20:25:07.5856601Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:07.5856866Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:07.5857145Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:07.5857435Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:07.5857754Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:07.5858056Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5858377Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:07.5858641Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:07.5858939Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:07.5859241Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:07.5859568Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.5859934Z #define __SSE__ 1 2025-05-07T20:25:07.5860171Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:07.5860644Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.5860996Z #define __amd64__ 1 2025-05-07T20:25:07.5861323Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:07.5861574Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:07.5861852Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:07.5862119Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:07.5862397Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:07.5862651Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:07.5862926Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:07.5863195Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:07.5863533Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:07.5864002Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:07.5864366Z #define _LP64 1 2025-05-07T20:25:07.5864581Z #define __UINT8_C(c) c 2025-05-07T20:25:07.5864823Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:07.5865092Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:07.5865364Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:07.5865630Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:07.5865990Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.5866456Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.5866824Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5867121Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5867434Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:07.5867739Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:07.5868125Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:07.5868500Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:07.5868761Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:07.5869122Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:07.5869459Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:07.5869819Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:07.5870086Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:07.5870403Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:07.5870655Z #define __FXSR__ 1 2025-05-07T20:25:07.5870959Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.5871411Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.5871825Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.5872138Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:07.5872403Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:07.5872706Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:07.5873015Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:07.5873283Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:07.5873646Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:07.5874024Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:07.5874296Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:07.5874551Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:07.5874795Z #define __PIC__ 2 2025-05-07T20:25:07.5875057Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:07.5875458Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.5875852Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:07.5876202Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.5876546Z #define __cpp_constexpr 201603L 2025-05-07T20:25:07.5876813Z #define __SSE2__ 1 2025-05-07T20:25:07.5877056Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:07.5877341Z #define __INT32_TYPE__ int 2025-05-07T20:25:07.5877595Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:07.5877858Z #define __cpp_exceptions 199711L 2025-05-07T20:25:07.5878129Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.5878468Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:07.5878831Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:07.5879190Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:07.5879457Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:07.5879725Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5880004Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:07.5880254Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:07.5880509Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:07.5880806Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:07.5881092Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5881393Z #define __PIE__ 2 2025-05-07T20:25:07.5881718Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:07.5882128Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:07.5882438Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:07.5882785Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:07.5883156Z #define __INT16_C(c) c 2025-05-07T20:25:07.5883379Z #define __STDC__ 1 2025-05-07T20:25:07.5883601Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:07.5883863Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:07.5884135Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:07.5884397Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.5884699Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:07.5885043Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:07.5885379Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:07.5885650Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.5885936Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:07.5886218Z #define __SSE_MATH__ 1 2025-05-07T20:25:07.5886461Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:07.5886741Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:07.5887051Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:07.5887340Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:07.5887813Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5888083Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:07.5888390Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5888795Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.5889164Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:07.5889468Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:07.5889762Z #define _GNU_SOURCE 1 2025-05-07T20:25:07.5890004Z #define __cpp_init_captures 201304L 2025-05-07T20:25:07.5890289Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:07.5890542Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:07.5890703Z 2025-05-07T20:25:07.6413388Z 2025-05-07T20:25:07.6414018Z + conda run -n build_binary c++ --version 2025-05-07T20:25:07.6414280Z 2025-05-07T20:25:09.5253755Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:09.5254153Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:09.5254615Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:09.5255187Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:09.5255525Z 2025-05-07T20:25:09.5255546Z 2025-05-07T20:25:09.5883769Z 2025-05-07T20:25:09.5884422Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:09.5885509Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:09.5886148Z 2025-05-07T20:25:11.5495418Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:11.5497718Z 2025-05-07T20:25:11.5498121Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:11.5498700Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:11.5499025Z 2025-05-07T20:25:13.5105153Z #define __cplusplus 201703L 2025-05-07T20:25:13.5108202Z 2025-05-07T20:25:13.5109905Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:13.5156071Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:13.5156505Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:13.5168362Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:13.5168710Z env: 2025-05-07T20:25:13.5168932Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:13.5169241Z BUILD_ENV: build_binary 2025-05-07T20:25:13.5169490Z BUILD_TARGET: genai 2025-05-07T20:25:13.5169717Z BUILD_VARIANT: cuda 2025-05-07T20:25:13.5169955Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:13.5170214Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:13.5170512Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:13.5170846Z ##[endgroup] 2025-05-07T20:25:13.8543353Z ################################################################################ 2025-05-07T20:25:13.8543720Z # Install CUDA 2025-05-07T20:25:13.8543938Z # 2025-05-07T20:25:13.8559563Z # [2025-05-07T20:25:13.855Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:13.8559967Z ################################################################################ 2025-05-07T20:25:13.8560195Z 2025-05-07T20:25:13.8576330Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:13.9483334Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:13.9483780Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:13.9489108Z + conda clean --packages --tarball -y 2025-05-07T20:25:14.6574633Z 2025-05-07T20:25:14.6575106Z Will remove 32 (140.4 MB) tarball(s). 2025-05-07T20:25:14.6575581Z Will remove 6 (617 KB) package(s). 2025-05-07T20:25:14.7202823Z 2025-05-07T20:25:14.7213741Z + conda clean --all -y 2025-05-07T20:25:14.7214000Z 2025-05-07T20:25:15.3890276Z There are no unused tarball(s) to remove. 2025-05-07T20:25:15.3890763Z Will remove 1 index cache(s). 2025-05-07T20:25:15.3891217Z There are no unused package(s) to remove. 2025-05-07T20:25:15.3891680Z There are no tempfile(s) to remove. 2025-05-07T20:25:15.3892119Z There are no logfile(s) to remove. 2025-05-07T20:25:15.4539099Z 2025-05-07T20:25:15.4552510Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:25:15.4577857Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:25:16.3648231Z Channels: 2025-05-07T20:25:16.3648492Z - conda-forge 2025-05-07T20:25:16.3648735Z Platform: linux-64 2025-05-07T20:25:26.8556977Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:27.9561843Z Solving environment: - \ | / done 2025-05-07T20:25:28.0291498Z 2025-05-07T20:25:28.0291887Z ## Package Plan ## 2025-05-07T20:25:28.0292109Z 2025-05-07T20:25:28.0292393Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:28.0292869Z 2025-05-07T20:25:28.0293005Z added / updated specs: 2025-05-07T20:25:28.0293354Z - cuda=12.6.3 2025-05-07T20:25:28.0293545Z 2025-05-07T20:25:28.0293586Z 2025-05-07T20:25:28.0293777Z The following packages will be downloaded: 2025-05-07T20:25:28.0294255Z 2025-05-07T20:25:28.0294420Z package | build 2025-05-07T20:25:28.0294888Z ---------------------------|----------------- 2025-05-07T20:25:28.0295329Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:28.0295930Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:28.0296514Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:28.0297101Z bzip2-1.0.8 | h4bc722e_7 247 KB conda-forge 2025-05-07T20:25:28.0297546Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:28.0297961Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:28.0298837Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:25:28.0299368Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.0299882Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:28.0300366Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:25:28.0300860Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:25:28.0301425Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.0301905Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.0302413Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:25:28.0302933Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.0303462Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:25:28.0303992Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:25:28.0304489Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:25:28.0304956Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:25:28.0305415Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:25:28.0305884Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:25:28.0306357Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.0306861Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:25:28.0307383Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:25:28.0307835Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.0308330Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.0308997Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:28.0309436Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:28.0309910Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:25:28.0310399Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:25:28.0310866Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:25:28.0311346Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:28.0311825Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:25:28.0312292Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:25:28.0312744Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:25:28.0313211Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:25:28.0313671Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:25:28.0314127Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:25:28.0314571Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:28.0315040Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:25:28.0315527Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:25:28.0315992Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:25:28.0316450Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:25:28.0316894Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:28.0317478Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:25:28.0317975Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:28.0318453Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:28.0318939Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:25:28.0319408Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:28.0319854Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:28.0320295Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:25:28.0320760Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:28.0321224Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:28.0321646Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:28.0322053Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:28.0322526Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:28.0323053Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:28.0323580Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:28.0324083Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:28.0324577Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:28.0325054Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:28.0325539Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:28.0325984Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:28.0326388Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:28.0326892Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:25:28.0327305Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:28.0327682Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:28.0328090Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:28.0328498Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:28.0328897Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:28.0329319Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:25:28.0329776Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:25:28.0330226Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:25:28.0330672Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:28.0331126Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:25:28.0331586Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:25:28.0332047Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:25:28.0332524Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:25:28.0332992Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:25:28.0333468Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:25:28.0333934Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:25:28.0334415Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:28.0334890Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:28.0335433Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:28.0335882Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:28.0336341Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:28.0336806Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:28.0337256Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:28.0337688Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:28.0338127Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:28.0338547Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:28.0338959Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:25:28.0339404Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:25:28.0339840Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:28.0340513Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:28.0340948Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:25:28.0341473Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:28.0341946Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:25:28.0342415Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:28.0342882Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:25:28.0343341Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:28.0343782Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:28.0344206Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:28.0344790Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:28.0345230Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:28.0345648Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:28.0346063Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:28.0346499Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:28.0346946Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:28.0347367Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:28.0347784Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:28.0348198Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:28.0348655Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:25:28.0349099Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:28.0349491Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:28.0349896Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:28.0350344Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:28.0350796Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:28.0351233Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:28.0351691Z python-3.9.18 |h0755675_1_cpython 22.7 MB conda-forge 2025-05-07T20:25:28.0352120Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:28.0352673Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:28.0353093Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:28.0353499Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:28.0353919Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:28.0354365Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:28.0354833Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:28.0355293Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:28.0355781Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:28.0356254Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:28.0356716Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:28.0357187Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:28.0357627Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:28.0358066Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:28.0358505Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:28.0358982Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:28.0359475Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:28.0359952Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:28.0360405Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:28.0360866Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:28.0361326Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:28.0361889Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:28.0362370Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:28.0362836Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:28.0363257Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:28.0363643Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:28.0364026Z ------------------------------------------------------------ 2025-05-07T20:25:28.0364380Z Total: 1.63 GB 2025-05-07T20:25:28.0364595Z 2025-05-07T20:25:28.0364728Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:28.0364961Z 2025-05-07T20:25:28.0365170Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:28.0365602Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:28.0366038Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:28.0366465Z bzip2 conda-forge/linux-64::bzip2-1.0.8-h4bc722e_7 2025-05-07T20:25:28.0366909Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:28.0367355Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:25:28.0367832Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:25:28.0368426Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:25:28.0369011Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:25:28.0369565Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:28.0370135Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:25:28.0370742Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:25:28.0371277Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.0371857Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.0374653Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:25:28.0375308Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.0375918Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.0376487Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.0377011Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:25:28.0377519Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:25:28.0378060Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.0378615Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.0379195Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.0379722Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:25:28.0380222Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:25:28.0380794Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:25:28.0381425Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:25:28.0381921Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:25:28.0382450Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:25:28.0383016Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:25:28.0383562Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:25:28.0384231Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:25:28.0384780Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.0385311Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.0385817Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:25:28.0386332Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.0386838Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:25:28.0387352Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:25:28.0387850Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:25:28.0388386Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:28.0388961Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:25:28.0389509Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:25:28.0390016Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:25:28.0390499Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:25:28.0391029Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.0391606Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:25:28.0392142Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:25:28.0392705Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.0393262Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:25:28.0393843Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:25:28.0394333Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:25:28.0394920Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:25:28.0395468Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:28.0395922Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:28.0396333Z expat conda-forge/linux-64::expat-2.7.0-h5888daf_0 2025-05-07T20:25:28.0396855Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:28.0397466Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:28.0398066Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:28.0398645Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:28.0399157Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:28.0399665Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:28.0400157Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:28.0400629Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:28.0401060Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:28.0401492Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:25:28.0401925Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:28.0402315Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:28.0402734Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:28.0403155Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:28.0403572Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:28.0404121Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:25:28.0404638Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:25:28.0405140Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:25:28.0405633Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:25:28.0406137Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:25:28.0406645Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:25:28.0407152Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:25:28.0407670Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:25:28.0408203Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:25:28.0408754Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:25:28.0409298Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:25:28.0409846Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:25:28.0410372Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:28.0410840Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:28.0411311Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:28.0411826Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:28.0412350Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:28.0412842Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:28.0413305Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:28.0413870Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:28.0414309Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:28.0414737Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:25:28.0415214Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:25:28.0415682Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:28.0416116Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:28.0416583Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:25:28.0417118Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.0417662Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:25:28.0418213Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:25:28.0418745Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:25:28.0419276Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:25:28.0419772Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:28.0420220Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:28.0420693Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:28.0421227Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:28.0421669Z libuuid conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:28.0422094Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:28.0422567Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:28.0423063Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:28.0423529Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:28.0424047Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:28.0424471Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:28.0424964Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:25:28.0425456Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:28.0425837Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:28.0426254Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:28.0426757Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:28.0427258Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:28.0427725Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:28.0428232Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:28.0439257Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:28.0439725Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:28.0440407Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:28.0440955Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:28.0441515Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:28.0442105Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:28.0442649Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:28.0443168Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:28.0443704Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:28.0444408Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:28.0444897Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:28.0445391Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:28.0445952Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:28.0446549Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:28.0447081Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:28.0447591Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:28.0448113Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:28.0448610Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:28.0449110Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:28.0449670Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:28.0450205Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:28.0450654Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:28.0450912Z 2025-05-07T20:25:28.0451034Z The following packages will be UPDATED: 2025-05-07T20:25:28.0451239Z 2025-05-07T20:25:28.0451480Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:28.0451813Z 2025-05-07T20:25:28.0452041Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:28.0452362Z 2025-05-07T20:25:28.0452646Z python pkgs/main::python-3.9.21-he870216_1 --> conda-forge::python-3.9.18-h0755675_1_cpython 2025-05-07T20:25:28.0453273Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:28.0453857Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:28.0454315Z 2025-05-07T20:25:28.0454336Z 2025-05-07T20:25:28.0454340Z 2025-05-07T20:25:28.0454521Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:28.0454997Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:28.0455335Z 2025-05-07T20:25:28.0455741Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:28.0455992Z 2025-05-07T20:25:28.0455996Z 2025-05-07T20:25:28.0456218Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:28.0456465Z 2025-05-07T20:25:28.0456476Z 2025-05-07T20:25:28.0456480Z 2025-05-07T20:25:28.0456729Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:28.0457058Z 2025-05-07T20:25:28.0457062Z 2025-05-07T20:25:28.0457065Z 2025-05-07T20:25:28.0457069Z 2025-05-07T20:25:28.0457321Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:25:28.0457590Z 2025-05-07T20:25:28.0457600Z 2025-05-07T20:25:28.0457610Z 2025-05-07T20:25:28.0457613Z 2025-05-07T20:25:28.0457624Z 2025-05-07T20:25:28.0457869Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:28.0458136Z 2025-05-07T20:25:28.0458140Z 2025-05-07T20:25:28.0458143Z 2025-05-07T20:25:28.0458147Z 2025-05-07T20:25:28.0458151Z 2025-05-07T20:25:28.0458155Z 2025-05-07T20:25:28.0458838Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:28.0459144Z 2025-05-07T20:25:28.0459149Z 2025-05-07T20:25:28.0459153Z 2025-05-07T20:25:28.0459157Z 2025-05-07T20:25:28.0459160Z 2025-05-07T20:25:28.0459164Z 2025-05-07T20:25:28.0459178Z 2025-05-07T20:25:28.0460594Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:28.0460883Z 2025-05-07T20:25:28.0460887Z 2025-05-07T20:25:28.0460891Z 2025-05-07T20:25:28.0460900Z 2025-05-07T20:25:28.0460903Z 2025-05-07T20:25:28.0460907Z 2025-05-07T20:25:28.0460911Z 2025-05-07T20:25:28.0461057Z 2025-05-07T20:25:28.0462089Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:28.0462493Z 2025-05-07T20:25:28.0462499Z 2025-05-07T20:25:28.0462504Z 2025-05-07T20:25:28.0462509Z 2025-05-07T20:25:28.0462515Z 2025-05-07T20:25:28.0462527Z 2025-05-07T20:25:28.0462531Z 2025-05-07T20:25:28.0462536Z 2025-05-07T20:25:28.0462541Z 2025-05-07T20:25:28.0470060Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:28.0470350Z 2025-05-07T20:25:28.0470354Z 2025-05-07T20:25:28.0470358Z 2025-05-07T20:25:28.0470362Z 2025-05-07T20:25:28.0470401Z 2025-05-07T20:25:28.0470405Z 2025-05-07T20:25:28.0470411Z 2025-05-07T20:25:28.0470415Z 2025-05-07T20:25:28.0470419Z 2025-05-07T20:25:28.0476555Z 2025-05-07T20:25:28.0478426Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:28.0478838Z 2025-05-07T20:25:28.0478844Z 2025-05-07T20:25:28.0478849Z 2025-05-07T20:25:28.0478866Z 2025-05-07T20:25:28.0478879Z 2025-05-07T20:25:28.0478894Z 2025-05-07T20:25:28.0478900Z 2025-05-07T20:25:28.0478905Z 2025-05-07T20:25:28.0478910Z 2025-05-07T20:25:28.0478915Z 2025-05-07T20:25:28.0478920Z 2025-05-07T20:25:28.0480602Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:28.0481032Z 2025-05-07T20:25:28.0481038Z 2025-05-07T20:25:28.0481043Z 2025-05-07T20:25:28.0481048Z 2025-05-07T20:25:28.0481053Z 2025-05-07T20:25:28.0481058Z 2025-05-07T20:25:28.0481064Z 2025-05-07T20:25:28.0481069Z 2025-05-07T20:25:28.0481074Z 2025-05-07T20:25:28.0481079Z 2025-05-07T20:25:28.0481084Z 2025-05-07T20:25:28.0483570Z 2025-05-07T20:25:28.0487252Z python-3.9.18 | 22.7 MB | | 0%  2025-05-07T20:25:28.0487660Z 2025-05-07T20:25:28.0487666Z 2025-05-07T20:25:28.0487671Z 2025-05-07T20:25:28.0487676Z 2025-05-07T20:25:28.0487681Z 2025-05-07T20:25:28.0487687Z 2025-05-07T20:25:28.0487702Z 2025-05-07T20:25:28.0487849Z 2025-05-07T20:25:28.0487855Z 2025-05-07T20:25:28.0487860Z 2025-05-07T20:25:28.0487865Z 2025-05-07T20:25:28.0487870Z 2025-05-07T20:25:28.0487875Z 2025-05-07T20:25:28.0489116Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:28.0489512Z 2025-05-07T20:25:28.0489517Z 2025-05-07T20:25:28.0489522Z 2025-05-07T20:25:28.0489527Z 2025-05-07T20:25:28.0489532Z 2025-05-07T20:25:28.0489547Z 2025-05-07T20:25:28.0489553Z 2025-05-07T20:25:28.0489558Z 2025-05-07T20:25:28.0489563Z 2025-05-07T20:25:28.0489575Z 2025-05-07T20:25:28.0489580Z 2025-05-07T20:25:28.0489586Z 2025-05-07T20:25:28.0489591Z 2025-05-07T20:25:28.0489596Z 2025-05-07T20:25:28.0490704Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:28.0491138Z 2025-05-07T20:25:28.0491143Z 2025-05-07T20:25:28.0491148Z 2025-05-07T20:25:28.0491154Z 2025-05-07T20:25:28.0491159Z 2025-05-07T20:25:28.0491172Z 2025-05-07T20:25:28.0491185Z 2025-05-07T20:25:28.0491195Z 2025-05-07T20:25:28.0491200Z 2025-05-07T20:25:28.0491205Z 2025-05-07T20:25:28.0491210Z 2025-05-07T20:25:28.0491215Z 2025-05-07T20:25:28.0491220Z 2025-05-07T20:25:28.0491226Z 2025-05-07T20:25:28.0492808Z 2025-05-07T20:25:28.0494322Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:28.0494892Z 2025-05-07T20:25:28.0494898Z 2025-05-07T20:25:28.0494904Z 2025-05-07T20:25:28.0494910Z 2025-05-07T20:25:28.0494916Z 2025-05-07T20:25:28.0494922Z 2025-05-07T20:25:28.0494928Z 2025-05-07T20:25:28.0494934Z 2025-05-07T20:25:28.0494939Z 2025-05-07T20:25:28.0494945Z 2025-05-07T20:25:28.0494951Z 2025-05-07T20:25:28.0494964Z 2025-05-07T20:25:28.0494970Z 2025-05-07T20:25:28.0494976Z 2025-05-07T20:25:28.0494982Z 2025-05-07T20:25:28.0494994Z 2025-05-07T20:25:28.0496486Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:28.0498031Z 2025-05-07T20:25:28.0498046Z 2025-05-07T20:25:28.0498051Z 2025-05-07T20:25:28.0498057Z 2025-05-07T20:25:28.0498062Z 2025-05-07T20:25:28.0498068Z 2025-05-07T20:25:28.0498074Z 2025-05-07T20:25:28.0498079Z 2025-05-07T20:25:28.0498084Z 2025-05-07T20:25:28.0498089Z 2025-05-07T20:25:28.0498095Z 2025-05-07T20:25:28.0498100Z 2025-05-07T20:25:28.0498106Z 2025-05-07T20:25:28.0498111Z 2025-05-07T20:25:28.0498117Z 2025-05-07T20:25:28.0498122Z 2025-05-07T20:25:28.0498128Z 2025-05-07T20:25:28.0498715Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:28.0499229Z 2025-05-07T20:25:28.0499235Z 2025-05-07T20:25:28.0499240Z 2025-05-07T20:25:28.0499246Z 2025-05-07T20:25:28.0499252Z 2025-05-07T20:25:28.0499257Z 2025-05-07T20:25:28.0499263Z 2025-05-07T20:25:28.0499269Z 2025-05-07T20:25:28.0499285Z 2025-05-07T20:25:28.0499291Z 2025-05-07T20:25:28.0499297Z 2025-05-07T20:25:28.0499302Z 2025-05-07T20:25:28.0499308Z 2025-05-07T20:25:28.0499337Z 2025-05-07T20:25:28.0499343Z 2025-05-07T20:25:28.0499349Z 2025-05-07T20:25:28.0499354Z 2025-05-07T20:25:28.0499360Z 2025-05-07T20:25:28.0499945Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:28.0500398Z 2025-05-07T20:25:28.0500404Z 2025-05-07T20:25:28.0500409Z 2025-05-07T20:25:28.0500414Z 2025-05-07T20:25:28.0500419Z 2025-05-07T20:25:28.0500425Z 2025-05-07T20:25:28.0500430Z 2025-05-07T20:25:28.0500446Z 2025-05-07T20:25:28.0500452Z 2025-05-07T20:25:28.0500457Z 2025-05-07T20:25:28.0500462Z 2025-05-07T20:25:28.0500467Z 2025-05-07T20:25:28.0500471Z 2025-05-07T20:25:28.0500474Z 2025-05-07T20:25:28.0500478Z 2025-05-07T20:25:28.0500481Z 2025-05-07T20:25:28.0500485Z 2025-05-07T20:25:28.0500489Z 2025-05-07T20:25:28.0500492Z 2025-05-07T20:25:28.1391000Z ... (more hidden) ... 2025-05-07T20:25:28.1398638Z nsight-compute-2024. | 443.1 MB | | 1% 2025-05-07T20:25:28.1399263Z 2025-05-07T20:25:28.1414450Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:28.1414746Z 2025-05-07T20:25:28.1414751Z 2025-05-07T20:25:28.1435279Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:28.1435551Z 2025-05-07T20:25:28.1435556Z 2025-05-07T20:25:28.1436612Z 2025-05-07T20:25:28.1455531Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:28.1455934Z 2025-05-07T20:25:28.1455940Z 2025-05-07T20:25:28.1455946Z 2025-05-07T20:25:28.1456893Z 2025-05-07T20:25:28.2396632Z cuda-nsight-12.6.77 | 113.2 MB | 1 | 1%  2025-05-07T20:25:28.2401575Z nsight-compute-2024. | 443.1 MB | 1 | 2% 2025-05-07T20:25:28.2403237Z 2025-05-07T20:25:28.2416720Z libcublas-12.6.4.1 | 256.2 MB | | 1%  2025-05-07T20:25:28.2417117Z 2025-05-07T20:25:28.2418713Z 2025-05-07T20:25:28.2437314Z libcufft-11.3.0.4 | 156.2 MB | 2 | 2%  2025-05-07T20:25:28.2437735Z 2025-05-07T20:25:28.2437742Z 2025-05-07T20:25:28.2437747Z 2025-05-07T20:25:28.2456887Z libcusparse-12.5.4.2 | 118.6 MB | 3 | 4%  2025-05-07T20:25:28.2457232Z 2025-05-07T20:25:28.2457236Z 2025-05-07T20:25:28.2457239Z 2025-05-07T20:25:28.2459403Z 2025-05-07T20:25:28.3406243Z cuda-nsight-12.6.77 | 113.2 MB | 4 | 5%  2025-05-07T20:25:28.3417489Z nsight-compute-2024. | 443.1 MB | 2 | 2% 2025-05-07T20:25:28.3417838Z 2025-05-07T20:25:28.3418574Z 2025-05-07T20:25:28.3458104Z libcufft-11.3.0.4 | 156.2 MB | 4 | 5%  2025-05-07T20:25:28.3458470Z 2025-05-07T20:25:28.3458476Z 2025-05-07T20:25:28.3458482Z 2025-05-07T20:25:28.3459952Z 2025-05-07T20:25:28.3509086Z cuda-nsight-12.6.77 | 113.2 MB | 8 | 8%  2025-05-07T20:25:28.3511444Z 2025-05-07T20:25:28.3518218Z libcublas-12.6.4.1 | 256.2 MB | 1 | 2%  2025-05-07T20:25:28.3518603Z 2025-05-07T20:25:28.3518893Z 2025-05-07T20:25:28.3519619Z 2025-05-07T20:25:28.4410243Z libcusparse-12.5.4.2 | 118.6 MB | 6 | 7%  2025-05-07T20:25:28.4420142Z nsight-compute-2024. | 443.1 MB | 3 | 3% 2025-05-07T20:25:28.4420488Z 2025-05-07T20:25:28.4422556Z 2025-05-07T20:25:28.4460341Z libcufft-11.3.0.4 | 156.2 MB | 7 | 7%  2025-05-07T20:25:28.4460704Z 2025-05-07T20:25:28.4460710Z 2025-05-07T20:25:28.4460715Z 2025-05-07T20:25:28.4463129Z 2025-05-07T20:25:28.4515814Z cuda-nsight-12.6.77 | 113.2 MB | #1 | 12%  2025-05-07T20:25:28.4516793Z 2025-05-07T20:25:28.4708544Z libcublas-12.6.4.1 | 256.2 MB | 3 | 3%  2025-05-07T20:25:28.4708900Z 2025-05-07T20:25:28.4708905Z 2025-05-07T20:25:28.4710885Z 2025-05-07T20:25:28.5479272Z libcusparse-12.5.4.2 | 118.6 MB | 9 | 9%  2025-05-07T20:25:28.5479676Z 2025-05-07T20:25:28.5479713Z 2025-05-07T20:25:28.5517280Z libcufft-11.3.0.4 | 156.2 MB | 9 | 9%  2025-05-07T20:25:28.5518936Z 2025-05-07T20:25:28.5548349Z libcublas-12.6.4.1 | 256.2 MB | 4 | 5%  2025-05-07T20:25:28.5574594Z nsight-compute-2024. | 443.1 MB | 3 | 4% 2025-05-07T20:25:28.5574961Z 2025-05-07T20:25:28.5575216Z 2025-05-07T20:25:28.5575223Z 2025-05-07T20:25:28.5575248Z 2025-05-07T20:25:28.5710222Z cuda-nsight-12.6.77 | 113.2 MB | #5 | 15%  2025-05-07T20:25:28.5710611Z 2025-05-07T20:25:28.5710616Z 2025-05-07T20:25:28.5712696Z 2025-05-07T20:25:28.6479479Z libcusparse-12.5.4.2 | 118.6 MB | #2 | 12%  2025-05-07T20:25:28.6479876Z 2025-05-07T20:25:28.6479891Z 2025-05-07T20:25:28.6519120Z libcufft-11.3.0.4 | 156.2 MB | #1 | 12%  2025-05-07T20:25:28.6521915Z 2025-05-07T20:25:28.6574928Z libcublas-12.6.4.1 | 256.2 MB | 6 | 6%  2025-05-07T20:25:28.6575293Z 2025-05-07T20:25:28.6575395Z 2025-05-07T20:25:28.6575401Z 2025-05-07T20:25:28.6575423Z 2025-05-07T20:25:28.6593993Z cuda-nsight-12.6.77 | 113.2 MB | #8 | 18%  2025-05-07T20:25:28.6712103Z nsight-compute-2024. | 443.1 MB | 4 | 5% 2025-05-07T20:25:28.6712459Z 2025-05-07T20:25:28.6712712Z 2025-05-07T20:25:28.6714348Z 2025-05-07T20:25:28.7480518Z libcusparse-12.5.4.2 | 118.6 MB | #4 | 15%  2025-05-07T20:25:28.7480893Z 2025-05-07T20:25:28.7480903Z 2025-05-07T20:25:28.7519042Z libcufft-11.3.0.4 | 156.2 MB | #3 | 14%  2025-05-07T20:25:28.7523159Z 2025-05-07T20:25:28.7578281Z libcublas-12.6.4.1 | 256.2 MB | 7 | 8%  2025-05-07T20:25:28.7578614Z 2025-05-07T20:25:28.7578621Z 2025-05-07T20:25:28.7578626Z 2025-05-07T20:25:28.7579172Z 2025-05-07T20:25:28.7612767Z cuda-nsight-12.6.77 | 113.2 MB | ##1 | 22%  2025-05-07T20:25:28.7712117Z nsight-compute-2024. | 443.1 MB | 5 | 6% 2025-05-07T20:25:28.7712374Z 2025-05-07T20:25:28.7712379Z 2025-05-07T20:25:28.7715407Z 2025-05-07T20:25:28.8484858Z libcusparse-12.5.4.2 | 118.6 MB | #8 | 18%  2025-05-07T20:25:28.8485292Z 2025-05-07T20:25:28.8485846Z 2025-05-07T20:25:28.8558125Z libcufft-11.3.0.4 | 156.2 MB | #5 | 16%  2025-05-07T20:25:28.8558488Z 2025-05-07T20:25:28.8613153Z libcublas-12.6.4.1 | 256.2 MB | 8 | 9%  2025-05-07T20:25:28.8713151Z nsight-compute-2024. | 443.1 MB | 6 | 6% 2025-05-07T20:25:28.8713509Z 2025-05-07T20:25:28.8713515Z 2025-05-07T20:25:28.8714844Z 2025-05-07T20:25:28.8830058Z libcusparse-12.5.4.2 | 118.6 MB | ## | 21%  2025-05-07T20:25:28.8830463Z 2025-05-07T20:25:28.8830468Z 2025-05-07T20:25:28.8830474Z 2025-05-07T20:25:28.8832700Z 2025-05-07T20:25:28.9616993Z cuda-nsight-12.6.77 | 113.2 MB | ##4 | 25%  2025-05-07T20:25:28.9625256Z nsight-compute-2024. | 443.1 MB | 7 | 7% 2025-05-07T20:25:28.9625631Z 2025-05-07T20:25:28.9625637Z 2025-05-07T20:25:28.9649950Z libcufft-11.3.0.4 | 156.2 MB | #8 | 18%  2025-05-07T20:25:28.9650508Z 2025-05-07T20:25:28.9713321Z libcublas-12.6.4.1 | 256.2 MB | # | 10%  2025-05-07T20:25:28.9713659Z 2025-05-07T20:25:28.9713664Z 2025-05-07T20:25:28.9714616Z 2025-05-07T20:25:28.9847848Z libcusparse-12.5.4.2 | 118.6 MB | ##3 | 24%  2025-05-07T20:25:28.9848258Z 2025-05-07T20:25:28.9848272Z 2025-05-07T20:25:28.9848279Z 2025-05-07T20:25:28.9848284Z 2025-05-07T20:25:29.0617850Z cuda-nsight-12.6.77 | 113.2 MB | ##7 | 28%  2025-05-07T20:25:29.0630727Z nsight-compute-2024. | 443.1 MB | 7 | 8% 2025-05-07T20:25:29.0631103Z 2025-05-07T20:25:29.0632281Z 2025-05-07T20:25:29.0713652Z libcufft-11.3.0.4 | 156.2 MB | ## | 20%  2025-05-07T20:25:29.0715212Z 2025-05-07T20:25:29.0787307Z libcublas-12.6.4.1 | 256.2 MB | #1 | 11%  2025-05-07T20:25:29.0787713Z 2025-05-07T20:25:29.0787719Z 2025-05-07T20:25:29.0787725Z 2025-05-07T20:25:29.0923651Z libcusparse-12.5.4.2 | 118.6 MB | ##6 | 26%  2025-05-07T20:25:29.0923961Z 2025-05-07T20:25:29.0923966Z 2025-05-07T20:25:29.0923969Z 2025-05-07T20:25:29.0924836Z 2025-05-07T20:25:29.1640680Z cuda-nsight-12.6.77 | 113.2 MB | ### | 31%  2025-05-07T20:25:29.1641600Z nsight-compute-2024. | 443.1 MB | 8 | 9% 2025-05-07T20:25:29.1641862Z 2025-05-07T20:25:29.1642007Z 2025-05-07T20:25:29.1713528Z libcufft-11.3.0.4 | 156.2 MB | ##2 | 22%  2025-05-07T20:25:29.1715200Z 2025-05-07T20:25:29.1827465Z libcublas-12.6.4.1 | 256.2 MB | #2 | 13%  2025-05-07T20:25:29.1827758Z 2025-05-07T20:25:29.1827764Z 2025-05-07T20:25:29.1829763Z 2025-05-07T20:25:29.2035630Z libcusparse-12.5.4.2 | 118.6 MB | ##9 | 29%  2025-05-07T20:25:29.2036049Z 2025-05-07T20:25:29.2036055Z 2025-05-07T20:25:29.2036061Z 2025-05-07T20:25:29.2037662Z 2025-05-07T20:25:29.2644763Z cuda-nsight-12.6.77 | 113.2 MB | ###3 | 34%  2025-05-07T20:25:29.2645369Z 2025-05-07T20:25:29.2645411Z 2025-05-07T20:25:29.2661403Z libcufft-11.3.0.4 | 156.2 MB | ##4 | 25%  2025-05-07T20:25:29.2779179Z nsight-compute-2024. | 443.1 MB | 9 | 10% 2025-05-07T20:25:29.2779556Z 2025-05-07T20:25:29.2830487Z libcublas-12.6.4.1 | 256.2 MB | #3 | 14%  2025-05-07T20:25:29.2830752Z 2025-05-07T20:25:29.2830756Z 2025-05-07T20:25:29.2833898Z 2025-05-07T20:25:29.3039730Z libcusparse-12.5.4.2 | 118.6 MB | ###2 | 32%  2025-05-07T20:25:29.3040376Z 2025-05-07T20:25:29.3040382Z 2025-05-07T20:25:29.3040388Z 2025-05-07T20:25:29.3042794Z 2025-05-07T20:25:29.3659834Z cuda-nsight-12.6.77 | 113.2 MB | ###6 | 36%  2025-05-07T20:25:29.3660142Z 2025-05-07T20:25:29.3660146Z 2025-05-07T20:25:29.3667001Z libcufft-11.3.0.4 | 156.2 MB | ##6 | 27%  2025-05-07T20:25:29.3892782Z nsight-compute-2024. | 443.1 MB | # | 10% 2025-05-07T20:25:29.3893051Z 2025-05-07T20:25:29.3893055Z 2025-05-07T20:25:29.3897078Z 2025-05-07T20:25:29.3901894Z libcusparse-12.5.4.2 | 118.6 MB | ###4 | 35%  2025-05-07T20:25:29.3905125Z 2025-05-07T20:25:29.4091743Z libcublas-12.6.4.1 | 256.2 MB | #5 | 15%  2025-05-07T20:25:29.4092021Z 2025-05-07T20:25:29.4092024Z 2025-05-07T20:25:29.4092028Z 2025-05-07T20:25:29.4100968Z 2025-05-07T20:25:29.4663438Z cuda-nsight-12.6.77 | 113.2 MB | ###9 | 39%  2025-05-07T20:25:29.4663892Z 2025-05-07T20:25:29.4663899Z 2025-05-07T20:25:29.4668015Z libcufft-11.3.0.4 | 156.2 MB | ##9 | 29%  2025-05-07T20:25:29.4899475Z nsight-compute-2024. | 443.1 MB | #1 | 11% 2025-05-07T20:25:29.4899788Z 2025-05-07T20:25:29.4899793Z 2025-05-07T20:25:29.4899800Z 2025-05-07T20:25:29.4902733Z libcusparse-12.5.4.2 | 118.6 MB | ###7 | 38%  2025-05-07T20:25:29.4905566Z 2025-05-07T20:25:29.5095588Z libcublas-12.6.4.1 | 256.2 MB | #6 | 16%  2025-05-07T20:25:29.5095859Z 2025-05-07T20:25:29.5095864Z 2025-05-07T20:25:29.5096144Z 2025-05-07T20:25:29.5096891Z 2025-05-07T20:25:29.5666213Z cuda-nsight-12.6.77 | 113.2 MB | ####2 | 42%  2025-05-07T20:25:29.5666515Z 2025-05-07T20:25:29.5666519Z 2025-05-07T20:25:29.5688145Z libcufft-11.3.0.4 | 156.2 MB | ###1 | 31%  2025-05-07T20:25:29.5904730Z nsight-compute-2024. | 443.1 MB | #1 | 12% 2025-05-07T20:25:29.5906514Z 2025-05-07T20:25:29.5996362Z libcublas-12.6.4.1 | 256.2 MB | #7 | 18%  2025-05-07T20:25:29.5996669Z 2025-05-07T20:25:29.5996681Z 2025-05-07T20:25:29.6001718Z 2025-05-07T20:25:29.6095933Z libcusparse-12.5.4.2 | 118.6 MB | #### | 40%  2025-05-07T20:25:29.6096214Z 2025-05-07T20:25:29.6096218Z 2025-05-07T20:25:29.6096222Z 2025-05-07T20:25:29.6097811Z 2025-05-07T20:25:29.6667598Z cuda-nsight-12.6.77 | 113.2 MB | ####4 | 45%  2025-05-07T20:25:29.6668009Z 2025-05-07T20:25:29.6668016Z 2025-05-07T20:25:29.6691383Z libcufft-11.3.0.4 | 156.2 MB | ###3 | 34%  2025-05-07T20:25:29.6918713Z nsight-compute-2024. | 443.1 MB | #2 | 13% 2025-05-07T20:25:29.6921539Z 2025-05-07T20:25:29.6996724Z libcublas-12.6.4.1 | 256.2 MB | #9 | 19%  2025-05-07T20:25:29.6997002Z 2025-05-07T20:25:29.6997006Z 2025-05-07T20:25:29.6999006Z 2025-05-07T20:25:29.7097000Z libcusparse-12.5.4.2 | 118.6 MB | ####2 | 43%  2025-05-07T20:25:29.7097322Z 2025-05-07T20:25:29.7097326Z 2025-05-07T20:25:29.7097330Z 2025-05-07T20:25:29.7099232Z 2025-05-07T20:25:29.7668472Z cuda-nsight-12.6.77 | 113.2 MB | ####7 | 48%  2025-05-07T20:25:29.7668767Z 2025-05-07T20:25:29.7668771Z 2025-05-07T20:25:29.7691733Z libcufft-11.3.0.4 | 156.2 MB | ###5 | 36%  2025-05-07T20:25:29.7920126Z nsight-compute-2024. | 443.1 MB | #3 | 14% 2025-05-07T20:25:29.7921958Z 2025-05-07T20:25:29.8001759Z libcublas-12.6.4.1 | 256.2 MB | ## | 20%  2025-05-07T20:25:29.8002149Z 2025-05-07T20:25:29.8002156Z 2025-05-07T20:25:29.8003098Z 2025-05-07T20:25:29.8103967Z libcusparse-12.5.4.2 | 118.6 MB | ####5 | 46%  2025-05-07T20:25:29.8104261Z 2025-05-07T20:25:29.8104265Z 2025-05-07T20:25:29.8104269Z 2025-05-07T20:25:29.8105134Z 2025-05-07T20:25:29.8694793Z cuda-nsight-12.6.77 | 113.2 MB | ##### | 51%  2025-05-07T20:25:29.8709564Z nsight-compute-2024. | 443.1 MB | #4 | 14% 2025-05-07T20:25:29.8709827Z 2025-05-07T20:25:29.8711055Z 2025-05-07T20:25:29.8922540Z libcufft-11.3.0.4 | 156.2 MB | ###8 | 38%  2025-05-07T20:25:29.8927113Z 2025-05-07T20:25:29.9006152Z libcublas-12.6.4.1 | 256.2 MB | ##1 | 22%  2025-05-07T20:25:29.9006562Z 2025-05-07T20:25:29.9006569Z 2025-05-07T20:25:29.9007818Z 2025-05-07T20:25:29.9108524Z libcusparse-12.5.4.2 | 118.6 MB | ####8 | 49%  2025-05-07T20:25:29.9108824Z 2025-05-07T20:25:29.9108829Z 2025-05-07T20:25:29.9108834Z 2025-05-07T20:25:29.9109369Z 2025-05-07T20:25:29.9726958Z cuda-nsight-12.6.77 | 113.2 MB | #####3 | 54%  2025-05-07T20:25:29.9800554Z nsight-compute-2024. | 443.1 MB | #5 | 15% 2025-05-07T20:25:29.9800825Z 2025-05-07T20:25:29.9802172Z 2025-05-07T20:25:29.9945882Z libcufft-11.3.0.4 | 156.2 MB | #### | 40%  2025-05-07T20:25:29.9946306Z 2025-05-07T20:25:30.0007203Z libcublas-12.6.4.1 | 256.2 MB | ##3 | 23%  2025-05-07T20:25:30.0007624Z 2025-05-07T20:25:30.0007631Z 2025-05-07T20:25:30.0008453Z 2025-05-07T20:25:30.0118542Z libcusparse-12.5.4.2 | 118.6 MB | #####1 | 52%  2025-05-07T20:25:30.0118851Z 2025-05-07T20:25:30.0118855Z 2025-05-07T20:25:30.0118859Z 2025-05-07T20:25:30.0119267Z 2025-05-07T20:25:30.0773702Z cuda-nsight-12.6.77 | 113.2 MB | #####6 | 57%  2025-05-07T20:25:30.0886054Z nsight-compute-2024. | 443.1 MB | #6 | 16% 2025-05-07T20:25:30.0886418Z 2025-05-07T20:25:30.0886424Z 2025-05-07T20:25:30.0982988Z libcufft-11.3.0.4 | 156.2 MB | ####2 | 43%  2025-05-07T20:25:30.0985321Z 2025-05-07T20:25:30.1011055Z libcublas-12.6.4.1 | 256.2 MB | ##4 | 24%  2025-05-07T20:25:30.1011485Z 2025-05-07T20:25:30.1011492Z 2025-05-07T20:25:30.1011555Z 2025-05-07T20:25:30.1119959Z libcusparse-12.5.4.2 | 118.6 MB | #####4 | 54%  2025-05-07T20:25:30.1120385Z 2025-05-07T20:25:30.1120391Z 2025-05-07T20:25:30.1120412Z 2025-05-07T20:25:30.1121858Z 2025-05-07T20:25:30.1773833Z cuda-nsight-12.6.77 | 113.2 MB | #####9 | 60%  2025-05-07T20:25:30.1983357Z nsight-compute-2024. | 443.1 MB | #7 | 17% 2025-05-07T20:25:30.1983798Z 2025-05-07T20:25:30.2012410Z libcublas-12.6.4.1 | 256.2 MB | ##6 | 27%  2025-05-07T20:25:30.2012730Z 2025-05-07T20:25:30.2012734Z 2025-05-07T20:25:30.2012738Z 2025-05-07T20:25:30.2116128Z libcusparse-12.5.4.2 | 118.6 MB | #####8 | 59%  2025-05-07T20:25:30.2116458Z 2025-05-07T20:25:30.2118163Z 2025-05-07T20:25:30.2616051Z libcufft-11.3.0.4 | 156.2 MB | ####4 | 45%  2025-05-07T20:25:30.2616499Z 2025-05-07T20:25:30.2616505Z 2025-05-07T20:25:30.2616510Z 2025-05-07T20:25:30.2617110Z 2025-05-07T20:25:30.2776058Z cuda-nsight-12.6.77 | 113.2 MB | ######2 | 63%  2025-05-07T20:25:30.2984380Z nsight-compute-2024. | 443.1 MB | #8 | 18% 2025-05-07T20:25:30.2985244Z 2025-05-07T20:25:30.3151170Z libcublas-12.6.4.1 | 256.2 MB | ##8 | 28%  2025-05-07T20:25:30.3151537Z 2025-05-07T20:25:30.3151543Z 2025-05-07T20:25:30.3617734Z libcufft-11.3.0.4 | 156.2 MB | ####6 | 47%  2025-05-07T20:25:30.3618107Z 2025-05-07T20:25:30.3618113Z 2025-05-07T20:25:30.3618118Z 2025-05-07T20:25:30.3618125Z 2025-05-07T20:25:30.3778227Z cuda-nsight-12.6.77 | 113.2 MB | ######6 | 66%  2025-05-07T20:25:30.3778529Z 2025-05-07T20:25:30.3778533Z 2025-05-07T20:25:30.3781896Z 2025-05-07T20:25:30.4013283Z libcusparse-12.5.4.2 | 118.6 MB | ######1 | 62%  2025-05-07T20:25:30.4110302Z nsight-compute-2024. | 443.1 MB | #9 | 19% 2025-05-07T20:25:30.4111488Z 2025-05-07T20:25:30.4181040Z libcublas-12.6.4.1 | 256.2 MB | ##9 | 30%  2025-05-07T20:25:30.4181410Z 2025-05-07T20:25:30.4181414Z 2025-05-07T20:25:30.4689410Z libcufft-11.3.0.4 | 156.2 MB | ####8 | 49%  2025-05-07T20:25:30.4689703Z 2025-05-07T20:25:30.4689707Z 2025-05-07T20:25:30.4689712Z 2025-05-07T20:25:30.4689716Z 2025-05-07T20:25:30.4782587Z cuda-nsight-12.6.77 | 113.2 MB | ######9 | 69%  2025-05-07T20:25:30.4782926Z 2025-05-07T20:25:30.4782931Z 2025-05-07T20:25:30.4782935Z 2025-05-07T20:25:30.5181556Z libcusparse-12.5.4.2 | 118.6 MB | ######5 | 65%  2025-05-07T20:25:30.5181892Z 2025-05-07T20:25:30.5181896Z 2025-05-07T20:25:30.5186092Z libcufft-11.3.0.4 | 156.2 MB | ##### | 51%  2025-05-07T20:25:30.5289482Z nsight-compute-2024. | 443.1 MB | ## | 20% 2025-05-07T20:25:30.5291956Z 2025-05-07T20:25:30.5694618Z libcublas-12.6.4.1 | 256.2 MB | ###1 | 31%  2025-05-07T20:25:30.5694918Z 2025-05-07T20:25:30.5694923Z 2025-05-07T20:25:30.5694926Z 2025-05-07T20:25:30.5695399Z 2025-05-07T20:25:30.5820162Z cuda-nsight-12.6.77 | 113.2 MB | #######2 | 72%  2025-05-07T20:25:30.5820509Z 2025-05-07T20:25:30.5820515Z 2025-05-07T20:25:30.5825609Z 2025-05-07T20:25:30.6182096Z libcusparse-12.5.4.2 | 118.6 MB | ######8 | 68%  2025-05-07T20:25:30.6182432Z 2025-05-07T20:25:30.6182436Z 2025-05-07T20:25:30.6280899Z libcufft-11.3.0.4 | 156.2 MB | #####2 | 53%  2025-05-07T20:25:30.6290447Z nsight-compute-2024. | 443.1 MB | ##1 | 21% 2025-05-07T20:25:30.6290827Z 2025-05-07T20:25:30.6695500Z libcublas-12.6.4.1 | 256.2 MB | ###2 | 33%  2025-05-07T20:25:30.6695798Z 2025-05-07T20:25:30.6695802Z 2025-05-07T20:25:30.6695806Z 2025-05-07T20:25:30.6696147Z 2025-05-07T20:25:30.6822130Z cuda-nsight-12.6.77 | 113.2 MB | #######5 | 76%  2025-05-07T20:25:30.6822428Z 2025-05-07T20:25:30.6822692Z 2025-05-07T20:25:30.6822917Z 2025-05-07T20:25:30.7184937Z libcusparse-12.5.4.2 | 118.6 MB | #######1 | 71%  2025-05-07T20:25:30.7185242Z 2025-05-07T20:25:30.7185247Z 2025-05-07T20:25:30.7283338Z libcufft-11.3.0.4 | 156.2 MB | #####5 | 55%  2025-05-07T20:25:30.7293395Z nsight-compute-2024. | 443.1 MB | ##2 | 22% 2025-05-07T20:25:30.7294310Z 2025-05-07T20:25:30.7697608Z libcublas-12.6.4.1 | 256.2 MB | ###4 | 34%  2025-05-07T20:25:30.7697887Z 2025-05-07T20:25:30.7697893Z 2025-05-07T20:25:30.7697899Z 2025-05-07T20:25:30.7700098Z 2025-05-07T20:25:30.7887442Z cuda-nsight-12.6.77 | 113.2 MB | #######8 | 79%  2025-05-07T20:25:30.7888097Z 2025-05-07T20:25:30.7888103Z 2025-05-07T20:25:30.7888511Z 2025-05-07T20:25:30.8190751Z libcusparse-12.5.4.2 | 118.6 MB | #######4 | 74%  2025-05-07T20:25:30.8191068Z 2025-05-07T20:25:30.8191072Z 2025-05-07T20:25:30.8285242Z libcufft-11.3.0.4 | 156.2 MB | #####7 | 57%  2025-05-07T20:25:30.8293983Z nsight-compute-2024. | 443.1 MB | ##3 | 23% 2025-05-07T20:25:30.8296687Z 2025-05-07T20:25:30.8697803Z libcublas-12.6.4.1 | 256.2 MB | ###5 | 36%  2025-05-07T20:25:30.8698220Z 2025-05-07T20:25:30.8698226Z 2025-05-07T20:25:30.8698232Z 2025-05-07T20:25:30.8699093Z 2025-05-07T20:25:30.8889935Z cuda-nsight-12.6.77 | 113.2 MB | ########2 | 83%  2025-05-07T20:25:30.8890276Z 2025-05-07T20:25:30.8890280Z 2025-05-07T20:25:30.8892753Z 2025-05-07T20:25:30.9192781Z libcusparse-12.5.4.2 | 118.6 MB | #######7 | 77%  2025-05-07T20:25:30.9193077Z 2025-05-07T20:25:30.9193688Z 2025-05-07T20:25:30.9334638Z libcufft-11.3.0.4 | 156.2 MB | #####9 | 59%  2025-05-07T20:25:30.9350983Z nsight-compute-2024. | 443.1 MB | ##3 | 24% 2025-05-07T20:25:30.9351404Z 2025-05-07T20:25:30.9698922Z libcublas-12.6.4.1 | 256.2 MB | ###7 | 37%  2025-05-07T20:25:30.9699343Z 2025-05-07T20:25:30.9699349Z 2025-05-07T20:25:30.9699382Z 2025-05-07T20:25:30.9701561Z 2025-05-07T20:25:30.9891612Z cuda-nsight-12.6.77 | 113.2 MB | ########6 | 86%  2025-05-07T20:25:30.9891981Z 2025-05-07T20:25:30.9892536Z 2025-05-07T20:25:30.9894033Z 2025-05-07T20:25:31.0193316Z libcusparse-12.5.4.2 | 118.6 MB | ######## | 81%  2025-05-07T20:25:31.0193623Z 2025-05-07T20:25:31.0194227Z 2025-05-07T20:25:31.0338049Z libcufft-11.3.0.4 | 156.2 MB | ######1 | 62%  2025-05-07T20:25:31.0700766Z nsight-compute-2024. | 443.1 MB | ##4 | 25% 2025-05-07T20:25:31.0701041Z 2025-05-07T20:25:31.0701045Z 2025-05-07T20:25:31.0701050Z 2025-05-07T20:25:31.0701766Z 2025-05-07T20:25:31.0749194Z cuda-nsight-12.6.77 | 113.2 MB | ########9 | 90%  2025-05-07T20:25:31.0750231Z 2025-05-07T20:25:31.0894078Z libcublas-12.6.4.1 | 256.2 MB | ###8 | 39%  2025-05-07T20:25:31.0894415Z 2025-05-07T20:25:31.0894421Z 2025-05-07T20:25:31.0896311Z 2025-05-07T20:25:31.1198850Z libcusparse-12.5.4.2 | 118.6 MB | ########4 | 84%  2025-05-07T20:25:31.1199229Z 2025-05-07T20:25:31.1199236Z 2025-05-07T20:25:31.1339472Z libcufft-11.3.0.4 | 156.2 MB | ######4 | 64%  2025-05-07T20:25:31.1733837Z nsight-compute-2024. | 443.1 MB | ##5 | 26% 2025-05-07T20:25:31.1734343Z 2025-05-07T20:25:31.1734349Z 2025-05-07T20:25:31.1734354Z 2025-05-07T20:25:31.1735081Z 2025-05-07T20:25:31.1750837Z cuda-nsight-12.6.77 | 113.2 MB | #########3 | 93%  2025-05-07T20:25:31.1751130Z 2025-05-07T20:25:31.1920123Z libcublas-12.6.4.1 | 256.2 MB | #### | 40%  2025-05-07T20:25:31.1920418Z 2025-05-07T20:25:31.1920424Z 2025-05-07T20:25:31.1922824Z 2025-05-07T20:25:31.2239315Z libcusparse-12.5.4.2 | 118.6 MB | ########7 | 87%  2025-05-07T20:25:31.2239704Z 2025-05-07T20:25:31.2241332Z 2025-05-07T20:25:31.2376409Z libcufft-11.3.0.4 | 156.2 MB | ######6 | 66%  2025-05-07T20:25:31.2735198Z nsight-compute-2024. | 443.1 MB | ##6 | 27% 2025-05-07T20:25:31.2735995Z 2025-05-07T20:25:31.2736004Z 2025-05-07T20:25:31.2736010Z 2025-05-07T20:25:31.2736974Z 2025-05-07T20:25:31.2751701Z cuda-nsight-12.6.77 | 113.2 MB | #########6 | 97%  2025-05-07T20:25:31.2752696Z 2025-05-07T20:25:31.2952597Z libcublas-12.6.4.1 | 256.2 MB | ####1 | 42%  2025-05-07T20:25:31.2952963Z 2025-05-07T20:25:31.2952967Z 2025-05-07T20:25:31.2953639Z 2025-05-07T20:25:31.3283904Z libcusparse-12.5.4.2 | 118.6 MB | ######### | 90%  2025-05-07T20:25:31.3284339Z 2025-05-07T20:25:31.3284345Z 2025-05-07T20:25:31.3455102Z libcufft-11.3.0.4 | 156.2 MB | ######8 | 69%  2025-05-07T20:25:31.3752014Z nsight-compute-2024. | 443.1 MB | ##7 | 28% 2025-05-07T20:25:31.3752875Z 2025-05-07T20:25:31.3953072Z libcublas-12.6.4.1 | 256.2 MB | ####3 | 44%  2025-05-07T20:25:31.3953359Z 2025-05-07T20:25:31.3953366Z 2025-05-07T20:25:31.3955329Z 2025-05-07T20:25:31.4457872Z libcusparse-12.5.4.2 | 118.6 MB | #########4 | 95%  2025-05-07T20:25:31.4755450Z nsight-compute-2024. | 443.1 MB | ##9 | 29% 2025-05-07T20:25:31.4758268Z 2025-05-07T20:25:31.4956506Z libcublas-12.6.4.1 | 256.2 MB | ####5 | 46%  2025-05-07T20:25:31.4956841Z 2025-05-07T20:25:31.4956847Z 2025-05-07T20:25:31.4956852Z 2025-05-07T20:25:31.5067273Z libcusparse-12.5.4.2 | 118.6 MB | #########8 | 99%  2025-05-07T20:25:31.5067623Z 2025-05-07T20:25:31.5067629Z 2025-05-07T20:25:31.5521599Z libcufft-11.3.0.4 | 156.2 MB | ####### | 71%  2025-05-07T20:25:31.5867449Z nsight-compute-2024. | 443.1 MB | ### | 30% 2025-05-07T20:25:31.5868848Z 2025-05-07T20:25:31.6071370Z libcublas-12.6.4.1 | 256.2 MB | ####7 | 47%  2025-05-07T20:25:31.6071693Z 2025-05-07T20:25:31.6071698Z 2025-05-07T20:25:31.6548938Z libcufft-11.3.0.4 | 156.2 MB | #######3 | 73%  2025-05-07T20:25:31.6867624Z nsight-compute-2024. | 443.1 MB | ###1 | 31% 2025-05-07T20:25:31.6868836Z 2025-05-07T20:25:31.7074400Z libcublas-12.6.4.1 | 256.2 MB | ####9 | 49%  2025-05-07T20:25:31.7074952Z 2025-05-07T20:25:31.7074957Z 2025-05-07T20:25:31.7550363Z libcufft-11.3.0.4 | 156.2 MB | #######5 | 76%  2025-05-07T20:25:31.7891412Z nsight-compute-2024. | 443.1 MB | ###2 | 32% 2025-05-07T20:25:31.7891824Z 2025-05-07T20:25:31.8126436Z libcublas-12.6.4.1 | 256.2 MB | #####1 | 51%  2025-05-07T20:25:31.8126779Z 2025-05-07T20:25:31.8129601Z 2025-05-07T20:25:31.8551579Z libcufft-11.3.0.4 | 156.2 MB | #######8 | 78%  2025-05-07T20:25:31.9137995Z nsight-compute-2024. | 443.1 MB | ###3 | 33% 2025-05-07T20:25:31.9138269Z 2025-05-07T20:25:31.9138273Z 2025-05-07T20:25:31.9435246Z libcufft-11.3.0.4 | 156.2 MB | ######## | 81%  2025-05-07T20:25:31.9436564Z 2025-05-07T20:25:31.9552439Z libcublas-12.6.4.1 | 256.2 MB | #####2 | 53%  2025-05-07T20:25:32.0437365Z nsight-compute-2024. | 443.1 MB | ###4 | 34% 2025-05-07T20:25:32.0437832Z 2025-05-07T20:25:32.0488444Z libcublas-12.6.4.1 | 256.2 MB | #####4 | 55%  2025-05-07T20:25:32.0488769Z 2025-05-07T20:25:32.0488773Z 2025-05-07T20:25:32.0561371Z libcufft-11.3.0.4 | 156.2 MB | ########2 | 83%  2025-05-07T20:25:32.1438884Z nsight-compute-2024. | 443.1 MB | ###5 | 36% 2025-05-07T20:25:32.1442294Z 2025-05-07T20:25:32.1563432Z libcublas-12.6.4.1 | 256.2 MB | #####6 | 56%  2025-05-07T20:25:32.1721599Z nsight-compute-2024. | 443.1 MB | ###6 | 37% 2025-05-07T20:25:32.1721877Z 2025-05-07T20:25:32.1723953Z 2025-05-07T20:25:32.2440511Z libcufft-11.3.0.4 | 156.2 MB | ########5 | 85%  2025-05-07T20:25:32.2441582Z 2025-05-07T20:25:32.2567323Z libcublas-12.6.4.1 | 256.2 MB | #####8 | 58%  2025-05-07T20:25:32.3218550Z nsight-compute-2024. | 443.1 MB | ###7 | 38% 2025-05-07T20:25:32.3218880Z 2025-05-07T20:25:32.3219604Z 2025-05-07T20:25:32.3444284Z libcufft-11.3.0.4 | 156.2 MB | ########7 | 87%  2025-05-07T20:25:32.3445639Z 2025-05-07T20:25:32.3569428Z libcublas-12.6.4.1 | 256.2 MB | ###### | 60%  2025-05-07T20:25:32.4220956Z nsight-compute-2024. | 443.1 MB | ###9 | 39% 2025-05-07T20:25:32.4221294Z 2025-05-07T20:25:32.4222745Z 2025-05-07T20:25:32.4503068Z libcufft-11.3.0.4 | 156.2 MB | ########9 | 89%  2025-05-07T20:25:32.4504858Z 2025-05-07T20:25:32.4799735Z libcublas-12.6.4.1 | 256.2 MB | ######1 | 62%  2025-05-07T20:25:32.5220988Z nsight-compute-2024. | 443.1 MB | #### | 40% 2025-05-07T20:25:32.5221361Z 2025-05-07T20:25:32.5222665Z 2025-05-07T20:25:32.5562117Z libcufft-11.3.0.4 | 156.2 MB | #########2 | 92%  2025-05-07T20:25:32.5564688Z 2025-05-07T20:25:32.5811609Z libcublas-12.6.4.1 | 256.2 MB | ######3 | 64%  2025-05-07T20:25:32.6564392Z nsight-compute-2024. | 443.1 MB | ####1 | 41% 2025-05-07T20:25:32.6565086Z 2025-05-07T20:25:32.6645203Z libcublas-12.6.4.1 | 256.2 MB | ######5 | 65%  2025-05-07T20:25:32.6645488Z 2025-05-07T20:25:32.6649977Z 2025-05-07T20:25:32.6813155Z libcufft-11.3.0.4 | 156.2 MB | #########4 | 94%  2025-05-07T20:25:32.7587727Z nsight-compute-2024. | 443.1 MB | ####2 | 43% 2025-05-07T20:25:32.7588785Z 2025-05-07T20:25:32.7645646Z libcublas-12.6.4.1 | 256.2 MB | ######7 | 67%  2025-05-07T20:25:32.7645955Z 2025-05-07T20:25:32.7645961Z 2025-05-07T20:25:32.8105710Z libcufft-11.3.0.4 | 156.2 MB | #########6 | 96%  2025-05-07T20:25:32.8629797Z nsight-compute-2024. | 443.1 MB | ####3 | 44% 2025-05-07T20:25:32.8630148Z 2025-05-07T20:25:32.8646817Z libcublas-12.6.4.1 | 256.2 MB | ######8 | 69%  2025-05-07T20:25:32.8647124Z 2025-05-07T20:25:32.8647130Z 2025-05-07T20:25:32.9212754Z libcufft-11.3.0.4 | 156.2 MB | #########8 | 98%  2025-05-07T20:25:32.9631900Z nsight-compute-2024. | 443.1 MB | ####4 | 45% 2025-05-07T20:25:32.9632267Z 2025-05-07T20:25:33.0221264Z libcublas-12.6.4.1 | 256.2 MB | ####### | 71%  2025-05-07T20:25:33.0696811Z nsight-compute-2024. | 443.1 MB | ####5 | 46% 2025-05-07T20:25:33.0697976Z 2025-05-07T20:25:33.1237006Z libcublas-12.6.4.1 | 256.2 MB | #######2 | 73%  2025-05-07T20:25:33.1709768Z nsight-compute-2024. | 443.1 MB | ####6 | 47% 2025-05-07T20:25:33.1711377Z 2025-05-07T20:25:33.2257739Z libcublas-12.6.4.1 | 256.2 MB | #######4 | 74%  2025-05-07T20:25:33.2731522Z nsight-compute-2024. | 443.1 MB | ####7 | 48% 2025-05-07T20:25:33.2732984Z 2025-05-07T20:25:33.3258790Z libcublas-12.6.4.1 | 256.2 MB | #######6 | 76%  2025-05-07T20:25:33.3762272Z nsight-compute-2024. | 443.1 MB | ####9 | 49% 2025-05-07T20:25:33.3762647Z 2025-05-07T20:25:33.4191089Z libcublas-12.6.4.1 | 256.2 MB | #######7 | 78%  2025-05-07T20:25:33.4191469Z 2025-05-07T20:25:33.4191476Z 2025-05-07T20:25:33.4191482Z 2025-05-07T20:25:33.4191488Z 2025-05-07T20:25:33.4265055Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:33.4832029Z nsight-compute-2024. | 443.1 MB | ##### | 50% 2025-05-07T20:25:33.4834256Z 2025-05-07T20:25:33.5020803Z libcublas-12.6.4.1 | 256.2 MB | #######9 | 80%  2025-05-07T20:25:33.5021287Z 2025-05-07T20:25:33.5021304Z 2025-05-07T20:25:33.5021309Z 2025-05-07T20:25:33.5021314Z 2025-05-07T20:25:33.5021319Z 2025-05-07T20:25:33.5370991Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:33.6021526Z nsight-compute-2024. | 443.1 MB | #####1 | 51% 2025-05-07T20:25:33.6021883Z 2025-05-07T20:25:33.6022418Z 2025-05-07T20:25:33.6022428Z 2025-05-07T20:25:33.6022435Z 2025-05-07T20:25:33.6022622Z 2025-05-07T20:25:33.6056106Z cuda-nvvp-12.6.80 | 109.3 MB | 3 | 3%  2025-05-07T20:25:33.6056476Z 2025-05-07T20:25:33.6056482Z 2025-05-07T20:25:33.6056492Z 2025-05-07T20:25:33.6340819Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:33.6347241Z 2025-05-07T20:25:33.6615473Z libcublas-12.6.4.1 | 256.2 MB | ########1 | 81%  2025-05-07T20:25:33.6615839Z 2025-05-07T20:25:33.6615851Z 2025-05-07T20:25:33.6615857Z 2025-05-07T20:25:33.6615863Z 2025-05-07T20:25:33.6615868Z 2025-05-07T20:25:33.6617159Z 2025-05-07T20:25:33.6650730Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:33.7023862Z nsight-compute-2024. | 443.1 MB | #####2 | 52% 2025-05-07T20:25:33.7024226Z 2025-05-07T20:25:33.7024232Z 2025-05-07T20:25:33.7024238Z 2025-05-07T20:25:33.7024243Z 2025-05-07T20:25:33.7025642Z 2025-05-07T20:25:33.7618000Z cuda-nvvp-12.6.80 | 109.3 MB | 5 | 5%  2025-05-07T20:25:33.7618344Z 2025-05-07T20:25:33.7618348Z 2025-05-07T20:25:33.7618352Z 2025-05-07T20:25:33.7618357Z 2025-05-07T20:25:33.7618360Z 2025-05-07T20:25:33.7620199Z 2025-05-07T20:25:33.7905870Z libcusolver-11.7.1.2 | 95.8 MB | 2 | 3%  2025-05-07T20:25:33.7983878Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:25:33.7989492Z 2025-05-07T20:25:33.8026783Z libcublas-12.6.4.1 | 256.2 MB | ########2 | 83%  2025-05-07T20:25:33.8027137Z 2025-05-07T20:25:33.8027143Z 2025-05-07T20:25:33.8027149Z 2025-05-07T20:25:33.8027154Z 2025-05-07T20:25:33.8027160Z 2025-05-07T20:25:33.8622425Z cuda-nvvp-12.6.80 | 109.3 MB | 7 | 8%  2025-05-07T20:25:33.8622794Z 2025-05-07T20:25:33.8622804Z 2025-05-07T20:25:33.8622808Z 2025-05-07T20:25:33.8622812Z 2025-05-07T20:25:33.8622815Z 2025-05-07T20:25:33.8623997Z 2025-05-07T20:25:33.9034775Z libcusolver-11.7.1.2 | 95.8 MB | 5 | 5%  2025-05-07T20:25:33.9035145Z 2025-05-07T20:25:33.9035149Z 2025-05-07T20:25:33.9035153Z 2025-05-07T20:25:33.9035157Z 2025-05-07T20:25:33.9038235Z 2025-05-07T20:25:33.9196832Z cuda-nvvp-12.6.80 | 109.3 MB | # | 10%  2025-05-07T20:25:33.9328790Z nsight-compute-2024. | 443.1 MB | #####3 | 54% 2025-05-07T20:25:33.9329220Z 2025-05-07T20:25:33.9630122Z libcublas-12.6.4.1 | 256.2 MB | ########4 | 84%  2025-05-07T20:25:33.9630730Z 2025-05-07T20:25:33.9630735Z 2025-05-07T20:25:33.9630739Z 2025-05-07T20:25:33.9630742Z 2025-05-07T20:25:33.9630746Z 2025-05-07T20:25:33.9630750Z 2025-05-07T20:25:34.0037843Z libcusolver-11.7.1.2 | 95.8 MB | 8 | 8%  2025-05-07T20:25:34.0038154Z 2025-05-07T20:25:34.0038158Z 2025-05-07T20:25:34.0038162Z 2025-05-07T20:25:34.0038165Z 2025-05-07T20:25:34.0046125Z 2025-05-07T20:25:34.0437647Z cuda-nvvp-12.6.80 | 109.3 MB | #2 | 13%  2025-05-07T20:25:34.0589745Z nsight-compute-2024. | 443.1 MB | #####4 | 55% 2025-05-07T20:25:34.0592004Z 2025-05-07T20:25:34.0638531Z libcublas-12.6.4.1 | 256.2 MB | ########5 | 86%  2025-05-07T20:25:34.0638796Z 2025-05-07T20:25:34.0638800Z 2025-05-07T20:25:34.0638817Z 2025-05-07T20:25:34.0638820Z 2025-05-07T20:25:34.0638824Z 2025-05-07T20:25:34.0641667Z 2025-05-07T20:25:34.1094012Z libcusolver-11.7.1.2 | 95.8 MB | # | 11%  2025-05-07T20:25:34.1094372Z 2025-05-07T20:25:34.1094385Z 2025-05-07T20:25:34.1094389Z 2025-05-07T20:25:34.1094393Z 2025-05-07T20:25:34.1096867Z 2025-05-07T20:25:34.1487739Z cuda-nvvp-12.6.80 | 109.3 MB | #4 | 15%  2025-05-07T20:25:34.1646106Z nsight-compute-2024. | 443.1 MB | #####5 | 56% 2025-05-07T20:25:34.1646481Z 2025-05-07T20:25:34.1646486Z 2025-05-07T20:25:34.1646492Z 2025-05-07T20:25:34.1646497Z 2025-05-07T20:25:34.1646502Z 2025-05-07T20:25:34.1646508Z 2025-05-07T20:25:34.1671243Z libcusolver-11.7.1.2 | 95.8 MB | #3 | 14%  2025-05-07T20:25:34.1671870Z 2025-05-07T20:25:34.2097900Z libcublas-12.6.4.1 | 256.2 MB | ########6 | 87%  2025-05-07T20:25:34.2098288Z 2025-05-07T20:25:34.2098294Z 2025-05-07T20:25:34.2098300Z 2025-05-07T20:25:34.2098305Z 2025-05-07T20:25:34.2100291Z 2025-05-07T20:25:34.2577112Z cuda-nvvp-12.6.80 | 109.3 MB | #7 | 17%  2025-05-07T20:25:34.2675442Z nsight-compute-2024. | 443.1 MB | #####6 | 56% 2025-05-07T20:25:34.2675809Z 2025-05-07T20:25:34.2675816Z 2025-05-07T20:25:34.2675821Z 2025-05-07T20:25:34.2675826Z 2025-05-07T20:25:34.2675831Z 2025-05-07T20:25:34.2675836Z 2025-05-07T20:25:34.2728970Z libcusolver-11.7.1.2 | 95.8 MB | #6 | 16%  2025-05-07T20:25:34.2732536Z 2025-05-07T20:25:34.3099866Z libcublas-12.6.4.1 | 256.2 MB | ########7 | 88%  2025-05-07T20:25:34.3100225Z 2025-05-07T20:25:34.3100231Z 2025-05-07T20:25:34.3100238Z 2025-05-07T20:25:34.3100244Z 2025-05-07T20:25:34.3101597Z 2025-05-07T20:25:34.3624339Z cuda-nvvp-12.6.80 | 109.3 MB | #9 | 20%  2025-05-07T20:25:34.3679799Z nsight-compute-2024. | 443.1 MB | #####7 | 57% 2025-05-07T20:25:34.3680076Z 2025-05-07T20:25:34.3680082Z 2025-05-07T20:25:34.3680087Z 2025-05-07T20:25:34.3680092Z 2025-05-07T20:25:34.3680099Z 2025-05-07T20:25:34.3683086Z 2025-05-07T20:25:34.3843366Z libcusolver-11.7.1.2 | 95.8 MB | #8 | 19%  2025-05-07T20:25:34.3846268Z 2025-05-07T20:25:34.4184538Z libcublas-12.6.4.1 | 256.2 MB | ########8 | 89%  2025-05-07T20:25:34.4184861Z 2025-05-07T20:25:34.4184866Z 2025-05-07T20:25:34.4184869Z 2025-05-07T20:25:34.4184873Z 2025-05-07T20:25:34.4188820Z 2025-05-07T20:25:34.4683444Z cuda-nvvp-12.6.80 | 109.3 MB | ##2 | 22%  2025-05-07T20:25:34.4683773Z 2025-05-07T20:25:34.4683776Z 2025-05-07T20:25:34.4683780Z 2025-05-07T20:25:34.4683784Z 2025-05-07T20:25:34.4683788Z 2025-05-07T20:25:34.4683792Z 2025-05-07T20:25:34.4716196Z libcusolver-11.7.1.2 | 95.8 MB | ##1 | 22%  2025-05-07T20:25:34.4870873Z nsight-compute-2024. | 443.1 MB | #####7 | 58% 2025-05-07T20:25:34.4871263Z 2025-05-07T20:25:34.5225454Z libcublas-12.6.4.1 | 256.2 MB | ######### | 90%  2025-05-07T20:25:34.5225820Z 2025-05-07T20:25:34.5225824Z 2025-05-07T20:25:34.5225828Z 2025-05-07T20:25:34.5225854Z 2025-05-07T20:25:34.5227118Z 2025-05-07T20:25:34.5688415Z cuda-nvvp-12.6.80 | 109.3 MB | ##4 | 25%  2025-05-07T20:25:34.5688791Z 2025-05-07T20:25:34.5688797Z 2025-05-07T20:25:34.5688802Z 2025-05-07T20:25:34.5688807Z 2025-05-07T20:25:34.5688812Z 2025-05-07T20:25:34.5688818Z 2025-05-07T20:25:34.5968737Z libcusolver-11.7.1.2 | 95.8 MB | ##4 | 25%  2025-05-07T20:25:34.5972710Z nsight-compute-2024. | 443.1 MB | #####8 | 59% 2025-05-07T20:25:34.5979161Z 2025-05-07T20:25:34.6231125Z libcublas-12.6.4.1 | 256.2 MB | #########1 | 91%  2025-05-07T20:25:34.6231398Z 2025-05-07T20:25:34.6231402Z 2025-05-07T20:25:34.6231406Z 2025-05-07T20:25:34.6231420Z 2025-05-07T20:25:34.6238872Z 2025-05-07T20:25:34.6692044Z cuda-nvvp-12.6.80 | 109.3 MB | ##7 | 27%  2025-05-07T20:25:34.6692633Z 2025-05-07T20:25:34.6692639Z 2025-05-07T20:25:34.6692643Z 2025-05-07T20:25:34.6692648Z 2025-05-07T20:25:34.6692652Z 2025-05-07T20:25:34.6692679Z 2025-05-07T20:25:34.6975286Z libcusolver-11.7.1.2 | 95.8 MB | ##7 | 28%  2025-05-07T20:25:34.6975632Z 2025-05-07T20:25:34.7088338Z libcublas-12.6.4.1 | 256.2 MB | #########2 | 92%  2025-05-07T20:25:34.7235353Z nsight-compute-2024. | 443.1 MB | #####9 | 59% 2025-05-07T20:25:34.7235711Z 2025-05-07T20:25:34.7235715Z 2025-05-07T20:25:34.7235719Z 2025-05-07T20:25:34.7235723Z 2025-05-07T20:25:34.7242963Z 2025-05-07T20:25:34.7727162Z cuda-nvvp-12.6.80 | 109.3 MB | ##9 | 30%  2025-05-07T20:25:34.7727849Z 2025-05-07T20:25:34.7727855Z 2025-05-07T20:25:34.7727861Z 2025-05-07T20:25:34.7727866Z 2025-05-07T20:25:34.7727871Z 2025-05-07T20:25:34.7728814Z 2025-05-07T20:25:34.8037509Z libcusolver-11.7.1.2 | 95.8 MB | ### | 31%  2025-05-07T20:25:34.8039450Z 2025-05-07T20:25:34.8091727Z libcublas-12.6.4.1 | 256.2 MB | #########3 | 93%  2025-05-07T20:25:34.8238156Z nsight-compute-2024. | 443.1 MB | ###### | 60% 2025-05-07T20:25:34.8238461Z 2025-05-07T20:25:34.8238465Z 2025-05-07T20:25:34.8238469Z 2025-05-07T20:25:34.8238473Z 2025-05-07T20:25:34.8239099Z 2025-05-07T20:25:34.8812061Z cuda-nvvp-12.6.80 | 109.3 MB | ###2 | 32%  2025-05-07T20:25:34.8812407Z 2025-05-07T20:25:34.8812411Z 2025-05-07T20:25:34.8812415Z 2025-05-07T20:25:34.8812419Z 2025-05-07T20:25:34.8812430Z 2025-05-07T20:25:34.8816015Z 2025-05-07T20:25:34.9040808Z libcusolver-11.7.1.2 | 95.8 MB | ###3 | 34%  2025-05-07T20:25:34.9042557Z 2025-05-07T20:25:34.9091606Z libcublas-12.6.4.1 | 256.2 MB | #########4 | 95%  2025-05-07T20:25:34.9365187Z nsight-compute-2024. | 443.1 MB | ###### | 61% 2025-05-07T20:25:34.9365560Z 2025-05-07T20:25:34.9365566Z 2025-05-07T20:25:34.9365571Z 2025-05-07T20:25:34.9365576Z 2025-05-07T20:25:34.9368761Z 2025-05-07T20:25:34.9871806Z cuda-nvvp-12.6.80 | 109.3 MB | ###4 | 35%  2025-05-07T20:25:34.9872133Z 2025-05-07T20:25:34.9872150Z 2025-05-07T20:25:34.9872154Z 2025-05-07T20:25:34.9872158Z 2025-05-07T20:25:34.9872162Z 2025-05-07T20:25:34.9872166Z 2025-05-07T20:25:35.0044200Z libcusolver-11.7.1.2 | 95.8 MB | ###6 | 36%  2025-05-07T20:25:35.0044498Z 2025-05-07T20:25:35.0095175Z libcublas-12.6.4.1 | 256.2 MB | #########5 | 96%  2025-05-07T20:25:35.0369167Z nsight-compute-2024. | 443.1 MB | ######1 | 61% 2025-05-07T20:25:35.0369441Z 2025-05-07T20:25:35.0369445Z 2025-05-07T20:25:35.0369449Z 2025-05-07T20:25:35.0369457Z 2025-05-07T20:25:35.0376432Z 2025-05-07T20:25:35.0871869Z cuda-nvvp-12.6.80 | 109.3 MB | ###7 | 37%  2025-05-07T20:25:35.0872294Z 2025-05-07T20:25:35.0872300Z 2025-05-07T20:25:35.0872305Z 2025-05-07T20:25:35.0872310Z 2025-05-07T20:25:35.0872316Z 2025-05-07T20:25:35.0873539Z 2025-05-07T20:25:35.1066484Z libcusolver-11.7.1.2 | 95.8 MB | ###9 | 39%  2025-05-07T20:25:35.1066835Z 2025-05-07T20:25:35.1181495Z libcublas-12.6.4.1 | 256.2 MB | #########6 | 97%  2025-05-07T20:25:35.1370668Z nsight-compute-2024. | 443.1 MB | ######2 | 62% 2025-05-07T20:25:35.1371081Z 2025-05-07T20:25:35.1371087Z 2025-05-07T20:25:35.1371093Z 2025-05-07T20:25:35.1371098Z 2025-05-07T20:25:35.1372577Z 2025-05-07T20:25:35.1872808Z cuda-nvvp-12.6.80 | 109.3 MB | #### | 40%  2025-05-07T20:25:35.1873150Z 2025-05-07T20:25:35.1873156Z 2025-05-07T20:25:35.1873161Z 2025-05-07T20:25:35.1873167Z 2025-05-07T20:25:35.1873173Z 2025-05-07T20:25:35.1878252Z 2025-05-07T20:25:35.2074776Z libcusolver-11.7.1.2 | 95.8 MB | ####2 | 42%  2025-05-07T20:25:35.2075082Z 2025-05-07T20:25:35.2183977Z libcublas-12.6.4.1 | 256.2 MB | #########7 | 98%  2025-05-07T20:25:35.2371004Z nsight-compute-2024. | 443.1 MB | ######2 | 63% 2025-05-07T20:25:35.2371441Z 2025-05-07T20:25:35.2371532Z 2025-05-07T20:25:35.2371538Z 2025-05-07T20:25:35.2371543Z 2025-05-07T20:25:35.2371667Z 2025-05-07T20:25:35.2872804Z cuda-nvvp-12.6.80 | 109.3 MB | ####2 | 43%  2025-05-07T20:25:35.2873095Z 2025-05-07T20:25:35.2873099Z 2025-05-07T20:25:35.2873103Z 2025-05-07T20:25:35.2873107Z 2025-05-07T20:25:35.2873110Z 2025-05-07T20:25:35.2878600Z 2025-05-07T20:25:35.3141017Z libcusolver-11.7.1.2 | 95.8 MB | ####5 | 46%  2025-05-07T20:25:35.3142519Z 2025-05-07T20:25:35.3184414Z libcublas-12.6.4.1 | 256.2 MB | #########9 | 99%  2025-05-07T20:25:35.3399897Z nsight-compute-2024. | 443.1 MB | ######3 | 64% 2025-05-07T20:25:35.3400178Z 2025-05-07T20:25:35.3400183Z 2025-05-07T20:25:35.3400186Z 2025-05-07T20:25:35.3400190Z 2025-05-07T20:25:35.3400194Z 2025-05-07T20:25:35.3873794Z cuda-nvvp-12.6.80 | 109.3 MB | ####5 | 46%  2025-05-07T20:25:35.3874252Z 2025-05-07T20:25:35.3874259Z 2025-05-07T20:25:35.3874264Z 2025-05-07T20:25:35.3874270Z 2025-05-07T20:25:35.3874276Z 2025-05-07T20:25:35.3877371Z 2025-05-07T20:25:35.4188303Z libcusolver-11.7.1.2 | 95.8 MB | ####9 | 49%  2025-05-07T20:25:35.4403216Z nsight-compute-2024. | 443.1 MB | ######4 | 64% 2025-05-07T20:25:35.4403487Z 2025-05-07T20:25:35.4403560Z 2025-05-07T20:25:35.4403566Z 2025-05-07T20:25:35.4403570Z 2025-05-07T20:25:35.4408589Z 2025-05-07T20:25:35.4879043Z cuda-nvvp-12.6.80 | 109.3 MB | ####8 | 48%  2025-05-07T20:25:35.4879374Z 2025-05-07T20:25:35.4879382Z 2025-05-07T20:25:35.4879387Z 2025-05-07T20:25:35.4879392Z 2025-05-07T20:25:35.4879398Z 2025-05-07T20:25:35.4882482Z 2025-05-07T20:25:35.5215296Z libcusolver-11.7.1.2 | 95.8 MB | #####2 | 53%  2025-05-07T20:25:35.5406121Z nsight-compute-2024. | 443.1 MB | ######5 | 65% 2025-05-07T20:25:35.5406487Z 2025-05-07T20:25:35.5406493Z 2025-05-07T20:25:35.5406498Z 2025-05-07T20:25:35.5406503Z 2025-05-07T20:25:35.5409013Z 2025-05-07T20:25:35.5895575Z cuda-nvvp-12.6.80 | 109.3 MB | #####1 | 52%  2025-05-07T20:25:35.5895967Z 2025-05-07T20:25:35.5895973Z 2025-05-07T20:25:35.5895979Z 2025-05-07T20:25:35.5895984Z 2025-05-07T20:25:35.5895990Z 2025-05-07T20:25:35.5897485Z 2025-05-07T20:25:35.6251676Z libcusolver-11.7.1.2 | 95.8 MB | #####5 | 56%  2025-05-07T20:25:35.6411899Z nsight-compute-2024. | 443.1 MB | ######5 | 66% 2025-05-07T20:25:35.6412163Z 2025-05-07T20:25:35.6412168Z 2025-05-07T20:25:35.6412172Z 2025-05-07T20:25:35.6412175Z 2025-05-07T20:25:35.6415758Z 2025-05-07T20:25:35.6897288Z cuda-nvvp-12.6.80 | 109.3 MB | #####4 | 54%  2025-05-07T20:25:35.6897616Z 2025-05-07T20:25:35.6897622Z 2025-05-07T20:25:35.6897627Z 2025-05-07T20:25:35.6897632Z 2025-05-07T20:25:35.6897640Z 2025-05-07T20:25:35.6899187Z 2025-05-07T20:25:35.7253720Z libcusolver-11.7.1.2 | 95.8 MB | #####9 | 59%  2025-05-07T20:25:35.7417266Z nsight-compute-2024. | 443.1 MB | ######6 | 67% 2025-05-07T20:25:35.7417548Z 2025-05-07T20:25:35.7417897Z 2025-05-07T20:25:35.7417902Z 2025-05-07T20:25:35.7417906Z 2025-05-07T20:25:35.7420335Z 2025-05-07T20:25:35.7948075Z cuda-nvvp-12.6.80 | 109.3 MB | #####7 | 57%  2025-05-07T20:25:35.7948480Z 2025-05-07T20:25:35.7948485Z 2025-05-07T20:25:35.7948491Z 2025-05-07T20:25:35.7948496Z 2025-05-07T20:25:35.7948501Z 2025-05-07T20:25:35.7948506Z 2025-05-07T20:25:35.8282215Z libcusolver-11.7.1.2 | 95.8 MB | ######2 | 62%  2025-05-07T20:25:35.8469328Z nsight-compute-2024. | 443.1 MB | ######7 | 67% 2025-05-07T20:25:35.8469702Z 2025-05-07T20:25:35.8469708Z 2025-05-07T20:25:35.8469713Z 2025-05-07T20:25:35.8469718Z 2025-05-07T20:25:35.8469724Z 2025-05-07T20:25:35.8950636Z cuda-nvvp-12.6.80 | 109.3 MB | #####9 | 60%  2025-05-07T20:25:35.8951046Z 2025-05-07T20:25:35.8951052Z 2025-05-07T20:25:35.8951058Z 2025-05-07T20:25:35.8951063Z 2025-05-07T20:25:35.8951068Z 2025-05-07T20:25:35.8951073Z 2025-05-07T20:25:35.9314062Z libcusolver-11.7.1.2 | 95.8 MB | ######5 | 66%  2025-05-07T20:25:35.9474955Z nsight-compute-2024. | 443.1 MB | ######7 | 68% 2025-05-07T20:25:35.9475443Z 2025-05-07T20:25:35.9475449Z 2025-05-07T20:25:35.9475455Z 2025-05-07T20:25:35.9475460Z 2025-05-07T20:25:35.9475466Z 2025-05-07T20:25:35.9961194Z cuda-nvvp-12.6.80 | 109.3 MB | ######3 | 63%  2025-05-07T20:25:35.9961627Z 2025-05-07T20:25:35.9961633Z 2025-05-07T20:25:35.9961639Z 2025-05-07T20:25:35.9961645Z 2025-05-07T20:25:35.9961650Z 2025-05-07T20:25:35.9962768Z 2025-05-07T20:25:36.0316188Z libcusolver-11.7.1.2 | 95.8 MB | ######8 | 69%  2025-05-07T20:25:36.0476411Z nsight-compute-2024. | 443.1 MB | ######8 | 69% 2025-05-07T20:25:36.0476680Z 2025-05-07T20:25:36.0476685Z 2025-05-07T20:25:36.0476689Z 2025-05-07T20:25:36.0476693Z 2025-05-07T20:25:36.0480365Z 2025-05-07T20:25:36.0975150Z cuda-nvvp-12.6.80 | 109.3 MB | ######5 | 66%  2025-05-07T20:25:36.0976008Z 2025-05-07T20:25:36.0976017Z 2025-05-07T20:25:36.0976023Z 2025-05-07T20:25:36.0976028Z 2025-05-07T20:25:36.0976033Z 2025-05-07T20:25:36.0976038Z 2025-05-07T20:25:36.1317261Z libcusolver-11.7.1.2 | 95.8 MB | #######2 | 72%  2025-05-07T20:25:36.1572677Z nsight-compute-2024. | 443.1 MB | ######9 | 70% 2025-05-07T20:25:36.1572956Z 2025-05-07T20:25:36.1572961Z 2025-05-07T20:25:36.1572965Z 2025-05-07T20:25:36.1572968Z 2025-05-07T20:25:36.1575022Z 2025-05-07T20:25:36.1980941Z cuda-nvvp-12.6.80 | 109.3 MB | ######8 | 69%  2025-05-07T20:25:36.1981423Z 2025-05-07T20:25:36.1981431Z 2025-05-07T20:25:36.1981436Z 2025-05-07T20:25:36.1981442Z 2025-05-07T20:25:36.1981447Z 2025-05-07T20:25:36.1981452Z 2025-05-07T20:25:36.2318646Z libcusolver-11.7.1.2 | 95.8 MB | #######5 | 76%  2025-05-07T20:25:36.2607537Z nsight-compute-2024. | 443.1 MB | ####### | 70% 2025-05-07T20:25:36.2607819Z 2025-05-07T20:25:36.2607854Z 2025-05-07T20:25:36.2607877Z 2025-05-07T20:25:36.2607882Z 2025-05-07T20:25:36.2610822Z 2025-05-07T20:25:36.3001305Z cuda-nvvp-12.6.80 | 109.3 MB | #######1 | 72%  2025-05-07T20:25:36.3001733Z 2025-05-07T20:25:36.3001739Z 2025-05-07T20:25:36.3001744Z 2025-05-07T20:25:36.3001749Z 2025-05-07T20:25:36.3001756Z 2025-05-07T20:25:36.3003947Z 2025-05-07T20:25:36.3457666Z libcusolver-11.7.1.2 | 95.8 MB | #######8 | 79%  2025-05-07T20:25:36.3607661Z nsight-compute-2024. | 443.1 MB | #######1 | 71% 2025-05-07T20:25:36.3608028Z 2025-05-07T20:25:36.3608034Z 2025-05-07T20:25:36.3608039Z 2025-05-07T20:25:36.3608056Z 2025-05-07T20:25:36.3610034Z 2025-05-07T20:25:36.4006126Z cuda-nvvp-12.6.80 | 109.3 MB | #######4 | 74%  2025-05-07T20:25:36.4006424Z 2025-05-07T20:25:36.4006428Z 2025-05-07T20:25:36.4006441Z 2025-05-07T20:25:36.4006446Z 2025-05-07T20:25:36.4006450Z 2025-05-07T20:25:36.4008058Z 2025-05-07T20:25:36.4584153Z libcusolver-11.7.1.2 | 95.8 MB | ########2 | 82%  2025-05-07T20:25:36.4608966Z nsight-compute-2024. | 443.1 MB | #######1 | 72% 2025-05-07T20:25:36.4609326Z 2025-05-07T20:25:36.4609332Z 2025-05-07T20:25:36.4609337Z 2025-05-07T20:25:36.4609342Z 2025-05-07T20:25:36.4610548Z 2025-05-07T20:25:36.5051464Z cuda-nvvp-12.6.80 | 109.3 MB | #######7 | 77%  2025-05-07T20:25:36.5051860Z 2025-05-07T20:25:36.5051865Z 2025-05-07T20:25:36.5051871Z 2025-05-07T20:25:36.5051876Z 2025-05-07T20:25:36.5051891Z 2025-05-07T20:25:36.5054727Z 2025-05-07T20:25:36.5584369Z libcusolver-11.7.1.2 | 95.8 MB | ########5 | 86%  2025-05-07T20:25:36.5622027Z nsight-compute-2024. | 443.1 MB | #######2 | 72% 2025-05-07T20:25:36.5622391Z 2025-05-07T20:25:36.5622397Z 2025-05-07T20:25:36.5622402Z 2025-05-07T20:25:36.5622407Z 2025-05-07T20:25:36.5622412Z 2025-05-07T20:25:36.6059008Z cuda-nvvp-12.6.80 | 109.3 MB | ######## | 80%  2025-05-07T20:25:36.6059414Z 2025-05-07T20:25:36.6059433Z 2025-05-07T20:25:36.6059438Z 2025-05-07T20:25:36.6059444Z 2025-05-07T20:25:36.6059449Z 2025-05-07T20:25:36.6062846Z 2025-05-07T20:25:36.6636461Z libcusolver-11.7.1.2 | 95.8 MB | ########8 | 89%  2025-05-07T20:25:36.6636865Z 2025-05-07T20:25:36.6636870Z 2025-05-07T20:25:36.6636876Z 2025-05-07T20:25:36.6636881Z 2025-05-07T20:25:36.6642292Z 2025-05-07T20:25:36.6673138Z cuda-nvvp-12.6.80 | 109.3 MB | ########2 | 83%  2025-05-07T20:25:36.7069111Z nsight-compute-2024. | 443.1 MB | #######3 | 73% 2025-05-07T20:25:36.7069478Z 2025-05-07T20:25:36.7069483Z 2025-05-07T20:25:36.7069489Z 2025-05-07T20:25:36.7069498Z 2025-05-07T20:25:36.7069504Z 2025-05-07T20:25:36.7071089Z 2025-05-07T20:25:36.7637349Z libcusolver-11.7.1.2 | 95.8 MB | #########2 | 92%  2025-05-07T20:25:36.7637780Z 2025-05-07T20:25:36.7637785Z 2025-05-07T20:25:36.7637791Z 2025-05-07T20:25:36.7637797Z 2025-05-07T20:25:36.7639977Z 2025-05-07T20:25:36.7756793Z cuda-nvvp-12.6.80 | 109.3 MB | ########6 | 86%  2025-05-07T20:25:36.8071558Z nsight-compute-2024. | 443.1 MB | #######3 | 74% 2025-05-07T20:25:36.8071929Z 2025-05-07T20:25:36.8071935Z 2025-05-07T20:25:36.8071941Z 2025-05-07T20:25:36.8071946Z 2025-05-07T20:25:36.8071951Z 2025-05-07T20:25:36.8071957Z 2025-05-07T20:25:36.8646790Z libcusolver-11.7.1.2 | 95.8 MB | #########5 | 96%  2025-05-07T20:25:36.8647206Z 2025-05-07T20:25:36.8647212Z 2025-05-07T20:25:36.8647217Z 2025-05-07T20:25:36.8647222Z 2025-05-07T20:25:36.8647228Z 2025-05-07T20:25:36.8750096Z cuda-nvvp-12.6.80 | 109.3 MB | ########8 | 89%  2025-05-07T20:25:36.8750577Z 2025-05-07T20:25:36.8750582Z 2025-05-07T20:25:36.8750587Z 2025-05-07T20:25:36.8753360Z 2025-05-07T20:25:36.8833160Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:36.9075579Z nsight-compute-2024. | 443.1 MB | #######4 | 75% 2025-05-07T20:25:36.9075966Z 2025-05-07T20:25:36.9075984Z 2025-05-07T20:25:36.9075994Z 2025-05-07T20:25:36.9075999Z 2025-05-07T20:25:36.9076005Z 2025-05-07T20:25:36.9076035Z 2025-05-07T20:25:36.9649572Z libcusolver-11.7.1.2 | 95.8 MB | #########9 | 99%  2025-05-07T20:25:36.9649895Z 2025-05-07T20:25:36.9649900Z 2025-05-07T20:25:36.9649904Z 2025-05-07T20:25:36.9649907Z 2025-05-07T20:25:36.9649911Z 2025-05-07T20:25:36.9836232Z cuda-nvvp-12.6.80 | 109.3 MB | #########2 | 92%  2025-05-07T20:25:37.0650622Z nsight-compute-2024. | 443.1 MB | #######5 | 75% 2025-05-07T20:25:37.0651014Z 2025-05-07T20:25:37.0651020Z 2025-05-07T20:25:37.0651025Z 2025-05-07T20:25:37.0651042Z 2025-05-07T20:25:37.0651048Z 2025-05-07T20:25:37.0839122Z cuda-nvvp-12.6.80 | 109.3 MB | #########5 | 96%  2025-05-07T20:25:37.1658254Z nsight-compute-2024. | 443.1 MB | #######6 | 76% 2025-05-07T20:25:37.1658516Z 2025-05-07T20:25:37.1658648Z 2025-05-07T20:25:37.1658652Z 2025-05-07T20:25:37.1658921Z 2025-05-07T20:25:37.1662452Z 2025-05-07T20:25:37.1840736Z cuda-nvvp-12.6.80 | 109.3 MB | #########9 | 100%  2025-05-07T20:25:37.2841831Z nsight-compute-2024. | 443.1 MB | #######7 | 77% 2025-05-07T20:25:37.3845397Z nsight-compute-2024. | 443.1 MB | #######8 | 78% 2025-05-07T20:25:37.4848172Z nsight-compute-2024. | 443.1 MB | #######9 | 79% 2025-05-07T20:25:37.5319416Z nsight-compute-2024. | 443.1 MB | ######## | 80% 2025-05-07T20:25:37.5319746Z 2025-05-07T20:25:37.5323163Z 2025-05-07T20:25:37.5849654Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:37.5878770Z nsight-compute-2024. | 443.1 MB | ########1 | 81% 2025-05-07T20:25:37.5879028Z 2025-05-07T20:25:37.5879032Z 2025-05-07T20:25:37.5879035Z 2025-05-07T20:25:37.5879039Z 2025-05-07T20:25:37.5879043Z 2025-05-07T20:25:37.5879047Z 2025-05-07T20:25:37.5879051Z 2025-05-07T20:25:37.6883180Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:37.6883483Z 2025-05-07T20:25:37.6883487Z 2025-05-07T20:25:37.6883491Z 2025-05-07T20:25:37.6883494Z 2025-05-07T20:25:37.6883498Z 2025-05-07T20:25:37.6883502Z 2025-05-07T20:25:37.6883630Z 2025-05-07T20:25:37.7005099Z libnpp-12.3.1.54 | 93.4 MB | 3 | 4%  2025-05-07T20:25:37.7883619Z nsight-compute-2024. | 443.1 MB | ########2 | 82% 2025-05-07T20:25:37.7883983Z 2025-05-07T20:25:37.7883988Z 2025-05-07T20:25:37.7883994Z 2025-05-07T20:25:37.7883999Z 2025-05-07T20:25:37.7884005Z 2025-05-07T20:25:37.7884010Z 2025-05-07T20:25:37.7887501Z 2025-05-07T20:25:37.8269126Z libnpp-12.3.1.54 | 93.4 MB | 7 | 7%  2025-05-07T20:25:37.8885302Z nsight-compute-2024. | 443.1 MB | ########3 | 83% 2025-05-07T20:25:37.8885663Z 2025-05-07T20:25:37.8885668Z 2025-05-07T20:25:37.8885674Z 2025-05-07T20:25:37.8885679Z 2025-05-07T20:25:37.8885685Z 2025-05-07T20:25:37.8885690Z 2025-05-07T20:25:37.8889963Z 2025-05-07T20:25:37.9315599Z libnpp-12.3.1.54 | 93.4 MB | #1 | 11%  2025-05-07T20:25:37.9886333Z nsight-compute-2024. | 443.1 MB | ########4 | 84% 2025-05-07T20:25:37.9886695Z 2025-05-07T20:25:37.9886700Z 2025-05-07T20:25:37.9886706Z 2025-05-07T20:25:37.9886711Z 2025-05-07T20:25:37.9886716Z 2025-05-07T20:25:37.9886721Z 2025-05-07T20:25:37.9888618Z 2025-05-07T20:25:38.0475406Z libnpp-12.3.1.54 | 93.4 MB | #4 | 15%  2025-05-07T20:25:38.0887312Z nsight-compute-2024. | 443.1 MB | ########4 | 85% 2025-05-07T20:25:38.0887643Z 2025-05-07T20:25:38.0887649Z 2025-05-07T20:25:38.0887654Z 2025-05-07T20:25:38.0887659Z 2025-05-07T20:25:38.0887665Z 2025-05-07T20:25:38.0887670Z 2025-05-07T20:25:38.0889264Z 2025-05-07T20:25:38.1548290Z libnpp-12.3.1.54 | 93.4 MB | #8 | 19%  2025-05-07T20:25:38.1933643Z nsight-compute-2024. | 443.1 MB | ########5 | 86% 2025-05-07T20:25:38.1933967Z 2025-05-07T20:25:38.1933984Z 2025-05-07T20:25:38.1933998Z 2025-05-07T20:25:38.1934003Z 2025-05-07T20:25:38.1934008Z 2025-05-07T20:25:38.1934013Z 2025-05-07T20:25:38.1934219Z 2025-05-07T20:25:38.2929284Z libnpp-12.3.1.54 | 93.4 MB | ##2 | 22%  2025-05-07T20:25:38.2937542Z nsight-compute-2024. | 443.1 MB | ########6 | 87% 2025-05-07T20:25:38.2937872Z 2025-05-07T20:25:38.2937877Z 2025-05-07T20:25:38.2937882Z 2025-05-07T20:25:38.2937888Z 2025-05-07T20:25:38.2937893Z 2025-05-07T20:25:38.2937907Z 2025-05-07T20:25:38.2940494Z 2025-05-07T20:25:38.3939867Z libnpp-12.3.1.54 | 93.4 MB | ##6 | 27%  2025-05-07T20:25:38.3940452Z 2025-05-07T20:25:38.3940466Z 2025-05-07T20:25:38.3940472Z 2025-05-07T20:25:38.3940478Z 2025-05-07T20:25:38.3940483Z 2025-05-07T20:25:38.3940489Z 2025-05-07T20:25:38.3945415Z 2025-05-07T20:25:38.4133129Z libnpp-12.3.1.54 | 93.4 MB | ### | 31%  2025-05-07T20:25:38.4940264Z nsight-compute-2024. | 443.1 MB | ########7 | 87% 2025-05-07T20:25:38.4940832Z 2025-05-07T20:25:38.4940836Z 2025-05-07T20:25:38.4940839Z 2025-05-07T20:25:38.4940843Z 2025-05-07T20:25:38.4940854Z 2025-05-07T20:25:38.4940858Z 2025-05-07T20:25:38.4942645Z 2025-05-07T20:25:38.5136265Z libnpp-12.3.1.54 | 93.4 MB | ###4 | 35%  2025-05-07T20:25:38.6018416Z nsight-compute-2024. | 443.1 MB | ########8 | 88% 2025-05-07T20:25:38.6018771Z 2025-05-07T20:25:38.6018777Z 2025-05-07T20:25:38.6018783Z 2025-05-07T20:25:38.6018788Z 2025-05-07T20:25:38.6018793Z 2025-05-07T20:25:38.6018799Z 2025-05-07T20:25:38.6020549Z 2025-05-07T20:25:38.6140603Z libnpp-12.3.1.54 | 93.4 MB | ###8 | 39%  2025-05-07T20:25:38.7126616Z nsight-compute-2024. | 443.1 MB | ########9 | 89% 2025-05-07T20:25:38.7126873Z 2025-05-07T20:25:38.7126877Z 2025-05-07T20:25:38.7126880Z 2025-05-07T20:25:38.7126884Z 2025-05-07T20:25:38.7126887Z 2025-05-07T20:25:38.7126891Z 2025-05-07T20:25:38.7128807Z 2025-05-07T20:25:38.7140966Z libnpp-12.3.1.54 | 93.4 MB | ####2 | 43%  2025-05-07T20:25:38.8132733Z nsight-compute-2024. | 443.1 MB | ########9 | 90% 2025-05-07T20:25:38.8133001Z 2025-05-07T20:25:38.8133005Z 2025-05-07T20:25:38.8133009Z 2025-05-07T20:25:38.8133013Z 2025-05-07T20:25:38.8133017Z 2025-05-07T20:25:38.8133020Z 2025-05-07T20:25:38.8134779Z 2025-05-07T20:25:38.8142213Z libnpp-12.3.1.54 | 93.4 MB | ####6 | 47%  2025-05-07T20:25:38.9142906Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:25:38.9145390Z nsight-compute-2024. | 443.1 MB | #########1 | 92% 2025-05-07T20:25:38.9145702Z 2025-05-07T20:25:38.9145707Z 2025-05-07T20:25:38.9145712Z 2025-05-07T20:25:38.9145717Z 2025-05-07T20:25:38.9145723Z 2025-05-07T20:25:38.9145728Z 2025-05-07T20:25:38.9145733Z 2025-05-07T20:25:39.0145132Z libnpp-12.3.1.54 | 93.4 MB | ##### | 50%  2025-05-07T20:25:39.0187202Z nsight-compute-2024. | 443.1 MB | #########2 | 92% 2025-05-07T20:25:39.0187539Z 2025-05-07T20:25:39.0187543Z 2025-05-07T20:25:39.0187547Z 2025-05-07T20:25:39.0187551Z 2025-05-07T20:25:39.0187554Z 2025-05-07T20:25:39.0187558Z 2025-05-07T20:25:39.0187562Z 2025-05-07T20:25:39.1145711Z libnpp-12.3.1.54 | 93.4 MB | #####4 | 54%  2025-05-07T20:25:39.1239579Z nsight-compute-2024. | 443.1 MB | #########3 | 93% 2025-05-07T20:25:39.1239932Z 2025-05-07T20:25:39.1239990Z 2025-05-07T20:25:39.1240027Z 2025-05-07T20:25:39.1240033Z 2025-05-07T20:25:39.1240038Z 2025-05-07T20:25:39.1240043Z 2025-05-07T20:25:39.1240445Z 2025-05-07T20:25:39.2154015Z libnpp-12.3.1.54 | 93.4 MB | #####7 | 58%  2025-05-07T20:25:39.2246156Z nsight-compute-2024. | 443.1 MB | #########4 | 94% 2025-05-07T20:25:39.2246536Z 2025-05-07T20:25:39.2246542Z 2025-05-07T20:25:39.2246548Z 2025-05-07T20:25:39.2246553Z 2025-05-07T20:25:39.2246559Z 2025-05-07T20:25:39.2246586Z 2025-05-07T20:25:39.2249073Z 2025-05-07T20:25:39.3188479Z libnpp-12.3.1.54 | 93.4 MB | ######1 | 62%  2025-05-07T20:25:39.4191050Z nsight-compute-2024. | 443.1 MB | #########5 | 95% 2025-05-07T20:25:39.4293812Z nsight-compute-2024. | 443.1 MB | #########5 | 96% 2025-05-07T20:25:39.4294167Z 2025-05-07T20:25:39.4294171Z 2025-05-07T20:25:39.4294185Z 2025-05-07T20:25:39.4294189Z 2025-05-07T20:25:39.4294193Z 2025-05-07T20:25:39.4294196Z 2025-05-07T20:25:39.4297442Z 2025-05-07T20:25:39.5213131Z libnpp-12.3.1.54 | 93.4 MB | ######5 | 65%  2025-05-07T20:25:39.5295179Z nsight-compute-2024. | 443.1 MB | #########6 | 97% 2025-05-07T20:25:39.5295428Z 2025-05-07T20:25:39.5295432Z 2025-05-07T20:25:39.5295436Z 2025-05-07T20:25:39.5295440Z 2025-05-07T20:25:39.5295443Z 2025-05-07T20:25:39.5295455Z 2025-05-07T20:25:39.5297090Z 2025-05-07T20:25:39.6248297Z libnpp-12.3.1.54 | 93.4 MB | ######8 | 69%  2025-05-07T20:25:39.6299202Z nsight-compute-2024. | 443.1 MB | #########7 | 98% 2025-05-07T20:25:39.6299748Z 2025-05-07T20:25:39.6299753Z 2025-05-07T20:25:39.6299756Z 2025-05-07T20:25:39.6299760Z 2025-05-07T20:25:39.6299764Z 2025-05-07T20:25:39.6299767Z 2025-05-07T20:25:39.6300707Z 2025-05-07T20:25:39.7300796Z libnpp-12.3.1.54 | 93.4 MB | #######2 | 73%  2025-05-07T20:25:39.7301235Z 2025-05-07T20:25:39.7301239Z 2025-05-07T20:25:39.7301243Z 2025-05-07T20:25:39.7301246Z 2025-05-07T20:25:39.7301258Z 2025-05-07T20:25:39.7301261Z 2025-05-07T20:25:39.7302206Z 2025-05-07T20:25:39.7304754Z libnpp-12.3.1.54 | 93.4 MB | #######6 | 77%  2025-05-07T20:25:39.7941050Z nsight-compute-2024. | 443.1 MB | #########8 | 98% 2025-05-07T20:25:39.7941555Z 2025-05-07T20:25:39.7941560Z 2025-05-07T20:25:39.7941564Z 2025-05-07T20:25:39.7941567Z 2025-05-07T20:25:39.7941571Z 2025-05-07T20:25:39.7951644Z 2025-05-07T20:25:39.8303764Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:39.8304195Z 2025-05-07T20:25:39.8304201Z 2025-05-07T20:25:39.8304206Z 2025-05-07T20:25:39.8304211Z 2025-05-07T20:25:39.8304217Z 2025-05-07T20:25:39.8304222Z 2025-05-07T20:25:39.8305257Z 2025-05-07T20:25:39.8387221Z libnpp-12.3.1.54 | 93.4 MB | ######## | 80%  2025-05-07T20:25:39.8701397Z nsight-compute-2024. | 443.1 MB | #########9 | 99% 2025-05-07T20:25:39.8701771Z 2025-05-07T20:25:39.8701777Z 2025-05-07T20:25:39.8701782Z 2025-05-07T20:25:39.8701787Z 2025-05-07T20:25:39.8701792Z 2025-05-07T20:25:39.8701798Z 2025-05-07T20:25:39.8701803Z 2025-05-07T20:25:39.8702142Z 2025-05-07T20:25:39.9408657Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:39.9409101Z 2025-05-07T20:25:39.9409108Z 2025-05-07T20:25:39.9409113Z 2025-05-07T20:25:39.9409118Z 2025-05-07T20:25:39.9409124Z 2025-05-07T20:25:39.9409130Z 2025-05-07T20:25:39.9410208Z 2025-05-07T20:25:39.9707307Z libnpp-12.3.1.54 | 93.4 MB | ########3 | 84%  2025-05-07T20:25:39.9707752Z 2025-05-07T20:25:39.9707758Z 2025-05-07T20:25:39.9707764Z 2025-05-07T20:25:39.9707769Z 2025-05-07T20:25:39.9707775Z 2025-05-07T20:25:39.9707780Z 2025-05-07T20:25:39.9707785Z 2025-05-07T20:25:39.9707791Z 2025-05-07T20:25:40.0552692Z cuda-nvdisasm-12.6.7 | 47.6 MB | 5 | 6%  2025-05-07T20:25:40.0553096Z 2025-05-07T20:25:40.0553100Z 2025-05-07T20:25:40.0553104Z 2025-05-07T20:25:40.0553108Z 2025-05-07T20:25:40.0553111Z 2025-05-07T20:25:40.0553115Z 2025-05-07T20:25:40.0566117Z 2025-05-07T20:25:40.0710907Z libnpp-12.3.1.54 | 93.4 MB | ########7 | 87%  2025-05-07T20:25:40.0711295Z 2025-05-07T20:25:40.0711301Z 2025-05-07T20:25:40.0711306Z 2025-05-07T20:25:40.0711312Z 2025-05-07T20:25:40.0711316Z 2025-05-07T20:25:40.0711322Z 2025-05-07T20:25:40.0711337Z 2025-05-07T20:25:40.0715923Z 2025-05-07T20:25:40.1579915Z cuda-nvdisasm-12.6.7 | 47.6 MB | #2 | 12%  2025-05-07T20:25:40.1580363Z 2025-05-07T20:25:40.1580369Z 2025-05-07T20:25:40.1580387Z 2025-05-07T20:25:40.1580393Z 2025-05-07T20:25:40.1580398Z 2025-05-07T20:25:40.1580403Z 2025-05-07T20:25:40.1582443Z 2025-05-07T20:25:40.1711749Z libnpp-12.3.1.54 | 93.4 MB | ######### | 91%  2025-05-07T20:25:40.1712164Z 2025-05-07T20:25:40.1712170Z 2025-05-07T20:25:40.1712175Z 2025-05-07T20:25:40.1712181Z 2025-05-07T20:25:40.1712186Z 2025-05-07T20:25:40.1712191Z 2025-05-07T20:25:40.1712196Z 2025-05-07T20:25:40.1712202Z 2025-05-07T20:25:40.2611460Z cuda-nvdisasm-12.6.7 | 47.6 MB | #9 | 19%  2025-05-07T20:25:40.2611893Z 2025-05-07T20:25:40.2611898Z 2025-05-07T20:25:40.2611904Z 2025-05-07T20:25:40.2611909Z 2025-05-07T20:25:40.2611928Z 2025-05-07T20:25:40.2611933Z 2025-05-07T20:25:40.2615474Z 2025-05-07T20:25:40.2743142Z libnpp-12.3.1.54 | 93.4 MB | #########4 | 94%  2025-05-07T20:25:40.2743788Z 2025-05-07T20:25:40.2743793Z 2025-05-07T20:25:40.2743799Z 2025-05-07T20:25:40.2743804Z 2025-05-07T20:25:40.2743809Z 2025-05-07T20:25:40.2743814Z 2025-05-07T20:25:40.2743820Z 2025-05-07T20:25:40.2743830Z 2025-05-07T20:25:40.3615119Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##5 | 26%  2025-05-07T20:25:40.3615558Z 2025-05-07T20:25:40.3615564Z 2025-05-07T20:25:40.3615569Z 2025-05-07T20:25:40.3615574Z 2025-05-07T20:25:40.3615580Z 2025-05-07T20:25:40.3615585Z 2025-05-07T20:25:40.3621556Z 2025-05-07T20:25:40.3818623Z libnpp-12.3.1.54 | 93.4 MB | #########7 | 98%  2025-05-07T20:25:40.3819023Z 2025-05-07T20:25:40.3819028Z 2025-05-07T20:25:40.3819034Z 2025-05-07T20:25:40.3819039Z 2025-05-07T20:25:40.3819044Z 2025-05-07T20:25:40.3819050Z 2025-05-07T20:25:40.3819055Z 2025-05-07T20:25:40.3820407Z 2025-05-07T20:25:40.4823163Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###1 | 32%  2025-05-07T20:25:40.4823544Z 2025-05-07T20:25:40.4823551Z 2025-05-07T20:25:40.4823556Z 2025-05-07T20:25:40.4823561Z 2025-05-07T20:25:40.4823566Z 2025-05-07T20:25:40.4823572Z 2025-05-07T20:25:40.4823577Z 2025-05-07T20:25:40.4823582Z 2025-05-07T20:25:40.5484843Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###9 | 39%  2025-05-07T20:25:40.5485158Z 2025-05-07T20:25:40.5485162Z 2025-05-07T20:25:40.5485166Z 2025-05-07T20:25:40.5485169Z 2025-05-07T20:25:40.5485173Z 2025-05-07T20:25:40.6097538Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:40.6097956Z 2025-05-07T20:25:40.6097962Z 2025-05-07T20:25:40.6097967Z 2025-05-07T20:25:40.6097973Z 2025-05-07T20:25:40.6097978Z 2025-05-07T20:25:40.6097983Z 2025-05-07T20:25:40.6097989Z 2025-05-07T20:25:40.6099906Z 2025-05-07T20:25:40.6132138Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####5 | 46%  2025-05-07T20:25:40.6132580Z 2025-05-07T20:25:40.6132586Z 2025-05-07T20:25:40.6132853Z 2025-05-07T20:25:40.6132870Z 2025-05-07T20:25:40.6132875Z 2025-05-07T20:25:40.6132880Z 2025-05-07T20:25:40.6132886Z 2025-05-07T20:25:40.6132891Z 2025-05-07T20:25:40.6139711Z 2025-05-07T20:25:40.7109617Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:40.7109935Z 2025-05-07T20:25:40.7109939Z 2025-05-07T20:25:40.7109943Z 2025-05-07T20:25:40.7109947Z 2025-05-07T20:25:40.7109951Z 2025-05-07T20:25:40.7109955Z 2025-05-07T20:25:40.7109958Z 2025-05-07T20:25:40.7109962Z 2025-05-07T20:25:40.7205670Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####2 | 52%  2025-05-07T20:25:40.7206019Z 2025-05-07T20:25:40.7206023Z 2025-05-07T20:25:40.7206027Z 2025-05-07T20:25:40.7206030Z 2025-05-07T20:25:40.7206034Z 2025-05-07T20:25:40.7206038Z 2025-05-07T20:25:40.7206041Z 2025-05-07T20:25:40.7206055Z 2025-05-07T20:25:40.7206627Z 2025-05-07T20:25:40.8208158Z libcurand-10.3.7.77 | 39.9 MB | 5 | 6%  2025-05-07T20:25:40.8208489Z 2025-05-07T20:25:40.8208494Z 2025-05-07T20:25:40.8208498Z 2025-05-07T20:25:40.8208501Z 2025-05-07T20:25:40.8208505Z 2025-05-07T20:25:40.8208509Z 2025-05-07T20:25:40.8208513Z 2025-05-07T20:25:40.8208517Z 2025-05-07T20:25:40.8213635Z 2025-05-07T20:25:40.8331856Z libcurand-10.3.7.77 | 39.9 MB | #2 | 13%  2025-05-07T20:25:40.8332150Z 2025-05-07T20:25:40.8332153Z 2025-05-07T20:25:40.8332164Z 2025-05-07T20:25:40.8332168Z 2025-05-07T20:25:40.8332172Z 2025-05-07T20:25:40.8332176Z 2025-05-07T20:25:40.8332179Z 2025-05-07T20:25:40.8336869Z 2025-05-07T20:25:40.9209930Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####8 | 58%  2025-05-07T20:25:40.9210251Z 2025-05-07T20:25:40.9210259Z 2025-05-07T20:25:40.9210264Z 2025-05-07T20:25:40.9210268Z 2025-05-07T20:25:40.9210273Z 2025-05-07T20:25:40.9210278Z 2025-05-07T20:25:40.9210283Z 2025-05-07T20:25:40.9210288Z 2025-05-07T20:25:40.9213334Z 2025-05-07T20:25:40.9335816Z libcurand-10.3.7.77 | 39.9 MB | ##1 | 21%  2025-05-07T20:25:40.9336429Z 2025-05-07T20:25:40.9336433Z 2025-05-07T20:25:40.9336437Z 2025-05-07T20:25:40.9336441Z 2025-05-07T20:25:40.9336444Z 2025-05-07T20:25:40.9336456Z 2025-05-07T20:25:40.9336460Z 2025-05-07T20:25:40.9336463Z 2025-05-07T20:25:41.0216160Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######4 | 65%  2025-05-07T20:25:41.0216486Z 2025-05-07T20:25:41.0216491Z 2025-05-07T20:25:41.0216495Z 2025-05-07T20:25:41.0216500Z 2025-05-07T20:25:41.0216504Z 2025-05-07T20:25:41.0216509Z 2025-05-07T20:25:41.0216514Z 2025-05-07T20:25:41.0216518Z 2025-05-07T20:25:41.0217000Z 2025-05-07T20:25:41.0342812Z libcurand-10.3.7.77 | 39.9 MB | ##8 | 28%  2025-05-07T20:25:41.0343153Z 2025-05-07T20:25:41.0343157Z 2025-05-07T20:25:41.0343161Z 2025-05-07T20:25:41.0343165Z 2025-05-07T20:25:41.0343168Z 2025-05-07T20:25:41.0343172Z 2025-05-07T20:25:41.0343175Z 2025-05-07T20:25:41.0346684Z 2025-05-07T20:25:41.1254639Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######1 | 71%  2025-05-07T20:25:41.1254953Z 2025-05-07T20:25:41.1254957Z 2025-05-07T20:25:41.1254961Z 2025-05-07T20:25:41.1254965Z 2025-05-07T20:25:41.1254969Z 2025-05-07T20:25:41.1254973Z 2025-05-07T20:25:41.1254977Z 2025-05-07T20:25:41.1254981Z 2025-05-07T20:25:41.1257428Z 2025-05-07T20:25:41.1344232Z libcurand-10.3.7.77 | 39.9 MB | ###5 | 36%  2025-05-07T20:25:41.1344529Z 2025-05-07T20:25:41.1344533Z 2025-05-07T20:25:41.1344537Z 2025-05-07T20:25:41.1344541Z 2025-05-07T20:25:41.1344546Z 2025-05-07T20:25:41.1344550Z 2025-05-07T20:25:41.1344554Z 2025-05-07T20:25:41.1344565Z 2025-05-07T20:25:41.2256530Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######8 | 79%  2025-05-07T20:25:41.2256949Z 2025-05-07T20:25:41.2256955Z 2025-05-07T20:25:41.2256961Z 2025-05-07T20:25:41.2256977Z 2025-05-07T20:25:41.2256983Z 2025-05-07T20:25:41.2257256Z 2025-05-07T20:25:41.2257277Z 2025-05-07T20:25:41.2257283Z 2025-05-07T20:25:41.2259195Z 2025-05-07T20:25:41.2419865Z libcurand-10.3.7.77 | 39.9 MB | ####3 | 44%  2025-05-07T20:25:41.2420293Z 2025-05-07T20:25:41.2420299Z 2025-05-07T20:25:41.2420305Z 2025-05-07T20:25:41.2420310Z 2025-05-07T20:25:41.2420315Z 2025-05-07T20:25:41.2420320Z 2025-05-07T20:25:41.2420326Z 2025-05-07T20:25:41.2420331Z 2025-05-07T20:25:41.3306094Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########5 | 85%  2025-05-07T20:25:41.3306414Z 2025-05-07T20:25:41.3306418Z 2025-05-07T20:25:41.3306422Z 2025-05-07T20:25:41.3306425Z 2025-05-07T20:25:41.3306429Z 2025-05-07T20:25:41.3306433Z 2025-05-07T20:25:41.3306444Z 2025-05-07T20:25:41.3306448Z 2025-05-07T20:25:41.3306859Z 2025-05-07T20:25:41.3571517Z libcurand-10.3.7.77 | 39.9 MB | #####1 | 51%  2025-05-07T20:25:41.3571865Z 2025-05-07T20:25:41.3571877Z 2025-05-07T20:25:41.3571897Z 2025-05-07T20:25:41.3571908Z 2025-05-07T20:25:41.3571912Z 2025-05-07T20:25:41.3571915Z 2025-05-07T20:25:41.3571919Z 2025-05-07T20:25:41.3573598Z 2025-05-07T20:25:41.4308419Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########1 | 92%  2025-05-07T20:25:41.4308757Z 2025-05-07T20:25:41.4308762Z 2025-05-07T20:25:41.4308767Z 2025-05-07T20:25:41.4308772Z 2025-05-07T20:25:41.4308778Z 2025-05-07T20:25:41.4308785Z 2025-05-07T20:25:41.4308790Z 2025-05-07T20:25:41.4308795Z 2025-05-07T20:25:41.4310861Z 2025-05-07T20:25:41.4788347Z libcurand-10.3.7.77 | 39.9 MB | #####9 | 60%  2025-05-07T20:25:41.4788738Z 2025-05-07T20:25:41.4788742Z 2025-05-07T20:25:41.4788746Z 2025-05-07T20:25:41.4788749Z 2025-05-07T20:25:41.4788753Z 2025-05-07T20:25:41.4788757Z 2025-05-07T20:25:41.4788760Z 2025-05-07T20:25:41.4788764Z 2025-05-07T20:25:41.5308911Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########7 | 98%  2025-05-07T20:25:41.5309231Z 2025-05-07T20:25:41.5309491Z 2025-05-07T20:25:41.5309495Z 2025-05-07T20:25:41.5309499Z 2025-05-07T20:25:41.5309503Z 2025-05-07T20:25:41.5309507Z 2025-05-07T20:25:41.5309511Z 2025-05-07T20:25:41.5309514Z 2025-05-07T20:25:41.5310212Z 2025-05-07T20:25:41.6311601Z libcurand-10.3.7.77 | 39.9 MB | ######7 | 68%  2025-05-07T20:25:41.6311908Z 2025-05-07T20:25:41.6311912Z 2025-05-07T20:25:41.6311916Z 2025-05-07T20:25:41.6311920Z 2025-05-07T20:25:41.6311923Z 2025-05-07T20:25:41.6311927Z 2025-05-07T20:25:41.6311931Z 2025-05-07T20:25:41.6311935Z 2025-05-07T20:25:41.6312454Z 2025-05-07T20:25:41.7326870Z libcurand-10.3.7.77 | 39.9 MB | #######6 | 77%  2025-05-07T20:25:41.7327237Z 2025-05-07T20:25:41.7327241Z 2025-05-07T20:25:41.7327244Z 2025-05-07T20:25:41.7327248Z 2025-05-07T20:25:41.7327251Z 2025-05-07T20:25:41.7327255Z 2025-05-07T20:25:41.7327258Z 2025-05-07T20:25:41.7327262Z 2025-05-07T20:25:41.7333941Z 2025-05-07T20:25:41.8329835Z libcurand-10.3.7.77 | 39.9 MB | ########5 | 85%  2025-05-07T20:25:41.8330274Z 2025-05-07T20:25:41.8330280Z 2025-05-07T20:25:41.8330286Z 2025-05-07T20:25:41.8330291Z 2025-05-07T20:25:41.8330296Z 2025-05-07T20:25:41.8330301Z 2025-05-07T20:25:41.8330306Z 2025-05-07T20:25:41.8330311Z 2025-05-07T20:25:41.8335013Z 2025-05-07T20:25:41.8989974Z libcurand-10.3.7.77 | 39.9 MB | #########3 | 94%  2025-05-07T20:25:41.8990344Z 2025-05-07T20:25:41.8990348Z 2025-05-07T20:25:41.8994179Z 2025-05-07T20:25:43.0118019Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:43.0118447Z 2025-05-07T20:25:43.0118453Z 2025-05-07T20:25:43.0118458Z 2025-05-07T20:25:43.0118463Z 2025-05-07T20:25:43.0118469Z 2025-05-07T20:25:43.0118474Z 2025-05-07T20:25:43.0118480Z 2025-05-07T20:25:43.0119221Z 2025-05-07T20:25:43.0685777Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:43.0686123Z 2025-05-07T20:25:43.0686398Z 2025-05-07T20:25:43.0686404Z 2025-05-07T20:25:43.0686408Z 2025-05-07T20:25:43.0686412Z 2025-05-07T20:25:43.0686415Z 2025-05-07T20:25:43.0686419Z 2025-05-07T20:25:43.0686423Z 2025-05-07T20:25:43.0686427Z 2025-05-07T20:25:43.0688307Z 2025-05-07T20:25:43.1687151Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:43.1687562Z 2025-05-07T20:25:43.1687567Z 2025-05-07T20:25:43.1687570Z 2025-05-07T20:25:43.1687574Z 2025-05-07T20:25:43.1687578Z 2025-05-07T20:25:43.1687582Z 2025-05-07T20:25:43.1687586Z 2025-05-07T20:25:43.1687592Z 2025-05-07T20:25:43.1687596Z 2025-05-07T20:25:43.1687600Z 2025-05-07T20:25:43.2438714Z gds-tools-1.11.1.6 | 37.8 MB | 5 | 6%  2025-05-07T20:25:43.2439152Z 2025-05-07T20:25:43.2439159Z 2025-05-07T20:25:43.2439164Z 2025-05-07T20:25:43.2439169Z 2025-05-07T20:25:43.2439188Z 2025-05-07T20:25:43.2439194Z 2025-05-07T20:25:43.2439200Z 2025-05-07T20:25:43.2439236Z 2025-05-07T20:25:43.2439259Z 2025-05-07T20:25:43.2686418Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:43.2686868Z 2025-05-07T20:25:43.2686874Z 2025-05-07T20:25:43.2686892Z 2025-05-07T20:25:43.2686899Z 2025-05-07T20:25:43.2686904Z 2025-05-07T20:25:43.2686909Z 2025-05-07T20:25:43.2686914Z 2025-05-07T20:25:43.2686919Z 2025-05-07T20:25:43.2686925Z 2025-05-07T20:25:43.2686930Z 2025-05-07T20:25:43.2889211Z gds-tools-1.11.1.6 | 37.8 MB | #2 | 13%  2025-05-07T20:25:43.2889567Z 2025-05-07T20:25:43.2889573Z 2025-05-07T20:25:43.2889577Z 2025-05-07T20:25:43.2889582Z 2025-05-07T20:25:43.2889587Z 2025-05-07T20:25:43.2889592Z 2025-05-07T20:25:43.2889598Z 2025-05-07T20:25:43.2889602Z 2025-05-07T20:25:43.2889606Z 2025-05-07T20:25:43.2889609Z 2025-05-07T20:25:43.2889613Z 2025-05-07T20:25:43.3790283Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:43.3790792Z 2025-05-07T20:25:43.3791114Z 2025-05-07T20:25:43.3791120Z 2025-05-07T20:25:43.3791126Z 2025-05-07T20:25:43.3791132Z 2025-05-07T20:25:43.3791138Z 2025-05-07T20:25:43.3791143Z 2025-05-07T20:25:43.3791148Z 2025-05-07T20:25:43.3791154Z 2025-05-07T20:25:43.3791989Z 2025-05-07T20:25:43.3894484Z gds-tools-1.11.1.6 | 37.8 MB | #9 | 19%  2025-05-07T20:25:43.3894829Z 2025-05-07T20:25:43.3894833Z 2025-05-07T20:25:43.3894836Z 2025-05-07T20:25:43.3894840Z 2025-05-07T20:25:43.3894844Z 2025-05-07T20:25:43.3894847Z 2025-05-07T20:25:43.3894859Z 2025-05-07T20:25:43.3894863Z 2025-05-07T20:25:43.3894867Z 2025-05-07T20:25:43.3894871Z 2025-05-07T20:25:43.3898984Z 2025-05-07T20:25:43.4352620Z cuda-nvcc-tools-12.6 | 23.0 MB | #6 | 16%  2025-05-07T20:25:43.4353079Z 2025-05-07T20:25:43.4792919Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:43.4793204Z 2025-05-07T20:25:43.4793209Z 2025-05-07T20:25:43.4793263Z 2025-05-07T20:25:43.4793268Z 2025-05-07T20:25:43.4793272Z 2025-05-07T20:25:43.4793276Z 2025-05-07T20:25:43.4793280Z 2025-05-07T20:25:43.4793284Z 2025-05-07T20:25:43.4793287Z 2025-05-07T20:25:43.4796807Z 2025-05-07T20:25:43.5097102Z gds-tools-1.11.1.6 | 37.8 MB | ##6 | 27%  2025-05-07T20:25:43.5097423Z 2025-05-07T20:25:43.5097427Z 2025-05-07T20:25:43.5097431Z 2025-05-07T20:25:43.5097435Z 2025-05-07T20:25:43.5097438Z 2025-05-07T20:25:43.5097442Z 2025-05-07T20:25:43.5097446Z 2025-05-07T20:25:43.5097450Z 2025-05-07T20:25:43.5097454Z 2025-05-07T20:25:43.5097458Z 2025-05-07T20:25:43.5099086Z 2025-05-07T20:25:43.5154235Z cuda-nvcc-tools-12.6 | 23.0 MB | ###2 | 32%  2025-05-07T20:25:43.5154677Z 2025-05-07T20:25:43.5154683Z 2025-05-07T20:25:43.5154688Z 2025-05-07T20:25:43.5154694Z 2025-05-07T20:25:43.5154699Z 2025-05-07T20:25:43.5154705Z 2025-05-07T20:25:43.5158441Z 2025-05-07T20:25:43.5210608Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:43.5211081Z 2025-05-07T20:25:43.5211087Z 2025-05-07T20:25:43.5211091Z 2025-05-07T20:25:43.5211095Z 2025-05-07T20:25:43.5211099Z 2025-05-07T20:25:43.5211102Z 2025-05-07T20:25:43.5211106Z 2025-05-07T20:25:43.5211109Z 2025-05-07T20:25:43.5211113Z 2025-05-07T20:25:43.5211117Z 2025-05-07T20:25:43.5211121Z 2025-05-07T20:25:43.5213035Z 2025-05-07T20:25:43.5724177Z python-3.9.18 | 22.7 MB | | 0%  2025-05-07T20:25:43.5724492Z 2025-05-07T20:25:43.5724497Z 2025-05-07T20:25:43.5724501Z 2025-05-07T20:25:43.5724505Z 2025-05-07T20:25:43.5724509Z 2025-05-07T20:25:43.5724512Z 2025-05-07T20:25:43.5724516Z 2025-05-07T20:25:43.5724520Z 2025-05-07T20:25:43.5724524Z 2025-05-07T20:25:43.5724528Z 2025-05-07T20:25:43.5724538Z 2025-05-07T20:25:43.5724542Z 2025-05-07T20:25:43.5726265Z 2025-05-07T20:25:43.5799313Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:43.5799669Z 2025-05-07T20:25:43.5799684Z 2025-05-07T20:25:43.5799689Z 2025-05-07T20:25:43.5799694Z 2025-05-07T20:25:43.5799699Z 2025-05-07T20:25:43.5799704Z 2025-05-07T20:25:43.5799710Z 2025-05-07T20:25:43.5799715Z 2025-05-07T20:25:43.5799720Z 2025-05-07T20:25:43.5799725Z 2025-05-07T20:25:43.6167158Z gds-tools-1.11.1.6 | 37.8 MB | ###3 | 34%  2025-05-07T20:25:43.6167468Z 2025-05-07T20:25:43.6167473Z 2025-05-07T20:25:43.6167476Z 2025-05-07T20:25:43.6167480Z 2025-05-07T20:25:43.6167483Z 2025-05-07T20:25:43.6167487Z 2025-05-07T20:25:43.6167491Z 2025-05-07T20:25:43.6167494Z 2025-05-07T20:25:43.6167498Z 2025-05-07T20:25:43.6167502Z 2025-05-07T20:25:43.6167505Z 2025-05-07T20:25:43.6211306Z cuda-nvcc-tools-12.6 | 23.0 MB | ####6 | 47%  2025-05-07T20:25:43.6211666Z 2025-05-07T20:25:43.6211671Z 2025-05-07T20:25:43.6211676Z 2025-05-07T20:25:43.6211700Z 2025-05-07T20:25:43.6211974Z 2025-05-07T20:25:43.6211979Z 2025-05-07T20:25:43.6211985Z 2025-05-07T20:25:43.6211990Z 2025-05-07T20:25:43.6211995Z 2025-05-07T20:25:43.6212000Z 2025-05-07T20:25:43.6212005Z 2025-05-07T20:25:43.6213228Z 2025-05-07T20:25:43.6725040Z python-3.9.18 | 22.7 MB | #1 | 11%  2025-05-07T20:25:43.6725492Z 2025-05-07T20:25:43.6725498Z 2025-05-07T20:25:43.6725504Z 2025-05-07T20:25:43.6725530Z 2025-05-07T20:25:43.6725535Z 2025-05-07T20:25:43.6725541Z 2025-05-07T20:25:43.6725546Z 2025-05-07T20:25:43.6725551Z 2025-05-07T20:25:43.6725557Z 2025-05-07T20:25:43.6725562Z 2025-05-07T20:25:43.6725567Z 2025-05-07T20:25:43.6725572Z 2025-05-07T20:25:43.6728835Z 2025-05-07T20:25:43.6835124Z cuda-nvrtc-12.6.85 | 17.3 MB | #3 | 14%  2025-05-07T20:25:43.6835442Z 2025-05-07T20:25:43.6835446Z 2025-05-07T20:25:43.6835450Z 2025-05-07T20:25:43.6835455Z 2025-05-07T20:25:43.6835458Z 2025-05-07T20:25:43.6835492Z 2025-05-07T20:25:43.6835496Z 2025-05-07T20:25:43.6835500Z 2025-05-07T20:25:43.6835504Z 2025-05-07T20:25:43.6835508Z 2025-05-07T20:25:43.7253995Z gds-tools-1.11.1.6 | 37.8 MB | #### | 40%  2025-05-07T20:25:43.7254372Z 2025-05-07T20:25:43.7254376Z 2025-05-07T20:25:43.7254380Z 2025-05-07T20:25:43.7254384Z 2025-05-07T20:25:43.7254388Z 2025-05-07T20:25:43.7254391Z 2025-05-07T20:25:43.7254395Z 2025-05-07T20:25:43.7254399Z 2025-05-07T20:25:43.7254403Z 2025-05-07T20:25:43.7254406Z 2025-05-07T20:25:43.7254410Z 2025-05-07T20:25:43.7255714Z 2025-05-07T20:25:43.7316038Z python-3.9.18 | 22.7 MB | ##2 | 23%  2025-05-07T20:25:43.7316434Z 2025-05-07T20:25:43.7316439Z 2025-05-07T20:25:43.7316442Z 2025-05-07T20:25:43.7316446Z 2025-05-07T20:25:43.7316450Z 2025-05-07T20:25:43.7316455Z 2025-05-07T20:25:43.7316459Z 2025-05-07T20:25:43.7316463Z 2025-05-07T20:25:43.7316466Z 2025-05-07T20:25:43.7316735Z 2025-05-07T20:25:43.7318839Z 2025-05-07T20:25:43.7731368Z cuda-nvcc-tools-12.6 | 23.0 MB | ###### | 61%  2025-05-07T20:25:43.7731683Z 2025-05-07T20:25:43.7731687Z 2025-05-07T20:25:43.7731691Z 2025-05-07T20:25:43.7731694Z 2025-05-07T20:25:43.7731698Z 2025-05-07T20:25:43.7731702Z 2025-05-07T20:25:43.7731706Z 2025-05-07T20:25:43.7731717Z 2025-05-07T20:25:43.7731720Z 2025-05-07T20:25:43.7731724Z 2025-05-07T20:25:43.7731728Z 2025-05-07T20:25:43.7731731Z 2025-05-07T20:25:43.7735027Z 2025-05-07T20:25:43.7858075Z cuda-nvrtc-12.6.85 | 17.3 MB | ##7 | 28%  2025-05-07T20:25:43.7858390Z 2025-05-07T20:25:43.7858394Z 2025-05-07T20:25:43.7858398Z 2025-05-07T20:25:43.7858402Z 2025-05-07T20:25:43.7858405Z 2025-05-07T20:25:43.7858409Z 2025-05-07T20:25:43.7858413Z 2025-05-07T20:25:43.7858416Z 2025-05-07T20:25:43.7858420Z 2025-05-07T20:25:43.7858542Z 2025-05-07T20:25:43.8259936Z gds-tools-1.11.1.6 | 37.8 MB | ####7 | 47%  2025-05-07T20:25:43.8260284Z 2025-05-07T20:25:43.8260288Z 2025-05-07T20:25:43.8260292Z 2025-05-07T20:25:43.8260296Z 2025-05-07T20:25:43.8260300Z 2025-05-07T20:25:43.8260303Z 2025-05-07T20:25:43.8260307Z 2025-05-07T20:25:43.8260311Z 2025-05-07T20:25:43.8260315Z 2025-05-07T20:25:43.8260318Z 2025-05-07T20:25:43.8260322Z 2025-05-07T20:25:43.8263244Z 2025-05-07T20:25:43.8318959Z python-3.9.18 | 22.7 MB | ###4 | 35%  2025-05-07T20:25:43.8319380Z 2025-05-07T20:25:43.8319386Z 2025-05-07T20:25:43.8319392Z 2025-05-07T20:25:43.8319398Z 2025-05-07T20:25:43.8319403Z 2025-05-07T20:25:43.8319408Z 2025-05-07T20:25:43.8319413Z 2025-05-07T20:25:43.8319419Z 2025-05-07T20:25:43.8319424Z 2025-05-07T20:25:43.8319429Z 2025-05-07T20:25:43.8320976Z 2025-05-07T20:25:43.8746891Z cuda-nvcc-tools-12.6 | 23.0 MB | #######4 | 74%  2025-05-07T20:25:43.8747363Z 2025-05-07T20:25:43.8747738Z 2025-05-07T20:25:43.8747743Z 2025-05-07T20:25:43.8747746Z 2025-05-07T20:25:43.8747750Z 2025-05-07T20:25:43.8747754Z 2025-05-07T20:25:43.8747758Z 2025-05-07T20:25:43.8747773Z 2025-05-07T20:25:43.8747777Z 2025-05-07T20:25:43.8747781Z 2025-05-07T20:25:43.8747784Z 2025-05-07T20:25:43.8747788Z 2025-05-07T20:25:43.8756096Z 2025-05-07T20:25:43.8862178Z cuda-nvrtc-12.6.85 | 17.3 MB | ####1 | 42%  2025-05-07T20:25:43.8862553Z 2025-05-07T20:25:43.8862559Z 2025-05-07T20:25:43.8862564Z 2025-05-07T20:25:43.8862569Z 2025-05-07T20:25:43.8862575Z 2025-05-07T20:25:43.8862580Z 2025-05-07T20:25:43.8862585Z 2025-05-07T20:25:43.8862590Z 2025-05-07T20:25:43.8862595Z 2025-05-07T20:25:43.8862601Z 2025-05-07T20:25:43.9390712Z gds-tools-1.11.1.6 | 37.8 MB | #####3 | 54%  2025-05-07T20:25:43.9391034Z 2025-05-07T20:25:43.9391038Z 2025-05-07T20:25:43.9391043Z 2025-05-07T20:25:43.9391046Z 2025-05-07T20:25:43.9391068Z 2025-05-07T20:25:43.9391084Z 2025-05-07T20:25:43.9391088Z 2025-05-07T20:25:43.9391091Z 2025-05-07T20:25:43.9391095Z 2025-05-07T20:25:43.9391099Z 2025-05-07T20:25:43.9391102Z 2025-05-07T20:25:43.9397889Z 2025-05-07T20:25:43.9526844Z python-3.9.18 | 22.7 MB | ####6 | 46%  2025-05-07T20:25:43.9527250Z 2025-05-07T20:25:43.9527256Z 2025-05-07T20:25:43.9527261Z 2025-05-07T20:25:43.9527266Z 2025-05-07T20:25:43.9527272Z 2025-05-07T20:25:43.9527277Z 2025-05-07T20:25:43.9527294Z 2025-05-07T20:25:43.9527300Z 2025-05-07T20:25:43.9527305Z 2025-05-07T20:25:43.9527310Z 2025-05-07T20:25:43.9528670Z 2025-05-07T20:25:43.9862179Z cuda-nvcc-tools-12.6 | 23.0 MB | ########7 | 87%  2025-05-07T20:25:43.9862630Z 2025-05-07T20:25:43.9862636Z 2025-05-07T20:25:43.9862641Z 2025-05-07T20:25:43.9862646Z 2025-05-07T20:25:43.9862651Z 2025-05-07T20:25:43.9862657Z 2025-05-07T20:25:43.9862662Z 2025-05-07T20:25:43.9862914Z 2025-05-07T20:25:43.9862931Z 2025-05-07T20:25:43.9862936Z 2025-05-07T20:25:43.9868719Z gds-tools-1.11.1.6 | 37.8 MB | ###### | 61%  2025-05-07T20:25:43.9869147Z 2025-05-07T20:25:43.9869152Z 2025-05-07T20:25:43.9869158Z 2025-05-07T20:25:43.9869163Z 2025-05-07T20:25:43.9869168Z 2025-05-07T20:25:43.9869173Z 2025-05-07T20:25:43.9869178Z 2025-05-07T20:25:43.9869184Z 2025-05-07T20:25:43.9869189Z 2025-05-07T20:25:43.9869194Z 2025-05-07T20:25:43.9869199Z 2025-05-07T20:25:43.9869204Z 2025-05-07T20:25:43.9869209Z 2025-05-07T20:25:44.0473251Z cuda-nvrtc-12.6.85 | 17.3 MB | #####5 | 56%  2025-05-07T20:25:44.0473675Z 2025-05-07T20:25:44.0473681Z 2025-05-07T20:25:44.0473686Z 2025-05-07T20:25:44.0473692Z 2025-05-07T20:25:44.0473697Z 2025-05-07T20:25:44.0473702Z 2025-05-07T20:25:44.0473707Z 2025-05-07T20:25:44.0473725Z 2025-05-07T20:25:44.0473730Z 2025-05-07T20:25:44.0473740Z 2025-05-07T20:25:44.0473745Z 2025-05-07T20:25:44.0473776Z 2025-05-07T20:25:44.0936625Z python-3.9.18 | 22.7 MB | #####7 | 58%  2025-05-07T20:25:44.0937032Z 2025-05-07T20:25:44.0937038Z 2025-05-07T20:25:44.0937043Z 2025-05-07T20:25:44.0937049Z 2025-05-07T20:25:44.0937054Z 2025-05-07T20:25:44.0937059Z 2025-05-07T20:25:44.0937064Z 2025-05-07T20:25:44.0937069Z 2025-05-07T20:25:44.0937075Z 2025-05-07T20:25:44.0937080Z 2025-05-07T20:25:44.0937085Z 2025-05-07T20:25:44.0937090Z 2025-05-07T20:25:44.0937100Z 2025-05-07T20:25:44.1071099Z cuda-nvrtc-12.6.85 | 17.3 MB | ######8 | 69%  2025-05-07T20:25:44.1071691Z 2025-05-07T20:25:44.1071697Z 2025-05-07T20:25:44.1071702Z 2025-05-07T20:25:44.1071707Z 2025-05-07T20:25:44.1071712Z 2025-05-07T20:25:44.1071718Z 2025-05-07T20:25:44.1071723Z 2025-05-07T20:25:44.1071728Z 2025-05-07T20:25:44.1071733Z 2025-05-07T20:25:44.1071738Z 2025-05-07T20:25:44.1477373Z gds-tools-1.11.1.6 | 37.8 MB | ######7 | 68%  2025-05-07T20:25:44.1478066Z 2025-05-07T20:25:44.1478072Z 2025-05-07T20:25:44.1478077Z 2025-05-07T20:25:44.1478083Z 2025-05-07T20:25:44.1478088Z 2025-05-07T20:25:44.1478093Z 2025-05-07T20:25:44.1478098Z 2025-05-07T20:25:44.1478116Z 2025-05-07T20:25:44.1478122Z 2025-05-07T20:25:44.1478127Z 2025-05-07T20:25:44.1478132Z 2025-05-07T20:25:44.1478137Z 2025-05-07T20:25:44.1940850Z python-3.9.18 | 22.7 MB | ######8 | 69%  2025-05-07T20:25:44.1941323Z 2025-05-07T20:25:44.1941339Z 2025-05-07T20:25:44.1941345Z 2025-05-07T20:25:44.1941350Z 2025-05-07T20:25:44.1941355Z 2025-05-07T20:25:44.1941367Z 2025-05-07T20:25:44.1941372Z 2025-05-07T20:25:44.1941377Z 2025-05-07T20:25:44.1941383Z 2025-05-07T20:25:44.1941388Z 2025-05-07T20:25:44.1941393Z 2025-05-07T20:25:44.1941398Z 2025-05-07T20:25:44.1943259Z 2025-05-07T20:25:44.2074456Z cuda-nvrtc-12.6.85 | 17.3 MB | ########4 | 84%  2025-05-07T20:25:44.2074908Z 2025-05-07T20:25:44.2074913Z 2025-05-07T20:25:44.2074919Z 2025-05-07T20:25:44.2074924Z 2025-05-07T20:25:44.2074929Z 2025-05-07T20:25:44.2074934Z 2025-05-07T20:25:44.2074939Z 2025-05-07T20:25:44.2074945Z 2025-05-07T20:25:44.2074950Z 2025-05-07T20:25:44.2074955Z 2025-05-07T20:25:44.2488446Z gds-tools-1.11.1.6 | 37.8 MB | #######4 | 74%  2025-05-07T20:25:44.2488872Z 2025-05-07T20:25:44.2488878Z 2025-05-07T20:25:44.2488882Z 2025-05-07T20:25:44.2488887Z 2025-05-07T20:25:44.2488893Z 2025-05-07T20:25:44.2488898Z 2025-05-07T20:25:44.2488903Z 2025-05-07T20:25:44.2488908Z 2025-05-07T20:25:44.2488926Z 2025-05-07T20:25:44.2488931Z 2025-05-07T20:25:44.2488936Z 2025-05-07T20:25:44.2498761Z 2025-05-07T20:25:44.2953848Z python-3.9.18 | 22.7 MB | #######9 | 80%  2025-05-07T20:25:44.2954260Z 2025-05-07T20:25:44.2954265Z 2025-05-07T20:25:44.2954271Z 2025-05-07T20:25:44.2954538Z 2025-05-07T20:25:44.2954560Z 2025-05-07T20:25:44.2954565Z 2025-05-07T20:25:44.2954570Z 2025-05-07T20:25:44.2954575Z 2025-05-07T20:25:44.2954580Z 2025-05-07T20:25:44.2954585Z 2025-05-07T20:25:44.2954590Z 2025-05-07T20:25:44.2954596Z 2025-05-07T20:25:44.2954601Z 2025-05-07T20:25:44.3332664Z cuda-nvrtc-12.6.85 | 17.3 MB | #########9 | 100%  2025-05-07T20:25:44.3333091Z 2025-05-07T20:25:44.3333097Z 2025-05-07T20:25:44.3333102Z 2025-05-07T20:25:44.3333107Z 2025-05-07T20:25:44.3333113Z 2025-05-07T20:25:44.3333118Z 2025-05-07T20:25:44.3333123Z 2025-05-07T20:25:44.3333129Z 2025-05-07T20:25:44.3333134Z 2025-05-07T20:25:44.3333153Z 2025-05-07T20:25:44.3493663Z gds-tools-1.11.1.6 | 37.8 MB | ########1 | 81%  2025-05-07T20:25:44.3494074Z 2025-05-07T20:25:44.3494079Z 2025-05-07T20:25:44.3494084Z 2025-05-07T20:25:44.3494102Z 2025-05-07T20:25:44.3494107Z 2025-05-07T20:25:44.3494112Z 2025-05-07T20:25:44.3494117Z 2025-05-07T20:25:44.3494150Z 2025-05-07T20:25:44.3494156Z 2025-05-07T20:25:44.3494162Z 2025-05-07T20:25:44.3494167Z 2025-05-07T20:25:44.3495680Z 2025-05-07T20:25:44.4336120Z python-3.9.18 | 22.7 MB | #########1 | 91%  2025-05-07T20:25:44.4336549Z 2025-05-07T20:25:44.4336556Z 2025-05-07T20:25:44.4336561Z 2025-05-07T20:25:44.4336566Z 2025-05-07T20:25:44.4336571Z 2025-05-07T20:25:44.4336577Z 2025-05-07T20:25:44.4336584Z 2025-05-07T20:25:44.4336590Z 2025-05-07T20:25:44.4336595Z 2025-05-07T20:25:44.4342646Z 2025-05-07T20:25:44.5337039Z gds-tools-1.11.1.6 | 37.8 MB | ########7 | 88%  2025-05-07T20:25:44.5337467Z 2025-05-07T20:25:44.5337471Z 2025-05-07T20:25:44.5337475Z 2025-05-07T20:25:44.5337478Z 2025-05-07T20:25:44.5337482Z 2025-05-07T20:25:44.5337486Z 2025-05-07T20:25:44.5337490Z 2025-05-07T20:25:44.5337494Z 2025-05-07T20:25:44.5337501Z 2025-05-07T20:25:44.5339222Z 2025-05-07T20:25:44.8861978Z gds-tools-1.11.1.6 | 37.8 MB | #########5 | 96%  2025-05-07T20:25:44.8862668Z 2025-05-07T20:25:44.8862673Z 2025-05-07T20:25:44.8862690Z 2025-05-07T20:25:44.8862695Z 2025-05-07T20:25:44.8862700Z 2025-05-07T20:25:44.8862706Z 2025-05-07T20:25:44.8862711Z 2025-05-07T20:25:44.8862716Z 2025-05-07T20:25:44.8862721Z 2025-05-07T20:25:44.8862726Z 2025-05-07T20:25:44.8862981Z 2025-05-07T20:25:44.9129241Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:44.9129679Z 2025-05-07T20:25:44.9129685Z 2025-05-07T20:25:44.9129690Z 2025-05-07T20:25:44.9129695Z 2025-05-07T20:25:44.9129701Z 2025-05-07T20:25:44.9129706Z 2025-05-07T20:25:44.9129711Z 2025-05-07T20:25:44.9129717Z 2025-05-07T20:25:44.9129722Z 2025-05-07T20:25:44.9129727Z 2025-05-07T20:25:44.9129733Z 2025-05-07T20:25:44.9129738Z 2025-05-07T20:25:44.9130373Z 2025-05-07T20:25:44.9328870Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:44.9329316Z 2025-05-07T20:25:44.9329322Z 2025-05-07T20:25:44.9329328Z 2025-05-07T20:25:44.9329333Z 2025-05-07T20:25:44.9329338Z 2025-05-07T20:25:44.9329343Z 2025-05-07T20:25:44.9329348Z 2025-05-07T20:25:44.9329354Z 2025-05-07T20:25:44.9329359Z 2025-05-07T20:25:44.9329375Z 2025-05-07T20:25:44.9329381Z 2025-05-07T20:25:44.9329386Z 2025-05-07T20:25:44.9329392Z 2025-05-07T20:25:44.9329401Z 2025-05-07T20:25:44.9819227Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:44.9819691Z 2025-05-07T20:25:44.9819697Z 2025-05-07T20:25:44.9819702Z 2025-05-07T20:25:44.9819707Z 2025-05-07T20:25:44.9819713Z 2025-05-07T20:25:44.9819718Z 2025-05-07T20:25:44.9819723Z 2025-05-07T20:25:44.9819728Z 2025-05-07T20:25:44.9819733Z 2025-05-07T20:25:44.9819739Z 2025-05-07T20:25:44.9819744Z 2025-05-07T20:25:44.9819750Z 2025-05-07T20:25:44.9819756Z 2025-05-07T20:25:44.9819763Z 2025-05-07T20:25:44.9820447Z 2025-05-07T20:25:45.0329640Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:45.0330117Z 2025-05-07T20:25:45.0330124Z 2025-05-07T20:25:45.0330130Z 2025-05-07T20:25:45.0330136Z 2025-05-07T20:25:45.0330143Z 2025-05-07T20:25:45.0330148Z 2025-05-07T20:25:45.0330154Z 2025-05-07T20:25:45.0330159Z 2025-05-07T20:25:45.0330165Z 2025-05-07T20:25:45.0330188Z 2025-05-07T20:25:45.0330194Z 2025-05-07T20:25:45.0330199Z 2025-05-07T20:25:45.0330204Z 2025-05-07T20:25:45.0333027Z 2025-05-07T20:25:45.0822068Z libnvjitlink-12.6.85 | 14.9 MB | ##1 | 22%  2025-05-07T20:25:45.0822526Z 2025-05-07T20:25:45.0822531Z 2025-05-07T20:25:45.0822537Z 2025-05-07T20:25:45.0822542Z 2025-05-07T20:25:45.0822547Z 2025-05-07T20:25:45.0822552Z 2025-05-07T20:25:45.0822557Z 2025-05-07T20:25:45.0822562Z 2025-05-07T20:25:45.0822569Z 2025-05-07T20:25:45.0822574Z 2025-05-07T20:25:45.0822580Z 2025-05-07T20:25:45.0822584Z 2025-05-07T20:25:45.0822603Z 2025-05-07T20:25:45.0822615Z 2025-05-07T20:25:45.0822620Z 2025-05-07T20:25:45.1461649Z cuda-nvcc-dev_linux- | 10.8 MB | ##7 | 28%  2025-05-07T20:25:45.1462101Z 2025-05-07T20:25:45.1462107Z 2025-05-07T20:25:45.1462112Z 2025-05-07T20:25:45.1462117Z 2025-05-07T20:25:45.1462122Z 2025-05-07T20:25:45.1462127Z 2025-05-07T20:25:45.1462133Z 2025-05-07T20:25:45.1462138Z 2025-05-07T20:25:45.1462143Z 2025-05-07T20:25:45.1462148Z 2025-05-07T20:25:45.1462153Z 2025-05-07T20:25:45.1462178Z 2025-05-07T20:25:45.1462183Z 2025-05-07T20:25:45.1466236Z 2025-05-07T20:25:45.1881456Z libnvjitlink-12.6.85 | 14.9 MB | ####3 | 43%  2025-05-07T20:25:45.1881813Z 2025-05-07T20:25:45.1881817Z 2025-05-07T20:25:45.1881821Z 2025-05-07T20:25:45.1881824Z 2025-05-07T20:25:45.1881828Z 2025-05-07T20:25:45.1881832Z 2025-05-07T20:25:45.1881836Z 2025-05-07T20:25:45.1881840Z 2025-05-07T20:25:45.1881843Z 2025-05-07T20:25:45.1881861Z 2025-05-07T20:25:45.1882090Z 2025-05-07T20:25:45.1882093Z 2025-05-07T20:25:45.1882097Z 2025-05-07T20:25:45.1882101Z 2025-05-07T20:25:45.1882104Z 2025-05-07T20:25:45.2323464Z cuda-nvcc-dev_linux- | 10.8 MB | #####5 | 56%  2025-05-07T20:25:45.2323860Z 2025-05-07T20:25:45.2323864Z 2025-05-07T20:25:45.2323868Z 2025-05-07T20:25:45.2323871Z 2025-05-07T20:25:45.2323875Z 2025-05-07T20:25:45.2323878Z 2025-05-07T20:25:45.2323882Z 2025-05-07T20:25:45.2323885Z 2025-05-07T20:25:45.2323888Z 2025-05-07T20:25:45.2323892Z 2025-05-07T20:25:45.2323895Z 2025-05-07T20:25:45.2326579Z 2025-05-07T20:25:45.2518860Z python-3.9.18 | 22.7 MB | ########## | 100%  2025-05-07T20:25:45.2519234Z 2025-05-07T20:25:45.2519239Z 2025-05-07T20:25:45.2519242Z 2025-05-07T20:25:45.2519246Z 2025-05-07T20:25:45.2519258Z 2025-05-07T20:25:45.2519262Z 2025-05-07T20:25:45.2519265Z 2025-05-07T20:25:45.2519269Z 2025-05-07T20:25:45.2519292Z 2025-05-07T20:25:45.2519296Z 2025-05-07T20:25:45.2519299Z 2025-05-07T20:25:45.2519303Z 2025-05-07T20:25:45.2519307Z 2025-05-07T20:25:45.2521174Z 2025-05-07T20:25:45.2885676Z libnvjitlink-12.6.85 | 14.9 MB | ######3 | 63%  2025-05-07T20:25:45.2886116Z 2025-05-07T20:25:45.2886124Z 2025-05-07T20:25:45.2886129Z 2025-05-07T20:25:45.2886134Z 2025-05-07T20:25:45.2886140Z 2025-05-07T20:25:45.2886145Z 2025-05-07T20:25:45.2886150Z 2025-05-07T20:25:45.2886166Z 2025-05-07T20:25:45.2886171Z 2025-05-07T20:25:45.2886177Z 2025-05-07T20:25:45.2886182Z 2025-05-07T20:25:45.2886187Z 2025-05-07T20:25:45.2886193Z 2025-05-07T20:25:45.2886198Z 2025-05-07T20:25:45.2886203Z 2025-05-07T20:25:45.3204703Z cuda-nvcc-dev_linux- | 10.8 MB | ########6 | 87%  2025-05-07T20:25:45.3205121Z 2025-05-07T20:25:45.3205125Z 2025-05-07T20:25:45.3205129Z 2025-05-07T20:25:45.3205133Z 2025-05-07T20:25:45.3205381Z 2025-05-07T20:25:45.3205395Z 2025-05-07T20:25:45.3205399Z 2025-05-07T20:25:45.3205403Z 2025-05-07T20:25:45.3205407Z 2025-05-07T20:25:45.3205410Z 2025-05-07T20:25:45.3205414Z 2025-05-07T20:25:45.3205417Z 2025-05-07T20:25:45.3205421Z 2025-05-07T20:25:45.3205425Z 2025-05-07T20:25:45.3205428Z 2025-05-07T20:25:45.3205432Z 2025-05-07T20:25:45.3524374Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:45.3524705Z 2025-05-07T20:25:45.3524709Z 2025-05-07T20:25:45.3524713Z 2025-05-07T20:25:45.3524716Z 2025-05-07T20:25:45.3524720Z 2025-05-07T20:25:45.3524724Z 2025-05-07T20:25:45.3524728Z 2025-05-07T20:25:45.3524731Z 2025-05-07T20:25:45.3524744Z 2025-05-07T20:25:45.3524748Z 2025-05-07T20:25:45.3524751Z 2025-05-07T20:25:45.3524755Z 2025-05-07T20:25:45.3524759Z 2025-05-07T20:25:45.3524763Z 2025-05-07T20:25:45.4207031Z libnvjitlink-12.6.85 | 14.9 MB | ########6 | 86%  2025-05-07T20:25:45.4207380Z 2025-05-07T20:25:45.4207393Z 2025-05-07T20:25:45.4207397Z 2025-05-07T20:25:45.4207401Z 2025-05-07T20:25:45.4207405Z 2025-05-07T20:25:45.4207409Z 2025-05-07T20:25:45.4207412Z 2025-05-07T20:25:45.4207416Z 2025-05-07T20:25:45.4207420Z 2025-05-07T20:25:45.4207423Z 2025-05-07T20:25:45.4207427Z 2025-05-07T20:25:45.4207431Z 2025-05-07T20:25:45.4207434Z 2025-05-07T20:25:45.4207438Z 2025-05-07T20:25:45.4207442Z 2025-05-07T20:25:45.4207446Z 2025-05-07T20:25:45.5207378Z cuda-nvvm-tools-12.6 | 10.4 MB | ##7 | 27%  2025-05-07T20:25:45.5207802Z 2025-05-07T20:25:45.5207808Z 2025-05-07T20:25:45.5207813Z 2025-05-07T20:25:45.5207819Z 2025-05-07T20:25:45.5207824Z 2025-05-07T20:25:45.5207830Z 2025-05-07T20:25:45.5207835Z 2025-05-07T20:25:45.5207840Z 2025-05-07T20:25:45.5207846Z 2025-05-07T20:25:45.5207861Z 2025-05-07T20:25:45.5207868Z 2025-05-07T20:25:45.5207873Z 2025-05-07T20:25:45.5207877Z 2025-05-07T20:25:45.5207881Z 2025-05-07T20:25:45.5208193Z 2025-05-07T20:25:45.5210426Z 2025-05-07T20:25:45.6729452Z cuda-nvvm-tools-12.6 | 10.4 MB | #####8 | 59%  2025-05-07T20:25:45.6729799Z 2025-05-07T20:25:45.6729803Z 2025-05-07T20:25:45.6729807Z 2025-05-07T20:25:45.6729812Z 2025-05-07T20:25:45.6729816Z 2025-05-07T20:25:45.6729820Z 2025-05-07T20:25:45.6729824Z 2025-05-07T20:25:45.6729828Z 2025-05-07T20:25:45.6729831Z 2025-05-07T20:25:45.6729835Z 2025-05-07T20:25:45.6729839Z 2025-05-07T20:25:45.6729843Z 2025-05-07T20:25:45.6729846Z 2025-05-07T20:25:45.6729850Z 2025-05-07T20:25:45.6729854Z 2025-05-07T20:25:45.6729857Z 2025-05-07T20:25:45.7119630Z cuda-nvvm-tools-12.6 | 10.4 MB | ########8 | 88%  2025-05-07T20:25:45.7119961Z 2025-05-07T20:25:45.7121409Z 2025-05-07T20:25:45.7238408Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:45.7238677Z 2025-05-07T20:25:45.7238681Z 2025-05-07T20:25:45.7238704Z 2025-05-07T20:25:45.7238718Z 2025-05-07T20:25:45.7238721Z 2025-05-07T20:25:45.7238725Z 2025-05-07T20:25:45.7238729Z 2025-05-07T20:25:45.7238733Z 2025-05-07T20:25:45.7238736Z 2025-05-07T20:25:45.7238740Z 2025-05-07T20:25:45.7238753Z 2025-05-07T20:25:45.7238757Z 2025-05-07T20:25:45.7238761Z 2025-05-07T20:25:45.7238764Z 2025-05-07T20:25:45.7238768Z 2025-05-07T20:25:45.7739155Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:45.7739492Z 2025-05-07T20:25:45.7739496Z 2025-05-07T20:25:45.7739500Z 2025-05-07T20:25:45.7739504Z 2025-05-07T20:25:45.7739515Z 2025-05-07T20:25:45.7739519Z 2025-05-07T20:25:45.7739523Z 2025-05-07T20:25:45.7739527Z 2025-05-07T20:25:45.7739530Z 2025-05-07T20:25:45.7739534Z 2025-05-07T20:25:45.7739538Z 2025-05-07T20:25:45.7739541Z 2025-05-07T20:25:45.7739545Z 2025-05-07T20:25:45.7739549Z 2025-05-07T20:25:45.7739555Z 2025-05-07T20:25:45.7739561Z 2025-05-07T20:25:45.7745374Z 2025-05-07T20:25:45.8740827Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:45.8741342Z 2025-05-07T20:25:45.8741346Z 2025-05-07T20:25:45.8741350Z 2025-05-07T20:25:45.8741354Z 2025-05-07T20:25:45.8741358Z 2025-05-07T20:25:45.8741361Z 2025-05-07T20:25:45.8741365Z 2025-05-07T20:25:45.8741369Z 2025-05-07T20:25:45.8741373Z 2025-05-07T20:25:45.8741376Z 2025-05-07T20:25:45.8741380Z 2025-05-07T20:25:45.8741384Z 2025-05-07T20:25:45.8741387Z 2025-05-07T20:25:45.8741391Z 2025-05-07T20:25:45.8741401Z 2025-05-07T20:25:45.8741405Z 2025-05-07T20:25:45.8743918Z 2025-05-07T20:25:45.8928105Z cuda-sanitizer-api-1 | 8.9 MB | ###5 | 35%  2025-05-07T20:25:45.8928449Z 2025-05-07T20:25:45.8928453Z 2025-05-07T20:25:45.8928457Z 2025-05-07T20:25:45.8928461Z 2025-05-07T20:25:45.8928464Z 2025-05-07T20:25:45.8928468Z 2025-05-07T20:25:45.8928472Z 2025-05-07T20:25:45.8928475Z 2025-05-07T20:25:45.8928487Z 2025-05-07T20:25:45.8930658Z 2025-05-07T20:25:45.9356744Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:45.9357044Z 2025-05-07T20:25:45.9357048Z 2025-05-07T20:25:45.9357052Z 2025-05-07T20:25:45.9357055Z 2025-05-07T20:25:45.9357059Z 2025-05-07T20:25:45.9357063Z 2025-05-07T20:25:45.9357066Z 2025-05-07T20:25:45.9357070Z 2025-05-07T20:25:45.9357073Z 2025-05-07T20:25:45.9357077Z 2025-05-07T20:25:45.9357081Z 2025-05-07T20:25:45.9357084Z 2025-05-07T20:25:45.9357088Z 2025-05-07T20:25:45.9357091Z 2025-05-07T20:25:45.9588226Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:45.9588547Z 2025-05-07T20:25:45.9588551Z 2025-05-07T20:25:45.9588554Z 2025-05-07T20:25:45.9588558Z 2025-05-07T20:25:45.9588562Z 2025-05-07T20:25:45.9588565Z 2025-05-07T20:25:45.9588578Z 2025-05-07T20:25:45.9588582Z 2025-05-07T20:25:45.9588585Z 2025-05-07T20:25:45.9588589Z 2025-05-07T20:25:45.9588593Z 2025-05-07T20:25:45.9588837Z 2025-05-07T20:25:45.9588841Z 2025-05-07T20:25:45.9588845Z 2025-05-07T20:25:45.9588848Z 2025-05-07T20:25:45.9588852Z 2025-05-07T20:25:45.9588856Z 2025-05-07T20:25:45.9588859Z 2025-05-07T20:25:45.9733998Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:45.9734427Z 2025-05-07T20:25:45.9734431Z 2025-05-07T20:25:45.9734435Z 2025-05-07T20:25:45.9734439Z 2025-05-07T20:25:45.9734442Z 2025-05-07T20:25:45.9734446Z 2025-05-07T20:25:45.9734450Z 2025-05-07T20:25:45.9734454Z 2025-05-07T20:25:45.9734457Z 2025-05-07T20:25:45.9734461Z 2025-05-07T20:25:45.9734465Z 2025-05-07T20:25:45.9734468Z 2025-05-07T20:25:45.9734472Z 2025-05-07T20:25:45.9734476Z 2025-05-07T20:25:45.9734480Z 2025-05-07T20:25:45.9734491Z 2025-05-07T20:25:45.9734494Z 2025-05-07T20:25:45.9734498Z 2025-05-07T20:25:45.9737144Z 2025-05-07T20:25:45.9739314Z ... (more hidden) ... 2025-05-07T20:25:45.9739808Z 2025-05-07T20:25:45.9739812Z 2025-05-07T20:25:45.9739816Z 2025-05-07T20:25:45.9739819Z 2025-05-07T20:25:45.9739823Z 2025-05-07T20:25:45.9739827Z 2025-05-07T20:25:45.9739830Z 2025-05-07T20:25:45.9739834Z 2025-05-07T20:25:45.9739838Z 2025-05-07T20:25:45.9739841Z 2025-05-07T20:25:45.9739845Z 2025-05-07T20:25:45.9739849Z 2025-05-07T20:25:45.9739852Z 2025-05-07T20:25:45.9739856Z 2025-05-07T20:25:45.9739860Z 2025-05-07T20:25:45.9739863Z 2025-05-07T20:25:45.9741950Z 2025-05-07T20:25:46.0598211Z cuda-sanitizer-api-1 | 8.9 MB | #######4 | 74%  2025-05-07T20:25:46.0598677Z 2025-05-07T20:25:46.0598683Z 2025-05-07T20:25:46.0598690Z 2025-05-07T20:25:46.0598696Z 2025-05-07T20:25:46.0598700Z 2025-05-07T20:25:46.0598704Z 2025-05-07T20:25:46.0598708Z 2025-05-07T20:25:46.0598711Z 2025-05-07T20:25:46.0598716Z 2025-05-07T20:25:46.0598720Z 2025-05-07T20:25:46.0598724Z 2025-05-07T20:25:46.0598728Z 2025-05-07T20:25:46.0599005Z 2025-05-07T20:25:46.0599033Z 2025-05-07T20:25:46.0599052Z 2025-05-07T20:25:46.0599058Z 2025-05-07T20:25:46.0599064Z 2025-05-07T20:25:46.0600864Z 2025-05-07T20:25:46.0736491Z cuda-nvvm-impl-12.6. | 7.7 MB | ###4 | 34%  2025-05-07T20:25:46.0736941Z 2025-05-07T20:25:46.0736947Z 2025-05-07T20:25:46.0736952Z 2025-05-07T20:25:46.0736957Z 2025-05-07T20:25:46.0736962Z 2025-05-07T20:25:46.0736968Z 2025-05-07T20:25:46.0736973Z 2025-05-07T20:25:46.0736978Z 2025-05-07T20:25:46.0736983Z 2025-05-07T20:25:46.0736988Z 2025-05-07T20:25:46.0736994Z 2025-05-07T20:25:46.0736999Z 2025-05-07T20:25:46.0737004Z 2025-05-07T20:25:46.0737009Z 2025-05-07T20:25:46.0737014Z 2025-05-07T20:25:46.0737019Z 2025-05-07T20:25:46.0737025Z 2025-05-07T20:25:46.0737030Z 2025-05-07T20:25:46.0738179Z 2025-05-07T20:25:46.1031578Z ... (more hidden) ... 2025-05-07T20:25:46.1032077Z 2025-05-07T20:25:46.1032094Z 2025-05-07T20:25:46.1032105Z 2025-05-07T20:25:46.1032109Z 2025-05-07T20:25:46.1032112Z 2025-05-07T20:25:46.1032116Z 2025-05-07T20:25:46.1032132Z 2025-05-07T20:25:46.1032136Z 2025-05-07T20:25:46.1032139Z 2025-05-07T20:25:46.1032143Z 2025-05-07T20:25:46.1032147Z 2025-05-07T20:25:46.1032151Z 2025-05-07T20:25:46.1032154Z 2025-05-07T20:25:46.1032158Z 2025-05-07T20:25:46.1032162Z 2025-05-07T20:25:46.1032166Z 2025-05-07T20:25:46.1598570Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:46.1598932Z 2025-05-07T20:25:46.1598936Z 2025-05-07T20:25:46.1598940Z 2025-05-07T20:25:46.1598944Z 2025-05-07T20:25:46.1598947Z 2025-05-07T20:25:46.1598951Z 2025-05-07T20:25:46.1598955Z 2025-05-07T20:25:46.1598959Z 2025-05-07T20:25:46.1598962Z 2025-05-07T20:25:46.1598966Z 2025-05-07T20:25:46.1598970Z 2025-05-07T20:25:46.1598974Z 2025-05-07T20:25:46.1598977Z 2025-05-07T20:25:46.1598981Z 2025-05-07T20:25:46.1598985Z 2025-05-07T20:25:46.1599262Z 2025-05-07T20:25:46.1599267Z 2025-05-07T20:25:46.1599284Z 2025-05-07T20:25:46.2239211Z cuda-nvvm-impl-12.6. | 7.7 MB | #######4 | 74%  2025-05-07T20:25:46.2239568Z 2025-05-07T20:25:46.2239573Z 2025-05-07T20:25:46.2239588Z 2025-05-07T20:25:46.2239592Z 2025-05-07T20:25:46.2239597Z 2025-05-07T20:25:46.2239602Z 2025-05-07T20:25:46.2239606Z 2025-05-07T20:25:46.2239610Z 2025-05-07T20:25:46.2239613Z 2025-05-07T20:25:46.2239618Z 2025-05-07T20:25:46.2239622Z 2025-05-07T20:25:46.2239626Z 2025-05-07T20:25:46.2239630Z 2025-05-07T20:25:46.2239633Z 2025-05-07T20:25:46.2239637Z 2025-05-07T20:25:46.2239641Z 2025-05-07T20:25:46.2239644Z 2025-05-07T20:25:46.2239648Z 2025-05-07T20:25:46.2240711Z 2025-05-07T20:25:46.3371889Z ... (more hidden) ... 2025-05-07T20:25:46.3372288Z 2025-05-07T20:25:46.3372293Z 2025-05-07T20:25:46.3372297Z 2025-05-07T20:25:46.3372300Z 2025-05-07T20:25:46.3372338Z 2025-05-07T20:25:46.3372343Z 2025-05-07T20:25:46.3372347Z 2025-05-07T20:25:46.3372350Z 2025-05-07T20:25:46.3372354Z 2025-05-07T20:25:46.3372358Z 2025-05-07T20:25:46.3372361Z 2025-05-07T20:25:46.3372365Z 2025-05-07T20:25:46.3372369Z 2025-05-07T20:25:46.3372372Z 2025-05-07T20:25:46.3372376Z 2025-05-07T20:25:46.3372379Z 2025-05-07T20:25:46.3372383Z 2025-05-07T20:25:46.4772288Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:46.4772666Z 2025-05-07T20:25:46.4772670Z 2025-05-07T20:25:46.4772674Z 2025-05-07T20:25:46.4772677Z 2025-05-07T20:25:46.4772690Z 2025-05-07T20:25:46.4772694Z 2025-05-07T20:25:46.4772698Z 2025-05-07T20:25:46.4772701Z 2025-05-07T20:25:46.4772705Z 2025-05-07T20:25:46.4772709Z 2025-05-07T20:25:46.4772712Z 2025-05-07T20:25:46.4772716Z 2025-05-07T20:25:46.4772720Z 2025-05-07T20:25:46.4772724Z 2025-05-07T20:25:46.4772727Z 2025-05-07T20:25:46.4772732Z 2025-05-07T20:25:46.4772973Z 2025-05-07T20:25:46.4772990Z 2025-05-07T20:25:46.8224647Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:46.8225134Z 2025-05-07T20:25:46.8225139Z 2025-05-07T20:25:46.8225143Z 2025-05-07T20:25:46.8225146Z 2025-05-07T20:25:46.8225151Z 2025-05-07T20:25:46.8225156Z 2025-05-07T20:25:47.0376764Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:47.0377231Z 2025-05-07T20:25:47.0377237Z 2025-05-07T20:25:47.0377242Z 2025-05-07T20:25:47.0377248Z 2025-05-07T20:25:47.0377253Z 2025-05-07T20:25:47.0377258Z 2025-05-07T20:25:47.0377263Z 2025-05-07T20:25:47.0377268Z 2025-05-07T20:25:48.0870021Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:48.0870440Z 2025-05-07T20:25:48.0870446Z 2025-05-07T20:25:48.0870451Z 2025-05-07T20:25:48.0870457Z 2025-05-07T20:25:48.0870462Z 2025-05-07T20:25:48.1739376Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:48.1739706Z 2025-05-07T20:25:48.1739710Z 2025-05-07T20:25:48.1739714Z 2025-05-07T20:25:48.1739725Z 2025-05-07T20:25:48.1739729Z 2025-05-07T20:25:48.1739733Z 2025-05-07T20:25:48.1739736Z 2025-05-07T20:25:48.1739740Z 2025-05-07T20:25:48.1739744Z 2025-05-07T20:25:49.0290060Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:49.0337484Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:49.0337869Z 2025-05-07T20:25:49.0337876Z 2025-05-07T20:25:49.0337881Z 2025-05-07T20:25:49.0337886Z 2025-05-07T20:25:49.0337892Z 2025-05-07T20:25:49.0337897Z 2025-05-07T20:25:49.0337902Z 2025-05-07T20:25:49.0337907Z 2025-05-07T20:25:49.0337913Z 2025-05-07T20:25:49.0337919Z 2025-05-07T20:25:49.0337925Z 2025-05-07T20:25:49.2833876Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:49.2834287Z 2025-05-07T20:25:49.2834291Z 2025-05-07T20:25:49.2834295Z 2025-05-07T20:25:49.2834615Z 2025-05-07T20:25:49.2834621Z 2025-05-07T20:25:49.2834626Z 2025-05-07T20:25:49.2834631Z 2025-05-07T20:25:49.2834636Z 2025-05-07T20:25:49.2834641Z 2025-05-07T20:25:49.2834659Z 2025-05-07T20:25:49.2834664Z 2025-05-07T20:25:49.2834669Z 2025-05-07T20:25:49.2834680Z 2025-05-07T20:25:49.7340913Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:49.7341505Z 2025-05-07T20:25:49.7341511Z 2025-05-07T20:25:49.7341517Z 2025-05-07T20:25:49.7341522Z 2025-05-07T20:25:49.7341529Z 2025-05-07T20:25:49.7341534Z 2025-05-07T20:25:49.7341539Z 2025-05-07T20:25:50.0659294Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:50.0659703Z 2025-05-07T20:25:50.0659708Z 2025-05-07T20:25:50.0659711Z 2025-05-07T20:25:50.0659715Z 2025-05-07T20:25:50.0659719Z 2025-05-07T20:25:50.0659725Z 2025-05-07T20:25:50.0659732Z 2025-05-07T20:25:50.0659737Z 2025-05-07T20:25:50.0659743Z 2025-05-07T20:25:50.0659775Z 2025-05-07T20:25:50.0659799Z 2025-05-07T20:25:50.0659804Z 2025-05-07T20:25:50.0659809Z 2025-05-07T20:25:50.0659815Z 2025-05-07T20:25:50.0659820Z 2025-05-07T20:25:50.4141310Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:50.4141918Z 2025-05-07T20:25:50.4141924Z 2025-05-07T20:25:50.4141930Z 2025-05-07T20:25:50.4141936Z 2025-05-07T20:25:50.4141941Z 2025-05-07T20:25:50.4141947Z 2025-05-07T20:25:50.4141954Z 2025-05-07T20:25:50.4141961Z 2025-05-07T20:25:50.4141978Z 2025-05-07T20:25:50.4141984Z 2025-05-07T20:25:50.6211548Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:50.6211904Z 2025-05-07T20:25:50.6211908Z 2025-05-07T20:25:50.6211912Z 2025-05-07T20:25:50.6211931Z 2025-05-07T20:25:50.6211935Z 2025-05-07T20:25:50.6211939Z 2025-05-07T20:25:50.6211943Z 2025-05-07T20:25:50.6211947Z 2025-05-07T20:25:50.6211951Z 2025-05-07T20:25:50.6211955Z 2025-05-07T20:25:50.6212219Z 2025-05-07T20:25:50.6212247Z 2025-05-07T20:25:50.7052768Z python-3.9.18 | 22.7 MB | ########## | 100%  2025-05-07T20:25:50.7053187Z 2025-05-07T20:25:50.7053193Z 2025-05-07T20:25:50.7053198Z 2025-05-07T20:25:50.7053202Z 2025-05-07T20:25:50.7053207Z 2025-05-07T20:25:50.7053213Z 2025-05-07T20:25:50.7053217Z 2025-05-07T20:25:50.7053229Z 2025-05-07T20:25:50.7053234Z 2025-05-07T20:25:50.7053241Z 2025-05-07T20:25:50.7053247Z 2025-05-07T20:25:50.7053255Z 2025-05-07T20:25:50.7053264Z 2025-05-07T20:25:50.7053271Z 2025-05-07T20:25:50.8231149Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:50.8231494Z 2025-05-07T20:25:50.8231499Z 2025-05-07T20:25:50.8231503Z 2025-05-07T20:25:50.8231507Z 2025-05-07T20:25:50.8231511Z 2025-05-07T20:25:50.8231515Z 2025-05-07T20:25:50.8231520Z 2025-05-07T20:25:50.8231525Z 2025-05-07T20:25:50.8231535Z 2025-05-07T20:25:50.8231540Z 2025-05-07T20:25:50.8231578Z 2025-05-07T20:25:50.8231594Z 2025-05-07T20:25:50.8231598Z 2025-05-07T20:25:50.8231603Z 2025-05-07T20:25:50.8231608Z 2025-05-07T20:25:50.8231612Z 2025-05-07T20:25:50.9094579Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:50.9095074Z 2025-05-07T20:25:50.9095078Z 2025-05-07T20:25:50.9095082Z 2025-05-07T20:25:50.9095086Z 2025-05-07T20:25:50.9095089Z 2025-05-07T20:25:50.9095093Z 2025-05-07T20:25:50.9095097Z 2025-05-07T20:25:50.9095101Z 2025-05-07T20:25:50.9095104Z 2025-05-07T20:25:50.9095108Z 2025-05-07T20:25:50.9095112Z 2025-05-07T20:25:50.9095116Z 2025-05-07T20:25:50.9095119Z 2025-05-07T20:25:50.9095123Z 2025-05-07T20:25:50.9095127Z 2025-05-07T20:25:50.9095133Z 2025-05-07T20:25:50.9095137Z 2025-05-07T20:25:50.9095141Z 2025-05-07T20:25:50.9095145Z 2025-05-07T20:25:51.0523155Z ... (more hidden) ... 2025-05-07T20:25:51.0523778Z 2025-05-07T20:25:51.0523821Z 2025-05-07T20:25:51.0524260Z 2025-05-07T20:25:51.0524267Z 2025-05-07T20:25:51.0524275Z 2025-05-07T20:25:51.0524282Z 2025-05-07T20:25:51.0524290Z 2025-05-07T20:25:51.0524297Z 2025-05-07T20:25:51.0524304Z 2025-05-07T20:25:51.0524312Z 2025-05-07T20:25:51.0524319Z 2025-05-07T20:25:51.0524326Z 2025-05-07T20:25:51.0524334Z 2025-05-07T20:25:51.0524341Z 2025-05-07T20:25:51.0524348Z 2025-05-07T20:25:51.0524356Z 2025-05-07T20:25:51.0524363Z 2025-05-07T20:25:51.2312095Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:51.2312463Z 2025-05-07T20:25:51.2312468Z 2025-05-07T20:25:51.2312472Z 2025-05-07T20:25:51.2312475Z 2025-05-07T20:25:51.2312479Z 2025-05-07T20:25:51.2312483Z 2025-05-07T20:25:51.2312486Z 2025-05-07T20:25:51.2312490Z 2025-05-07T20:25:51.2312494Z 2025-05-07T20:25:51.2312498Z 2025-05-07T20:25:51.2312501Z 2025-05-07T20:25:51.2312505Z 2025-05-07T20:25:51.2312519Z 2025-05-07T20:25:51.2312523Z 2025-05-07T20:25:51.2312550Z 2025-05-07T20:25:51.2312572Z 2025-05-07T20:25:51.2312575Z 2025-05-07T20:25:51.2312583Z 2025-05-07T20:25:51.5527577Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:51.5528034Z 2025-05-07T20:25:57.0702186Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:57.0709183Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:57.0709478Z 2025-05-07T20:25:57.0709483Z 2025-05-07T20:25:57.0709488Z 2025-05-07T20:25:57.0709492Z 2025-05-07T20:25:57.0709498Z 2025-05-07T20:25:57.0709502Z 2025-05-07T20:25:57.0709506Z 2025-05-07T20:25:57.0709510Z 2025-05-07T20:25:57.0709513Z 2025-05-07T20:25:57.0709517Z 2025-05-07T20:25:57.0709521Z 2025-05-07T20:25:57.0709524Z 2025-05-07T20:25:57.0709528Z 2025-05-07T20:25:57.0709532Z 2025-05-07T20:25:57.0709536Z 2025-05-07T20:25:57.0709539Z 2025-05-07T20:25:57.0709553Z 2025-05-07T20:25:57.0709557Z 2025-05-07T20:25:57.0709560Z 2025-05-07T20:25:57.0709901Z 2025-05-07T20:25:57.0710278Z  2025-05-07T20:25:57.0710603Z 2025-05-07T20:25:57.0710808Z 2025-05-07T20:25:57.0710973Z  2025-05-07T20:25:57.0711185Z 2025-05-07T20:25:57.0711189Z 2025-05-07T20:25:57.0711376Z  2025-05-07T20:25:57.0711598Z 2025-05-07T20:25:57.0711603Z 2025-05-07T20:25:57.0711609Z 2025-05-07T20:25:57.0712202Z  2025-05-07T20:25:57.0712415Z 2025-05-07T20:25:57.0712421Z 2025-05-07T20:25:57.0712425Z 2025-05-07T20:25:57.0712429Z 2025-05-07T20:25:57.0713212Z  2025-05-07T20:25:57.0713537Z 2025-05-07T20:25:57.0713543Z 2025-05-07T20:25:57.0713553Z 2025-05-07T20:25:57.0713559Z 2025-05-07T20:25:57.0713564Z 2025-05-07T20:25:57.0713877Z  2025-05-07T20:25:57.0714104Z 2025-05-07T20:25:57.0714112Z 2025-05-07T20:25:57.0714115Z 2025-05-07T20:25:57.0714119Z 2025-05-07T20:25:57.0714123Z 2025-05-07T20:25:57.0714126Z 2025-05-07T20:25:57.0714763Z  2025-05-07T20:25:57.0715093Z 2025-05-07T20:25:57.0715098Z 2025-05-07T20:25:57.0715103Z 2025-05-07T20:25:57.0715108Z 2025-05-07T20:25:57.0715114Z 2025-05-07T20:25:57.0715119Z 2025-05-07T20:25:57.0715125Z 2025-05-07T20:25:57.0715392Z  2025-05-07T20:25:57.0715704Z 2025-05-07T20:25:57.0715714Z 2025-05-07T20:25:57.0715720Z 2025-05-07T20:25:57.0715725Z 2025-05-07T20:25:57.0715730Z 2025-05-07T20:25:57.0715736Z 2025-05-07T20:25:57.0715741Z 2025-05-07T20:25:57.0715747Z 2025-05-07T20:25:57.0716046Z  2025-05-07T20:25:57.0716456Z 2025-05-07T20:25:57.0716460Z 2025-05-07T20:25:57.0716464Z 2025-05-07T20:25:57.0716467Z 2025-05-07T20:25:57.0716471Z 2025-05-07T20:25:57.0716485Z 2025-05-07T20:25:57.0716488Z 2025-05-07T20:25:57.0716492Z 2025-05-07T20:25:57.0716495Z 2025-05-07T20:25:57.0716771Z  2025-05-07T20:25:57.0716993Z 2025-05-07T20:25:57.0717008Z 2025-05-07T20:25:57.0717011Z 2025-05-07T20:25:57.0717015Z 2025-05-07T20:25:57.0717019Z 2025-05-07T20:25:57.0717022Z 2025-05-07T20:25:57.0717026Z 2025-05-07T20:25:57.0717029Z 2025-05-07T20:25:57.0717033Z 2025-05-07T20:25:57.0717037Z 2025-05-07T20:25:57.0717523Z  2025-05-07T20:25:57.0717753Z 2025-05-07T20:25:57.0717761Z 2025-05-07T20:25:57.0717765Z 2025-05-07T20:25:57.0717768Z 2025-05-07T20:25:57.0717772Z 2025-05-07T20:25:57.0717775Z 2025-05-07T20:25:57.0717779Z 2025-05-07T20:25:57.0717787Z 2025-05-07T20:25:57.0717797Z 2025-05-07T20:25:57.0717800Z 2025-05-07T20:25:57.0717804Z 2025-05-07T20:25:57.0718504Z  2025-05-07T20:25:57.0718762Z 2025-05-07T20:25:57.0718766Z 2025-05-07T20:25:57.0718770Z 2025-05-07T20:25:57.0718773Z 2025-05-07T20:25:57.0718783Z 2025-05-07T20:25:57.0718795Z 2025-05-07T20:25:57.0718799Z 2025-05-07T20:25:57.0718803Z 2025-05-07T20:25:57.0718806Z 2025-05-07T20:25:57.0718810Z 2025-05-07T20:25:57.0718813Z 2025-05-07T20:25:57.0718817Z 2025-05-07T20:25:57.0719096Z  2025-05-07T20:25:57.0719348Z 2025-05-07T20:25:57.0719357Z 2025-05-07T20:25:57.0719361Z 2025-05-07T20:25:57.0719365Z 2025-05-07T20:25:57.0719369Z 2025-05-07T20:25:57.0719372Z 2025-05-07T20:25:57.0719376Z 2025-05-07T20:25:57.0719379Z 2025-05-07T20:25:57.0719383Z 2025-05-07T20:25:57.0719387Z 2025-05-07T20:25:57.0719526Z 2025-05-07T20:25:57.0719537Z 2025-05-07T20:25:57.0719541Z 2025-05-07T20:25:57.0719989Z  2025-05-07T20:25:57.0720228Z 2025-05-07T20:25:57.0720232Z 2025-05-07T20:25:57.0720235Z 2025-05-07T20:25:57.0720239Z 2025-05-07T20:25:57.0720243Z 2025-05-07T20:25:57.0720246Z 2025-05-07T20:25:57.0720250Z 2025-05-07T20:25:57.0720254Z 2025-05-07T20:25:57.0720264Z 2025-05-07T20:25:57.0720267Z 2025-05-07T20:25:57.0720279Z 2025-05-07T20:25:57.0720283Z 2025-05-07T20:25:57.0720286Z 2025-05-07T20:25:57.0720290Z 2025-05-07T20:25:57.0720573Z  2025-05-07T20:25:57.0720893Z 2025-05-07T20:25:57.0720899Z 2025-05-07T20:25:57.0720913Z 2025-05-07T20:25:57.0720918Z 2025-05-07T20:25:57.0720923Z 2025-05-07T20:25:57.0720937Z 2025-05-07T20:25:57.0720942Z 2025-05-07T20:25:57.0720947Z 2025-05-07T20:25:57.0720953Z 2025-05-07T20:25:57.0720970Z 2025-05-07T20:25:57.0720982Z 2025-05-07T20:25:57.0720988Z 2025-05-07T20:25:57.0720993Z 2025-05-07T20:25:57.0720999Z 2025-05-07T20:25:57.0721004Z 2025-05-07T20:25:57.0721468Z  2025-05-07T20:25:57.0721802Z 2025-05-07T20:25:57.0721808Z 2025-05-07T20:25:57.0721813Z 2025-05-07T20:25:57.0721818Z 2025-05-07T20:25:57.0721824Z 2025-05-07T20:25:57.0721829Z 2025-05-07T20:25:57.0721834Z 2025-05-07T20:25:57.0721840Z 2025-05-07T20:25:57.0721845Z 2025-05-07T20:25:57.0721862Z 2025-05-07T20:25:57.0721867Z 2025-05-07T20:25:57.0721883Z 2025-05-07T20:25:57.0721888Z 2025-05-07T20:25:57.0721893Z 2025-05-07T20:25:57.0721898Z 2025-05-07T20:25:57.0721902Z 2025-05-07T20:25:57.0722201Z  2025-05-07T20:25:57.0722535Z 2025-05-07T20:25:57.0722550Z 2025-05-07T20:25:57.0722555Z 2025-05-07T20:25:57.0722561Z 2025-05-07T20:25:57.0722697Z 2025-05-07T20:25:57.0722702Z 2025-05-07T20:25:57.0722708Z 2025-05-07T20:25:57.0722713Z 2025-05-07T20:25:57.0722718Z 2025-05-07T20:25:57.0722723Z 2025-05-07T20:25:57.0722728Z 2025-05-07T20:25:57.0722734Z 2025-05-07T20:25:57.0722739Z 2025-05-07T20:25:57.0722744Z 2025-05-07T20:25:57.0722749Z 2025-05-07T20:25:57.0722754Z 2025-05-07T20:25:57.0722760Z 2025-05-07T20:25:57.0723080Z  2025-05-07T20:25:57.0723409Z 2025-05-07T20:25:57.0723415Z 2025-05-07T20:25:57.0723420Z 2025-05-07T20:25:57.0723425Z 2025-05-07T20:25:57.0723430Z 2025-05-07T20:25:57.0723435Z 2025-05-07T20:25:57.0723440Z 2025-05-07T20:25:57.0723445Z 2025-05-07T20:25:57.0723450Z 2025-05-07T20:25:57.0723456Z 2025-05-07T20:25:57.0723461Z 2025-05-07T20:25:57.0723475Z 2025-05-07T20:25:57.0723480Z 2025-05-07T20:25:57.0723486Z 2025-05-07T20:25:57.0723491Z 2025-05-07T20:25:57.0723495Z 2025-05-07T20:25:57.0723506Z 2025-05-07T20:25:57.0723519Z 2025-05-07T20:25:57.0724242Z  2025-05-07T20:25:57.0724494Z 2025-05-07T20:25:57.0724500Z 2025-05-07T20:25:57.0724604Z  2025-05-07T20:25:57.0724712Z 2025-05-07T20:25:57.0724717Z 2025-05-07T20:25:57.0725398Z  2025-05-07T20:25:57.0725532Z 2025-05-07T20:25:57.0725536Z 2025-05-07T20:25:57.0725543Z 2025-05-07T20:25:57.0725994Z  2025-05-07T20:25:57.0726115Z 2025-05-07T20:25:57.0726119Z 2025-05-07T20:25:57.0726126Z 2025-05-07T20:25:57.0726130Z 2025-05-07T20:25:57.0726643Z  2025-05-07T20:25:57.0726800Z 2025-05-07T20:25:57.0726806Z 2025-05-07T20:25:57.0726815Z 2025-05-07T20:25:57.0726820Z 2025-05-07T20:25:57.0726826Z 2025-05-07T20:25:57.0727322Z  2025-05-07T20:25:57.0727494Z 2025-05-07T20:25:57.0727506Z 2025-05-07T20:25:57.0727511Z 2025-05-07T20:25:57.0727516Z 2025-05-07T20:25:57.0727521Z 2025-05-07T20:25:57.0727671Z 2025-05-07T20:25:57.0728296Z  2025-05-07T20:25:57.0728534Z 2025-05-07T20:25:57.0728540Z 2025-05-07T20:25:57.0728546Z 2025-05-07T20:25:57.0728552Z 2025-05-07T20:25:57.0728570Z 2025-05-07T20:25:57.0728576Z 2025-05-07T20:25:57.0728581Z 2025-05-07T20:25:57.0729101Z  2025-05-07T20:25:57.0729337Z 2025-05-07T20:25:57.0729343Z 2025-05-07T20:25:57.0729349Z 2025-05-07T20:25:57.0729364Z 2025-05-07T20:25:57.0729370Z 2025-05-07T20:25:57.0729376Z 2025-05-07T20:25:57.0729381Z 2025-05-07T20:25:57.0729392Z 2025-05-07T20:25:57.0729770Z  2025-05-07T20:25:57.0729942Z 2025-05-07T20:25:57.0729946Z 2025-05-07T20:25:57.0729950Z 2025-05-07T20:25:57.0729954Z 2025-05-07T20:25:57.0729957Z 2025-05-07T20:25:57.0729961Z 2025-05-07T20:25:57.0729965Z 2025-05-07T20:25:57.0729968Z 2025-05-07T20:25:57.0729975Z 2025-05-07T20:25:57.0730597Z  2025-05-07T20:25:57.0730879Z 2025-05-07T20:25:57.0730885Z 2025-05-07T20:25:57.0730907Z 2025-05-07T20:25:57.0730920Z 2025-05-07T20:25:57.0730926Z 2025-05-07T20:25:57.0730932Z 2025-05-07T20:25:57.0730938Z 2025-05-07T20:25:57.0730944Z 2025-05-07T20:25:57.0730950Z 2025-05-07T20:25:57.0730960Z 2025-05-07T20:25:57.0731419Z  2025-05-07T20:25:57.0731715Z 2025-05-07T20:25:57.0731721Z 2025-05-07T20:25:57.0731727Z 2025-05-07T20:25:57.0731732Z 2025-05-07T20:25:57.0731738Z 2025-05-07T20:25:57.0731744Z 2025-05-07T20:25:57.0731750Z 2025-05-07T20:25:57.0731759Z 2025-05-07T20:25:57.0731764Z 2025-05-07T20:25:57.0731770Z 2025-05-07T20:25:57.0731774Z 2025-05-07T20:25:57.0732189Z  2025-05-07T20:25:57.0732489Z 2025-05-07T20:25:57.0732495Z 2025-05-07T20:25:57.0732508Z 2025-05-07T20:25:57.0732514Z 2025-05-07T20:25:57.0732520Z 2025-05-07T20:25:57.0732526Z 2025-05-07T20:25:57.0732532Z 2025-05-07T20:25:57.0732537Z 2025-05-07T20:25:57.0732543Z 2025-05-07T20:25:57.0732549Z 2025-05-07T20:25:57.0732555Z 2025-05-07T20:25:57.0732560Z 2025-05-07T20:25:57.0733392Z  2025-05-07T20:25:57.0733689Z 2025-05-07T20:25:57.0733696Z 2025-05-07T20:25:57.0733702Z 2025-05-07T20:25:57.0733708Z 2025-05-07T20:25:57.0733714Z 2025-05-07T20:25:57.0733720Z 2025-05-07T20:25:57.0733726Z 2025-05-07T20:25:57.0733741Z 2025-05-07T20:25:57.0733747Z 2025-05-07T20:25:57.0733753Z 2025-05-07T20:25:57.0733759Z 2025-05-07T20:25:57.0733765Z 2025-05-07T20:25:57.0733771Z 2025-05-07T20:25:57.0734023Z  2025-05-07T20:25:57.0734327Z 2025-05-07T20:25:57.0734333Z 2025-05-07T20:25:57.0734339Z 2025-05-07T20:25:57.0734345Z 2025-05-07T20:25:57.0734350Z 2025-05-07T20:25:57.0734356Z 2025-05-07T20:25:57.0734362Z 2025-05-07T20:25:57.0734367Z 2025-05-07T20:25:57.0734373Z 2025-05-07T20:25:57.0734379Z 2025-05-07T20:25:57.0734384Z 2025-05-07T20:25:57.0734390Z 2025-05-07T20:25:57.0734396Z 2025-05-07T20:25:57.0734402Z 2025-05-07T20:25:57.0734664Z  2025-05-07T20:25:57.0734968Z 2025-05-07T20:25:57.0734975Z 2025-05-07T20:25:57.0734980Z 2025-05-07T20:25:57.0734986Z 2025-05-07T20:25:57.0734992Z 2025-05-07T20:25:57.0734997Z 2025-05-07T20:25:57.0735003Z 2025-05-07T20:25:57.0735009Z 2025-05-07T20:25:57.0735015Z 2025-05-07T20:25:57.0735020Z 2025-05-07T20:25:57.0735026Z 2025-05-07T20:25:57.0735032Z 2025-05-07T20:25:57.0735038Z 2025-05-07T20:25:57.0735043Z 2025-05-07T20:25:57.0735049Z 2025-05-07T20:25:57.0735795Z  2025-05-07T20:25:57.0736124Z 2025-05-07T20:25:57.0736130Z 2025-05-07T20:25:57.0736136Z 2025-05-07T20:25:57.0736141Z 2025-05-07T20:25:57.0736147Z 2025-05-07T20:25:57.0736162Z 2025-05-07T20:25:57.0736168Z 2025-05-07T20:25:57.0736174Z 2025-05-07T20:25:57.0736179Z 2025-05-07T20:25:57.0736185Z 2025-05-07T20:25:57.0736190Z 2025-05-07T20:25:57.0736196Z 2025-05-07T20:25:57.0736202Z 2025-05-07T20:25:57.0736208Z 2025-05-07T20:25:57.0736214Z 2025-05-07T20:25:57.0736219Z 2025-05-07T20:25:57.0736634Z  2025-05-07T20:25:57.0736983Z 2025-05-07T20:25:57.0737000Z 2025-05-07T20:25:57.0737005Z 2025-05-07T20:25:57.0737011Z 2025-05-07T20:25:57.0737017Z 2025-05-07T20:25:57.0737023Z 2025-05-07T20:25:57.0737029Z 2025-05-07T20:25:57.0737035Z 2025-05-07T20:25:57.0737042Z 2025-05-07T20:25:57.0737048Z 2025-05-07T20:25:57.0737054Z 2025-05-07T20:25:57.0737060Z 2025-05-07T20:25:57.0737066Z 2025-05-07T20:25:57.0737072Z 2025-05-07T20:25:57.0737078Z 2025-05-07T20:25:57.0737084Z 2025-05-07T20:25:57.0737090Z 2025-05-07T20:25:57.0737377Z  2025-05-07T20:25:57.0737707Z 2025-05-07T20:25:57.0737713Z 2025-05-07T20:25:57.0737727Z 2025-05-07T20:25:57.0737733Z 2025-05-07T20:25:57.0737739Z 2025-05-07T20:25:57.0737753Z 2025-05-07T20:25:57.0737760Z 2025-05-07T20:25:57.0737766Z 2025-05-07T20:25:57.0737772Z 2025-05-07T20:25:57.0737778Z 2025-05-07T20:25:57.0737784Z 2025-05-07T20:25:57.0737790Z 2025-05-07T20:25:57.0737806Z 2025-05-07T20:25:57.0737818Z 2025-05-07T20:25:57.0737824Z 2025-05-07T20:25:57.0737830Z 2025-05-07T20:25:57.0737836Z 2025-05-07T20:25:57.0737842Z 2025-05-07T20:25:57.0738792Z  2025-05-07T20:25:57.0739132Z 2025-05-07T20:25:57.0739143Z 2025-05-07T20:25:57.0739331Z  2025-05-07T20:25:57.0749659Z 2025-05-07T20:25:57.0749674Z 2025-05-07T20:25:57.0749969Z  2025-05-07T20:25:57.0750108Z 2025-05-07T20:25:57.0750114Z 2025-05-07T20:25:57.0750165Z 2025-05-07T20:25:57.0750325Z  2025-05-07T20:25:57.0750505Z 2025-05-07T20:25:57.0750511Z 2025-05-07T20:25:57.0750516Z 2025-05-07T20:25:57.0750521Z 2025-05-07T20:25:57.0750682Z  2025-05-07T20:25:57.0750848Z 2025-05-07T20:25:57.0750854Z 2025-05-07T20:25:57.0750859Z 2025-05-07T20:25:57.0750877Z 2025-05-07T20:25:57.0750882Z 2025-05-07T20:25:57.0751035Z  2025-05-07T20:25:57.0751213Z 2025-05-07T20:25:57.0751218Z 2025-05-07T20:25:57.0751223Z 2025-05-07T20:25:57.0751238Z 2025-05-07T20:25:57.0751456Z 2025-05-07T20:25:57.0751462Z 2025-05-07T20:25:57.0751649Z  2025-05-07T20:25:57.0751832Z 2025-05-07T20:25:57.0751837Z 2025-05-07T20:25:57.0751842Z 2025-05-07T20:25:57.0751848Z 2025-05-07T20:25:57.0751852Z 2025-05-07T20:25:57.0751858Z 2025-05-07T20:25:57.0751863Z 2025-05-07T20:25:57.0752036Z  2025-05-07T20:25:57.0752231Z 2025-05-07T20:25:57.0752237Z 2025-05-07T20:25:57.0752242Z 2025-05-07T20:25:57.0752247Z 2025-05-07T20:25:57.0752252Z 2025-05-07T20:25:57.0752257Z 2025-05-07T20:25:57.0752262Z 2025-05-07T20:25:57.0752268Z 2025-05-07T20:25:57.0752442Z  2025-05-07T20:25:57.0752656Z 2025-05-07T20:25:57.0752662Z 2025-05-07T20:25:57.0752667Z 2025-05-07T20:25:57.0752672Z 2025-05-07T20:25:57.0752677Z 2025-05-07T20:25:57.0752682Z 2025-05-07T20:25:57.0752687Z 2025-05-07T20:25:57.0752708Z 2025-05-07T20:25:57.0752713Z 2025-05-07T20:25:57.0752892Z  2025-05-07T20:25:57.0753121Z 2025-05-07T20:25:57.0753135Z 2025-05-07T20:25:57.0753140Z 2025-05-07T20:25:57.0753146Z 2025-05-07T20:25:57.0753152Z 2025-05-07T20:25:57.0753157Z 2025-05-07T20:25:57.0753174Z 2025-05-07T20:25:57.0753179Z 2025-05-07T20:25:57.0753185Z 2025-05-07T20:25:57.0753189Z 2025-05-07T20:25:57.0753369Z  2025-05-07T20:25:57.0753610Z 2025-05-07T20:25:57.0753617Z 2025-05-07T20:25:57.0753624Z 2025-05-07T20:25:57.0753631Z 2025-05-07T20:25:57.0753647Z 2025-05-07T20:25:57.0753653Z 2025-05-07T20:25:57.0753660Z 2025-05-07T20:25:57.0753666Z 2025-05-07T20:25:57.0753673Z 2025-05-07T20:25:57.0753679Z 2025-05-07T20:25:57.0753686Z 2025-05-07T20:25:57.0753894Z  2025-05-07T20:25:57.0754152Z 2025-05-07T20:25:57.0754157Z 2025-05-07T20:25:57.0754163Z 2025-05-07T20:25:57.0754168Z 2025-05-07T20:25:57.0754173Z 2025-05-07T20:25:57.0754178Z 2025-05-07T20:25:57.0754183Z 2025-05-07T20:25:57.0754189Z 2025-05-07T20:25:57.0754194Z 2025-05-07T20:25:57.0754359Z 2025-05-07T20:25:57.0754374Z 2025-05-07T20:25:57.0754379Z 2025-05-07T20:25:57.0754589Z  2025-05-07T20:25:57.0754864Z 2025-05-07T20:25:57.0754870Z 2025-05-07T20:25:57.0754875Z 2025-05-07T20:25:57.0754881Z 2025-05-07T20:25:57.0754887Z 2025-05-07T20:25:57.0754891Z 2025-05-07T20:25:57.0754896Z 2025-05-07T20:25:57.0754901Z 2025-05-07T20:25:57.0754906Z 2025-05-07T20:25:57.0754911Z 2025-05-07T20:25:57.0754916Z 2025-05-07T20:25:57.0754921Z 2025-05-07T20:25:57.0754927Z 2025-05-07T20:25:57.0755135Z  2025-05-07T20:25:57.0755398Z 2025-05-07T20:25:57.0755403Z 2025-05-07T20:25:57.0755409Z 2025-05-07T20:25:57.0755414Z 2025-05-07T20:25:57.0755418Z 2025-05-07T20:25:57.0755423Z 2025-05-07T20:25:57.0755429Z 2025-05-07T20:25:57.0755434Z 2025-05-07T20:25:57.0755439Z 2025-05-07T20:25:57.0755445Z 2025-05-07T20:25:57.0755450Z 2025-05-07T20:25:57.0755456Z 2025-05-07T20:25:57.0755461Z 2025-05-07T20:25:57.0755466Z 2025-05-07T20:25:57.0755695Z  2025-05-07T20:25:57.0755968Z 2025-05-07T20:25:57.0755974Z 2025-05-07T20:25:57.0755979Z 2025-05-07T20:25:57.0755984Z 2025-05-07T20:25:57.0755989Z 2025-05-07T20:25:57.0755994Z 2025-05-07T20:25:57.0755999Z 2025-05-07T20:25:57.0756012Z 2025-05-07T20:25:57.0756017Z 2025-05-07T20:25:57.0756023Z 2025-05-07T20:25:57.0756028Z 2025-05-07T20:25:57.0756033Z 2025-05-07T20:25:57.0756038Z 2025-05-07T20:25:57.0756043Z 2025-05-07T20:25:57.0756048Z 2025-05-07T20:25:57.0756257Z  2025-05-07T20:25:57.0756548Z 2025-05-07T20:25:57.0756553Z 2025-05-07T20:25:57.0756559Z 2025-05-07T20:25:57.0756564Z 2025-05-07T20:25:57.0756569Z 2025-05-07T20:25:57.0756574Z 2025-05-07T20:25:57.0756579Z 2025-05-07T20:25:57.0756584Z 2025-05-07T20:25:57.0756590Z 2025-05-07T20:25:57.0756595Z 2025-05-07T20:25:57.0756600Z 2025-05-07T20:25:57.0756605Z 2025-05-07T20:25:57.0756610Z 2025-05-07T20:25:57.0756615Z 2025-05-07T20:25:57.0756626Z 2025-05-07T20:25:57.0756788Z 2025-05-07T20:25:57.0757013Z  2025-05-07T20:25:57.0757301Z 2025-05-07T20:25:57.0757306Z 2025-05-07T20:25:57.0757312Z 2025-05-07T20:25:57.0757317Z 2025-05-07T20:25:57.0757322Z 2025-05-07T20:25:57.0757327Z 2025-05-07T20:25:57.0757333Z 2025-05-07T20:25:57.0757338Z 2025-05-07T20:25:57.0757343Z 2025-05-07T20:25:57.0757349Z 2025-05-07T20:25:57.0757354Z 2025-05-07T20:25:57.0757359Z 2025-05-07T20:25:57.0757364Z 2025-05-07T20:25:57.0757370Z 2025-05-07T20:25:57.0757387Z 2025-05-07T20:25:57.0757393Z 2025-05-07T20:25:57.0757398Z 2025-05-07T20:25:57.0757620Z  2025-05-07T20:25:57.0757914Z 2025-05-07T20:25:57.0757919Z 2025-05-07T20:25:57.0757925Z 2025-05-07T20:25:57.0757930Z 2025-05-07T20:25:57.0757944Z 2025-05-07T20:25:57.0757949Z 2025-05-07T20:25:57.0757954Z 2025-05-07T20:25:57.0757959Z 2025-05-07T20:25:57.0757965Z 2025-05-07T20:25:57.0757970Z 2025-05-07T20:25:57.0757981Z 2025-05-07T20:25:57.0757993Z 2025-05-07T20:25:57.0757998Z 2025-05-07T20:25:57.0758004Z 2025-05-07T20:25:57.0758009Z 2025-05-07T20:25:57.0758014Z 2025-05-07T20:25:57.0758019Z 2025-05-07T20:25:57.0758024Z 2025-05-07T20:25:57.0758294Z  2025-05-07T20:25:57.0758590Z 2025-05-07T20:25:57.0758595Z 2025-05-07T20:25:57.0758738Z  2025-05-07T20:25:57.0758898Z 2025-05-07T20:25:57.0758903Z 2025-05-07T20:25:57.0759048Z  2025-05-07T20:25:57.0759206Z 2025-05-07T20:25:57.0759211Z 2025-05-07T20:25:57.0759216Z 2025-05-07T20:25:57.0759367Z  2025-05-07T20:25:57.0759517Z 2025-05-07T20:25:57.0759523Z 2025-05-07T20:25:57.0759528Z 2025-05-07T20:25:57.0759544Z 2025-05-07T20:25:57.0759695Z  2025-05-07T20:25:57.0759865Z 2025-05-07T20:25:57.0759871Z 2025-05-07T20:25:57.0759876Z 2025-05-07T20:25:57.0759881Z 2025-05-07T20:25:57.0759886Z 2025-05-07T20:25:57.0760053Z  2025-05-07T20:25:57.0760226Z 2025-05-07T20:25:57.0760340Z 2025-05-07T20:25:57.0760346Z 2025-05-07T20:25:57.0760351Z 2025-05-07T20:25:57.0760356Z 2025-05-07T20:25:57.0760362Z 2025-05-07T20:25:57.0760532Z  2025-05-07T20:25:57.0760720Z 2025-05-07T20:25:57.0760726Z 2025-05-07T20:25:57.0760731Z 2025-05-07T20:25:57.0760736Z 2025-05-07T20:25:57.0760741Z 2025-05-07T20:25:57.0760745Z 2025-05-07T20:25:57.0760748Z 2025-05-07T20:25:57.0760880Z  2025-05-07T20:25:57.0761039Z 2025-05-07T20:25:57.0761045Z 2025-05-07T20:25:57.0761050Z 2025-05-07T20:25:57.0761055Z 2025-05-07T20:25:57.0761061Z 2025-05-07T20:25:57.0761066Z 2025-05-07T20:25:57.0761071Z 2025-05-07T20:25:57.0761076Z 2025-05-07T20:25:57.0761266Z  2025-05-07T20:25:57.0761436Z 2025-05-07T20:25:57.0761442Z 2025-05-07T20:25:57.0761448Z 2025-05-07T20:25:57.0761453Z 2025-05-07T20:25:57.0761458Z 2025-05-07T20:25:57.0761464Z 2025-05-07T20:25:57.0761469Z 2025-05-07T20:25:57.0761474Z 2025-05-07T20:25:57.0761479Z 2025-05-07T20:25:57.0761671Z  2025-05-07T20:25:57.0761836Z 2025-05-07T20:25:57.0761840Z 2025-05-07T20:25:57.0761843Z 2025-05-07T20:25:57.0761847Z 2025-05-07T20:25:57.0761851Z 2025-05-07T20:25:57.0761854Z 2025-05-07T20:25:57.0761858Z 2025-05-07T20:25:57.0761862Z 2025-05-07T20:25:57.0761865Z 2025-05-07T20:25:57.0761869Z 2025-05-07T20:25:57.0762006Z  2025-05-07T20:25:57.0762236Z 2025-05-07T20:25:57.0762240Z 2025-05-07T20:25:57.0762243Z 2025-05-07T20:25:57.0762247Z 2025-05-07T20:25:57.0762251Z 2025-05-07T20:25:57.0762254Z 2025-05-07T20:25:57.0762258Z 2025-05-07T20:25:57.0762261Z 2025-05-07T20:25:57.0762275Z 2025-05-07T20:25:57.0762278Z 2025-05-07T20:25:57.0762282Z 2025-05-07T20:25:57.0762417Z  2025-05-07T20:25:57.0762594Z 2025-05-07T20:25:57.0762598Z 2025-05-07T20:25:57.0762602Z 2025-05-07T20:25:57.0762606Z 2025-05-07T20:25:57.0762622Z 2025-05-07T20:25:57.0762626Z 2025-05-07T20:25:57.0762631Z 2025-05-07T20:25:57.0762644Z 2025-05-07T20:25:57.0762766Z 2025-05-07T20:25:57.0762772Z 2025-05-07T20:25:57.0762777Z 2025-05-07T20:25:57.0762782Z 2025-05-07T20:25:57.0762941Z  2025-05-07T20:25:57.0763123Z 2025-05-07T20:25:57.0763132Z 2025-05-07T20:25:57.0763136Z 2025-05-07T20:25:57.0763139Z 2025-05-07T20:25:57.0763143Z 2025-05-07T20:25:57.0763147Z 2025-05-07T20:25:57.0763150Z 2025-05-07T20:25:57.0763154Z 2025-05-07T20:25:57.0763158Z 2025-05-07T20:25:57.0763161Z 2025-05-07T20:25:57.0763165Z 2025-05-07T20:25:57.0763169Z 2025-05-07T20:25:57.0763172Z 2025-05-07T20:25:57.0763367Z  2025-05-07T20:25:57.0763581Z 2025-05-07T20:25:57.0763584Z 2025-05-07T20:25:57.0763588Z 2025-05-07T20:25:57.0763592Z 2025-05-07T20:25:57.0763595Z 2025-05-07T20:25:57.0763599Z 2025-05-07T20:25:57.0763603Z 2025-05-07T20:25:57.0763606Z 2025-05-07T20:25:57.0763610Z 2025-05-07T20:25:57.0763614Z 2025-05-07T20:25:57.0763618Z 2025-05-07T20:25:57.0763621Z 2025-05-07T20:25:57.0763636Z 2025-05-07T20:25:57.0763640Z 2025-05-07T20:25:57.0763823Z  2025-05-07T20:25:57.0764103Z 2025-05-07T20:25:57.0764109Z 2025-05-07T20:25:57.0764114Z 2025-05-07T20:25:57.0764119Z 2025-05-07T20:25:57.0764124Z 2025-05-07T20:25:57.0764129Z 2025-05-07T20:25:57.0764142Z 2025-05-07T20:25:57.0764147Z 2025-05-07T20:25:57.0764153Z 2025-05-07T20:25:57.0764158Z 2025-05-07T20:25:57.0764163Z 2025-05-07T20:25:57.0764168Z 2025-05-07T20:25:57.0764173Z 2025-05-07T20:25:57.0764178Z 2025-05-07T20:25:57.0764183Z 2025-05-07T20:25:57.0764398Z  2025-05-07T20:25:57.0764682Z 2025-05-07T20:25:57.0764687Z 2025-05-07T20:25:57.0764692Z 2025-05-07T20:25:57.0764697Z 2025-05-07T20:25:57.0764702Z 2025-05-07T20:25:57.0764708Z 2025-05-07T20:25:57.0764713Z 2025-05-07T20:25:57.0764718Z 2025-05-07T20:25:57.0764723Z 2025-05-07T20:25:57.0764728Z 2025-05-07T20:25:57.0764733Z 2025-05-07T20:25:57.0764738Z 2025-05-07T20:25:57.0764887Z 2025-05-07T20:25:57.0764893Z 2025-05-07T20:25:57.0764898Z 2025-05-07T20:25:57.0764903Z 2025-05-07T20:25:57.0765133Z  2025-05-07T20:25:57.0765419Z 2025-05-07T20:25:57.0765424Z 2025-05-07T20:25:57.0765430Z 2025-05-07T20:25:57.0765435Z 2025-05-07T20:25:57.0765440Z 2025-05-07T20:25:57.0765445Z 2025-05-07T20:25:57.0765451Z 2025-05-07T20:25:57.0765456Z 2025-05-07T20:25:57.0765461Z 2025-05-07T20:25:57.0765466Z 2025-05-07T20:25:57.0765471Z 2025-05-07T20:25:57.0765476Z 2025-05-07T20:25:57.0765482Z 2025-05-07T20:25:57.0765497Z 2025-05-07T20:25:57.0765503Z 2025-05-07T20:25:57.0765508Z 2025-05-07T20:25:57.0765513Z 2025-05-07T20:25:57.0765727Z  2025-05-07T20:25:57.0766019Z 2025-05-07T20:25:57.0766025Z 2025-05-07T20:25:57.0766030Z 2025-05-07T20:25:57.0766042Z 2025-05-07T20:25:57.0766048Z 2025-05-07T20:25:57.0766053Z 2025-05-07T20:25:57.0766058Z 2025-05-07T20:25:57.0766063Z 2025-05-07T20:25:57.0766083Z 2025-05-07T20:25:57.0766088Z 2025-05-07T20:25:57.0766094Z 2025-05-07T20:25:57.0766099Z 2025-05-07T20:25:57.0766104Z 2025-05-07T20:25:57.0766109Z 2025-05-07T20:25:57.0766114Z 2025-05-07T20:25:57.0766119Z 2025-05-07T20:25:57.0766124Z 2025-05-07T20:25:57.0766129Z 2025-05-07T20:25:57.0766354Z  2025-05-07T20:25:57.0766655Z 2025-05-07T20:25:57.0766660Z 2025-05-07T20:25:57.0766798Z  2025-05-07T20:25:57.0766949Z 2025-05-07T20:25:57.0766954Z 2025-05-07T20:25:57.0767097Z  2025-05-07T20:25:57.0767247Z 2025-05-07T20:25:57.0767253Z 2025-05-07T20:25:57.0767258Z 2025-05-07T20:25:57.0767410Z  2025-05-07T20:25:57.0767558Z 2025-05-07T20:25:57.0767563Z 2025-05-07T20:25:57.0767568Z 2025-05-07T20:25:57.0767574Z 2025-05-07T20:25:57.0767728Z  2025-05-07T20:25:57.0767885Z 2025-05-07T20:25:57.0767891Z 2025-05-07T20:25:57.0767896Z 2025-05-07T20:25:57.0767901Z 2025-05-07T20:25:57.0767906Z 2025-05-07T20:25:57.0768094Z  2025-05-07T20:25:57.0768375Z 2025-05-07T20:25:57.0768380Z 2025-05-07T20:25:57.0768386Z 2025-05-07T20:25:57.0768391Z 2025-05-07T20:25:57.0768396Z 2025-05-07T20:25:57.0768401Z 2025-05-07T20:25:57.0768564Z  2025-05-07T20:25:57.0768740Z 2025-05-07T20:25:57.0768745Z 2025-05-07T20:25:57.0768751Z 2025-05-07T20:25:57.0768756Z 2025-05-07T20:25:57.0768761Z 2025-05-07T20:25:57.0768766Z 2025-05-07T20:25:57.0768771Z 2025-05-07T20:25:57.0768938Z  2025-05-07T20:25:57.0769141Z 2025-05-07T20:25:57.0769146Z 2025-05-07T20:25:57.0769151Z 2025-05-07T20:25:57.0769156Z 2025-05-07T20:25:57.0769162Z 2025-05-07T20:25:57.0769167Z 2025-05-07T20:25:57.0769172Z 2025-05-07T20:25:57.0769177Z 2025-05-07T20:25:57.0769355Z  2025-05-07T20:25:57.0769565Z 2025-05-07T20:25:57.0769570Z 2025-05-07T20:25:57.0769575Z 2025-05-07T20:25:57.0769580Z 2025-05-07T20:25:57.0769585Z 2025-05-07T20:25:57.0769591Z 2025-05-07T20:25:57.0769604Z 2025-05-07T20:25:57.0769617Z 2025-05-07T20:25:57.0769622Z 2025-05-07T20:25:57.0769804Z  2025-05-07T20:25:57.0770020Z 2025-05-07T20:25:57.0770025Z 2025-05-07T20:25:57.0770030Z 2025-05-07T20:25:57.0770035Z 2025-05-07T20:25:57.0770040Z 2025-05-07T20:25:57.0770045Z 2025-05-07T20:25:57.0770051Z 2025-05-07T20:25:57.0770056Z 2025-05-07T20:25:57.0770061Z 2025-05-07T20:25:57.0770075Z 2025-05-07T20:25:57.0770252Z  2025-05-07T20:25:57.0770479Z 2025-05-07T20:25:57.0770484Z 2025-05-07T20:25:57.0770489Z 2025-05-07T20:25:57.0770495Z 2025-05-07T20:25:57.0770500Z 2025-05-07T20:25:57.0770505Z 2025-05-07T20:25:57.0770510Z 2025-05-07T20:25:57.0770523Z 2025-05-07T20:25:57.0770529Z 2025-05-07T20:25:57.0770533Z 2025-05-07T20:25:57.0770539Z 2025-05-07T20:25:57.0770722Z  2025-05-07T20:25:57.0770962Z 2025-05-07T20:25:57.0770967Z 2025-05-07T20:25:57.0770972Z 2025-05-07T20:25:57.0770986Z 2025-05-07T20:25:57.0770991Z 2025-05-07T20:25:57.0771099Z 2025-05-07T20:25:57.0771105Z 2025-05-07T20:25:57.0771110Z 2025-05-07T20:25:57.0771115Z 2025-05-07T20:25:57.0771120Z 2025-05-07T20:25:57.0771126Z 2025-05-07T20:25:57.0771131Z 2025-05-07T20:25:57.0771317Z  2025-05-07T20:25:57.0771575Z 2025-05-07T20:25:57.0771580Z 2025-05-07T20:25:57.0771585Z 2025-05-07T20:25:57.0771590Z 2025-05-07T20:25:57.0771596Z 2025-05-07T20:25:57.0771601Z 2025-05-07T20:25:57.0771606Z 2025-05-07T20:25:57.0771611Z 2025-05-07T20:25:57.0771616Z 2025-05-07T20:25:57.0771621Z 2025-05-07T20:25:57.0771626Z 2025-05-07T20:25:57.0771631Z 2025-05-07T20:25:57.0771637Z 2025-05-07T20:25:57.0771824Z  2025-05-07T20:25:57.0772095Z 2025-05-07T20:25:57.0772100Z 2025-05-07T20:25:57.0772105Z 2025-05-07T20:25:57.0772110Z 2025-05-07T20:25:57.0772116Z 2025-05-07T20:25:57.0772121Z 2025-05-07T20:25:57.0772126Z 2025-05-07T20:25:57.0772131Z 2025-05-07T20:25:57.0772136Z 2025-05-07T20:25:57.0772148Z 2025-05-07T20:25:57.0772163Z 2025-05-07T20:25:57.0772168Z 2025-05-07T20:25:57.0772173Z 2025-05-07T20:25:57.0772179Z 2025-05-07T20:25:57.0772413Z  2025-05-07T20:25:57.0772683Z 2025-05-07T20:25:57.0772689Z 2025-05-07T20:25:57.0772694Z 2025-05-07T20:25:57.0772707Z 2025-05-07T20:25:57.0772713Z 2025-05-07T20:25:57.0772718Z 2025-05-07T20:25:57.0772723Z 2025-05-07T20:25:57.0772728Z 2025-05-07T20:25:57.0772733Z 2025-05-07T20:25:57.0772739Z 2025-05-07T20:25:57.0772745Z 2025-05-07T20:25:57.0772750Z 2025-05-07T20:25:57.0772756Z 2025-05-07T20:25:57.0772761Z 2025-05-07T20:25:57.0772767Z 2025-05-07T20:25:57.0772976Z  2025-05-07T20:25:57.0773270Z 2025-05-07T20:25:57.0773276Z 2025-05-07T20:25:57.0773281Z 2025-05-07T20:25:57.0773287Z 2025-05-07T20:25:57.0773292Z 2025-05-07T20:25:57.0773297Z 2025-05-07T20:25:57.0773302Z 2025-05-07T20:25:57.0773308Z 2025-05-07T20:25:57.0773313Z 2025-05-07T20:25:57.0773327Z 2025-05-07T20:25:57.0773452Z 2025-05-07T20:25:57.0773457Z 2025-05-07T20:25:57.0773462Z 2025-05-07T20:25:57.0773468Z 2025-05-07T20:25:57.0773473Z 2025-05-07T20:25:57.0773478Z 2025-05-07T20:25:57.0773739Z  2025-05-07T20:25:57.0774050Z 2025-05-07T20:25:57.0774055Z 2025-05-07T20:25:57.0774060Z 2025-05-07T20:25:57.0774065Z 2025-05-07T20:25:57.0774070Z 2025-05-07T20:25:57.0774076Z 2025-05-07T20:25:57.0774081Z 2025-05-07T20:25:57.0774086Z 2025-05-07T20:25:57.0774092Z 2025-05-07T20:25:57.0774097Z 2025-05-07T20:25:57.0774114Z 2025-05-07T20:25:57.0774119Z 2025-05-07T20:25:57.0774124Z 2025-05-07T20:25:57.0774130Z 2025-05-07T20:25:57.0774135Z 2025-05-07T20:25:57.0774140Z 2025-05-07T20:25:57.0774145Z 2025-05-07T20:25:57.0774376Z  2025-05-07T20:25:57.0774678Z 2025-05-07T20:25:57.0774683Z 2025-05-07T20:25:57.0774689Z 2025-05-07T20:25:57.0774695Z 2025-05-07T20:25:57.0774700Z 2025-05-07T20:25:57.0774712Z 2025-05-07T20:25:57.0774725Z 2025-05-07T20:25:57.0774731Z 2025-05-07T20:25:57.0774736Z 2025-05-07T20:25:57.0774740Z 2025-05-07T20:25:57.0774745Z 2025-05-07T20:25:57.0774750Z 2025-05-07T20:25:57.0774755Z 2025-05-07T20:25:57.0774760Z 2025-05-07T20:25:57.0774765Z 2025-05-07T20:25:57.0774770Z 2025-05-07T20:25:57.0774775Z 2025-05-07T20:25:57.0774780Z 2025-05-07T20:25:57.0775017Z  2025-05-07T20:25:57.0775309Z 2025-05-07T20:25:57.0775315Z 2025-05-07T20:25:57.0775456Z  2025-05-07T20:25:57.0775615Z 2025-05-07T20:25:57.0775621Z 2025-05-07T20:25:57.0775763Z  2025-05-07T20:25:57.0775911Z 2025-05-07T20:25:57.0775924Z 2025-05-07T20:25:57.0775929Z 2025-05-07T20:25:57.0776076Z  2025-05-07T20:25:57.0776229Z 2025-05-07T20:25:57.0776234Z 2025-05-07T20:25:57.0776239Z 2025-05-07T20:25:57.0776244Z 2025-05-07T20:25:57.0776434Z  2025-05-07T20:25:57.0776597Z 2025-05-07T20:25:57.0776603Z 2025-05-07T20:25:57.0776608Z 2025-05-07T20:25:57.0776734Z 2025-05-07T20:25:57.0776740Z 2025-05-07T20:25:57.0776911Z  2025-05-07T20:25:57.0777088Z 2025-05-07T20:25:57.0777093Z 2025-05-07T20:25:57.0777099Z 2025-05-07T20:25:57.0777104Z 2025-05-07T20:25:57.0777109Z 2025-05-07T20:25:57.0777114Z 2025-05-07T20:25:57.0777287Z  2025-05-07T20:25:57.0777466Z 2025-05-07T20:25:57.0777472Z 2025-05-07T20:25:57.0777477Z 2025-05-07T20:25:57.0777482Z 2025-05-07T20:25:57.0777487Z 2025-05-07T20:25:57.0777492Z 2025-05-07T20:25:57.0777497Z 2025-05-07T20:25:57.0777671Z  2025-05-07T20:25:57.0777862Z 2025-05-07T20:25:57.0777867Z 2025-05-07T20:25:57.0777872Z 2025-05-07T20:25:57.0777878Z 2025-05-07T20:25:57.0777883Z 2025-05-07T20:25:57.0777888Z 2025-05-07T20:25:57.0777893Z 2025-05-07T20:25:57.0777898Z 2025-05-07T20:25:57.0778082Z  2025-05-07T20:25:57.0778290Z 2025-05-07T20:25:57.0778296Z 2025-05-07T20:25:57.0778301Z 2025-05-07T20:25:57.0778306Z 2025-05-07T20:25:57.0778322Z 2025-05-07T20:25:57.0778332Z 2025-05-07T20:25:57.0778337Z 2025-05-07T20:25:57.0778342Z 2025-05-07T20:25:57.0778355Z 2025-05-07T20:25:57.0778529Z  2025-05-07T20:25:57.0778745Z 2025-05-07T20:25:57.0778750Z 2025-05-07T20:25:57.0778756Z 2025-05-07T20:25:57.0778761Z 2025-05-07T20:25:57.0778766Z 2025-05-07T20:25:57.0778771Z 2025-05-07T20:25:57.0778784Z 2025-05-07T20:25:57.0778789Z 2025-05-07T20:25:57.0778794Z 2025-05-07T20:25:57.0778800Z 2025-05-07T20:25:57.0778973Z  2025-05-07T20:25:57.0779199Z 2025-05-07T20:25:57.0779204Z 2025-05-07T20:25:57.0779209Z 2025-05-07T20:25:57.0779215Z 2025-05-07T20:25:57.0779227Z 2025-05-07T20:25:57.0779232Z 2025-05-07T20:25:57.0779237Z 2025-05-07T20:25:57.0779243Z 2025-05-07T20:25:57.0779248Z 2025-05-07T20:25:57.0779253Z 2025-05-07T20:25:57.0779258Z 2025-05-07T20:25:57.0779436Z  2025-05-07T20:25:57.0779693Z 2025-05-07T20:25:57.0779699Z 2025-05-07T20:25:57.0779710Z 2025-05-07T20:25:57.0779824Z 2025-05-07T20:25:57.0779829Z 2025-05-07T20:25:57.0779835Z 2025-05-07T20:25:57.0779840Z 2025-05-07T20:25:57.0779845Z 2025-05-07T20:25:57.0779850Z 2025-05-07T20:25:57.0779856Z 2025-05-07T20:25:57.0779861Z 2025-05-07T20:25:57.0779866Z 2025-05-07T20:25:57.0780061Z  2025-05-07T20:25:57.0780323Z 2025-05-07T20:25:57.0780328Z 2025-05-07T20:25:57.0780333Z 2025-05-07T20:25:57.0780339Z 2025-05-07T20:25:57.0780344Z 2025-05-07T20:25:57.0780349Z 2025-05-07T20:25:57.0780354Z 2025-05-07T20:25:57.0780359Z 2025-05-07T20:25:57.0780364Z 2025-05-07T20:25:57.0780369Z 2025-05-07T20:25:57.0780375Z 2025-05-07T20:25:57.0780380Z 2025-05-07T20:25:57.0780385Z 2025-05-07T20:25:57.0780611Z  2025-05-07T20:25:57.0780875Z 2025-05-07T20:25:57.0780879Z 2025-05-07T20:25:57.0780884Z 2025-05-07T20:25:57.0780890Z 2025-05-07T20:25:57.0780895Z 2025-05-07T20:25:57.0780900Z 2025-05-07T20:25:57.0780905Z 2025-05-07T20:25:57.0780925Z 2025-05-07T20:25:57.0780937Z 2025-05-07T20:25:57.0780942Z 2025-05-07T20:25:57.0780948Z 2025-05-07T20:25:57.0780953Z 2025-05-07T20:25:57.0780958Z 2025-05-07T20:25:57.0780963Z 2025-05-07T20:25:57.0781290Z  2025-05-07T20:25:57.0781574Z 2025-05-07T20:25:57.0781580Z 2025-05-07T20:25:57.0781585Z 2025-05-07T20:25:57.0781590Z 2025-05-07T20:25:57.0781596Z 2025-05-07T20:25:57.0781601Z 2025-05-07T20:25:57.0781606Z 2025-05-07T20:25:57.0781611Z 2025-05-07T20:25:57.0781616Z 2025-05-07T20:25:57.0781621Z 2025-05-07T20:25:57.0781627Z 2025-05-07T20:25:57.0781632Z 2025-05-07T20:25:57.0781637Z 2025-05-07T20:25:57.0781642Z 2025-05-07T20:25:57.0781647Z 2025-05-07T20:25:57.0781859Z  2025-05-07T20:25:57.0782147Z 2025-05-07T20:25:57.0782152Z 2025-05-07T20:25:57.0782157Z 2025-05-07T20:25:57.0782162Z 2025-05-07T20:25:57.0782167Z 2025-05-07T20:25:57.0782173Z 2025-05-07T20:25:57.0782178Z 2025-05-07T20:25:57.0782300Z 2025-05-07T20:25:57.0782314Z 2025-05-07T20:25:57.0782320Z 2025-05-07T20:25:57.0782325Z 2025-05-07T20:25:57.0782330Z 2025-05-07T20:25:57.0782336Z 2025-05-07T20:25:57.0782341Z 2025-05-07T20:25:57.0782346Z 2025-05-07T20:25:57.0782351Z 2025-05-07T20:25:57.0782580Z  2025-05-07T20:25:57.0782868Z 2025-05-07T20:25:57.0782873Z 2025-05-07T20:25:57.0782878Z 2025-05-07T20:25:57.0782884Z 2025-05-07T20:25:57.0782889Z 2025-05-07T20:25:57.0782894Z 2025-05-07T20:25:57.0782899Z 2025-05-07T20:25:57.0782911Z 2025-05-07T20:25:57.0782917Z 2025-05-07T20:25:57.0782922Z 2025-05-07T20:25:57.0782927Z 2025-05-07T20:25:57.0782933Z 2025-05-07T20:25:57.0782937Z 2025-05-07T20:25:57.0782943Z 2025-05-07T20:25:57.0782948Z 2025-05-07T20:25:57.0782953Z 2025-05-07T20:25:57.0782958Z 2025-05-07T20:25:57.0783177Z  2025-05-07T20:25:57.0783474Z 2025-05-07T20:25:57.0783480Z 2025-05-07T20:25:57.0783485Z 2025-05-07T20:25:57.0783496Z 2025-05-07T20:25:57.0783506Z 2025-05-07T20:25:57.0783511Z 2025-05-07T20:25:57.0783516Z 2025-05-07T20:25:57.0783521Z 2025-05-07T20:25:57.0783526Z 2025-05-07T20:25:57.0783531Z 2025-05-07T20:25:57.0783537Z 2025-05-07T20:25:57.0783542Z 2025-05-07T20:25:57.0783549Z 2025-05-07T20:25:57.0783555Z 2025-05-07T20:25:57.0783562Z 2025-05-07T20:25:57.0783569Z 2025-05-07T20:25:57.0783575Z 2025-05-07T20:25:57.0783582Z 2025-05-07T20:25:57.0783856Z  2025-05-07T20:25:57.0784153Z 2025-05-07T20:25:57.0784158Z 2025-05-07T20:25:57.0784295Z  2025-05-07T20:25:57.0784444Z 2025-05-07T20:25:57.0784449Z 2025-05-07T20:25:57.0784588Z  2025-05-07T20:25:57.0784744Z 2025-05-07T20:25:57.0784749Z 2025-05-07T20:25:57.0784754Z 2025-05-07T20:25:57.0784902Z  2025-05-07T20:25:57.0785053Z 2025-05-07T20:25:57.0785059Z 2025-05-07T20:25:57.0785064Z 2025-05-07T20:25:57.0785084Z 2025-05-07T20:25:57.0785268Z  2025-05-07T20:25:57.0785446Z 2025-05-07T20:25:57.0785554Z 2025-05-07T20:25:57.0785559Z 2025-05-07T20:25:57.0785564Z 2025-05-07T20:25:57.0785570Z 2025-05-07T20:25:57.0785725Z  2025-05-07T20:25:57.0785906Z 2025-05-07T20:25:57.0785911Z 2025-05-07T20:25:57.0785916Z 2025-05-07T20:25:57.0785921Z 2025-05-07T20:25:57.0785927Z 2025-05-07T20:25:57.0785932Z 2025-05-07T20:25:57.0786087Z  2025-05-07T20:25:57.0786269Z 2025-05-07T20:25:57.0786274Z 2025-05-07T20:25:57.0786279Z 2025-05-07T20:25:57.0786284Z 2025-05-07T20:25:57.0786289Z 2025-05-07T20:25:57.0786294Z 2025-05-07T20:25:57.0786300Z 2025-05-07T20:25:57.0786463Z  2025-05-07T20:25:57.0786663Z 2025-05-07T20:25:57.0786669Z 2025-05-07T20:25:57.0786674Z 2025-05-07T20:25:57.0786679Z 2025-05-07T20:25:57.0786684Z 2025-05-07T20:25:57.0786690Z 2025-05-07T20:25:57.0786695Z 2025-05-07T20:25:57.0786700Z 2025-05-07T20:25:57.0786864Z  2025-05-07T20:25:57.0787087Z 2025-05-07T20:25:57.0787092Z 2025-05-07T20:25:57.0787101Z 2025-05-07T20:25:57.0787115Z 2025-05-07T20:25:57.0787120Z 2025-05-07T20:25:57.0787125Z 2025-05-07T20:25:57.0787130Z 2025-05-07T20:25:57.0787135Z 2025-05-07T20:25:57.0787141Z 2025-05-07T20:25:57.0787313Z  2025-05-07T20:25:57.0787538Z 2025-05-07T20:25:57.0787544Z 2025-05-07T20:25:57.0787549Z 2025-05-07T20:25:57.0787554Z 2025-05-07T20:25:57.0787559Z 2025-05-07T20:25:57.0787564Z 2025-05-07T20:25:57.0787569Z 2025-05-07T20:25:57.0787574Z 2025-05-07T20:25:57.0787580Z 2025-05-07T20:25:57.0787585Z 2025-05-07T20:25:57.0787758Z  2025-05-07T20:25:57.0787993Z 2025-05-07T20:25:57.0787998Z 2025-05-07T20:25:57.0788003Z 2025-05-07T20:25:57.0788008Z 2025-05-07T20:25:57.0788013Z 2025-05-07T20:25:57.0788019Z 2025-05-07T20:25:57.0788024Z 2025-05-07T20:25:57.0788029Z 2025-05-07T20:25:57.0788034Z 2025-05-07T20:25:57.0788040Z 2025-05-07T20:25:57.0788045Z 2025-05-07T20:25:57.0788236Z  2025-05-07T20:25:57.0788608Z 2025-05-07T20:25:57.0788621Z 2025-05-07T20:25:57.0788626Z 2025-05-07T20:25:57.0788631Z 2025-05-07T20:25:57.0788637Z 2025-05-07T20:25:57.0788642Z 2025-05-07T20:25:57.0788647Z 2025-05-07T20:25:57.0788652Z 2025-05-07T20:25:57.0788657Z 2025-05-07T20:25:57.0788662Z 2025-05-07T20:25:57.0788667Z 2025-05-07T20:25:57.0788673Z 2025-05-07T20:25:57.0788886Z  2025-05-07T20:25:57.0789133Z 2025-05-07T20:25:57.0789139Z 2025-05-07T20:25:57.0789144Z 2025-05-07T20:25:57.0789149Z 2025-05-07T20:25:57.0789154Z 2025-05-07T20:25:57.0789160Z 2025-05-07T20:25:57.0789164Z 2025-05-07T20:25:57.0789170Z 2025-05-07T20:25:57.0789175Z 2025-05-07T20:25:57.0789180Z 2025-05-07T20:25:57.0789185Z 2025-05-07T20:25:57.0789200Z 2025-05-07T20:25:57.0789205Z 2025-05-07T20:25:57.0789396Z  2025-05-07T20:25:57.0789656Z 2025-05-07T20:25:57.0789661Z 2025-05-07T20:25:57.0789666Z 2025-05-07T20:25:57.0789671Z 2025-05-07T20:25:57.0789676Z 2025-05-07T20:25:57.0789700Z 2025-05-07T20:25:57.0789705Z 2025-05-07T20:25:57.0789710Z 2025-05-07T20:25:57.0789716Z 2025-05-07T20:25:57.0789721Z 2025-05-07T20:25:57.0789726Z 2025-05-07T20:25:57.0789731Z 2025-05-07T20:25:57.0789737Z 2025-05-07T20:25:57.0789742Z 2025-05-07T20:25:57.0789940Z  2025-05-07T20:25:57.0790143Z 2025-05-07T20:25:57.0790147Z 2025-05-07T20:25:57.0790151Z 2025-05-07T20:25:57.0790154Z 2025-05-07T20:25:57.0790158Z 2025-05-07T20:25:57.0790162Z 2025-05-07T20:25:57.0790165Z 2025-05-07T20:25:57.0790169Z 2025-05-07T20:25:57.0790172Z 2025-05-07T20:25:57.0790176Z 2025-05-07T20:25:57.0790180Z 2025-05-07T20:25:57.0790183Z 2025-05-07T20:25:57.0790187Z 2025-05-07T20:25:57.0790190Z 2025-05-07T20:25:57.0790194Z 2025-05-07T20:25:57.0790347Z  2025-05-07T20:25:57.0790547Z 2025-05-07T20:25:57.0790551Z 2025-05-07T20:25:57.0790554Z 2025-05-07T20:25:57.0790558Z 2025-05-07T20:25:57.0790562Z 2025-05-07T20:25:57.0790570Z 2025-05-07T20:25:57.0790663Z 2025-05-07T20:25:57.0790667Z 2025-05-07T20:25:57.0790670Z 2025-05-07T20:25:57.0790674Z 2025-05-07T20:25:57.0790677Z 2025-05-07T20:25:57.0790681Z 2025-05-07T20:25:57.0790685Z 2025-05-07T20:25:57.0790688Z 2025-05-07T20:25:57.0790699Z 2025-05-07T20:25:57.0790703Z 2025-05-07T20:25:57.0790860Z  2025-05-07T20:25:57.0791067Z 2025-05-07T20:25:57.0791071Z 2025-05-07T20:25:57.0791075Z 2025-05-07T20:25:57.0791078Z 2025-05-07T20:25:57.0791082Z 2025-05-07T20:25:57.0791092Z 2025-05-07T20:25:57.0791096Z 2025-05-07T20:25:57.0791099Z 2025-05-07T20:25:57.0791103Z 2025-05-07T20:25:57.0791106Z 2025-05-07T20:25:57.0791110Z 2025-05-07T20:25:57.0791114Z 2025-05-07T20:25:57.0791117Z 2025-05-07T20:25:57.0791121Z 2025-05-07T20:25:57.0791124Z 2025-05-07T20:25:57.0791128Z 2025-05-07T20:25:57.0791132Z 2025-05-07T20:25:57.0791315Z  2025-05-07T20:25:57.0791522Z 2025-05-07T20:25:57.0791534Z 2025-05-07T20:25:57.0791537Z 2025-05-07T20:25:57.0791541Z 2025-05-07T20:25:57.0791544Z 2025-05-07T20:25:57.0791548Z 2025-05-07T20:25:57.0791552Z 2025-05-07T20:25:57.0791555Z 2025-05-07T20:25:57.0791559Z 2025-05-07T20:25:57.0791562Z 2025-05-07T20:25:57.0791566Z 2025-05-07T20:25:57.0791570Z 2025-05-07T20:25:57.0791582Z 2025-05-07T20:25:57.0791585Z 2025-05-07T20:25:57.0791589Z 2025-05-07T20:25:57.0791593Z 2025-05-07T20:25:57.0791596Z 2025-05-07T20:25:57.0791600Z 2025-05-07T20:25:57.0791761Z  2025-05-07T20:25:57.0791971Z 2025-05-07T20:25:57.0791974Z 2025-05-07T20:25:57.0792081Z  2025-05-07T20:25:57.0792184Z 2025-05-07T20:25:57.0792188Z 2025-05-07T20:25:57.0792295Z  2025-05-07T20:25:57.0792403Z 2025-05-07T20:25:57.0792406Z 2025-05-07T20:25:57.0792410Z 2025-05-07T20:25:57.0792515Z  2025-05-07T20:25:57.0792652Z 2025-05-07T20:25:57.0792658Z 2025-05-07T20:25:57.0792663Z 2025-05-07T20:25:57.0792770Z 2025-05-07T20:25:57.0792953Z  2025-05-07T20:25:57.0793130Z 2025-05-07T20:25:57.0793135Z 2025-05-07T20:25:57.0793140Z 2025-05-07T20:25:57.0793145Z 2025-05-07T20:25:57.0793151Z 2025-05-07T20:25:57.0793308Z  2025-05-07T20:25:57.0793479Z 2025-05-07T20:25:57.0793492Z 2025-05-07T20:25:57.0793498Z 2025-05-07T20:25:57.0793503Z 2025-05-07T20:25:57.0793508Z 2025-05-07T20:25:57.0793513Z 2025-05-07T20:25:57.0793667Z  2025-05-07T20:25:57.0793849Z 2025-05-07T20:25:57.0793855Z 2025-05-07T20:25:57.0793860Z 2025-05-07T20:25:57.0793876Z 2025-05-07T20:25:57.0793881Z 2025-05-07T20:25:57.0793887Z 2025-05-07T20:25:57.0793892Z 2025-05-07T20:25:57.0794019Z  2025-05-07T20:25:57.0794158Z 2025-05-07T20:25:57.0794162Z 2025-05-07T20:25:57.0794165Z 2025-05-07T20:25:57.0794176Z 2025-05-07T20:25:57.0794180Z 2025-05-07T20:25:57.0794183Z 2025-05-07T20:25:57.0794187Z 2025-05-07T20:25:57.0794190Z 2025-05-07T20:25:57.0794311Z  2025-05-07T20:25:57.0794473Z 2025-05-07T20:25:57.0794476Z 2025-05-07T20:25:57.0794480Z 2025-05-07T20:25:57.0794490Z 2025-05-07T20:25:57.0794494Z 2025-05-07T20:25:57.0794497Z 2025-05-07T20:25:57.0794501Z 2025-05-07T20:25:57.0794504Z 2025-05-07T20:25:57.0794508Z 2025-05-07T20:25:57.0794633Z  2025-05-07T20:25:57.0794791Z 2025-05-07T20:25:57.0794794Z 2025-05-07T20:25:57.0794805Z 2025-05-07T20:25:57.0794809Z 2025-05-07T20:25:57.0794812Z 2025-05-07T20:25:57.0794816Z 2025-05-07T20:25:57.0794819Z 2025-05-07T20:25:57.0794823Z 2025-05-07T20:25:57.0794826Z 2025-05-07T20:25:57.0794830Z 2025-05-07T20:25:57.0794960Z  2025-05-07T20:25:57.0795131Z 2025-05-07T20:25:57.0795135Z 2025-05-07T20:25:57.0795138Z 2025-05-07T20:25:57.0795142Z 2025-05-07T20:25:57.0795146Z 2025-05-07T20:25:57.0795149Z 2025-05-07T20:25:57.0795153Z 2025-05-07T20:25:57.0795156Z 2025-05-07T20:25:57.0795160Z 2025-05-07T20:25:57.0795164Z 2025-05-07T20:25:57.0795167Z 2025-05-07T20:25:57.0795395Z  2025-05-07T20:25:57.0795580Z 2025-05-07T20:25:57.0795583Z 2025-05-07T20:25:57.0795587Z 2025-05-07T20:25:57.0795591Z 2025-05-07T20:25:57.0795594Z 2025-05-07T20:25:57.0795598Z 2025-05-07T20:25:57.0795601Z 2025-05-07T20:25:57.0795605Z 2025-05-07T20:25:57.0795609Z 2025-05-07T20:25:57.0795612Z 2025-05-07T20:25:57.0795616Z 2025-05-07T20:25:57.0795619Z 2025-05-07T20:25:57.0795773Z  done 2025-05-07T20:25:57.3985879Z Preparing transaction: \ | / done 2025-05-07T20:25:58.8331624Z Verifying transaction: \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:59.6589967Z Executing transaction: - \ | / - \ | / done 2025-05-07T20:26:02.0140503Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:02.0141070Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:02.0141966Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:02.0142553Z 2025-05-07T20:26:02.0156002Z 2025-05-07T20:26:02.0157112Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:02.0157926Z 2025-05-07T20:26:02.0170272Z 2025-05-07T20:26:02.0170677Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:02.0175794Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:02.0179605Z 2025-05-07T20:26:02.0384494Z 2025-05-07T20:26:02.0390019Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:02.0393776Z 2025-05-07T20:26:02.0411509Z 2025-05-07T20:26:02.0411932Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:02.0791130Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:03.9950362Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:04.0624070Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:04.0624690Z 2025-05-07T20:26:04.4854073Z 2025-05-07T20:26:04.4862500Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:04.5216131Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:04.5216637Z 2025-05-07T20:26:04.9636307Z 2025-05-07T20:26:04.9636633Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:04.9637554Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:04.9638269Z 2025-05-07T20:26:05.3891476Z 2025-05-07T20:26:07.4178656Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:09.4456852Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:11.4686588Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:11.4687411Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:13.4993050Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:15.3885092Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:15.3885383Z 2025-05-07T20:26:15.4521315Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:19.3265838Z /tmp/tmp2em5kp8a: line 3: clang: command not found 2025-05-07T20:26:19.3266203Z 2025-05-07T20:26:19.3266953Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:19.3913614Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:19.3914133Z 2025-05-07T20:26:19.3934940Z total 36 2025-05-07T20:26:19.3935250Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:25 . 2025-05-07T20:26:19.3935642Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:24 .. 2025-05-07T20:26:19.3936089Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:19.3936608Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:19.3937254Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:19.3937855Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:19.3938359Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:19.3938820Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:26:19.3939109Z 2025-05-07T20:26:19.3939334Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:19.3939974Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:19.3940638Z 2025-05-07T20:26:19.3960467Z 2025-05-07T20:26:19.3960781Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:19.3961122Z 2025-05-07T20:26:21.3570600Z 2025-05-07T20:26:21.3571271Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:21.3571843Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:21.3572226Z 2025-05-07T20:26:21.7909547Z 2025-05-07T20:26:21.7909953Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:21.7910328Z 2025-05-07T20:26:23.6979050Z -allow-unsupported-compiler 2025-05-07T20:26:23.6979317Z 2025-05-07T20:26:23.7637898Z 2025-05-07T20:26:23.7638708Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:23.7639736Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:23.7640742Z 2025-05-07T20:26:25.7282746Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:25.7283854Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:25.7284198Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:25.7284523Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:25.7284858Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:25.7285136Z #define _STL_PAIR_H 1 2025-05-07T20:26:25.7285388Z #define __cpp_attributes 200809L 2025-05-07T20:26:25.7285726Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:25.7286087Z #define __DELETE_THROW throw() 2025-05-07T20:26:25.7286351Z #define _PTRDIFF_T_ 2025-05-07T20:26:25.7286611Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:25.7286907Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:25.7287178Z #define _IO_LEFT 02 2025-05-07T20:26:25.7287415Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:25.7287681Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:25.7287956Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:25.7288406Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:25.7288844Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:25.7289132Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:25.7289383Z #define _IOS_OUTPUT 2 2025-05-07T20:26:25.7289687Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:25.7290061Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:25.7290369Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:25.7290654Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:25.7290954Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:25.7291737Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:25.7292778Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:25.7293096Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:25.7293571Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:25.7293897Z #define _T_WCHAR_ 2025-05-07T20:26:25.7294123Z #define stdout stdout 2025-05-07T20:26:25.7294465Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:25.7294850Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:25.7295107Z #define __flexarr [] 2025-05-07T20:26:25.7295359Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:25.7295694Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:25.7296044Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:25.7296310Z #define _MATH_H 1 2025-05-07T20:26:25.7296649Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:25.7297143Z #define __S64_TYPE long int 2025-05-07T20:26:25.7297506Z #define __stub_fchflags 2025-05-07T20:26:25.7297881Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:25.7298277Z #define __SQUAD_TYPE long int 2025-05-07T20:26:25.7298652Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:25.7299033Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:25.7299405Z #define NL_NMAX INT_MAX 2025-05-07T20:26:25.7299731Z #define _BITS_TIME_H 1 2025-05-07T20:26:25.7300127Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:25.7300522Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:25.7300829Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:25.7301289Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:25.7301698Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:25.7302069Z #define __CHAR_BIT__ 8 2025-05-07T20:26:25.7302338Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.7302666Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:25.7302962Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:25.7303237Z #define FP_NAN 0 2025-05-07T20:26:25.7303505Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:25.7303954Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:25.7304571Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:25.7304964Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:25.7305258Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:25.7305525Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:25.7305786Z #define __SM_80_RT_H__ 2025-05-07T20:26:25.7306022Z #define _NEW 2025-05-07T20:26:25.7306249Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:25.7306541Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:25.7306919Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:25.7307321Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:25.7307574Z #define __USE_ANSI 1 2025-05-07T20:26:25.7307869Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:25.7308271Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:25.7308630Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:25.7308937Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:25.7309231Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:25.7309523Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:25.7309813Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:25.7310107Z #define PIPE_BUF 4096 2025-05-07T20:26:25.7310434Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:25.7310809Z #define ADJ_TICK 0x4000 2025-05-07T20:26:25.7311094Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:25.7311417Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:25.7311693Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:25.7312022Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:25.7312489Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:25.7313017Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:25.7313390Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:25.7313652Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:25.7314014Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:25.7314316Z #define __cpp_static_assert 201411L 2025-05-07T20:26:25.7314662Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:25.7315058Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:25.7315345Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:25.7315636Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:25.7315947Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:25.7316240Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:25.7316548Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.7316914Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:25.7317259Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:25.7317549Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:25.7317871Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.7318238Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:25.7318599Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:25.7318908Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:25.7319210Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:25.7319543Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:25.7319876Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:25.7320284Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:25.7320708Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:25.7321025Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:25.7321313Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:25.7321598Z #define __GCC_IEC_559 2 2025-05-07T20:26:25.7321901Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:25.7322251Z #define _IO_flockfile(_fp) 2025-05-07T20:26:25.7322518Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:25.7322798Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:25.7323075Z #define _IOFBF 0 2025-05-07T20:26:25.7323293Z #define __USE_BSD 1 2025-05-07T20:26:25.7323532Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:25.7323913Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:25.7324194Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:25.7324458Z #define _IO_NO_WRITES 8 2025-05-07T20:26:25.7324725Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:25.7325082Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:25.7325447Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:25.7325772Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:25.7326109Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:25.7326406Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:25.7326689Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:25.7326966Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:25.7327285Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:25.7327677Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:25.7328050Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:25.7328359Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:25.7328688Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:25.7329025Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:25.7329331Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:25.7329645Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:25.7329927Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:25.7330201Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:25.7330782Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:25.7331370Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:25.7331703Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:25.7332029Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:25.7332340Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:25.7332623Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:25.7332891Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:25.7333293Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:25.7333637Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:25.7333948Z #define RAND_MAX 2147483647 2025-05-07T20:26:25.7334215Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:25.7334551Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.7334873Z #define __SM_90_RT_H__ 2025-05-07T20:26:25.7335119Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:25.7335389Z #define __COMPAR_FN_T 2025-05-07T20:26:25.7335642Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:25.7335910Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:25.7336396Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:25.7336912Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:25.7337260Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:25.7337633Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:25.7337966Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:25.7349614Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:25.7349960Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:25.7350490Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:25.7351037Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:25.7351380Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:25.7351664Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:25.7351970Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:25.7352282Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:25.7352572Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:25.7352892Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:25.7353156Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:25.7353413Z #define __u_char_defined 2025-05-07T20:26:25.7353732Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:25.7354088Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:25.7354355Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:25.7354621Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:25.7355102Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:25.7355544Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:25.7355971Z #define FP_INFINITE 1 2025-05-07T20:26:25.7356343Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:25.7356764Z #define _IO_pid_t __pid_t 2025-05-07T20:26:25.7357025Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:25.7357292Z #define __LEAF , __leaf__ 2025-05-07T20:26:25.7357531Z #define PATH_MAX 4096 2025-05-07T20:26:25.7357801Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:25.7358136Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:25.7358461Z #define _LIMITS_H___ 2025-05-07T20:26:25.7358693Z #define __size_t 2025-05-07T20:26:25.7358924Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:25.7359474Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:25.7360041Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:25.7360351Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:25.7360676Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:25.7360942Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:25.7361302Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:25.7361695Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:25.7361997Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:25.7362327Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:25.7362609Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:25.7362894Z #define __INT8_C(c) c 2025-05-07T20:26:25.7363159Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:25.7363460Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:25.7363720Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:25.7363986Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:25.7364237Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:25.7364659Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:25.7364993Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.7365322Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:25.7365591Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:25.7365871Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:25.7366140Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:25.7366453Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:25.7366760Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:25.7367126Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:25.7367499Z #define NFDBITS __NFDBITS 2025-05-07T20:26:25.7367764Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:25.7368056Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:25.7368377Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:25.7368693Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:25.7368955Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:25.7369253Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:25.7369561Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:25.7369877Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:25.7370295Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:25.7370652Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:25.7370951Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:25.7371273Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:25.7371640Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:25.7371985Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:25.7372309Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:25.7372646Z #define __daddr_t_defined 2025-05-07T20:26:25.7372895Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:25.7373173Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:25.7373494Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:25.7374009Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:25.7374580Z #define _ACRTIMP 2025-05-07T20:26:25.7374807Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:25.7375072Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:25.7375368Z #define _IOS_BIN 128 2025-05-07T20:26:25.7375724Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:25.7376139Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.7376405Z #define UNDERFLOW 4 2025-05-07T20:26:25.7376628Z #define NAME_MAX 255 2025-05-07T20:26:25.7376867Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:25.7377132Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:25.7377414Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:25.7377710Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:25.7378080Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:25.7378469Z #define __ptr_t void * 2025-05-07T20:26:25.7378715Z #define M_E 2.7182818284590452354 2025-05-07T20:26:25.7379001Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:25.7379274Z #define __USE_ISOCXX11 1 2025-05-07T20:26:25.7379546Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:25.7379859Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:25.7380158Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:25.7380444Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:25.7380737Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:25.7381049Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:25.7381369Z #define __linux 1 2025-05-07T20:26:25.7381600Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:25.7381871Z #define cudaDeviceMask 0xff 2025-05-07T20:26:25.7382142Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:25.7382441Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:25.7382744Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:25.7383064Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:25.7383460Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:25.7383769Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:25.7384064Z #define _BITS_TYPES_H 1 2025-05-07T20:26:25.7384357Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:25.7384691Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:25.7385002Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:25.7385284Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:25.7385576Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:25.7385861Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:25.7386640Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:25.7387453Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:25.7387733Z #define __unix 1 2025-05-07T20:26:25.7387952Z #define MATH_ERRNO 1 2025-05-07T20:26:25.7388198Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:25.7388479Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:25.7388753Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:25.7389042Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:25.7389332Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:25.7389616Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:25.7390084Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:25.7390548Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:25.7390840Z #define CUDARTAPI_CDECL 2025-05-07T20:26:25.7391100Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:25.7391377Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:25.7391660Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:25.7391929Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:25.7392169Z #define __SIZE_T 2025-05-07T20:26:25.7392416Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:25.7392738Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:25.7393041Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:25.7393383Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:25.7393648Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:25.7394037Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:25.7394465Z #define __WAIT_STATUS void * 2025-05-07T20:26:25.7394724Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:25.7394992Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:25.7395263Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:25.7395545Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:25.7395823Z #define __WINT_MIN__ 0U 2025-05-07T20:26:25.7396399Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:25.7397042Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:25.7397343Z #define WUNTRACED 2 2025-05-07T20:26:25.7397575Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:25.7397860Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:25.7398146Z #define NZERO 20 2025-05-07T20:26:25.7398378Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:25.7398663Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:25.7398951Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:25.7399239Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:25.7399501Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:25.7399780Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:25.7400060Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:25.7400340Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:25.7400611Z #define EXIT_FAILURE 1 2025-05-07T20:26:25.7400853Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:25.7401120Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:25.7401383Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:25.7401640Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:25.7401924Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:25.7402259Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:25.7402703Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:25.7403009Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:25.7403264Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:25.7403532Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:25.7403828Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:25.7404139Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:25.7404424Z #define SEEK_DATA 3 2025-05-07T20:26:25.7404658Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:25.7404955Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:25.7405370Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:25.7405760Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:25.7406011Z #define __INT64_C(c) c ## L 2025-05-07T20:26:25.7406282Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:25.7406613Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:25.7406945Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:25.7407229Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:25.7407533Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:25.7407832Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:25.7408094Z #define __INT_WCHAR_T_H 2025-05-07T20:26:25.7408338Z #define WSTOPPED 2 2025-05-07T20:26:25.7408572Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:25.7408861Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:25.7409119Z #define FP_NORMAL 4 2025-05-07T20:26:25.7409361Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:25.7409656Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:25.7409898Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:25.7410155Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:25.7410450Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:25.7410727Z #define cudaTextureType1D 0x01 2025-05-07T20:26:25.7411000Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:25.7411266Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:25.7411536Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:25.7411834Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:25.7412346Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:25.7412824Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:25.7413117Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:25.7413378Z #define _POSIX_SOURCE 1 2025-05-07T20:26:25.7413630Z #define cudaTextureType2D 0x02 2025-05-07T20:26:25.7413900Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:25.7414170Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:25.7414491Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:25.7414762Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:25.7415080Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:25.7415417Z #define cudaTextureType3D 0x03 2025-05-07T20:26:25.7415691Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:25.7415948Z #define CLOCK_REALTIME 0 2025-05-07T20:26:25.7416205Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:25.7416484Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:25.7416792Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:25.7417078Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:25.7417359Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:25.7417651Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:25.7417919Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:25.7418226Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:25.7418523Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:25.7418800Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:25.7419060Z #define __GLIBC__ 2 2025-05-07T20:26:25.7419282Z #define __END_DECLS } 2025-05-07T20:26:25.7419517Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:25.7419889Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:25.7420269Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:25.7420518Z #define WCONTINUED 8 2025-05-07T20:26:25.7420756Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:25.7421016Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:25.7421343Z #define _ALLOCA_H 1 2025-05-07T20:26:25.7421659Z #define __host__ __location__(host) 2025-05-07T20:26:25.7422089Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:25.7422523Z #define __SLONG32_TYPE int 2025-05-07T20:26:25.7422811Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:25.7423125Z #define _SYS_SELECT_H 1 2025-05-07T20:26:25.7423369Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:25.7423616Z #define _IOS_NOCREATE 32 2025-05-07T20:26:25.7423870Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:25.7424156Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:25.7424447Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:25.7424737Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:25.7425025Z #define __global__ __location__(global) 2025-05-07T20:26:25.7425315Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:25.7425576Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:25.7425853Z #define __DBL_DIG__ 15 2025-05-07T20:26:25.7426094Z #define TIME_UTC 1 2025-05-07T20:26:25.7426320Z #define __FLT32_DIG__ 6 2025-05-07T20:26:25.7426651Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:25.7427049Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:25.7427362Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:25.7427677Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:25.7427976Z #define _G_BUFSIZ 8192 2025-05-07T20:26:25.7428277Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:25.7428649Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:25.7428951Z #define __cudaCDP2GetDevice 2025-05-07T20:26:25.7429229Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:25.7429522Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:25.7429774Z #define __GXX_WEAK__ 1 2025-05-07T20:26:25.7430024Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.7430334Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:25.7430599Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:25.7430900Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:25.7431347Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:25.7431631Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:25.7431920Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:25.7432215Z #define _G_config_h 1 2025-05-07T20:26:25.7432496Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:25.7432836Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:25.7433110Z #define _GCC_WCHAR_T 2025-05-07T20:26:25.7433344Z #define TMP_MAX 238328 2025-05-07T20:26:25.7433588Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:25.7433850Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:25.7434114Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.7434397Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:25.7434680Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:25.7434960Z #define _IO_SKIPWS 01 2025-05-07T20:26:25.7435366Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:25.7435833Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:25.7436102Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:25.7436442Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:25.7436811Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:25.7437176Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:25.7437543Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:25.7437802Z #define le32toh(x) (x) 2025-05-07T20:26:25.7438033Z #define _SIZE_T_DEFINED 2025-05-07T20:26:25.7438292Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:25.7438636Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:25.7438996Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:25.7439392Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:25.7439806Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:25.7440392Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:25.7440707Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:25.7441107Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:25.7441399Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:25.7441929Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:25.7442430Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:25.7442743Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:25.7443089Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:25.7443407Z #define _WCHAR_T_ 2025-05-07T20:26:25.7443639Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:25.7444002Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:25.7444382Z #define RTSIG_MAX 32 2025-05-07T20:26:25.7444610Z #define _STDDEF_H 2025-05-07T20:26:25.7444845Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:25.7445115Z #define _VA_LIST_DEFINED 2025-05-07T20:26:25.7445374Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:25.7445717Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:25.7446108Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:25.7446444Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:25.7446739Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:25.7447204Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:25.7447737Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:25.7448111Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:25.7448435Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:25.7448745Z #define __unix__ 1 2025-05-07T20:26:25.7448988Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.7449276Z #define __INT_WIDTH__ 32 2025-05-07T20:26:25.7449521Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:25.7449762Z #define _IONBF 2 2025-05-07T20:26:25.7450208Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:25.7450966Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:25.7451637Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:25.7451904Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:25.7452177Z #define __UINT16_C(c) c 2025-05-07T20:26:25.7452417Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:25.7452735Z #define STA_DEL 0x0020 2025-05-07T20:26:25.7452996Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:26:25.7453252Z #define __id_t_defined 2025-05-07T20:26:25.7453531Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:25.7453984Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:25.7454415Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:25.7454691Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:25.7454958Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:25.7455209Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:25.7455483Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:25.7455765Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:25.7456039Z #define SING 2 2025-05-07T20:26:25.7456259Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:25.7456531Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:25.7456836Z #define cudaStreamDefault 0x00 2025-05-07T20:26:25.7457180Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:25.7457552Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:25.7457829Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:25.7458096Z #define __gnu_linux__ 1 2025-05-07T20:26:25.7458337Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:25.7458598Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:25.7458842Z #define MAX_INPUT 255 2025-05-07T20:26:25.7459086Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:25.7459419Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:25.7459788Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:25.7460188Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:25.7460519Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:25.7460928Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:25.7461409Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:25.7461745Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:25.7462105Z #define _Mfloat_ float 2025-05-07T20:26:25.7462364Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:25.7462678Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:25.7462968Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:25.7463454Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:25.7463947Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.7464226Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:25.7464558Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:25.7464911Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:25.7465222Z #define __USE_ISOC11 1 2025-05-07T20:26:25.7465462Z #define _BSD_SIZE_T_ 2025-05-07T20:26:25.7465692Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:25.7465945Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:25.7466214Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:25.7466509Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:25.7466840Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:25.7467158Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:25.7467486Z #define __THROW throw () 2025-05-07T20:26:25.7467741Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:25.7468033Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:25.7468394Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:25.7468746Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:25.7469024Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:25.7475995Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:25.7476297Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:25.7476577Z #define L_tmpnam 20 2025-05-07T20:26:25.7476912Z #define ___int_wchar_t_h 2025-05-07T20:26:25.7477257Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:25.7477644Z #define isascii(c) __isascii (c) 2025-05-07T20:26:25.7477908Z #define _T_PTRDIFF 2025-05-07T20:26:25.7478216Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:25.7478574Z #define toascii(c) __toascii (c) 2025-05-07T20:26:25.7478837Z #define __GNUC__ 11 2025-05-07T20:26:25.7479085Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:25.7479393Z #define __GXX_RTTI 1 2025-05-07T20:26:25.7479620Z #define __pie__ 2 2025-05-07T20:26:25.7479830Z #define __MMX__ 1 2025-05-07T20:26:25.7480051Z #define __cudaCDP2Malloc 2025-05-07T20:26:25.7480309Z #define __timespec_defined 1 2025-05-07T20:26:25.7480557Z #define L_ctermid 9 2025-05-07T20:26:25.7480792Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:25.7481099Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:25.7481501Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:25.7481883Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:25.7482153Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:25.7482446Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:25.7482796Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:25.7483113Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:25.7483375Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:25.7483812Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:25.7484559Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:25.7485158Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:25.7485461Z #define __USE_SVID 1 2025-05-07T20:26:25.7485710Z #define __constant__ __location__(constant) 2025-05-07T20:26:25.7486016Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:25.7486402Z #define __device__ __location__(device) 2025-05-07T20:26:25.7486730Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:25.7487050Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:25.7487316Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:25.7487596Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:25.7487944Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:25.7488313Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:25.7488595Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:25.7488960Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:25.7489333Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:25.7489579Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:25.7489947Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:25.7490362Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:25.7490677Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:25.7490953Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:25.7491219Z #define NGROUPS_MAX 65536 2025-05-07T20:26:25.7491471Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:25.7491733Z #define __USE_ISOC95 1 2025-05-07T20:26:25.7491952Z #define _TIME_H 1 2025-05-07T20:26:25.7492218Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:25.7492537Z #define __USE_ISOC99 1 2025-05-07T20:26:25.7492857Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:25.7493216Z #define HOST_NAME_MAX 64 2025-05-07T20:26:25.7493462Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:25.7493722Z #define _IOS_ATEND 4 2025-05-07T20:26:25.7493952Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:25.7494275Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:25.7494675Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:25.7495015Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:25.7495297Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:25.7495625Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:25.7496081Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:25.7496339Z #define _STDIO_H 1 2025-05-07T20:26:25.7496735Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:25.7497198Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:25.7497555Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:25.7497931Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:25.7498220Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:25.7498482Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:25.7498752Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:25.7499042Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:25.7499339Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.7499653Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:25.7499927Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:25.7500203Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:25.7500519Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:25.7500791Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:25.7501322Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:25.7501675Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:25.7502039Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:25.7502285Z #define __USE_XOPEN 1 2025-05-07T20:26:25.7502523Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:25.7502959Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:25.7503394Z #define __USE_XOPEN2K 1 2025-05-07T20:26:25.7503630Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:25.7503897Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:25.7504188Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:25.7504453Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:25.7504971Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:25.7505581Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:25.7505869Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:25.7506222Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:25.7506604Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:25.7506988Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:25.7507372Z #define __END_NAMESPACE_C99 2025-05-07T20:26:25.7507641Z #define __glibcxx_integral_traps true 2025-05-07T20:26:25.7507924Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:25.7508173Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:25.7508431Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:25.7508700Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:25.7508945Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:25.7509237Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:25.7509534Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:25.7509894Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:25.7510278Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:25.7510554Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:25.7510821Z #define _IO_UNITBUF 020000 2025-05-07T20:26:25.7511070Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:25.7511329Z #define __FD_SETSIZE 1024 2025-05-07T20:26:25.7511582Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:25.7511849Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:25.7512190Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:25.7512545Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:25.7512853Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:25.7513167Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:25.7513483Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:25.7513752Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:25.7514051Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:25.7514386Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:25.7514677Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:25.7515117Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:25.7515402Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:25.7515671Z #define __USE_POSIX199506 1 2025-05-07T20:26:25.7515915Z #define _FEATURES_H 1 2025-05-07T20:26:25.7516153Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:25.7516547Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:25.7516959Z #define __stub_getmsg 2025-05-07T20:26:25.7517191Z #define _IO_FIXED 010000 2025-05-07T20:26:25.7517463Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:25.7517767Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:25.7518037Z #define __stub_setlogin 2025-05-07T20:26:25.7518273Z #define __stub_fattach 2025-05-07T20:26:25.7518512Z #define __cplusplus 201703L 2025-05-07T20:26:25.7518774Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:25.7519052Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:25.7519306Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:25.7519588Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:25.7520072Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:25.7520596Z #define _IO_INTERNAL 010 2025-05-07T20:26:25.7520835Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:25.7521170Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:25.7521521Z #define __dev_t_defined 2025-05-07T20:26:25.7521752Z #define __DEPRECATED 1 2025-05-07T20:26:25.7521980Z #define __S32_TYPE int 2025-05-07T20:26:25.7522226Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:25.7522515Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:25.7522778Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:25.7523039Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:25.7523632Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:25.7524254Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:25.7524684Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:25.7525032Z #define OVERFLOW 3 2025-05-07T20:26:25.7525274Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:25.7525584Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:25.7525868Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.7526199Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:25.7526526Z #define __SSE2_MATH__ 1 2025-05-07T20:26:25.7526769Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:25.7527078Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.7527374Z #define _IO_STDIO_H 2025-05-07T20:26:25.7527615Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:25.7527907Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:25.7528220Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:25.7528514Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.7528821Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:25.7529079Z #define __amd64 1 2025-05-07T20:26:25.7529307Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:25.7529573Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:25.7529844Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:25.7530130Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:25.7530438Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:25.7530699Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:25.7530994Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:25.7531251Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:25.7531493Z #define __bounded 2025-05-07T20:26:25.7531716Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:25.7532001Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:25.7532276Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:25.7532536Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:25.7532802Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.7533113Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:25.7533523Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:25.7533925Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:25.7534275Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:25.7534613Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:25.7534950Z #define STA_PLL 0x0001 2025-05-07T20:26:25.7535197Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:25.7535458Z #define __GNUG__ 11 2025-05-07T20:26:25.7535687Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:25.7535959Z #define _T_WCHAR 2025-05-07T20:26:25.7536199Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:25.7536493Z #define __specialization_static 2025-05-07T20:26:25.7536793Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:25.7537107Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:25.7537371Z #define cudaArraySparse 0x40 2025-05-07T20:26:25.7537632Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:25.7537884Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:25.7538172Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:25.7538469Z #define _WCHAR_T 2025-05-07T20:26:25.7538704Z #define __cudaCDP2Free 2025-05-07T20:26:25.7539338Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:25.7540498Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:25.7540960Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:25.7541453Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:25.7541736Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:25.7541996Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:25.7542334Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:25.7542685Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:25.7542923Z #define __NO_CTYPE 1 2025-05-07T20:26:25.7543156Z #define __stub_bdflush 2025-05-07T20:26:25.7543519Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:25.7544086Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:25.7544403Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:25.7544678Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:25.7544959Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:25.7545261Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:25.7545563Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:25.7545907Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:25.7546248Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:25.7546538Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:25.7546926Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:25.7547383Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:25.7547733Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:25.7548021Z #define _IO_STDIO 040000 2025-05-07T20:26:25.7548345Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:25.7548743Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:25.7549071Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:25.7549363Z #define _PTRDIFF_T 2025-05-07T20:26:25.7549583Z #define _MOVE_H 1 2025-05-07T20:26:25.7549820Z #define __cpp_hex_float 201603L 2025-05-07T20:26:25.7550084Z #define ADJ_TAI 0x0080 2025-05-07T20:26:25.7550310Z #define __ptrvalue 2025-05-07T20:26:25.7550542Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:25.7550801Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:25.7551084Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:25.7551396Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:25.7551653Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:25.7551937Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:25.7552340Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:25.7552719Z #define __USE_GNU 1 2025-05-07T20:26:25.7552947Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:25.7553227Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:25.7553504Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:25.7554040Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:25.7554426Z #define WEXITED 4 2025-05-07T20:26:25.7554645Z #define _IO_NO_READS 4 2025-05-07T20:26:25.7554949Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:25.7555295Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:25.7555576Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:25.7555882Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:25.7556199Z #define __uid_t_defined 2025-05-07T20:26:25.7556454Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:25.7556744Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:25.7557016Z #define WNOHANG 1 2025-05-07T20:26:25.7557264Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:25.7557574Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:25.7557844Z #define cudaEventDefault 0x00 2025-05-07T20:26:25.7558153Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:25.7558483Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:25.7558723Z #define __x86_64 1 2025-05-07T20:26:25.7558954Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:25.7559349Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:25.7559826Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:25.7560317Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:25.7560753Z #define __PTRDIFF_T 2025-05-07T20:26:25.7561079Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:25.7561450Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:25.7561729Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.7562022Z #define _Mlong_double_ long double 2025-05-07T20:26:25.7562306Z #define __cpp_lambdas 200907L 2025-05-07T20:26:25.7562562Z #define _IO_DEC 020 2025-05-07T20:26:25.7562795Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:25.7563161Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:25.7563457Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:25.7563744Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:25.7564009Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:25.7564304Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:25.7564632Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:25.7564908Z #define _ANSI_STDDEF_H 2025-05-07T20:26:25.7565173Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:25.7565489Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:25.7565857Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:25.7566236Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:25.7566522Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:25.7566822Z #define __cpp_template_auto 201606L 2025-05-07T20:26:25.7567182Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:25.7567550Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:25.7567828Z #define __key_t_defined 2025-05-07T20:26:25.7568083Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:25.7568450Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:25.7568920Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:25.7569288Z #define __GNUC_VA_LIST 2025-05-07T20:26:25.7569619Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:25.7570010Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:25.7570277Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:25.7570562Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:25.7570859Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:25.7571111Z #define __WCOREFLAG 0x80 2025-05-07T20:26:25.7571368Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:25.7571671Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:25.7571957Z #define __LP64__ 1 2025-05-07T20:26:25.7572206Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:25.7572621Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:25.7572910Z #define _IO_off64_t __off64_t 2025-05-07T20:26:25.7573174Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.7573433Z #define __time_t_defined 1 2025-05-07T20:26:25.7573688Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:25.7574038Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:25.7574401Z #define __USE_UNIX98 1 2025-05-07T20:26:25.7574646Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:25.7574923Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:25.7575194Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:25.7575491Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:25.7575804Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:25.7576068Z #define SEEK_CUR 1 2025-05-07T20:26:25.7576295Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.7576570Z #define _ASSERT_H 1 2025-05-07T20:26:25.7577139Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:25.7577762Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:25.7578042Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:25.7578300Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:25.7578567Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:25.7578842Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:25.7579217Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:25.7579632Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:25.7580280Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:25.7580925Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:25.7581285Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:25.7581642Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:25.7582134Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:25.7582417Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:25.7582702Z #define cudaArrayDefault 0x00 2025-05-07T20:26:25.7582990Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:25.7583287Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:25.7583571Z #define TLOSS 5 2025-05-07T20:26:25.7583785Z #define __ssize_t_defined 2025-05-07T20:26:25.7584041Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:26:25.7584317Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:25.7584607Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:25.7584903Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:25.7585267Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:25.7585651Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:25.7585939Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:25.7586229Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:25.7586539Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:25.7586840Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:25.7587132Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:25.7587387Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:25.7587724Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:25.7588086Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:25.7588329Z #define __cdecl 2025-05-07T20:26:25.7588567Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:25.7588902Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:25.7589233Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:25.7589484Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:25.7589766Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:25.7590067Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:25.7590331Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:25.7590643Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:25.7590976Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:25.7591382Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:25.7591920Z #define ADJ_NANO 0x2000 2025-05-07T20:26:25.7592229Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:25.7592589Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:25.7592875Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:25.7593139Z #define __FLT_DIG__ 6 2025-05-07T20:26:25.7593492Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:25.7593883Z #define __NO_INLINE__ 1 2025-05-07T20:26:25.7594187Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:25.7594541Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:25.7594797Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:25.7595065Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:25.7595359Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:25.7595627Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:25.7595929Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:25.7596224Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:25.7603674Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:25.7604099Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:25.7604452Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:25.7604803Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:25.7605047Z #define MAX_CANON 255 2025-05-07T20:26:25.7605278Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:25.7605535Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:25.7605807Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:25.7606094Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:25.7606408Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:25.7606711Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:25.7606993Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:25.7607317Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:25.7607629Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:25.7607996Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:25.7608304Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:25.7608601Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:25.7608876Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:25.7609193Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:25.7609490Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:25.7609754Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:25.7610007Z #define _SYS_TYPES_H 1 2025-05-07T20:26:25.7610250Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:25.7610514Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:25.7610760Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:25.7610994Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:25.7611267Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:25.7611555Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:25.7611808Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:25.7612098Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:25.7612369Z #define FP_SUBNORMAL 3 2025-05-07T20:26:25.7612621Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:25.7612909Z #define _INITIALIZER_LIST 2025-05-07T20:26:25.7613159Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:25.7613401Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:25.7613676Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:25.7613965Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:25.7614219Z #define _IO_file_flags _flags 2025-05-07T20:26:25.7614475Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:25.7614724Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:25.7614997Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:25.7615271Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:25.7615538Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:25.7615911Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:25.7616304Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:25.7616612Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:25.7616882Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:25.7617136Z #define _BSD_SOURCE 1 2025-05-07T20:26:25.7617456Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:25.7618297Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:25.7619135Z #define __catch(X) catch(X) 2025-05-07T20:26:25.7619397Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:25.7619688Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:25.7619958Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:25.7620215Z #define __STRING(x) #x 2025-05-07T20:26:25.7620458Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:25.7620728Z #define _T_PTRDIFF_ 2025-05-07T20:26:25.7620976Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:25.7621357Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:25.7621634Z #define __unbounded 2025-05-07T20:26:25.7621873Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.7622168Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:25.7622457Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.7622753Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:25.7623035Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:25.7623332Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:25.7623656Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:25.7623965Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:25.7624246Z #define __managed__ __location__(managed) 2025-05-07T20:26:25.7624542Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:25.7624942Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:25.7625364Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:25.7625624Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:25.7625998Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:25.7626394Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:25.7626729Z #define _SYS_SIZE_T_H 2025-05-07T20:26:25.7627024Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:25.7627359Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:25.7627634Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:25.7627922Z #define _CRTIMP 2025-05-07T20:26:25.7628148Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:25.7628454Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:25.7628776Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:25.7629134Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:25.7629550Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.7629870Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:25.7630148Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:25.7630434Z #define __SIZE_T__ 2025-05-07T20:26:25.7630652Z #define __stub_gtty 2025-05-07T20:26:25.7630875Z #define __pid_t_defined 2025-05-07T20:26:25.7631131Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:25.7631444Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.7631761Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:25.7632052Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:25.7632297Z #define __need_clockid_t 2025-05-07T20:26:25.7632535Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:25.7632787Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:25.7633106Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:25.7633423Z #define _IO_HEX 0100 2025-05-07T20:26:25.7633682Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:25.7634016Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:25.7634324Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:25.7634596Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:25.7635009Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:25.7635449Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:25.7635755Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:25.7636062Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:25.7636257Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:25.7636364Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:25.7636448Z #define __stub_sstk 2025-05-07T20:26:25.7636543Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:25.7636706Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:25.7636787Z #define __wur 2025-05-07T20:26:25.7636909Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:25.7636998Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:25.7637081Z #define _IO_OCT 040 2025-05-07T20:26:25.7637180Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:25.7637270Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:25.7637361Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:25.7637493Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:25.7637585Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:25.7637689Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:25.7637945Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:25.7638091Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:25.7638225Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:25.7638349Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:25.7638440Z #define __off64_t_defined 2025-05-07T20:26:25.7638541Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:25.7638628Z #define __FLT128_DIG__ 33 2025-05-07T20:26:25.7638734Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:25.7638832Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:25.7638915Z #define __INT32_C(c) c 2025-05-07T20:26:25.7639013Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:25.7639116Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:25.7639213Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:25.7639303Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:25.7639395Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:25.7639491Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:25.7639629Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:25.7639822Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:25.7639920Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:25.7640024Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:25.7640345Z #define __have_pthread_attr_t 1 2025-05-07T20:26:25.7640451Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:25.7640681Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:25.7640791Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:25.7640893Z #define __cudaCDP2EventRecord 2025-05-07T20:26:25.7640992Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:25.7641078Z #define htole32(x) (x) 2025-05-07T20:26:25.7641329Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:25.7641463Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:25.7641564Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:25.7641726Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:25.7641872Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:25.7642003Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:25.7642145Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:25.7642237Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:25.7642338Z #define cudaArrayLayered 0x01 2025-05-07T20:26:25.7642511Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:25.7642621Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:25.7642716Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:25.7642822Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:25.7642903Z #define unix 1 2025-05-07T20:26:25.7642999Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:25.7643092Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:25.7643185Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:25.7643305Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:25.7643392Z #define __USE_POSIX 1 2025-05-07T20:26:25.7643484Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:25.7643625Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:25.7643870Z #define __THROWNL throw () 2025-05-07T20:26:25.7643963Z #define __cpp_rtti 199711L 2025-05-07T20:26:25.7644075Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:25.7644165Z #define __PMT(args) args 2025-05-07T20:26:25.7644279Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.7644435Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:25.7644551Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:25.7644647Z #define _SIZE_T_DECLARED 2025-05-07T20:26:25.7644745Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:25.7644835Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:25.7645229Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:25.7645327Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:25.7645421Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:25.7645531Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:25.7645678Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:25.7645767Z #define _WCHAR_T_H 2025-05-07T20:26:25.7645860Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:25.7645950Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:25.7646042Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:25.7646141Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:25.7646237Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:25.7646332Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:25.7646441Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:25.7646520Z #define __ELF__ 1 2025-05-07T20:26:25.7646628Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:25.7646729Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:25.7646815Z #define STA_INS 0x0010 2025-05-07T20:26:25.7646921Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:25.7647097Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:25.7647190Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:25.7647433Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:25.7647556Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.7647673Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:25.7647772Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:25.7647876Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:25.7647979Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:25.7648136Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:25.7648295Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:25.7648399Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:25.7648760Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:25.7648946Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:25.7649085Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:25.7649211Z #define __FLT_RADIX__ 2 2025-05-07T20:26:25.7649360Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:25.7649607Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:25.7649758Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:25.7649864Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:25.7649968Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:25.7650067Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:25.7650168Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:25.7650273Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:25.7650358Z #define WORD_BIT 32 2025-05-07T20:26:25.7650446Z #define _IO_USER_BUF 1 2025-05-07T20:26:25.7650538Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:25.7650641Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:25.7650758Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:25.7650861Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:25.7650968Z #define __long_double_t long double 2025-05-07T20:26:25.7651061Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:25.7651153Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:25.7651559Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:25.7651748Z #define __k8 1 2025-05-07T20:26:25.7651945Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:25.7652119Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:25.7652235Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:25.7652342Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:25.7652441Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:25.7652544Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:25.7652643Z #define __blksize_t_defined 2025-05-07T20:26:25.7652743Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:25.7652842Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:25.7652954Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:25.7653045Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:25.7653151Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:25.7653244Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:25.7653342Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:25.7653609Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:25.7653951Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:25.7654058Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:25.7654153Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:25.7654235Z #define SEEK_SET 0 2025-05-07T20:26:25.7654333Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:25.7654428Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:26:25.7654620Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:25.7654725Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:25.7654826Z #define __cudaCDP2GetLastError 2025-05-07T20:26:25.7654918Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:25.7655009Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:25.7655404Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:25.7655514Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:25.7655609Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:25.7655698Z #define __stub_sigreturn 2025-05-07T20:26:25.7655935Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:25.7656029Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:25.7656118Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:25.7656219Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:25.7656302Z #define CLOCK_TAI 11 2025-05-07T20:26:25.7656409Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:25.7656504Z #define __restrict_arr 2025-05-07T20:26:25.7656618Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:25.7656761Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:25.7657289Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:25.7657478Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:25.7657573Z #define __USE_MISC 1 2025-05-07T20:26:25.7657679Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:25.7657779Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:25.7657872Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:25.7657959Z #define __LDBL_DIG__ 18 2025-05-07T20:26:25.7658056Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:25.7658165Z #define __malloc_and_calloc_defined 2025-05-07T20:26:25.7658259Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:25.7658367Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:25.7658451Z #define __x86_64__ 1 2025-05-07T20:26:25.7658535Z #define _SIZE_T_ 2025-05-07T20:26:25.7659422Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:25.7659614Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:25.7659713Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:25.7659839Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:25.7659960Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:25.7660068Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:25.7660178Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:25.7660305Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:25.7660450Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:25.7660548Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:25.7661005Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:25.7661247Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:25.7661407Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:25.7661516Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:25.7661612Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:25.7661701Z #define STA_FLL 0x0008 2025-05-07T20:26:25.7661849Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:25.7661946Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:25.7662071Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.7662189Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:25.7662276Z #define __stub_revoke 2025-05-07T20:26:25.7662372Z #define __timer_t_defined 1 2025-05-07T20:26:25.7662513Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:25.7662607Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:25.7662715Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:25.7662828Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:25.7663020Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:25.7663136Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:25.7663248Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:25.7663347Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:25.7663499Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:25.7663594Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:25.7663683Z #define _IO_off_t __off_t 2025-05-07T20:26:25.7663779Z #define __FLT64_DIG__ 15 2025-05-07T20:26:25.7664003Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:25.7664101Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:25.7664238Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.7664362Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:25.7664464Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:25.7664568Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:25.7664686Z #define NULL __null 2025-05-07T20:26:25.7664883Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:25.7665037Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:25.7665180Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:25.7665302Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.7665400Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:25.7665520Z #define FP_ZERO 2 2025-05-07T20:26:25.7665806Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:25.7666148Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:25.7666503Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.7666791Z #define __WCHAR_T__ 2025-05-07T20:26:25.7667029Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:25.7667399Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:25.7667843Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:25.7668191Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:25.7668485Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:25.7668814Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:25.7669266Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:25.7669624Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:25.7669931Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:25.7670181Z #define _SIGSET_H_types 1 2025-05-07T20:26:25.7670482Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:25.7670796Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:25.7671140Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:25.7671488Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:25.7671806Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:25.7672156Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:25.7672490Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:25.7672868Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:25.7673275Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:25.7673641Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:25.7673928Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:25.7674221Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:25.7674485Z #define STA_MODE 0x4000 2025-05-07T20:26:25.7674746Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:25.7675057Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:25.7675365Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:25.7675672Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:25.7675962Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:25.7676245Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:25.7676529Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:25.7676814Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:25.7677118Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:25.7677392Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.7677694Z #define __SEG_FS 1 2025-05-07T20:26:25.7677918Z #define _IO_size_t size_t 2025-05-07T20:26:25.7678251Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:25.7678545Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:25.7678807Z #define __stub_lchmod 2025-05-07T20:26:25.7679041Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:25.7679319Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.7679624Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:25.7679886Z #define __SEG_GS 1 2025-05-07T20:26:25.7680195Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:25.7680574Z #define _IOS_APPEND 8 2025-05-07T20:26:25.7680814Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:25.7681075Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:25.7681336Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:25.7681620Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:25.7681894Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:25.7682172Z #define htole16(x) (x) 2025-05-07T20:26:25.7682427Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:25.7682727Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:25.7682998Z #define __INT16_TYPE__ short int 2025-05-07T20:26:25.7683278Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:25.7683578Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:25.7683891Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:25.7684221Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:25.7684529Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:25.7684769Z #define __WCLONE 0x80000000 2025-05-07T20:26:25.7685025Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:25.7685280Z #define SEEK_HOLE 4 2025-05-07T20:26:25.7685500Z #define TIMER_ABSTIME 1 2025-05-07T20:26:25.7685745Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:25.7686009Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:25.7686343Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:25.7686730Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.7687035Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:25.7687320Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:25.7687745Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:25.7688043Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:25.7688345Z #define _LINUX_LIMITS_H 2025-05-07T20:26:25.7688585Z #define linux 1 2025-05-07T20:26:25.7688808Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:25.7689083Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:25.7689383Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:25.7689661Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:25.7689941Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:25.7690283Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:25.7690624Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:25.7690903Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.7691183Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:25.7691460Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:25.7691706Z #define htole64(x) (x) 2025-05-07T20:26:25.7691946Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:25.7692273Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:25.7692596Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:25.7693311Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:25.7693981Z #define __USE_POSIX2 1 2025-05-07T20:26:25.7694231Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:25.7694503Z #define __WALL 0x40000000 2025-05-07T20:26:25.7694751Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:25.7695013Z #define _XLOCALE_H 1 2025-05-07T20:26:25.7695252Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:25.7695517Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:25.7695799Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:25.7696078Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:25.7696363Z #define __EXCEPTIONS 1 2025-05-07T20:26:25.7696613Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:25.7697080Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:25.7697464Z #define __WORDSIZE 64 2025-05-07T20:26:25.7697709Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:25.7697959Z #define _STL_RELOPS_H 1 2025-05-07T20:26:25.7698205Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:25.7698468Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:25.7698754Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:25.7699036Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:25.7699296Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:25.7699775Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:25.7700406Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:25.7700849Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:25.7701246Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:25.7701538Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:25.7701838Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:25.7702156Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:25.7702459Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:25.7702849Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:25.7703222Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:25.7703497Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:25.7703765Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:25.7704135Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:25.7704529Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:25.7704826Z #define _STRING_H 1 2025-05-07T20:26:25.7705060Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:25.7705328Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:25.7705586Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:25.7705900Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:25.7706236Z #define __code_model_small__ 1 2025-05-07T20:26:25.7706496Z #define _PSTL_CONFIG_H 2025-05-07T20:26:25.7706749Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:25.7707151Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:25.7707458Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:25.7707733Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:25.7708253Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:25.7708770Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:25.7709024Z #define le64toh(x) (x) 2025-05-07T20:26:25.7709262Z #define FILENAME_MAX 4096 2025-05-07T20:26:25.7709569Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:25.7709924Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:25.7710223Z #define L_cuserid 9 2025-05-07T20:26:25.7710449Z #define __ino_t_defined 2025-05-07T20:26:25.7710682Z #define __k8__ 1 2025-05-07T20:26:25.7717841Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:25.7718165Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:25.7718463Z #define __int8_t_defined 2025-05-07T20:26:25.7718719Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:25.7718986Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:25.7719289Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:25.7719591Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:25.7719860Z #define _IOS_TRUNC 16 2025-05-07T20:26:25.7720118Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:25.7720482Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:25.7720823Z #define __HAVE_COLUMN 2025-05-07T20:26:25.7721055Z #define __stub_fdetach 2025-05-07T20:26:25.7721611Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:25.7722182Z #define __pic__ 2 2025-05-07T20:26:25.7722429Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.7722738Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:25.7723006Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:25.7723381Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:25.7723667Z #define __stub_chflags 2025-05-07T20:26:25.7723905Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:25.7724146Z #define __need_IOV_MAX 2025-05-07T20:26:25.7724399Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:25.7724718Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:25.7725015Z #define __cpp_decltype 200707L 2025-05-07T20:26:25.7725281Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:25.7725561Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:25.7725827Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:25.7726109Z #define TTY_NAME_MAX 32 2025-05-07T20:26:25.7726425Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:25.7726804Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.7727186Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:25.7727564Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:25.7727865Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:25.7728131Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:25.7728375Z #define __import__ 2025-05-07T20:26:25.7728597Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:25.7728888Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:25.7729197Z #define __export__ 2025-05-07T20:26:25.7729446Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:25.7729758Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:25.7730089Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:25.7730438Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:25.7730697Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:25.7730945Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:25.7731210Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:25.7731487Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:25.7731813Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:25.7732130Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:25.7732415Z #define WNOWAIT 0x01000000 2025-05-07T20:26:25.7732749Z #define PLOSS 6 2025-05-07T20:26:25.7732970Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:25.7733416Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:25.7733858Z #define EXIT_SUCCESS 0 2025-05-07T20:26:25.7734094Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:25.7734365Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:25.7734636Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:25.7734908Z #define __thread__ __thread 2025-05-07T20:26:25.7735162Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:25.7735427Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:25.7735690Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:25.7736109Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:25.7736545Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:25.7736848Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:25.7737092Z #define __linux__ 1 2025-05-07T20:26:25.7737325Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:25.7737621Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:25.7737932Z #define __S16_TYPE short int 2025-05-07T20:26:25.7738440Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:25.7738985Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:25.7739367Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:25.7739752Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:25.7740022Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:25.7740735Z #define _T_SIZE_ 2025-05-07T20:26:25.7741084Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:25.7741504Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:25.7741810Z #define _PSTL_VERSION 12000 2025-05-07T20:26:25.7742087Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:25.7742395Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:25.7742804Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:25.7743119Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:25.7743431Z #define _IOS_INPUT 1 2025-05-07T20:26:25.7743668Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:25.7743930Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:25.7744221Z #define __INT64_TYPE__ long int 2025-05-07T20:26:25.7744491Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:25.7744757Z #define __shared__ __location__(shared) 2025-05-07T20:26:25.7745042Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:25.7745354Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:25.7745688Z #define __gid_t_defined 2025-05-07T20:26:25.7745953Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:25.7746256Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:25.7746631Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:25.7747020Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:25.7747293Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:25.7747545Z #define ___int_size_t_h 2025-05-07T20:26:25.7747799Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.7748125Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:25.7748496Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:25.7748846Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:25.7749130Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:25.7749397Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:25.7749669Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:25.7749964Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.7750302Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:25.7750632Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:25.7750930Z #define __clock_t_defined 1 2025-05-07T20:26:25.7751189Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:25.7751588Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:25.7751959Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:25.7752220Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:25.7752728Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:25.7753042Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:25.7753333Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:25.7753663Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:25.7754010Z #define __SSE__ 1 2025-05-07T20:26:25.7754237Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:25.7754512Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:25.7754780Z #define _CTYPE_H 1 2025-05-07T20:26:25.7755001Z #define __sigset_t_defined 2025-05-07T20:26:25.7755257Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:25.7755527Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:25.7755774Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:25.7756013Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:25.7756287Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:25.7756531Z #define __SM_70_RT_H__ 2025-05-07T20:26:25.7756779Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:25.7757061Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:25.7757341Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:25.7757666Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:25.7758016Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:25.7758284Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:25.7758581Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:25.7758841Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:25.7759075Z #define __amd64__ 1 2025-05-07T20:26:25.7759295Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:25.7759548Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:25.7760009Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:25.7760462Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:25.7760723Z #define EOF (-1) 2025-05-07T20:26:25.7760948Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:25.7761216Z #define __USE_POSIX199309 1 2025-05-07T20:26:25.7761564Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:25.7761842Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:25.7762102Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:25.7762372Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:25.7762660Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:25.7762950Z #define ____mbstate_t_defined 1 2025-05-07T20:26:25.7763206Z #define STA_NANO 0x2000 2025-05-07T20:26:25.7763446Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:25.7763711Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:25.7763960Z #define _IO_LINKED 0x80 2025-05-07T20:26:25.7764204Z #define __cpp_lib_launder 201606 2025-05-07T20:26:25.7764474Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:25.7764735Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:25.7765016Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:25.7765287Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:25.7765593Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:25.7765940Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:25.7766245Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:25.7766528Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:25.7766794Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:25.7767049Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:25.7767335Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:25.7767683Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.7768098Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:25.7768577Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:25.7768941Z #define __stub_stty 2025-05-07T20:26:25.7769247Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:25.7769595Z #define le16toh(x) (x) 2025-05-07T20:26:25.7769838Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:25.7770212Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:25.7770560Z #define _SIZET_ 2025-05-07T20:26:25.7770780Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:25.7771138Z #define _SVID_SOURCE 1 2025-05-07T20:26:25.7771365Z #define _LP64 1 2025-05-07T20:26:25.7771581Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:25.7771962Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:25.7772401Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:25.7772684Z #define __UINT8_C(c) c 2025-05-07T20:26:25.7772919Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:25.7773174Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:25.7773447Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:25.7773738Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:25.7773993Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:25.7774255Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:25.7774514Z #define CUDARTAPI 2025-05-07T20:26:25.7774725Z #define IOV_MAX 1024 2025-05-07T20:26:25.7775005Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:25.7775342Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:25.7775628Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:25.7775894Z #define __wchar_t__ 2025-05-07T20:26:25.7776146Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:25.7776418Z #define SEEK_END 2 2025-05-07T20:26:25.7776646Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:25.7776984Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:25.7777353Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:25.7777664Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:25.7777995Z #define ____FILE_defined 1 2025-05-07T20:26:25.7778271Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:25.7778572Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:25.7778835Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:25.7779085Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:25.7779501Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:25.7780072Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:25.7780396Z #define _IO_RIGHT 04 2025-05-07T20:26:25.7780627Z #define __END_NAMESPACE_STD 2025-05-07T20:26:25.7780982Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:25.7781438Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:25.7781720Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:25.7782024Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:25.7782297Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:25.7782572Z #define _STDDEF_H_ 2025-05-07T20:26:25.7782878Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:25.7783250Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.7783552Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:25.7783954Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:25.7784366Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.7784728Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:25.7785088Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:25.7785409Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:25.7785711Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:25.7786017Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:25.7786302Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:25.7786612Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:25.7786884Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:25.7787145Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:25.7787490Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:25.7787858Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:25.7788196Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:25.7788570Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:25.7788849Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:25.7789158Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:25.7789502Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:25.7789855Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:25.7790124Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:25.7790402Z #define P_tmpdir "/tmp" 2025-05-07T20:26:25.7790679Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:25.7790988Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:25.7791253Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:25.7791607Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:25.7792043Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:25.7792408Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:25.7792730Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:25.7793092Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:25.7793395Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:25.7793821Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:25.7794251Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:25.7794547Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:25.7794845Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:25.7795106Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:25.7795356Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:25.7795618Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:25.7795894Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:25.7796144Z #define __FXSR__ 1 2025-05-07T20:26:25.7796356Z #define _SIZE_T 2025-05-07T20:26:25.7796594Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:25.7796900Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:25.7797264Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:25.7797680Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:25.7798019Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:25.7798290Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:25.7798658Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:25.7799265Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:25.7799658Z #define _GXX_NULLPTR_T 2025-05-07T20:26:25.7799927Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:25.7800235Z #define FOPEN_MAX 16 2025-05-07T20:26:25.7800469Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:25.7800741Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:25.7801052Z #define __suseconds_t_defined 2025-05-07T20:26:25.7801312Z #define __off_t_defined 2025-05-07T20:26:25.7801548Z #define stderr stderr 2025-05-07T20:26:25.7801787Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:25.7802089Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:25.7802393Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:25.7802651Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:25.7803220Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:25.7803826Z #define __mode_t_defined 2025-05-07T20:26:25.7804071Z #define _GCC_SIZE_T 2025-05-07T20:26:25.7804318Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.7804613Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:25.7804907Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:25.7805207Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:25.7805464Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:25.7805730Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:25.7806034Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:25.7806332Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:25.7806627Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:25.7806865Z #define __size_t__ 2025-05-07T20:26:25.7807133Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:25.7807456Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:25.7807733Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:25.7808079Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:25.7808422Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:25.7808839Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:25.7809191Z #define _ENDIAN_H 1 2025-05-07T20:26:25.7809433Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:25.7809726Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:25.7810001Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:25.7810275Z #define __try try 2025-05-07T20:26:25.7810498Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:25.7810766Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:25.7811034Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:25.7811449Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:25.7811890Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:25.7812132Z #define __PIC__ 2 2025-05-07T20:26:25.7812377Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:25.7812702Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:25.7813051Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:25.7813383Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:25.7813648Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:25.7814004Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:25.7814386Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:25.7814666Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:25.7814946Z #define _IO_uid_t __uid_t 2025-05-07T20:26:25.7815202Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:25.7815510Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:25.7815824Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:25.7816136Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:25.7816481Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:25.7816789Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:25.7817092Z #define LONG_BIT 64 2025-05-07T20:26:25.7817334Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:25.7817713Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:25.7818042Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:25.7818363Z #define __fsfilcnt_t_defined 2025-05-07T20:26:25.7818619Z #define __blkcnt_t_defined 2025-05-07T20:26:25.7819055Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:25.7819515Z #define __USE_LARGEFILE 1 2025-05-07T20:26:25.7819771Z #define __cpp_constexpr 201603L 2025-05-07T20:26:25.7820042Z #define CUDART_VERSION 12060 2025-05-07T20:26:25.7820308Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:25.7820571Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:25.7820831Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:25.7821259Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:25.7821652Z #define __lldiv_t_defined 1 2025-05-07T20:26:25.7821891Z #define __SSE2__ 1 2025-05-07T20:26:25.7822105Z #define _IOLBF 1 2025-05-07T20:26:25.7822338Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:25.7822628Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:25.7822911Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:25.7823203Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:25.7823478Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:25.7823776Z #define __INT32_TYPE__ int 2025-05-07T20:26:25.7824027Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:25.7824296Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:25.7824584Z #define __cpp_exceptions 199711L 2025-05-07T20:26:25.7824861Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:25.7825150Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:25.7825435Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:25.7825711Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:25.7826086Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:25.7826435Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:25.7826705Z #define __SWORD_TYPE long int 2025-05-07T20:26:25.7826976Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:25.7827247Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:25.7827611Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:25.7827706Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:25.7827995Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:25.7828092Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:25.7828245Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:25.7828328Z #define _T_SIZE 2025-05-07T20:26:25.7828435Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:25.7828568Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:25.7828697Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:25.7828793Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:25.7828891Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:25.7829014Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:25.7829109Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:25.7829215Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.7829307Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:25.7829499Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:25.7829603Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:25.7829705Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:25.7829806Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:25.7829929Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.7830014Z #define __PIE__ 2 2025-05-07T20:26:25.7830123Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:25.7830223Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:25.7830415Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:25.7830642Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:25.7830736Z #define __nlink_t_defined 2025-05-07T20:26:25.7830864Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:25.7830984Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:25.7831072Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:25.7831440Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:25.7831568Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:25.7831674Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:25.7831787Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:25.7831882Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:25.7831973Z #define __FILE_defined 1 2025-05-07T20:26:25.7832158Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:25.7832254Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:25.7832351Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:25.7832468Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:25.7832586Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:25.7832698Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:25.7832803Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:25.7832890Z #define __INT16_C(c) c 2025-05-07T20:26:25.7832996Z #define __U32_TYPE unsigned int 2025-05-07T20:26:25.7833100Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:25.7833227Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:25.7833320Z #define __STDC__ 1 2025-05-07T20:26:25.7833418Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:25.7840383Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:25.7840554Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:25.7840723Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:25.7840817Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:25.7840924Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:25.7841024Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:25.7841140Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:25.7841257Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:25.7841356Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:25.7841459Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:25.7841546Z #define stdin stdin 2025-05-07T20:26:25.7841647Z #define __ino64_t_defined 2025-05-07T20:26:25.7841920Z #define STA_CLK 0x8000 2025-05-07T20:26:25.7842020Z #define __clockid_t_defined 1 2025-05-07T20:26:25.7842168Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:25.7842345Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:25.7842492Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:25.7842643Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:25.7842787Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:25.7842894Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:25.7843093Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:25.7843190Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:25.7843710Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:25.7843801Z #define DOMAIN 1 2025-05-07T20:26:25.7843899Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:25.7843983Z #define __NVCC__ 1 2025-05-07T20:26:25.7844092Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:25.7844204Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.7844307Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:25.7844413Z #define __throw_exception_again throw 2025-05-07T20:26:25.7844506Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:25.7844595Z #define __EXCEPTION_H 1 2025-05-07T20:26:25.7844697Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:25.7844799Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:25.7845105Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:25.7845225Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:25.7845322Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:25.7845419Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:25.7845661Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:25.7845772Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:25.7845917Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:25.7846025Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.7846142Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:25.7846236Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:25.7846342Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:25.7846439Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:25.7846541Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:25.7846679Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:25.7846776Z #define __useconds_t_defined 2025-05-07T20:26:25.7846874Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:25.7847062Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:25.7847211Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:25.7847297Z #define __SSE_MATH__ 1 2025-05-07T20:26:25.7847402Z #define _IO_wint_t wint_t 2025-05-07T20:26:25.7847501Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:25.7847596Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:25.7847691Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:25.7847804Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:25.7847904Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:25.7847996Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:25.7848081Z #define __USE_ATFILE 1 2025-05-07T20:26:25.7848177Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:25.7848275Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:25.7848364Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:25.7848597Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:25.7848692Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:25.7848791Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:25.7848897Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:25.7849008Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:25.7849101Z #define _STDLIB_H 1 2025-05-07T20:26:25.7849328Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:25.7849423Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.7849517Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:25.7849645Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.7849753Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:25.7849854Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:25.7850038Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:25.7850199Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:25.7850307Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:25.7850422Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:25.7850514Z #define __ldiv_t_defined 1 2025-05-07T20:26:25.7850693Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:25.7850786Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:25.7850961Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:25.7851070Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:25.7851162Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:25.7851266Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:25.7851368Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.7851466Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:25.7851553Z #define CUDART_CB 2025-05-07T20:26:25.7851657Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:25.7851785Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:25.7851874Z #define MB_LEN_MAX 16 2025-05-07T20:26:25.7852097Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:25.7852198Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:25.7852328Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:25.7852445Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:25.7852550Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:25.7852809Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:25.7852941Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:25.7853029Z #define _GNU_SOURCE 1 2025-05-07T20:26:25.7853115Z #define __stub_putmsg 2025-05-07T20:26:25.7853230Z #define __CUDACC__ 1 2025-05-07T20:26:25.7853365Z #define __N(msgid) (msgid) 2025-05-07T20:26:25.7853490Z #define __P(args) args 2025-05-07T20:26:25.7853860Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:25.7854011Z #define __cpp_init_captures 201304L 2025-05-07T20:26:25.7854160Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:25.7854259Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:25.7854358Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:25.7854439Z #define __WCHAR_T 2025-05-07T20:26:25.7854534Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:25.7854630Z #define __fsblkcnt_t_defined 2025-05-07T20:26:25.7854751Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:25.7854868Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:25.7854879Z 2025-05-07T20:26:25.7921698Z 2025-05-07T20:26:25.7921848Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:25.7921854Z 2025-05-07T20:26:27.6788527Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:27.6788921Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:26:27.6789237Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:26:27.6789560Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:26:27.6789901Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:26:27.6790107Z 2025-05-07T20:26:27.7451463Z 2025-05-07T20:26:27.7462990Z /usr/bin/nvidia-smi 2025-05-07T20:26:27.7468115Z + nvidia-smi 2025-05-07T20:26:27.7468448Z 2025-05-07T20:26:27.7642953Z Wed May 7 20:26:27 2025 2025-05-07T20:26:27.7643850Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:27.7644858Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:26:27.7645865Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:27.7649773Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:26:27.7650842Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:26:27.7651700Z | | | MIG M. | 2025-05-07T20:26:27.7652378Z |=========================================+========================+======================| 2025-05-07T20:26:27.7814181Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:26:27.7814742Z | 0% 28C P8 16W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:26:27.7815139Z | | | N/A | 2025-05-07T20:26:27.7815552Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:27.7817578Z 2025-05-07T20:26:27.7818111Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:27.7818722Z | Processes: | 2025-05-07T20:26:27.7819275Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:26:27.7819699Z | ID ID Usage | 2025-05-07T20:26:27.7820056Z |=========================================================================================| 2025-05-07T20:26:27.7823264Z | No running processes found | 2025-05-07T20:26:27.7823839Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:28.0367419Z 2025-05-07T20:26:28.0372266Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:26:28.0421825Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:28.0422392Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:28.0435130Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:26:28.0435486Z env: 2025-05-07T20:26:28.0435721Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:26:28.0436022Z BUILD_ENV: build_binary 2025-05-07T20:26:28.0436272Z BUILD_TARGET: genai 2025-05-07T20:26:28.0436507Z BUILD_VARIANT: cuda 2025-05-07T20:26:28.0436738Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:26:28.0437000Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:26:28.0437307Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:26:28.0437644Z ##[endgroup] 2025-05-07T20:26:28.3846382Z ################################################################################ 2025-05-07T20:26:28.3846766Z # Install PyTorch (PIP) 2025-05-07T20:26:28.3847042Z # 2025-05-07T20:26:28.3861428Z # [2025-05-07T20:26:28.385Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:26:28.3861965Z ################################################################################ 2025-05-07T20:26:28.3862190Z 2025-05-07T20:26:28.3890243Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:26:29.3945397Z Channels: 2025-05-07T20:26:29.3945752Z - conda-forge 2025-05-07T20:26:29.3945990Z Platform: linux-64 2025-05-07T20:26:32.7775816Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:26:33.4969737Z Solving environment: \ | / done 2025-05-07T20:26:33.7125997Z 2025-05-07T20:26:33.7126496Z ## Package Plan ## 2025-05-07T20:26:33.7126709Z 2025-05-07T20:26:33.7126927Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:33.7127237Z 2025-05-07T20:26:33.7127347Z added / updated specs: 2025-05-07T20:26:33.7128027Z - numpy 2025-05-07T20:26:33.7128159Z 2025-05-07T20:26:33.7128180Z 2025-05-07T20:26:33.7128311Z The following packages will be downloaded: 2025-05-07T20:26:33.7128534Z 2025-05-07T20:26:33.7128663Z package | build 2025-05-07T20:26:33.7128993Z ---------------------------|----------------- 2025-05-07T20:26:33.7129383Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:26:33.7129851Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:26:33.7130310Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:26:33.7130764Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:26:33.7131227Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:26:33.7131708Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:26:33.7132169Z numpy-2.0.2 | py39h9cb892a_1 7.6 MB conda-forge 2025-05-07T20:26:33.7132564Z ------------------------------------------------------------ 2025-05-07T20:26:33.7132909Z Total: 14.8 MB 2025-05-07T20:26:33.7133122Z 2025-05-07T20:26:33.7133256Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:33.7133480Z 2025-05-07T20:26:33.7133717Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:26:33.7134219Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:26:33.7134736Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:26:33.7135298Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:26:33.7135815Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:26:33.7136368Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:26:33.7137128Z numpy conda-forge/linux-64::numpy-2.0.2-py39h9cb892a_1 2025-05-07T20:26:33.7137403Z 2025-05-07T20:26:33.7137408Z 2025-05-07T20:26:33.7137412Z 2025-05-07T20:26:33.7137570Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:33.7137950Z numpy-2.0.2 | 7.6 MB | | 0% 2025-05-07T20:26:33.7138176Z 2025-05-07T20:26:33.7138700Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:33.7138958Z 2025-05-07T20:26:33.7146184Z 2025-05-07T20:26:33.7151861Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:26:33.7152135Z 2025-05-07T20:26:33.7152139Z 2025-05-07T20:26:33.7155130Z 2025-05-07T20:26:33.7174349Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:26:33.7174621Z 2025-05-07T20:26:33.7174625Z 2025-05-07T20:26:33.7174629Z 2025-05-07T20:26:33.7175112Z 2025-05-07T20:26:33.7192966Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:33.7193287Z 2025-05-07T20:26:33.7193292Z 2025-05-07T20:26:33.7193304Z 2025-05-07T20:26:33.7193309Z 2025-05-07T20:26:33.7195851Z 2025-05-07T20:26:33.7198252Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:33.7198534Z 2025-05-07T20:26:33.7198538Z 2025-05-07T20:26:33.7198542Z 2025-05-07T20:26:33.7198545Z 2025-05-07T20:26:33.7198549Z 2025-05-07T20:26:33.7198553Z 2025-05-07T20:26:33.7830245Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:33.7830535Z 2025-05-07T20:26:33.7830539Z 2025-05-07T20:26:33.7830542Z 2025-05-07T20:26:33.8589309Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:33.8589596Z 2025-05-07T20:26:33.8589600Z 2025-05-07T20:26:33.8589603Z 2025-05-07T20:26:33.8589607Z 2025-05-07T20:26:33.8599109Z libblas-3.9.0 | 16 KB | #########7 | 97%  2025-05-07T20:26:33.8599440Z 2025-05-07T20:26:33.8599456Z 2025-05-07T20:26:33.8599703Z 2025-05-07T20:26:33.8599707Z 2025-05-07T20:26:33.8599710Z 2025-05-07T20:26:33.8657052Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:33.8657351Z 2025-05-07T20:26:33.8657367Z 2025-05-07T20:26:33.8657371Z 2025-05-07T20:26:33.8657374Z 2025-05-07T20:26:33.8666987Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:33.8667358Z 2025-05-07T20:26:33.8667362Z 2025-05-07T20:26:33.8667373Z 2025-05-07T20:26:33.8667377Z 2025-05-07T20:26:33.8667381Z 2025-05-07T20:26:33.9861705Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:33.9869586Z numpy-2.0.2 | 7.6 MB | | 0% 2025-05-07T20:26:33.9869909Z 2025-05-07T20:26:33.9869914Z 2025-05-07T20:26:33.9869918Z 2025-05-07T20:26:33.9869921Z 2025-05-07T20:26:33.9869925Z 2025-05-07T20:26:33.9869929Z 2025-05-07T20:26:33.9889158Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:33.9889438Z 2025-05-07T20:26:33.9889453Z 2025-05-07T20:26:33.9889456Z 2025-05-07T20:26:33.9889465Z 2025-05-07T20:26:33.9889469Z 2025-05-07T20:26:33.9901916Z 2025-05-07T20:26:34.0437690Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.0439139Z 2025-05-07T20:26:34.0499309Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:34.0499572Z 2025-05-07T20:26:34.0499576Z 2025-05-07T20:26:34.0499580Z 2025-05-07T20:26:34.0499777Z 2025-05-07T20:26:34.0560770Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.0561151Z 2025-05-07T20:26:34.0561158Z 2025-05-07T20:26:34.0561163Z 2025-05-07T20:26:34.0561169Z 2025-05-07T20:26:34.0562164Z 2025-05-07T20:26:34.0585996Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.0586376Z 2025-05-07T20:26:34.0586382Z 2025-05-07T20:26:34.0807296Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:26:34.0807568Z 2025-05-07T20:26:34.0807581Z 2025-05-07T20:26:34.0810979Z 2025-05-07T20:26:34.0827592Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:34.0827875Z 2025-05-07T20:26:34.0827879Z 2025-05-07T20:26:34.0828357Z 2025-05-07T20:26:34.0861791Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:34.1180151Z numpy-2.0.2 | 7.6 MB | ####7 | 47% 2025-05-07T20:26:34.1180513Z 2025-05-07T20:26:34.1182462Z 2025-05-07T20:26:34.1251581Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:34.1251985Z 2025-05-07T20:26:34.1251991Z 2025-05-07T20:26:34.1251996Z 2025-05-07T20:26:34.1252002Z 2025-05-07T20:26:34.1252008Z 2025-05-07T20:26:34.1252590Z 2025-05-07T20:26:34.1438923Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.1440907Z 2025-05-07T20:26:34.1610377Z libopenblas-0.3.29 | 5.6 MB | #########1 | 92%  2025-05-07T20:26:34.1610867Z 2025-05-07T20:26:34.1771233Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:34.1968148Z numpy-2.0.2 | 7.6 MB | ########## | 100% 2025-05-07T20:26:34.1968380Z 2025-05-07T20:26:34.1970243Z 2025-05-07T20:26:34.1972973Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:34.1973239Z 2025-05-07T20:26:34.1973466Z 2025-05-07T20:26:34.3188178Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:34.3188477Z 2025-05-07T20:26:34.6154648Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:34.6155121Z numpy-2.0.2 | 7.6 MB | ########## | 100% 2025-05-07T20:26:34.6163430Z numpy-2.0.2 | 7.6 MB | ########## | 100% 2025-05-07T20:26:34.6163933Z 2025-05-07T20:26:34.6164237Z 2025-05-07T20:26:34.6164501Z  2025-05-07T20:26:34.6164754Z 2025-05-07T20:26:34.6164758Z 2025-05-07T20:26:34.6164927Z  2025-05-07T20:26:34.6165408Z 2025-05-07T20:26:34.6165438Z 2025-05-07T20:26:34.6165442Z 2025-05-07T20:26:34.6165664Z  2025-05-07T20:26:34.6165973Z 2025-05-07T20:26:34.6165979Z 2025-05-07T20:26:34.6165985Z 2025-05-07T20:26:34.6165990Z 2025-05-07T20:26:34.6166184Z  2025-05-07T20:26:34.6166393Z 2025-05-07T20:26:34.6166397Z 2025-05-07T20:26:34.6166401Z 2025-05-07T20:26:34.6166404Z 2025-05-07T20:26:34.6166408Z 2025-05-07T20:26:34.6166591Z  2025-05-07T20:26:34.6166805Z 2025-05-07T20:26:34.6166809Z 2025-05-07T20:26:34.6166813Z 2025-05-07T20:26:34.6166816Z 2025-05-07T20:26:34.6166820Z 2025-05-07T20:26:34.6166829Z 2025-05-07T20:26:34.6167025Z  done 2025-05-07T20:26:34.7179060Z Preparing transaction: \ done 2025-05-07T20:26:34.9183392Z Verifying transaction: / - done 2025-05-07T20:26:35.0192857Z Executing transaction: | done 2025-05-07T20:26:35.1976430Z ################################################################################ 2025-05-07T20:26:35.1976855Z # Install Package From PyTorch PIP: torch 2025-05-07T20:26:35.1977163Z # 2025-05-07T20:26:35.1992773Z # [2025-05-07T20:26:35.198Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:26:35.1993262Z ################################################################################ 2025-05-07T20:26:35.1993488Z 2025-05-07T20:26:35.2008354Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:26:35.3023529Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:26:35.3023973Z ################################################################################ 2025-05-07T20:26:35.3024455Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:26:35.3024852Z # 2025-05-07T20:26:35.3041257Z # [2025-05-07T20:26:35.303Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:26:35.3042083Z ################################################################################ 2025-05-07T20:26:35.3042320Z 2025-05-07T20:26:35.3062957Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:26:35.3089578Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:26:35.3107023Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:26:35.3107565Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:26:35.3116292Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:26:35.3124899Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:26:35.3146599Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:55.1880009Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:55.1880526Z Collecting torch 2025-05-07T20:27:55.1881174Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp39-cp39-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:27:55.1881901Z Collecting filelock (from torch) 2025-05-07T20:27:55.1882429Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:27:55.1883385Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from torch) (4.13.2) 2025-05-07T20:27:55.1884094Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:27:55.1884650Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:27:55.1885521Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 41.6 MB/s eta 0:00:00 2025-05-07T20:27:55.1885884Z Collecting networkx (from torch) 2025-05-07T20:27:55.1886763Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.2.1-py3-none-any.whl (1.6 MB) 2025-05-07T20:27:55.1887443Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 16.7 MB/s eta 0:00:00 2025-05-07T20:27:55.1887783Z Collecting jinja2 (from torch) 2025-05-07T20:27:55.1888271Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:27:55.1888781Z Collecting fsspec (from torch) 2025-05-07T20:27:55.1889274Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:27:55.1889848Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.1890567Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:27:55.1891352Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 56.7 MB/s eta 0:00:00 2025-05-07T20:27:55.1891765Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.1892524Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:27:55.1893310Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 9.8 MB/s eta 0:00:00 2025-05-07T20:27:55.1893710Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:27:55.1894418Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:27:55.1895200Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 36.9 MB/s eta 0:00:00 2025-05-07T20:27:55.1895585Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:27:55.1896260Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:27:55.1897032Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 36.3 MB/s eta 0:00:00 2025-05-07T20:27:55.1906178Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:27:55.1907266Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:27:55.1908182Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 51.7 MB/s eta 0:00:00 2025-05-07T20:27:55.1908577Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:27:55.1909263Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:27:55.1910036Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 146.4 MB/s eta 0:00:00 2025-05-07T20:27:55.1910423Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:27:55.1911118Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:27:55.1911890Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 196.0 MB/s eta 0:00:00 2025-05-07T20:27:55.1912316Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:27:55.1913015Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:27:55.1913785Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 150.2 MB/s eta 0:00:00 2025-05-07T20:27:55.1914175Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:27:55.1914885Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:27:55.1915649Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 129.1 MB/s eta 0:00:00 2025-05-07T20:27:55.1916046Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:27:55.1916741Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:27:55.1917533Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 163.1 MB/s eta 0:00:00 2025-05-07T20:27:55.1918639Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:27:55.1919404Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:27:55.1920177Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.1920837Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:27:55.1921499Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:27:55.1922275Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:27:55.1923128Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 183.7 MB/s eta 0:00:00 2025-05-07T20:27:55.1923512Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:27:55.1924318Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:27:55.1925127Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:27:55.1925947Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:27:55.1927191Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T20:27:55.1928036Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:27:55.1928593Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:27:55.1929227Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 48.0 MB/s eta 0:00:00 2025-05-07T20:27:55.1929603Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:27:55.1930391Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) 2025-05-07T20:27:55.1931418Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp39-cp39-manylinux_2_28_x86_64.whl (825.5 MB) 2025-05-07T20:27:55.1932211Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.5/825.5 MB 36.1 MB/s eta 0:00:00 2025-05-07T20:27:55.1932981Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:27:55.1933821Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 11.5 MB/s eta 0:00:00 2025-05-07T20:27:55.1934567Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:27:55.1935392Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 102.9 MB/s eta 0:00:00 2025-05-07T20:27:55.1936261Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB) 2025-05-07T20:27:55.1937119Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 134.2 MB/s eta 0:00:00 2025-05-07T20:27:55.1938812Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:27:55.1940557Z 2025-05-07T20:27:55.1942558Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:27:55.1944590Z 2025-05-07T20:27:57.4187780Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:27:57.4190207Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:28:00.8253501Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:04.2763425Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:04.2763897Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:07.6175631Z True 2025-05-07T20:28:07.6175924Z True 2025-05-07T20:28:07.6176040Z 2025-05-07T20:28:07.6837690Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:07.6874449Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:07.6875077Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:07.6886770Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:07.6887124Z env: 2025-05-07T20:28:07.6887351Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:07.6887651Z BUILD_ENV: build_binary 2025-05-07T20:28:07.6887903Z BUILD_TARGET: genai 2025-05-07T20:28:07.6888132Z BUILD_VARIANT: cuda 2025-05-07T20:28:07.6888369Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:07.6888625Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:07.6888929Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:07.6889267Z ##[endgroup] 2025-05-07T20:28:08.0268310Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:08.0270787Z ################################################################################ 2025-05-07T20:28:08.0271320Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:08.0271701Z # 2025-05-07T20:28:08.0288090Z # [2025-05-07T20:28:08.028Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:08.0288522Z ################################################################################ 2025-05-07T20:28:08.0288741Z 2025-05-07T20:28:08.0305196Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:08.1244672Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:08.1255473Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:08.1256099Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:08.1256511Z 2025-05-07T20:28:08.2168446Z 2025-05-07T20:28:08.2169003Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:08.2190219Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:14.2924858Z Collecting environment information... 2025-05-07T20:28:14.2925443Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:28:14.2925813Z Is debug build: False 2025-05-07T20:28:14.2926066Z CUDA used to build PyTorch: 12.6 2025-05-07T20:28:14.2926354Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:14.2926542Z 2025-05-07T20:28:14.2926674Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:14.2927022Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:14.2927349Z Clang version: Could not collect 2025-05-07T20:28:14.2927634Z CMake version: Could not collect 2025-05-07T20:28:14.2927902Z Libc version: glibc-2.34 2025-05-07T20:28:14.2928105Z 2025-05-07T20:28:14.2928532Z Python version: 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:14.2929304Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:14.2929916Z Is CUDA available: True 2025-05-07T20:28:14.2930268Z CUDA runtime version: 12.6.85 2025-05-07T20:28:14.2930646Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:14.2931019Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:14.2931354Z Nvidia driver version: 570.133.07 2025-05-07T20:28:14.2931643Z cuDNN version: Could not collect 2025-05-07T20:28:14.2931921Z HIP runtime version: N/A 2025-05-07T20:28:14.2932171Z MIOpen runtime version: N/A 2025-05-07T20:28:14.2932433Z Is XNNPACK available: True 2025-05-07T20:28:14.2932597Z 2025-05-07T20:28:14.2932680Z CPU: 2025-05-07T20:28:14.2932896Z Architecture: x86_64 2025-05-07T20:28:14.2933241Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:14.2933640Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:14.2934041Z Byte Order: Little Endian 2025-05-07T20:28:14.2934359Z CPU(s): 16 2025-05-07T20:28:14.2934676Z On-line CPU(s) list: 0-15 2025-05-07T20:28:14.2935416Z Vendor ID: AuthenticAMD 2025-05-07T20:28:14.2935770Z Model name: AMD EPYC 7R32 2025-05-07T20:28:14.2936100Z CPU family: 23 2025-05-07T20:28:14.2936396Z Model: 49 2025-05-07T20:28:14.2936689Z Thread(s) per core: 2 2025-05-07T20:28:14.2936994Z Core(s) per socket: 8 2025-05-07T20:28:14.2937284Z Socket(s): 1 2025-05-07T20:28:14.2937569Z Stepping: 0 2025-05-07T20:28:14.2937890Z BogoMIPS: 5599.99 2025-05-07T20:28:14.2939987Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:14.2942488Z Hypervisor vendor: KVM 2025-05-07T20:28:14.2942807Z Virtualization type: full 2025-05-07T20:28:14.2943147Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:14.2943520Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:14.2943885Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:14.2944242Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:14.2944564Z NUMA node(s): 1 2025-05-07T20:28:14.2944862Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:14.2945206Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:14.2945788Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:14.2946155Z Vulnerability L1tf: Not affected 2025-05-07T20:28:14.2946512Z Vulnerability Mds: Not affected 2025-05-07T20:28:14.2946865Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:14.2947231Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:14.2947603Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:14.2948154Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:14.2948733Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:14.2949284Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:14.2949973Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:14.2950824Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:14.2951511Z Vulnerability Srbds: Not affected 2025-05-07T20:28:14.2951879Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:14.2952112Z 2025-05-07T20:28:14.2952224Z Versions of relevant libraries: 2025-05-07T20:28:14.2952493Z [pip3] numpy==2.0.2 2025-05-07T20:28:14.2952744Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:28:14.2953060Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:28:14.2953376Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:28:14.2953699Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:28:14.2954021Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:28:14.2954312Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:28:14.2954613Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:28:14.2954920Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:28:14.2955236Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:28:14.2955714Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:14.2956027Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:14.2956326Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:28:14.2956653Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:28:14.2956979Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:14.2957288Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:28:14.2957665Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2958158Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2958676Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.2959203Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2959741Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.2960280Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.2960781Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2961249Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:28:14.2961736Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:14.2962236Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:14.2962752Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2963215Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:14.2963680Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2964141Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2964621Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2965198Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:28:14.2965663Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:14.2966135Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:14.2966599Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2967067Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:28:14.2967538Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2968012Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.2968491Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:14.2968978Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:14.2969478Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2969967Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:28:14.2970453Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.2970945Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:14.2971420Z [conda] numpy 2.0.2 py39h9cb892a_1 conda-forge 2025-05-07T20:28:14.2971880Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:28:14.2972388Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:28:14.2972899Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.2973409Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.2973903Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:28:14.2974483Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:28:14.2974971Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:28:14.2975460Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:28:14.2975965Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:28:14.2976475Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:14.2976973Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:14.2977458Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:28:14.2977949Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.2978436Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:14.2978904Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:28:14.2979193Z 2025-05-07T20:28:14.3722478Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:14.3723168Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:14.3735122Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:14.3735478Z env: 2025-05-07T20:28:14.3735700Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:14.3736008Z BUILD_ENV: build_binary 2025-05-07T20:28:14.3736255Z BUILD_TARGET: genai 2025-05-07T20:28:14.3736485Z BUILD_VARIANT: cuda 2025-05-07T20:28:14.3736719Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:14.3736980Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:14.3737289Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:14.3737623Z ##[endgroup] 2025-05-07T20:28:14.7143135Z ################################################################################ 2025-05-07T20:28:14.7143533Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:14.7144131Z # 2025-05-07T20:28:14.7158560Z # [2025-05-07T20:28:14.715Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:14.7158970Z ################################################################################ 2025-05-07T20:28:14.7159187Z 2025-05-07T20:28:14.7175682Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:14.8053867Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:14.8075300Z [BUILD] Running git submodules update ... 2025-05-07T20:28:14.8096404Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:14.8458123Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:14.8458599Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:14.8459048Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:14.8459460Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:14.8459872Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:14.8460344Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:14.8460748Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:14.8493737Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:14.9041012Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:14.9062928Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:17.3041828Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:17.3154118Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:17.4176794Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:17.4215063Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:17.6792659Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:17.6836414Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:17.8034387Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:17.8070596Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:18.1743225Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:18.1782225Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:18.2405987Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:18.2410829Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:18.3284375Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:18.3319720Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:18.3810893Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 21)) (2.0.2) 2025-05-07T20:28:18.4488719Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:18.4542077Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:18.5919083Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:18.5956578Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:18.6981908Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:18.7070187Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:18.7611575Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:18.8319746Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:18.8369888Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:18.9394772Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:18.9431987Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:19.0602863Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:19.0701356Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:19.1873692Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.1908199Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:19.3002334Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.3039768Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:19.4593302Z Collecting importlib-metadata>=4.6 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.4632557Z Downloading importlib_metadata-8.7.0-py3-none-any.whl.metadata (4.8 kB) 2025-05-07T20:28:19.5751743Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.5790550Z Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:19.6927798Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:19.6961973Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:19.8156981Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:19.8192477Z Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB) 2025-05-07T20:28:19.9209651Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:19.9243242Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:19.9861786Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:20.0374245Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:20.0409971Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:20.0913936Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:20.1447586Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:20.1481400Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:20.1981673Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:20.2790553Z Collecting zipp>=3.20 (from importlib-metadata>=4.6->build->-r requirements.txt (line 14)) 2025-05-07T20:28:20.2826435Z Downloading zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB) 2025-05-07T20:28:20.3932802Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:20.3977228Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:20.4513682Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:20.5096863Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:20.5712964Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:21.1959716Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 44.6 MB/s eta 0:00:00 2025-05-07T20:28:21.1997463Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:21.2601291Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:21.3187170Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:21.3820421Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:21.4452316Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:21.5033758Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB) 2025-05-07T20:28:21.5647754Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 737.4/737.4 kB 8.1 MB/s eta 0:00:00 2025-05-07T20:28:21.5713225Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:21.6383577Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:21.6950925Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:21.7552230Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:21.8150312Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:21.8753908Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB) 2025-05-07T20:28:21.9271620Z Downloading importlib_metadata-8.7.0-py3-none-any.whl (27 kB) 2025-05-07T20:28:21.9881617Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:22.0489604Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB) 2025-05-07T20:28:22.1121050Z Downloading zipp-3.21.0-py3-none-any.whl (9.6 kB) 2025-05-07T20:28:22.1680094Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:22.2276334Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:22.2879275Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:22.3482164Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:22.5957291Z Installing collected packages: sortedcontainers, zipp, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, importlib-metadata, hypothesis, pyre-extensions, build 2025-05-07T20:28:24.9792631Z 2025-05-07T20:28:24.9868523Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 importlib-metadata-8.7.0 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0 zipp-3.21.0 2025-05-07T20:28:25.1765760Z ################################################################################ 2025-05-07T20:28:25.1766316Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:25.1766670Z # 2025-05-07T20:28:25.1783479Z # [2025-05-07T20:28:25.178Z] + install_triton_pip build_binary 2025-05-07T20:28:25.1784018Z ################################################################################ 2025-05-07T20:28:25.1784243Z 2025-05-07T20:28:25.1784483Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:25.1785375Z ################################################################################ 2025-05-07T20:28:25.1785748Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:25.1786081Z # 2025-05-07T20:28:25.1800433Z # [2025-05-07T20:28:25.179Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:25.1801096Z ################################################################################ 2025-05-07T20:28:25.1801321Z 2025-05-07T20:28:25.1816110Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:25.2715472Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:25.2716137Z ################################################################################ 2025-05-07T20:28:25.2716520Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:25.2716801Z # 2025-05-07T20:28:25.2733298Z # [2025-05-07T20:28:25.273Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:25.2734031Z ################################################################################ 2025-05-07T20:28:25.2734254Z 2025-05-07T20:28:25.2783994Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:25.2799903Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:25.2800582Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:25.2808856Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:25.2818139Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:25.2839296Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:33.1472539Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:33.1473892Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:33.1474623Z 2025-05-07T20:28:33.1474840Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:33.1475262Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:33.1476069Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:33.1477256Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.4 MB) 2025-05-07T20:28:33.1478343Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.4/166.4 MB 51.5 MB/s eta 0:00:00 2025-05-07T20:28:33.1478738Z Installing collected packages: pytorch-triton 2025-05-07T20:28:33.1479107Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:33.1479504Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:33.1479934Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:33.1480368Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:33.1480814Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:33.1481088Z 2025-05-07T20:28:35.3533060Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:35.3537154Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:28:37.5114612Z ################################################################################ 2025-05-07T20:28:37.5115091Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:28:37.5117128Z ################################################################################ 2025-05-07T20:28:37.5117364Z 2025-05-07T20:28:39.5580146Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:28:41.7264467Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:28:41.7268018Z [BUILD] Successfully ran git submodules update 2025-05-07T20:28:41.7301944Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:41.7302447Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:41.7314248Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:41.7314601Z env: 2025-05-07T20:28:41.7314831Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:41.7315146Z BUILD_ENV: build_binary 2025-05-07T20:28:41.7315429Z BUILD_TARGET: genai 2025-05-07T20:28:41.7315659Z BUILD_VARIANT: cuda 2025-05-07T20:28:41.7315895Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:41.7316147Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:41.7316452Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:41.7316792Z ##[endgroup] 2025-05-07T20:28:42.0698072Z ################################################################################ 2025-05-07T20:28:42.0698858Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:28:42.0699126Z # 2025-05-07T20:28:42.0713224Z # [2025-05-07T20:28:42.070Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.0713870Z ################################################################################ 2025-05-07T20:28:42.0714092Z 2025-05-07T20:28:42.0714450Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.0715141Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.0715478Z 2025-05-07T20:28:42.0829962Z c3e6bfa6eadc59821953963216ffd62fb3371bf7 fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.0832508Z 2025-05-07T20:28:42.0833248Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.0833615Z 2025-05-07T20:28:42.0964817Z b3c4041bb027a8c4ddf5b1fb266e05c307983525c8b62d76707c6e5028cede02 fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.0967028Z 2025-05-07T20:28:42.0967336Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.0967686Z 2025-05-07T20:28:42.1195681Z 139b82e5ceb5b9eb6e5607b06b0c5115 fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:42.1198306Z 2025-05-07T20:28:42.1207953Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl ... 2025-05-07T20:28:42.1229686Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:44.8592739Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:44.8594027Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.0.2) 2025-05-07T20:28:44.8595026Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:28:44.8595487Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:28:44.8595763Z 2025-05-07T20:28:51.6921489Z ################################################################################ 2025-05-07T20:28:51.6921937Z [CHECK] !!!! INFO !!!! 2025-05-07T20:28:51.6922321Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:51.6922771Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:28:51.6923100Z [CHECK] 2025-05-07T20:28:51.6923425Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:28:51.6923939Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:28:51.6924346Z ################################################################################ 2025-05-07T20:28:51.6924598Z 2025-05-07T20:28:51.6924726Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:28:55.5972180Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:28:59.5106693Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:03.4298087Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:03.4302830Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:15.1581628Z ################################################################################ 2025-05-07T20:29:15.1582089Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:15.1582447Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:15.1582804Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:15.1583161Z ################################################################################ 2025-05-07T20:29:15.1583394Z 2025-05-07T20:29:22.9743409Z ################################################################################ 2025-05-07T20:29:22.9744284Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:22.9745674Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:22.9747239Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:22.9747778Z ################################################################################ 2025-05-07T20:29:22.9748003Z 2025-05-07T20:29:22.9748171Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:26.8687053Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:30.7621895Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:34.7984756Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:29:38.6902257Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:29:38.6906294Z [INSTALL] Check for operator registrations ... 2025-05-07T20:29:42.5194713Z fbgemm.nccl_init 2025-05-07T20:29:42.5194958Z 2025-05-07T20:29:42.5821396Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:29:46.4194524Z fbgemm.gqa_attn_splitk 2025-05-07T20:29:46.4194747Z 2025-05-07T20:29:46.4823584Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:29:50.3115653Z fbgemm.rope_qkv_decoding 2025-05-07T20:29:50.3115888Z 2025-05-07T20:29:50.3756189Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:29:50.3756810Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:29:50.3794649Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:50.3795129Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:50.3808997Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:50.3809360Z env: 2025-05-07T20:29:50.3809592Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:50.3809892Z BUILD_ENV: build_binary 2025-05-07T20:29:50.3810145Z BUILD_TARGET: genai 2025-05-07T20:29:50.3810380Z BUILD_VARIANT: cuda 2025-05-07T20:29:50.3810615Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:50.3810879Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:50.3811185Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:50.3811528Z ##[endgroup] 2025-05-07T20:29:50.7184713Z ################################################################################ 2025-05-07T20:29:50.7185237Z # Test All FBGEMM-GPU Modules 2025-05-07T20:29:50.7185607Z # 2025-05-07T20:29:50.7200271Z # [2025-05-07T20:29:50.719Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:29:50.7200860Z ################################################################################ 2025-05-07T20:29:50.7201192Z 2025-05-07T20:29:58.5745512Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:29:58.5746473Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:29:58.5746913Z [TEST] Determined the test directories: 2025-05-07T20:29:58.5747241Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:29:58.5747556Z fbgemm_gpu/experimental/example/test 2025-05-07T20:29:58.5747858Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:29:58.5748055Z 2025-05-07T20:29:58.5756140Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:29:58.5763270Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:29:58.5763713Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:29:58.5764423Z 2025-05-07T20:29:59.0029992Z 2025-05-07T20:29:59.0030275Z [TEST] Installing PyTest ... 2025-05-07T20:29:59.0053822Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:00.1057897Z Channels: 2025-05-07T20:30:00.1058189Z - conda-forge 2025-05-07T20:30:00.1058446Z Platform: linux-64 2025-05-07T20:30:03.3976690Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:04.5514619Z Solving environment: \ | / done 2025-05-07T20:30:04.7776336Z 2025-05-07T20:30:04.7776771Z ## Package Plan ## 2025-05-07T20:30:04.7777013Z 2025-05-07T20:30:04.7777302Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:04.7777764Z 2025-05-07T20:30:04.7777897Z added / updated specs: 2025-05-07T20:30:04.7778239Z - expecttest 2025-05-07T20:30:04.7778520Z - pytest 2025-05-07T20:30:04.7778688Z 2025-05-07T20:30:04.7778694Z 2025-05-07T20:30:04.7778853Z The following packages will be downloaded: 2025-05-07T20:30:04.7779155Z 2025-05-07T20:30:04.7779278Z package | build 2025-05-07T20:30:04.7779604Z ---------------------------|----------------- 2025-05-07T20:30:04.7779982Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:04.7780450Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:04.7780918Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:04.7781448Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:04.7781886Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:04.7782314Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:04.7782732Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:04.7783474Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:04.7783882Z ------------------------------------------------------------ 2025-05-07T20:30:04.7784230Z Total: 428 KB 2025-05-07T20:30:04.7784440Z 2025-05-07T20:30:04.7784574Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:04.7784795Z 2025-05-07T20:30:04.7784997Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:04.7785507Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:04.7786033Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:04.7786515Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:04.7786986Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:04.7787442Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:04.7787884Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:04.7788312Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:04.7788575Z 2025-05-07T20:30:04.7788580Z 2025-05-07T20:30:04.7788584Z 2025-05-07T20:30:04.7788731Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:04.7789118Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:04.7789342Z 2025-05-07T20:30:04.7789753Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:04.7789989Z 2025-05-07T20:30:04.7789993Z 2025-05-07T20:30:04.7795707Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:04.7796015Z 2025-05-07T20:30:04.7796021Z 2025-05-07T20:30:04.7796026Z 2025-05-07T20:30:04.7807263Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:04.7807604Z 2025-05-07T20:30:04.7807608Z 2025-05-07T20:30:04.7807611Z 2025-05-07T20:30:04.7807615Z 2025-05-07T20:30:04.7822961Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:04.7823678Z 2025-05-07T20:30:04.7823685Z 2025-05-07T20:30:04.7823690Z 2025-05-07T20:30:04.7823696Z 2025-05-07T20:30:04.7823701Z 2025-05-07T20:30:04.7824348Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:04.7824604Z 2025-05-07T20:30:04.7824608Z 2025-05-07T20:30:04.7824611Z 2025-05-07T20:30:04.7824615Z 2025-05-07T20:30:04.7824619Z 2025-05-07T20:30:04.7824622Z 2025-05-07T20:30:04.7826513Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:04.7826890Z 2025-05-07T20:30:04.7826895Z 2025-05-07T20:30:04.7826898Z 2025-05-07T20:30:04.7826902Z 2025-05-07T20:30:04.7826905Z 2025-05-07T20:30:04.7826909Z 2025-05-07T20:30:04.7826913Z 2025-05-07T20:30:05.0211313Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:05.0211757Z 2025-05-07T20:30:05.0211764Z 2025-05-07T20:30:05.0211769Z 2025-05-07T20:30:05.0216576Z 2025-05-07T20:30:05.0224629Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:30:05.0225047Z 2025-05-07T20:30:05.0225053Z 2025-05-07T20:30:05.0225058Z 2025-05-07T20:30:05.0225064Z 2025-05-07T20:30:05.0344008Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:05.0344430Z 2025-05-07T20:30:05.0344436Z 2025-05-07T20:30:05.0344441Z 2025-05-07T20:30:05.0344446Z 2025-05-07T20:30:05.0346368Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:05.0346776Z 2025-05-07T20:30:05.0346782Z 2025-05-07T20:30:05.0352235Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:05.0352608Z 2025-05-07T20:30:05.0352614Z 2025-05-07T20:30:05.0456434Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:05.0456792Z 2025-05-07T20:30:05.0456797Z 2025-05-07T20:30:05.0579432Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:05.0579709Z 2025-05-07T20:30:05.0579812Z 2025-05-07T20:30:05.0579818Z 2025-05-07T20:30:05.0579833Z 2025-05-07T20:30:05.0580057Z 2025-05-07T20:30:05.0584319Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:05.0584592Z 2025-05-07T20:30:05.0584596Z 2025-05-07T20:30:05.0584600Z 2025-05-07T20:30:05.0584604Z 2025-05-07T20:30:05.0584607Z 2025-05-07T20:30:05.0668396Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:05.0668674Z 2025-05-07T20:30:05.0668678Z 2025-05-07T20:30:05.0668682Z 2025-05-07T20:30:05.0668685Z 2025-05-07T20:30:05.0668689Z 2025-05-07T20:30:05.0773130Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:05.0773397Z 2025-05-07T20:30:05.0773401Z 2025-05-07T20:30:05.0773405Z 2025-05-07T20:30:05.0777165Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:05.0777431Z 2025-05-07T20:30:05.0777435Z 2025-05-07T20:30:05.0777439Z 2025-05-07T20:30:05.0859163Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:05.0859431Z 2025-05-07T20:30:05.0859442Z 2025-05-07T20:30:05.0859453Z 2025-05-07T20:30:05.0870933Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:05.0871188Z 2025-05-07T20:30:05.0871191Z 2025-05-07T20:30:05.0871195Z 2025-05-07T20:30:05.0871199Z 2025-05-07T20:30:05.0871211Z 2025-05-07T20:30:05.0871215Z 2025-05-07T20:30:05.0874652Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:05.0874933Z 2025-05-07T20:30:05.0874937Z 2025-05-07T20:30:05.0874940Z 2025-05-07T20:30:05.0874954Z 2025-05-07T20:30:05.0874958Z 2025-05-07T20:30:05.0874962Z 2025-05-07T20:30:05.0921911Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:05.0953394Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:05.0953992Z 2025-05-07T20:30:05.0955460Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:05.0955712Z 2025-05-07T20:30:05.0955716Z 2025-05-07T20:30:05.0955720Z 2025-05-07T20:30:05.0955954Z 2025-05-07T20:30:05.0955964Z 2025-05-07T20:30:05.0955967Z 2025-05-07T20:30:05.0963276Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:05.0963661Z 2025-05-07T20:30:05.0976106Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:05.0976435Z 2025-05-07T20:30:05.0976439Z 2025-05-07T20:30:05.0976443Z 2025-05-07T20:30:05.0976446Z 2025-05-07T20:30:05.0976450Z 2025-05-07T20:30:05.0976454Z 2025-05-07T20:30:05.0976458Z 2025-05-07T20:30:05.0984141Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:05.0984532Z 2025-05-07T20:30:05.0984538Z 2025-05-07T20:30:05.0984544Z 2025-05-07T20:30:05.0984549Z 2025-05-07T20:30:05.0984554Z 2025-05-07T20:30:05.0984559Z 2025-05-07T20:30:05.0988987Z 2025-05-07T20:30:05.1018685Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:05.1236081Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:05.1236461Z 2025-05-07T20:30:05.1236475Z 2025-05-07T20:30:05.1236481Z 2025-05-07T20:30:05.1236487Z 2025-05-07T20:30:05.1236493Z 2025-05-07T20:30:05.1236498Z 2025-05-07T20:30:05.1236503Z 2025-05-07T20:30:05.1279021Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:05.1279315Z 2025-05-07T20:30:05.1434798Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:05.1441775Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:05.1442113Z 2025-05-07T20:30:05.1442336Z 2025-05-07T20:30:05.1442558Z  2025-05-07T20:30:05.1442878Z 2025-05-07T20:30:05.1442884Z 2025-05-07T20:30:05.1443123Z  2025-05-07T20:30:05.1443337Z 2025-05-07T20:30:05.1443341Z 2025-05-07T20:30:05.1443345Z 2025-05-07T20:30:05.1443523Z  2025-05-07T20:30:05.1443936Z 2025-05-07T20:30:05.1443942Z 2025-05-07T20:30:05.1443946Z 2025-05-07T20:30:05.1443950Z 2025-05-07T20:30:05.1444137Z  2025-05-07T20:30:05.1444352Z 2025-05-07T20:30:05.1444356Z 2025-05-07T20:30:05.1444360Z 2025-05-07T20:30:05.1444364Z 2025-05-07T20:30:05.1444367Z 2025-05-07T20:30:05.1444561Z  2025-05-07T20:30:05.1444774Z 2025-05-07T20:30:05.1444777Z 2025-05-07T20:30:05.1444781Z 2025-05-07T20:30:05.1444785Z 2025-05-07T20:30:05.1444788Z 2025-05-07T20:30:05.1444792Z 2025-05-07T20:30:05.1444982Z  2025-05-07T20:30:05.1445201Z 2025-05-07T20:30:05.1445205Z 2025-05-07T20:30:05.1445209Z 2025-05-07T20:30:05.1445212Z 2025-05-07T20:30:05.1445216Z 2025-05-07T20:30:05.1445220Z 2025-05-07T20:30:05.1445224Z 2025-05-07T20:30:05.1445423Z  done 2025-05-07T20:30:05.2455902Z Preparing transaction: \ done 2025-05-07T20:30:05.3460568Z Verifying transaction: / done 2025-05-07T20:30:07.2488020Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:07.3848533Z [TEST] Checking imports ... 2025-05-07T20:30:11.2829217Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:11.2841691Z [TEST] Setting feature flags ... 2025-05-07T20:30:11.2842282Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:11.2842722Z 2025-05-07T20:30:11.7069701Z 2025-05-07T20:30:11.7070473Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:11.7071369Z ################################################################################ 2025-05-07T20:30:11.7071808Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:11.7072069Z # 2025-05-07T20:30:11.7092519Z # [2025-05-07T20:30:11.708Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:11.7093412Z ################################################################################ 2025-05-07T20:30:11.7093656Z 2025-05-07T20:30:11.7100280Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:11.7129079Z ./attention/gqa_test.py 2025-05-07T20:30:11.7129457Z ./coalesce/coalesce_test.py 2025-05-07T20:30:11.7129846Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:11.7130237Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:11.7130584Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:11.7130847Z ./moe/activation_test.py 2025-05-07T20:30:11.7131098Z ./moe/gather_scatter_test.py 2025-05-07T20:30:11.7131355Z ./moe/layers_test.py 2025-05-07T20:30:11.7131590Z ./moe/shuffling_test.py 2025-05-07T20:30:11.7131833Z ./quantize/quantize_test.py 2025-05-07T20:30:11.7132004Z 2025-05-07T20:30:11.7132122Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:11.7132342Z 2025-05-07T20:30:11.7150037Z ################################################################################ 2025-05-07T20:30:11.7164471Z # [2025-05-07T20:30:11.716Z] Run Python Test Suite: 2025-05-07T20:30:11.7164931Z # ./attention/gqa_test.py 2025-05-07T20:30:11.7165352Z ################################################################################ 2025-05-07T20:30:11.7188948Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:11.7189558Z 2025-05-07T20:30:14.2649184Z ============================= test session starts ============================== 2025-05-07T20:30:14.2649819Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:14.2650347Z cachedir: .pytest_cache 2025-05-07T20:30:14.2650928Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:14.2651915Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:14.2652367Z plugins: hypothesis-6.131.14 2025-05-07T20:30:15.7902948Z collecting ... collected 2 items 2025-05-07T20:30:15.7903191Z 2025-05-07T20:30:51.8656150Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:30:51.8656860Z self=, 2025-05-07T20:30:51.8657275Z int4_kv=False, 2025-05-07T20:30:51.8657541Z num_groups=1, 2025-05-07T20:30:51.8657801Z B=1, 2025-05-07T20:30:51.8658038Z MAX_T=4, 2025-05-07T20:30:51.8658277Z N_H_L=1, 2025-05-07T20:30:51.8658521Z ) 2025-05-07T20:30:51.8658767Z Trying example: test_gqa( 2025-05-07T20:30:51.8659142Z self=, 2025-05-07T20:30:51.8659537Z int4_kv=True, 2025-05-07T20:30:51.8659794Z num_groups=1, 2025-05-07T20:30:51.8660056Z B=1, 2025-05-07T20:30:51.8660290Z MAX_T=4, 2025-05-07T20:30:51.8660531Z N_H_L=1, 2025-05-07T20:30:51.8660774Z ) 2025-05-07T20:30:51.8661063Z Trying example: test_gqa( 2025-05-07T20:30:51.8661574Z self=, 2025-05-07T20:30:51.8661966Z int4_kv=True, 2025-05-07T20:30:51.8662242Z num_groups=4, 2025-05-07T20:30:51.8662499Z B=23, 2025-05-07T20:30:51.8662741Z MAX_T=33, 2025-05-07T20:30:51.8662986Z N_H_L=68, 2025-05-07T20:30:51.8663220Z ) 2025-05-07T20:30:51.8663465Z Trying example: test_gqa( 2025-05-07T20:30:51.8663833Z self=, 2025-05-07T20:30:51.8664228Z int4_kv=True, 2025-05-07T20:30:51.8664486Z num_groups=4, 2025-05-07T20:30:51.8664748Z B=77, 2025-05-07T20:30:51.8664984Z MAX_T=4, 2025-05-07T20:30:51.8665220Z N_H_L=1, 2025-05-07T20:30:51.8665459Z ) 2025-05-07T20:30:51.8665703Z Trying example: test_gqa( 2025-05-07T20:30:51.8666107Z self=, 2025-05-07T20:30:51.8666499Z int4_kv=True, 2025-05-07T20:30:51.8666765Z num_groups=4, 2025-05-07T20:30:51.8667468Z B=77, 2025-05-07T20:30:51.8667711Z MAX_T=52, 2025-05-07T20:30:51.8667958Z N_H_L=67, 2025-05-07T20:30:51.8668194Z ) 2025-05-07T20:30:51.8668439Z Trying example: test_gqa( 2025-05-07T20:30:51.8668800Z self=, 2025-05-07T20:30:51.8669187Z int4_kv=False, 2025-05-07T20:30:51.8669461Z num_groups=4, 2025-05-07T20:30:51.8669717Z B=57, 2025-05-07T20:30:51.8669951Z MAX_T=45, 2025-05-07T20:30:51.8670201Z N_H_L=120, 2025-05-07T20:30:51.8670450Z ) 2025-05-07T20:30:51.8670686Z Trying example: test_gqa( 2025-05-07T20:30:51.8671050Z self=, 2025-05-07T20:30:51.8671438Z int4_kv=True, 2025-05-07T20:30:51.8671698Z num_groups=4, 2025-05-07T20:30:51.8671958Z B=52, 2025-05-07T20:30:51.8672192Z MAX_T=42, 2025-05-07T20:30:51.8672432Z N_H_L=53, 2025-05-07T20:30:51.8672676Z ) 2025-05-07T20:30:51.8672918Z Trying example: test_gqa( 2025-05-07T20:30:51.8673271Z self=, 2025-05-07T20:30:51.8673685Z int4_kv=True, 2025-05-07T20:30:51.8673951Z num_groups=1, 2025-05-07T20:30:51.8674231Z B=77, 2025-05-07T20:30:51.8674470Z MAX_T=95, 2025-05-07T20:30:51.8674709Z N_H_L=53, 2025-05-07T20:30:51.8674956Z ) 2025-05-07T20:30:51.8675202Z Trying example: test_gqa( 2025-05-07T20:30:51.8675556Z self=, 2025-05-07T20:30:51.8675952Z int4_kv=True, 2025-05-07T20:30:51.8676248Z num_groups=4, 2025-05-07T20:30:51.8676529Z B=113, 2025-05-07T20:30:51.8676769Z MAX_T=48, 2025-05-07T20:30:51.8677021Z N_H_L=96, 2025-05-07T20:30:51.8677253Z ) 2025-05-07T20:30:51.8677496Z Trying example: test_gqa( 2025-05-07T20:30:51.8677859Z self=, 2025-05-07T20:30:51.8678243Z int4_kv=False, 2025-05-07T20:30:51.8678514Z num_groups=1, 2025-05-07T20:30:51.8678773Z B=51, 2025-05-07T20:30:51.8689530Z MAX_T=61, 2025-05-07T20:30:51.8689822Z N_H_L=69, 2025-05-07T20:30:51.8690318Z ) 2025-05-07T20:30:51.8690581Z Trying example: test_gqa( 2025-05-07T20:30:51.8690953Z self=, 2025-05-07T20:30:51.8691358Z int4_kv=False, 2025-05-07T20:30:51.8691628Z num_groups=4, 2025-05-07T20:30:51.8691885Z B=17, 2025-05-07T20:30:51.8692127Z MAX_T=113, 2025-05-07T20:30:51.8692387Z N_H_L=65, 2025-05-07T20:30:51.8692623Z ) 2025-05-07T20:30:51.8692869Z Trying example: test_gqa( 2025-05-07T20:30:51.8693239Z self=, 2025-05-07T20:30:51.8693628Z int4_kv=False, 2025-05-07T20:30:51.8693897Z num_groups=4, 2025-05-07T20:30:51.8694158Z B=17, 2025-05-07T20:30:51.8694389Z MAX_T=65, 2025-05-07T20:30:51.8694635Z N_H_L=65, 2025-05-07T20:30:51.8694876Z ) 2025-05-07T20:30:51.8695115Z Trying example: test_gqa( 2025-05-07T20:30:51.8695486Z self=, 2025-05-07T20:30:51.8695886Z int4_kv=False, 2025-05-07T20:30:51.8696157Z num_groups=4, 2025-05-07T20:30:51.8696426Z B=65, 2025-05-07T20:30:51.8696666Z MAX_T=65, 2025-05-07T20:30:51.8696915Z N_H_L=65, 2025-05-07T20:30:51.8697154Z ) 2025-05-07T20:30:51.8697399Z Trying example: test_gqa( 2025-05-07T20:30:51.8697766Z self=, 2025-05-07T20:30:51.8698154Z int4_kv=False, 2025-05-07T20:30:51.8698423Z num_groups=1, 2025-05-07T20:30:51.8698686Z B=6, 2025-05-07T20:30:51.8698916Z MAX_T=108, 2025-05-07T20:30:51.8699167Z N_H_L=14, 2025-05-07T20:30:51.8699409Z ) 2025-05-07T20:30:51.8699644Z Trying example: test_gqa( 2025-05-07T20:30:51.8700012Z self=, 2025-05-07T20:30:51.8700403Z int4_kv=False, 2025-05-07T20:30:51.8700666Z num_groups=1, 2025-05-07T20:30:51.8700926Z B=6, 2025-05-07T20:30:51.8701161Z MAX_T=14, 2025-05-07T20:30:51.8701498Z N_H_L=14, 2025-05-07T20:30:51.8701744Z ) 2025-05-07T20:30:51.8701986Z Trying example: test_gqa( 2025-05-07T20:30:51.8702458Z self=, 2025-05-07T20:30:51.8702855Z int4_kv=False, 2025-05-07T20:30:51.8703128Z num_groups=1, 2025-05-07T20:30:51.8703382Z B=6, 2025-05-07T20:30:51.8703621Z MAX_T=6, 2025-05-07T20:30:51.8703869Z N_H_L=14, 2025-05-07T20:30:51.8704103Z ) 2025-05-07T20:30:51.8704349Z Trying example: test_gqa( 2025-05-07T20:30:51.8704751Z self=, 2025-05-07T20:30:51.8705205Z int4_kv=False, 2025-05-07T20:30:51.8705506Z num_groups=1, 2025-05-07T20:30:51.8705800Z B=6, 2025-05-07T20:30:51.8706086Z MAX_T=6, 2025-05-07T20:30:51.8706336Z N_H_L=6, 2025-05-07T20:30:51.8706590Z ) 2025-05-07T20:30:51.8706841Z Trying example: test_gqa( 2025-05-07T20:30:51.8707136Z self=, 2025-05-07T20:30:51.8707462Z int4_kv=False, 2025-05-07T20:30:51.8707683Z num_groups=1, 2025-05-07T20:30:51.8707888Z B=70, 2025-05-07T20:30:51.8708082Z MAX_T=94, 2025-05-07T20:30:51.8708292Z N_H_L=78, 2025-05-07T20:30:51.8708487Z ) 2025-05-07T20:30:51.8708687Z Trying example: test_gqa( 2025-05-07T20:30:51.8708992Z self=, 2025-05-07T20:30:51.8709310Z int4_kv=False, 2025-05-07T20:30:51.8709527Z num_groups=1, 2025-05-07T20:30:51.8709736Z B=78, 2025-05-07T20:30:51.8709924Z MAX_T=94, 2025-05-07T20:30:51.8710128Z N_H_L=78, 2025-05-07T20:30:51.8710319Z ) 2025-05-07T20:30:51.8710508Z Trying example: test_gqa( 2025-05-07T20:30:51.8710807Z self=, 2025-05-07T20:30:51.8711126Z int4_kv=False, 2025-05-07T20:30:51.8711341Z num_groups=1, 2025-05-07T20:30:51.8711552Z B=94, 2025-05-07T20:30:51.8711749Z MAX_T=94, 2025-05-07T20:30:51.8711941Z N_H_L=78, 2025-05-07T20:30:51.8712139Z ) 2025-05-07T20:30:51.8712338Z Trying example: test_gqa( 2025-05-07T20:30:51.8712631Z self=, 2025-05-07T20:30:51.8712950Z int4_kv=False, 2025-05-07T20:30:51.8713277Z num_groups=1, 2025-05-07T20:30:51.8713499Z B=94, 2025-05-07T20:30:51.8713684Z MAX_T=94, 2025-05-07T20:30:51.8713884Z N_H_L=94, 2025-05-07T20:30:51.8714077Z ) 2025-05-07T20:30:51.8714265Z Trying example: test_gqa( 2025-05-07T20:30:51.8714563Z self=, 2025-05-07T20:30:51.8714882Z int4_kv=False, 2025-05-07T20:30:51.8715091Z num_groups=4, 2025-05-07T20:30:51.8715302Z B=41, 2025-05-07T20:30:51.8715496Z MAX_T=105, 2025-05-07T20:30:51.8715696Z N_H_L=126, 2025-05-07T20:30:51.8715927Z ) 2025-05-07T20:30:51.8716148Z Trying example: test_gqa( 2025-05-07T20:30:51.8716445Z self=, 2025-05-07T20:30:51.8716767Z int4_kv=False, 2025-05-07T20:30:51.8716984Z num_groups=4, 2025-05-07T20:30:51.8717188Z B=105, 2025-05-07T20:30:51.8717383Z MAX_T=105, 2025-05-07T20:30:51.8717589Z N_H_L=126, 2025-05-07T20:30:51.8717783Z ) 2025-05-07T20:30:51.8717983Z Trying example: test_gqa( 2025-05-07T20:30:51.8718296Z self=, 2025-05-07T20:30:51.8718611Z int4_kv=False, 2025-05-07T20:30:51.8718829Z num_groups=4, 2025-05-07T20:30:51.8719034Z B=105, 2025-05-07T20:30:51.8719225Z MAX_T=105, 2025-05-07T20:30:51.8719421Z N_H_L=105, 2025-05-07T20:30:51.8719617Z ) 2025-05-07T20:30:51.8719812Z Trying example: test_gqa( 2025-05-07T20:30:51.8720101Z self=, 2025-05-07T20:30:51.8720416Z int4_kv=True, 2025-05-07T20:30:51.8720630Z num_groups=1, 2025-05-07T20:30:51.8720831Z B=95, 2025-05-07T20:30:51.8721020Z MAX_T=114, 2025-05-07T20:30:51.8721222Z N_H_L=43, 2025-05-07T20:30:51.8721409Z ) 2025-05-07T20:30:51.8721603Z Trying example: test_gqa( 2025-05-07T20:30:51.8721897Z self=, 2025-05-07T20:30:51.8722207Z int4_kv=True, 2025-05-07T20:30:51.8722418Z num_groups=1, 2025-05-07T20:30:51.8722625Z B=43, 2025-05-07T20:30:51.8722900Z MAX_T=114, 2025-05-07T20:30:51.8723109Z N_H_L=43, 2025-05-07T20:30:51.8723302Z ) 2025-05-07T20:30:51.8723490Z Trying example: test_gqa( 2025-05-07T20:30:51.8723786Z self=, 2025-05-07T20:30:51.8724103Z int4_kv=True, 2025-05-07T20:30:51.8724316Z num_groups=1, 2025-05-07T20:30:51.8724520Z B=43, 2025-05-07T20:30:51.8724711Z MAX_T=43, 2025-05-07T20:30:51.8724912Z N_H_L=43, 2025-05-07T20:30:51.8725101Z ) 2025-05-07T20:30:51.8725297Z Trying example: test_gqa( 2025-05-07T20:30:51.8725596Z self=, 2025-05-07T20:30:51.8725970Z int4_kv=False, 2025-05-07T20:30:51.8726199Z num_groups=1, 2025-05-07T20:30:51.8726445Z B=21, 2025-05-07T20:30:51.8726633Z MAX_T=38, 2025-05-07T20:30:51.8726836Z N_H_L=42, 2025-05-07T20:30:51.8727033Z ) 2025-05-07T20:30:51.8727225Z Trying example: test_gqa( 2025-05-07T20:30:51.8727525Z self=, 2025-05-07T20:30:51.8727858Z int4_kv=False, 2025-05-07T20:30:51.8728069Z num_groups=1, 2025-05-07T20:30:51.8728280Z B=38, 2025-05-07T20:30:51.8728472Z MAX_T=38, 2025-05-07T20:30:51.8728662Z N_H_L=42, 2025-05-07T20:30:51.8728852Z ) 2025-05-07T20:30:51.8729045Z Trying example: test_gqa( 2025-05-07T20:30:51.8729334Z self=, 2025-05-07T20:30:51.8729648Z int4_kv=False, 2025-05-07T20:30:51.8729861Z num_groups=1, 2025-05-07T20:30:51.8730063Z B=38, 2025-05-07T20:30:51.8730251Z MAX_T=42, 2025-05-07T20:30:51.8730449Z N_H_L=42, 2025-05-07T20:30:51.8730636Z ) 2025-05-07T20:30:51.8730830Z Trying example: test_gqa( 2025-05-07T20:30:51.8731126Z self=, 2025-05-07T20:30:51.8731442Z int4_kv=False, 2025-05-07T20:30:51.8731651Z num_groups=1, 2025-05-07T20:30:51.8731856Z B=42, 2025-05-07T20:30:51.8732053Z MAX_T=42, 2025-05-07T20:30:51.8732244Z N_H_L=42, 2025-05-07T20:30:51.8732436Z ) 2025-05-07T20:30:51.8732728Z Trying example: test_gqa( 2025-05-07T20:30:51.8733027Z self=, 2025-05-07T20:30:51.8733343Z int4_kv=True, 2025-05-07T20:30:51.8733557Z num_groups=1, 2025-05-07T20:30:51.8733760Z B=74, 2025-05-07T20:30:51.8733948Z MAX_T=20, 2025-05-07T20:30:51.8734145Z N_H_L=15, 2025-05-07T20:30:51.8734332Z ) 2025-05-07T20:30:51.8734524Z Trying example: test_gqa( 2025-05-07T20:30:51.8734821Z self=, 2025-05-07T20:30:51.8735130Z int4_kv=True, 2025-05-07T20:30:51.8735345Z num_groups=1, 2025-05-07T20:30:51.8735555Z B=20, 2025-05-07T20:30:51.8735743Z MAX_T=20, 2025-05-07T20:30:51.8735941Z N_H_L=15, 2025-05-07T20:30:51.8736134Z ) 2025-05-07T20:30:51.8736323Z Trying example: test_gqa( 2025-05-07T20:30:51.8736618Z self=, 2025-05-07T20:30:51.8736934Z int4_kv=True, 2025-05-07T20:30:51.8737142Z num_groups=1, 2025-05-07T20:30:51.8737357Z B=20, 2025-05-07T20:30:51.8737561Z MAX_T=15, 2025-05-07T20:30:51.8737755Z N_H_L=15, 2025-05-07T20:30:51.8737949Z ) 2025-05-07T20:30:51.8738150Z Trying example: test_gqa( 2025-05-07T20:30:51.8738449Z self=, 2025-05-07T20:30:51.8738759Z int4_kv=True, 2025-05-07T20:30:51.8738975Z num_groups=1, 2025-05-07T20:30:51.8739189Z B=15, 2025-05-07T20:30:51.8739374Z MAX_T=20, 2025-05-07T20:30:51.8739571Z N_H_L=15, 2025-05-07T20:30:51.8739763Z ) 2025-05-07T20:30:51.8739950Z Trying example: test_gqa( 2025-05-07T20:30:51.8740561Z self=, 2025-05-07T20:30:51.8741033Z int4_kv=True, 2025-05-07T20:30:51.8741307Z num_groups=1, 2025-05-07T20:30:51.8741517Z B=15, 2025-05-07T20:30:51.8741705Z MAX_T=15, 2025-05-07T20:30:51.8741894Z N_H_L=15, 2025-05-07T20:30:51.8742086Z ) 2025-05-07T20:30:51.8742286Z Trying example: test_gqa( 2025-05-07T20:30:51.8742578Z self=, 2025-05-07T20:30:51.8743082Z int4_kv=False, 2025-05-07T20:30:51.8743296Z num_groups=4, 2025-05-07T20:30:51.8743498Z B=117, 2025-05-07T20:30:51.8743691Z MAX_T=104, 2025-05-07T20:30:51.8743890Z N_H_L=69, 2025-05-07T20:30:51.8744079Z ) 2025-05-07T20:30:51.8744277Z Trying example: test_gqa( 2025-05-07T20:30:51.8744572Z self=, 2025-05-07T20:30:51.8744882Z int4_kv=False, 2025-05-07T20:30:51.8745095Z num_groups=4, 2025-05-07T20:30:51.8745304Z B=117, 2025-05-07T20:30:51.8745489Z MAX_T=117, 2025-05-07T20:30:51.8745689Z N_H_L=69, 2025-05-07T20:30:51.8745884Z ) 2025-05-07T20:30:51.8746071Z Trying example: test_gqa( 2025-05-07T20:30:51.8746390Z self=, 2025-05-07T20:30:51.8746734Z int4_kv=False, 2025-05-07T20:30:51.8746958Z num_groups=4, 2025-05-07T20:30:51.8747160Z B=69, 2025-05-07T20:30:51.8747351Z MAX_T=117, 2025-05-07T20:30:51.8747553Z N_H_L=69, 2025-05-07T20:30:51.8747749Z ) 2025-05-07T20:30:51.8747954Z Trying example: test_gqa( 2025-05-07T20:30:51.8748257Z self=, 2025-05-07T20:30:51.8748568Z int4_kv=False, 2025-05-07T20:30:51.8748783Z num_groups=4, 2025-05-07T20:30:51.8748992Z B=117, 2025-05-07T20:30:51.8749176Z MAX_T=69, 2025-05-07T20:30:51.8749374Z N_H_L=69, 2025-05-07T20:30:51.8749571Z ) 2025-05-07T20:30:51.8749761Z PASSED 2025-05-07T20:30:51.9108230Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:30:51.9108584Z 2025-05-07T20:30:51.9108738Z =========================== short test summary info ============================ 2025-05-07T20:30:51.9109468Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when CUDA is not available or xformers is not available 2025-05-07T20:30:51.9110166Z ======================== 1 passed, 1 skipped in 38.16s ========================= 2025-05-07T20:30:52.5483716Z 2025-05-07T20:30:52.5484926Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:30:52.5505251Z [TEST] Python test time for ./attention/gqa_test.py: 41 seconds 2025-05-07T20:30:52.5505674Z 2025-05-07T20:30:52.5505680Z 2025-05-07T20:30:52.5505685Z 2025-05-07T20:30:52.5505690Z 2025-05-07T20:30:52.5527252Z ################################################################################ 2025-05-07T20:30:52.5547188Z # [2025-05-07T20:30:52.554Z] Run Python Test Suite: 2025-05-07T20:30:52.5547690Z # ./coalesce/coalesce_test.py 2025-05-07T20:30:52.5548091Z ################################################################################ 2025-05-07T20:30:52.5571723Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:30:52.5572497Z 2025-05-07T20:30:54.7048222Z ============================= test session starts ============================== 2025-05-07T20:30:54.7049087Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:54.7049659Z cachedir: .pytest_cache 2025-05-07T20:30:54.7050259Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:54.7051007Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:54.7051429Z plugins: hypothesis-6.131.14 2025-05-07T20:30:56.2512265Z collecting ... collected 1 item 2025-05-07T20:30:56.2512591Z 2025-05-07T20:30:56.9830792Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:30:56.9831150Z 2025-05-07T20:30:56.9831305Z ============================== 1 passed in 2.41s =============================== 2025-05-07T20:30:57.6034355Z 2025-05-07T20:30:57.6034934Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:30:57.6056729Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:30:57.6057623Z 2025-05-07T20:30:57.6057631Z 2025-05-07T20:30:57.6057637Z 2025-05-07T20:30:57.6057654Z 2025-05-07T20:30:57.6079069Z ################################################################################ 2025-05-07T20:30:57.6094925Z # [2025-05-07T20:30:57.609Z] Run Python Test Suite: 2025-05-07T20:30:57.6095388Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:30:57.6095706Z ################################################################################ 2025-05-07T20:30:57.6119398Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:30:57.6120022Z 2025-05-07T20:30:59.7660435Z ============================= test session starts ============================== 2025-05-07T20:30:59.7661461Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:59.7662010Z cachedir: .pytest_cache 2025-05-07T20:30:59.7662665Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:59.7663413Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:59.7663850Z plugins: hypothesis-6.131.14 2025-05-07T20:31:01.3625014Z collecting ... collected 5 items 2025-05-07T20:31:01.3625355Z 2025-05-07T20:31:01.3636120Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:01.3645040Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:01.3652770Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:01.3660515Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:01.3677752Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:01.3678228Z 2025-05-07T20:31:01.3678753Z =========================== short test summary info ============================ 2025-05-07T20:31:01.3679786Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.3680723Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.3681652Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.3682572Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.3683499Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.3684149Z ============================== 5 skipped in 1.73s ============================== 2025-05-07T20:31:01.9023374Z 2025-05-07T20:31:01.9023736Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:01.9045754Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds 2025-05-07T20:31:01.9046179Z 2025-05-07T20:31:01.9046185Z 2025-05-07T20:31:01.9046191Z 2025-05-07T20:31:01.9046196Z 2025-05-07T20:31:01.9068253Z ################################################################################ 2025-05-07T20:31:01.9083736Z # [2025-05-07T20:31:01.908Z] Run Python Test Suite: 2025-05-07T20:31:01.9084230Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:01.9084679Z ################################################################################ 2025-05-07T20:31:01.9108877Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:01.9109651Z 2025-05-07T20:31:04.0606238Z ============================= test session starts ============================== 2025-05-07T20:31:04.0607343Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:04.0607873Z cachedir: .pytest_cache 2025-05-07T20:31:04.0608467Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:04.0609204Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:04.0609618Z plugins: hypothesis-6.131.14 2025-05-07T20:31:05.7171123Z collecting ... collected 2 items 2025-05-07T20:31:05.7171367Z 2025-05-07T20:31:05.7183074Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:05.7198043Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:05.7198470Z 2025-05-07T20:31:05.7198650Z =========================== short test summary info ============================ 2025-05-07T20:31:05.7199300Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:05.7200148Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:05.7200782Z ============================== 2 skipped in 1.79s ============================== 2025-05-07T20:31:06.2550744Z 2025-05-07T20:31:06.2551551Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:06.2572323Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds 2025-05-07T20:31:06.2572752Z 2025-05-07T20:31:06.2572756Z 2025-05-07T20:31:06.2572760Z 2025-05-07T20:31:06.2572764Z 2025-05-07T20:31:06.2595184Z ################################################################################ 2025-05-07T20:31:06.2610276Z # [2025-05-07T20:31:06.260Z] Run Python Test Suite: 2025-05-07T20:31:06.2611127Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:06.2611455Z ################################################################################ 2025-05-07T20:31:06.2634944Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:06.2635584Z 2025-05-07T20:31:08.4232032Z ============================= test session starts ============================== 2025-05-07T20:31:08.4232709Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:08.4233254Z cachedir: .pytest_cache 2025-05-07T20:31:08.4233862Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:08.4234615Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:08.4235036Z plugins: hypothesis-6.131.14 2025-05-07T20:31:09.9965474Z collecting ... collected 4 items 2025-05-07T20:31:09.9965816Z 2025-05-07T20:31:13.0394051Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:13.0558084Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:13.0750857Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:13.0911582Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:13.0912100Z 2025-05-07T20:31:13.0912317Z =========================== short test summary info ============================ 2025-05-07T20:31:13.0913150Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:13.0914081Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when xformers is not available 2025-05-07T20:31:13.0914997Z ============================== 4 skipped in 4.80s ============================== 2025-05-07T20:31:14.6873895Z 2025-05-07T20:31:14.6874458Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:14.6895330Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:14.6895743Z 2025-05-07T20:31:14.6895749Z 2025-05-07T20:31:14.6895762Z 2025-05-07T20:31:14.6895767Z 2025-05-07T20:31:14.6917946Z ################################################################################ 2025-05-07T20:31:14.6933876Z # [2025-05-07T20:31:14.693Z] Run Python Test Suite: 2025-05-07T20:31:14.6934352Z # ./moe/activation_test.py 2025-05-07T20:31:14.6934756Z ################################################################################ 2025-05-07T20:31:14.6959508Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:14.6960140Z 2025-05-07T20:31:16.8514162Z ============================= test session starts ============================== 2025-05-07T20:31:16.8515100Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:16.8515826Z cachedir: .pytest_cache 2025-05-07T20:31:16.8516423Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:16.8517164Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:16.8517595Z plugins: hypothesis-6.131.14 2025-05-07T20:31:18.5053563Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:18.7195883Z collecting ... collected 2 items 2025-05-07T20:31:18.7205420Z 2025-05-07T20:31:24.7283496Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:24.7284174Z self=, 2025-05-07T20:31:24.7284987Z T=1, 2025-05-07T20:31:24.7285194Z D=5120, 2025-05-07T20:31:24.7285415Z contiguous=True, 2025-05-07T20:31:24.7285673Z compiled=True, 2025-05-07T20:31:24.7285898Z ) 2025-05-07T20:31:24.7286119Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7286621Z self=, 2025-05-07T20:31:24.7287200Z T=4096, 2025-05-07T20:31:24.7287490Z D=5120, 2025-05-07T20:31:24.7287779Z contiguous=True, 2025-05-07T20:31:24.7288104Z compiled=True, 2025-05-07T20:31:24.7288411Z ) 2025-05-07T20:31:24.7288711Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7289139Z self=, 2025-05-07T20:31:24.7289540Z T=4096, 2025-05-07T20:31:24.7289744Z D=7168, 2025-05-07T20:31:24.7289952Z contiguous=False, 2025-05-07T20:31:24.7290199Z compiled=False, 2025-05-07T20:31:24.7290419Z ) 2025-05-07T20:31:24.7290625Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7291040Z self=, 2025-05-07T20:31:24.7291439Z T=4096, 2025-05-07T20:31:24.7291635Z D=5120, 2025-05-07T20:31:24.7291851Z contiguous=False, 2025-05-07T20:31:24.7292106Z compiled=True, 2025-05-07T20:31:24.7292327Z ) 2025-05-07T20:31:24.7292534Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7292933Z self=, 2025-05-07T20:31:24.7293331Z T=1, 2025-05-07T20:31:24.7293524Z D=7168, 2025-05-07T20:31:24.7293736Z contiguous=True, 2025-05-07T20:31:24.7293976Z compiled=True, 2025-05-07T20:31:24.7294188Z ) 2025-05-07T20:31:24.7294398Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7294794Z self=, 2025-05-07T20:31:24.7295183Z T=1, 2025-05-07T20:31:24.7295385Z D=7168, 2025-05-07T20:31:24.7295596Z contiguous=False, 2025-05-07T20:31:24.7295829Z compiled=True, 2025-05-07T20:31:24.7296294Z ) 2025-05-07T20:31:24.7296535Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7296927Z self=, 2025-05-07T20:31:24.7297322Z T=4096, 2025-05-07T20:31:24.7297527Z D=5120, 2025-05-07T20:31:24.7297733Z contiguous=False, 2025-05-07T20:31:24.7297979Z compiled=False, 2025-05-07T20:31:24.7298206Z ) 2025-05-07T20:31:24.7298415Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7298811Z self=, 2025-05-07T20:31:24.7299213Z T=1, 2025-05-07T20:31:24.7299411Z D=7168, 2025-05-07T20:31:24.7299617Z contiguous=True, 2025-05-07T20:31:24.7299863Z compiled=False, 2025-05-07T20:31:24.7300093Z ) 2025-05-07T20:31:24.7300299Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7300698Z self=, 2025-05-07T20:31:24.7301093Z T=2048, 2025-05-07T20:31:24.7301448Z D=5120, 2025-05-07T20:31:24.7301672Z contiguous=True, 2025-05-07T20:31:24.7301916Z compiled=True, 2025-05-07T20:31:24.7302128Z ) 2025-05-07T20:31:24.7302343Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7302743Z self=, 2025-05-07T20:31:24.7303133Z T=2048, 2025-05-07T20:31:24.7303339Z D=7168, 2025-05-07T20:31:24.7303550Z contiguous=True, 2025-05-07T20:31:24.7303788Z compiled=True, 2025-05-07T20:31:24.7304009Z ) 2025-05-07T20:31:24.7304225Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7304621Z self=, 2025-05-07T20:31:24.7305022Z T=2048, 2025-05-07T20:31:24.7305225Z D=7168, 2025-05-07T20:31:24.7305435Z contiguous=True, 2025-05-07T20:31:24.7305671Z compiled=False, 2025-05-07T20:31:24.7305893Z ) 2025-05-07T20:31:24.7306108Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7306500Z self=, 2025-05-07T20:31:24.7307005Z T=128, 2025-05-07T20:31:24.7307209Z D=5120, 2025-05-07T20:31:24.7307414Z contiguous=False, 2025-05-07T20:31:24.7307663Z compiled=True, 2025-05-07T20:31:24.7307885Z ) 2025-05-07T20:31:24.7308090Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7308489Z self=, 2025-05-07T20:31:24.7308883Z T=128, 2025-05-07T20:31:24.7309076Z D=5120, 2025-05-07T20:31:24.7309288Z contiguous=True, 2025-05-07T20:31:24.7309533Z compiled=True, 2025-05-07T20:31:24.7309751Z ) 2025-05-07T20:31:24.7309969Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7310368Z self=, 2025-05-07T20:31:24.7310758Z T=16384, 2025-05-07T20:31:24.7310971Z D=5120, 2025-05-07T20:31:24.7311189Z contiguous=False, 2025-05-07T20:31:24.7311426Z compiled=True, 2025-05-07T20:31:24.7311652Z ) 2025-05-07T20:31:24.7311876Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7312287Z self=, 2025-05-07T20:31:24.7312678Z T=16384, 2025-05-07T20:31:24.7312890Z D=5120, 2025-05-07T20:31:24.7313109Z contiguous=False, 2025-05-07T20:31:24.7313347Z compiled=False, 2025-05-07T20:31:24.7313567Z ) 2025-05-07T20:31:24.7313782Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7314173Z self=, 2025-05-07T20:31:24.7314578Z T=128, 2025-05-07T20:31:24.7314777Z D=7168, 2025-05-07T20:31:24.7314983Z contiguous=True, 2025-05-07T20:31:24.7315216Z compiled=False, 2025-05-07T20:31:24.7315433Z ) 2025-05-07T20:31:24.7315645Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7316031Z self=, 2025-05-07T20:31:24.7316475Z T=128, 2025-05-07T20:31:24.7316672Z D=7168, 2025-05-07T20:31:24.7316875Z contiguous=False, 2025-05-07T20:31:24.7317123Z compiled=False, 2025-05-07T20:31:24.7317444Z ) 2025-05-07T20:31:24.7317649Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7318047Z self=, 2025-05-07T20:31:24.7318444Z T=1, 2025-05-07T20:31:24.7318638Z D=5120, 2025-05-07T20:31:24.7318849Z contiguous=False, 2025-05-07T20:31:24.7319095Z compiled=False, 2025-05-07T20:31:24.7319308Z ) 2025-05-07T20:31:24.7319528Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7319924Z self=, 2025-05-07T20:31:24.7320310Z T=1, 2025-05-07T20:31:24.7320508Z D=7168, 2025-05-07T20:31:24.7320719Z contiguous=False, 2025-05-07T20:31:24.7320954Z compiled=False, 2025-05-07T20:31:24.7321176Z ) 2025-05-07T20:31:24.7321389Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7321783Z self=, 2025-05-07T20:31:24.7322171Z T=4096, 2025-05-07T20:31:24.7322373Z D=5120, 2025-05-07T20:31:24.7322596Z contiguous=True, 2025-05-07T20:31:24.7322831Z compiled=False, 2025-05-07T20:31:24.7323054Z ) 2025-05-07T20:31:24.7323273Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7323663Z self=, 2025-05-07T20:31:24.7324058Z T=128, 2025-05-07T20:31:24.7324257Z D=7168, 2025-05-07T20:31:24.7324462Z contiguous=True, 2025-05-07T20:31:24.7324699Z compiled=True, 2025-05-07T20:31:24.7324917Z ) 2025-05-07T20:31:24.7325124Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7325515Z self=, 2025-05-07T20:31:24.7325908Z T=1, 2025-05-07T20:31:24.7326099Z D=5120, 2025-05-07T20:31:24.7326312Z contiguous=False, 2025-05-07T20:31:24.7326551Z compiled=True, 2025-05-07T20:31:24.7326763Z ) 2025-05-07T20:31:24.7326974Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7327366Z self=, 2025-05-07T20:31:24.7327895Z T=4096, 2025-05-07T20:31:24.7328177Z D=7168, 2025-05-07T20:31:24.7328394Z contiguous=True, 2025-05-07T20:31:24.7328635Z compiled=False, 2025-05-07T20:31:24.7328851Z ) 2025-05-07T20:31:24.7329061Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7329454Z self=, 2025-05-07T20:31:24.7329844Z T=4096, 2025-05-07T20:31:24.7330047Z D=7168, 2025-05-07T20:31:24.7330256Z contiguous=False, 2025-05-07T20:31:24.7330491Z compiled=True, 2025-05-07T20:31:24.7330712Z ) 2025-05-07T20:31:24.7330923Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7331312Z self=, 2025-05-07T20:31:24.7331720Z T=128, 2025-05-07T20:31:24.7331922Z D=5120, 2025-05-07T20:31:24.7332123Z contiguous=True, 2025-05-07T20:31:24.7332363Z compiled=False, 2025-05-07T20:31:24.7332584Z ) 2025-05-07T20:31:24.7332788Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7333197Z self=, 2025-05-07T20:31:24.7333592Z T=128, 2025-05-07T20:31:24.7333786Z D=5120, 2025-05-07T20:31:24.7333999Z contiguous=False, 2025-05-07T20:31:24.7334244Z compiled=False, 2025-05-07T20:31:24.7334458Z ) 2025-05-07T20:31:24.7334670Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7335068Z self=, 2025-05-07T20:31:24.7335465Z T=1, 2025-05-07T20:31:24.7335660Z D=5120, 2025-05-07T20:31:24.7335873Z contiguous=True, 2025-05-07T20:31:24.7336172Z compiled=False, 2025-05-07T20:31:24.7336425Z ) 2025-05-07T20:31:24.7336628Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7337025Z self=, 2025-05-07T20:31:24.7337421Z T=2048, 2025-05-07T20:31:24.7337614Z D=7168, 2025-05-07T20:31:24.7337823Z contiguous=False, 2025-05-07T20:31:24.7338064Z compiled=True, 2025-05-07T20:31:24.7338382Z ) 2025-05-07T20:31:24.7338591Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7338984Z self=, 2025-05-07T20:31:24.7339372Z T=2048, 2025-05-07T20:31:24.7339574Z D=7168, 2025-05-07T20:31:24.7339785Z contiguous=False, 2025-05-07T20:31:24.7340020Z compiled=False, 2025-05-07T20:31:24.7340554Z ) 2025-05-07T20:31:24.7340769Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7341250Z self=, 2025-05-07T20:31:24.7341652Z T=16384, 2025-05-07T20:31:24.7341863Z D=7168, 2025-05-07T20:31:24.7342069Z contiguous=False, 2025-05-07T20:31:24.7342310Z compiled=True, 2025-05-07T20:31:24.7342531Z ) 2025-05-07T20:31:24.7342743Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7343136Z self=, 2025-05-07T20:31:24.7343531Z T=16384, 2025-05-07T20:31:24.7343748Z D=7168, 2025-05-07T20:31:24.7343955Z contiguous=True, 2025-05-07T20:31:24.7344198Z compiled=True, 2025-05-07T20:31:24.7344419Z ) 2025-05-07T20:31:24.7344624Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7345026Z self=, 2025-05-07T20:31:24.7345424Z T=4096, 2025-05-07T20:31:24.7345622Z D=7168, 2025-05-07T20:31:24.7345835Z contiguous=True, 2025-05-07T20:31:24.7346080Z compiled=True, 2025-05-07T20:31:24.7346292Z ) 2025-05-07T20:31:24.7346508Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7346907Z self=, 2025-05-07T20:31:24.7347297Z T=2048, 2025-05-07T20:31:24.7347499Z D=5120, 2025-05-07T20:31:24.7347710Z contiguous=False, 2025-05-07T20:31:24.7347946Z compiled=False, 2025-05-07T20:31:24.7348170Z ) 2025-05-07T20:31:24.7348384Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7348931Z self=, 2025-05-07T20:31:24.7349337Z T=2048, 2025-05-07T20:31:24.7349542Z D=5120, 2025-05-07T20:31:24.7349753Z contiguous=True, 2025-05-07T20:31:24.7349990Z compiled=False, 2025-05-07T20:31:24.7350213Z ) 2025-05-07T20:31:24.7350425Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7350815Z self=, 2025-05-07T20:31:24.7351215Z T=128, 2025-05-07T20:31:24.7351419Z D=7168, 2025-05-07T20:31:24.7351627Z contiguous=False, 2025-05-07T20:31:24.7351869Z compiled=True, 2025-05-07T20:31:24.7352087Z ) 2025-05-07T20:31:24.7352293Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7352693Z self=, 2025-05-07T20:31:24.7353090Z T=16384, 2025-05-07T20:31:24.7353295Z D=5120, 2025-05-07T20:31:24.7353508Z contiguous=True, 2025-05-07T20:31:24.7353746Z compiled=True, 2025-05-07T20:31:24.7353959Z ) 2025-05-07T20:31:24.7354175Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7354581Z self=, 2025-05-07T20:31:24.7354971Z T=2048, 2025-05-07T20:31:24.7355177Z D=5120, 2025-05-07T20:31:24.7355387Z contiguous=False, 2025-05-07T20:31:24.7355630Z compiled=True, 2025-05-07T20:31:24.7355842Z ) 2025-05-07T20:31:24.7356054Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7356452Z self=, 2025-05-07T20:31:24.7356841Z T=16384, 2025-05-07T20:31:24.7357052Z D=5120, 2025-05-07T20:31:24.7357264Z contiguous=True, 2025-05-07T20:31:24.7357497Z compiled=False, 2025-05-07T20:31:24.7357720Z ) 2025-05-07T20:31:24.7357933Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7358324Z self=, 2025-05-07T20:31:24.7358719Z T=16384, 2025-05-07T20:31:24.7358929Z D=7168, 2025-05-07T20:31:24.7359136Z contiguous=False, 2025-05-07T20:31:24.7359547Z compiled=False, 2025-05-07T20:31:24.7359770Z ) 2025-05-07T20:31:24.7359980Z Trying example: test_silu_mul( 2025-05-07T20:31:24.7360381Z self=, 2025-05-07T20:31:24.7360776Z T=16384, 2025-05-07T20:31:24.7360982Z D=7168, 2025-05-07T20:31:24.7361193Z contiguous=True, 2025-05-07T20:31:24.7361432Z compiled=False, 2025-05-07T20:31:24.7361648Z ) 2025-05-07T20:31:24.7361840Z PASSED 2025-05-07T20:31:24.7988871Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.7989986Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.7991385Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.7992885Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.7994297Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.7995713Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.7997387Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.7998807Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.8000253Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.8001529Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.8002777Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.8004011Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.8005075Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:24.8006124Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.8007424Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.8008731Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.8009872Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:24.8011081Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.8012282Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.8013665Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.8014752Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.8015699Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.8016472Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.8017522Z W0507 20:31:24.797449 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.8168067Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.8169167Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.8170870Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.8172341Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.8173749Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.8175158Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.8176487Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.8177891Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.8179345Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.8180616Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.8181949Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.8183187Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.8184392Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:24.8185434Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.8186683Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.8187999Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.8189143Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:24.8190208Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.8191406Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.8192787Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.8193875Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.8194807Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.8195698Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.8196799Z W0507 20:31:24.816194 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.8597559Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.8598659Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.8600025Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.8601521Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.8602936Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.8604354Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.8605694Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.8607107Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.8608891Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.8610172Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.8611419Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.8612648Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.8613711Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:24.8614755Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.8615996Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.8617303Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.8618442Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:24.8619678Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.8620879Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.8622355Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.8623441Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.8624379Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.8625140Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.8626205Z W0507 20:31:24.859117 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.8639581Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.8640876Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.8642237Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.8643690Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.8645267Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.8646679Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.8648009Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.8649419Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.8650870Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.8652148Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.8653395Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.8664511Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.8666173Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:24.8667641Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.8669038Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.8670339Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.8671465Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:24.8672521Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.8673712Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.8675062Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.8676129Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.8677040Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.8677787Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.8678829Z W0507 20:31:24.863476 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:25.3759492Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:25.3760389Z self=, 2025-05-07T20:31:25.3760822Z T=1, 2025-05-07T20:31:25.3761016Z D=5120, 2025-05-07T20:31:25.3761229Z scale_ub=None, 2025-05-07T20:31:25.3761460Z contiguous=True, 2025-05-07T20:31:25.3761702Z compiled=True, 2025-05-07T20:31:25.3761918Z ) 2025-05-07T20:31:25.3762258Z self = 2025-05-07T20:31:25.3762765Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:25.3763036Z 2025-05-07T20:31:25.3763121Z @given( 2025-05-07T20:31:25.3763389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:25.3763735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:25.3764052Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:25.3764403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:25.3764754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:25.3765054Z ) 2025-05-07T20:31:25.3765416Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:25.3765871Z def test_silu_mul_quant( 2025-05-07T20:31:25.3766128Z self, 2025-05-07T20:31:25.3766336Z T: int, 2025-05-07T20:31:25.3766573Z D: int, 2025-05-07T20:31:25.3766832Z scale_ub: Optional[float], 2025-05-07T20:31:25.3767115Z contiguous: bool, 2025-05-07T20:31:25.3767376Z compiled: bool, 2025-05-07T20:31:25.3767617Z ) -> None: 2025-05-07T20:31:25.3767843Z torch.manual_seed(2025) 2025-05-07T20:31:25.3768101Z 2025-05-07T20:31:25.3768679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:25.3769048Z 2025-05-07T20:31:25.3769256Z x_sign = torch.sign(x) 2025-05-07T20:31:25.3769565Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:25.3769886Z x = x_sign * x_clamp 2025-05-07T20:31:25.3770139Z x0 = x[:, :D] 2025-05-07T20:31:25.3770368Z x1 = x[:, D:] 2025-05-07T20:31:25.3770584Z 2025-05-07T20:31:25.3770781Z if contiguous: 2025-05-07T20:31:25.3771029Z x0 = x0.contiguous() 2025-05-07T20:31:25.3771303Z x1 = x1.contiguous() 2025-05-07T20:31:25.3771551Z 2025-05-07T20:31:25.3771759Z if scale_ub is not None: 2025-05-07T20:31:25.3772045Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:25.3772393Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:25.3772719Z ) 2025-05-07T20:31:25.3772927Z else: 2025-05-07T20:31:25.3773143Z scale_ub_tensor = None 2025-05-07T20:31:25.3773420Z 2025-05-07T20:31:25.3773668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:25.3773993Z op = silu_mul_quant 2025-05-07T20:31:25.3774263Z if compiled: 2025-05-07T20:31:25.3774529Z op = torch.compile(op) 2025-05-07T20:31:25.3774834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:25.3775128Z 2025-05-07T20:31:25.3775337Z y_fp8, y_scale = fn() 2025-05-07T20:31:25.3775633Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:25.3775942Z 2025-05-07T20:31:25.3776193Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:25.3776540Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:25.3776841Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:25.3777173Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:25.3777549Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:25.3778889Z 2025-05-07T20:31:25.3779112Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:25.3779313Z 2025-05-07T20:31:25.3779428Z moe/activation_test.py:126: 2025-05-07T20:31:25.3779733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:25.3780084Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:25.3780429Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:25.3781331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:25.3782103Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:25.3782677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:25.3783378Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:25.3784091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:25.3784841Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:25.3785612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:25.3786366Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:25.3787093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:25.3787742Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:25.3788363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:25.3788895Z fn() 2025-05-07T20:31:25.3789401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:25.3790082Z self.fn.run( 2025-05-07T20:31:25.3790569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:25.3791106Z kernel = self.compile( 2025-05-07T20:31:25.3791666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:25.3792333Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:25.3792746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:25.3792980Z 2025-05-07T20:31:25.3793195Z self = 2025-05-07T20:31:25.3794297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:25.3795698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317a5ba040>} 2025-05-07T20:31:25.3797056Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:25.3798093Z context = 2025-05-07T20:31:25.3798401Z 2025-05-07T20:31:25.3798578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:25.3799111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:25.3799578Z module_map=module_map) 2025-05-07T20:31:25.3799963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:25.3800335Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:25.3800611Z E ^ 2025-05-07T20:31:25.3801182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:25.3801641Z 2025-05-07T20:31:25.3802070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:25.3802586Z 2025-05-07T20:31:25.3802700Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:25.3803120Z self=, 2025-05-07T20:31:25.3803531Z T=2048, 2025-05-07T20:31:25.3803728Z D=5120, 2025-05-07T20:31:25.3803925Z scale_ub=1200.0, 2025-05-07T20:31:25.3804156Z contiguous=True, 2025-05-07T20:31:25.3804388Z compiled=False, 2025-05-07T20:31:25.3804597Z ) 2025-05-07T20:31:25.9693574Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:25.9694830Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:25.9696176Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:25.9697614Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:25.9699007Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:25.9700741Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:25.9702187Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:25.9703553Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:25.9704973Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:25.9706225Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:25.9707457Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:25.9708665Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:25.9709704Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:25.9710723Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:25.9711950Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:25.9713412Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:25.9714530Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:25.9715574Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:25.9716792Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:25.9718163Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:25.9719236Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:25.9720150Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:25.9720899Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:25.9721931Z W0507 20:31:25.964929 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.1779947Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.1781816Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:26.1783186Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.1784612Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.1785998Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.1787403Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.1788726Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.1790102Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.1791513Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.1792759Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:26.1793996Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.1795398Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:26.1796441Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:26.1797464Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:26.1798698Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.1800000Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.1801125Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:26.1802179Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:26.1803361Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.1804730Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.1805869Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.1806801Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.1807555Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:26.1808574Z W0507 20:31:26.173967 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.7408884Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.7410092Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:26.7411470Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.7412990Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.7414381Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.7415785Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.7417598Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.7418987Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.7420423Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.7421800Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:26.7423041Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.7424267Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:26.7425308Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:26.7426331Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:26.7427569Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.7429008Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.7430138Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:26.7431182Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:26.7432371Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.7433719Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.7434804Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.7435725Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.7436475Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:26.7437510Z W0507 20:31:26.736781 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.7801737Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.7803233Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:26.7804736Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.7806352Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.7807744Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.7809133Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.7810461Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.7811850Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.7813281Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.7814527Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:26.7815864Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.7817084Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:26.7818123Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:26.7819149Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:26.7820385Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.7821763Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.7822891Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:26.7823950Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:26.7825137Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.7826500Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.7827581Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.7828683Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.7829444Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:26.7830479Z W0507 20:31:26.776267 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.5399147Z self = 2025-05-07T20:31:27.5399919Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:27.5400316Z 2025-05-07T20:31:27.5400430Z @given( 2025-05-07T20:31:27.5400774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:27.5401215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:27.5401562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:27.5401921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:27.5402266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:27.5402566Z ) 2025-05-07T20:31:27.5402921Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:27.5403374Z def test_silu_mul_quant( 2025-05-07T20:31:27.5403625Z self, 2025-05-07T20:31:27.5403823Z T: int, 2025-05-07T20:31:27.5404028Z D: int, 2025-05-07T20:31:27.5404257Z scale_ub: Optional[float], 2025-05-07T20:31:27.5404531Z contiguous: bool, 2025-05-07T20:31:27.5404780Z compiled: bool, 2025-05-07T20:31:27.5405026Z ) -> None: 2025-05-07T20:31:27.5405243Z torch.manual_seed(2025) 2025-05-07T20:31:27.5405560Z 2025-05-07T20:31:27.5405957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:27.5406444Z 2025-05-07T20:31:27.5406645Z x_sign = torch.sign(x) 2025-05-07T20:31:27.5407408Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:27.5407924Z x = x_sign * x_clamp 2025-05-07T20:31:27.5408285Z x0 = x[:, :D] 2025-05-07T20:31:27.5408602Z x1 = x[:, D:] 2025-05-07T20:31:27.5408825Z 2025-05-07T20:31:27.5409021Z if contiguous: 2025-05-07T20:31:27.5409264Z x0 = x0.contiguous() 2025-05-07T20:31:27.5409525Z x1 = x1.contiguous() 2025-05-07T20:31:27.5409775Z 2025-05-07T20:31:27.5409975Z if scale_ub is not None: 2025-05-07T20:31:27.5410251Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:27.5410596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:27.5410916Z ) 2025-05-07T20:31:27.5411114Z else: 2025-05-07T20:31:27.5411335Z scale_ub_tensor = None 2025-05-07T20:31:27.5411601Z 2025-05-07T20:31:27.5411847Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:27.5412184Z op = silu_mul_quant 2025-05-07T20:31:27.5412444Z if compiled: 2025-05-07T20:31:27.5412705Z op = torch.compile(op) 2025-05-07T20:31:27.5413015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:27.5413291Z 2025-05-07T20:31:27.5413491Z > y_fp8, y_scale = fn() 2025-05-07T20:31:27.5413659Z 2025-05-07T20:31:27.5413770Z moe/activation_test.py:117: 2025-05-07T20:31:27.5414066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.5414408Z moe/activation_test.py:115: in fn 2025-05-07T20:31:27.5414701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:27.5415399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:27.5416090Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:27.5416645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:27.5417527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:27.5418192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:27.5418743Z kernel = self.compile( 2025-05-07T20:31:27.5419289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:27.5419948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.5420347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.5420587Z 2025-05-07T20:31:27.5420794Z self = 2025-05-07T20:31:27.5421985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:27.5423387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317a60f9d0>} 2025-05-07T20:31:27.5424714Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:27.5425730Z context = 2025-05-07T20:31:27.5426030Z 2025-05-07T20:31:27.5426203Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:27.5426731Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.5427235Z module_map=module_map) 2025-05-07T20:31:27.5427627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.5428076Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.5428347Z E ^ 2025-05-07T20:31:27.5428811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.5429266Z 2025-05-07T20:31:27.5429684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:27.5430194Z 2025-05-07T20:31:27.5430305Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:27.5430725Z self=, 2025-05-07T20:31:27.5431126Z T=2048, 2025-05-07T20:31:27.5431322Z D=5120, 2025-05-07T20:31:27.5431522Z scale_ub=1200.0, 2025-05-07T20:31:27.5431749Z contiguous=True, 2025-05-07T20:31:27.5431978Z compiled=True, 2025-05-07T20:31:27.5432197Z ) 2025-05-07T20:31:27.5432518Z self = 2025-05-07T20:31:27.5433024Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:27.5433302Z 2025-05-07T20:31:27.5433389Z @given( 2025-05-07T20:31:27.5433621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:27.5433946Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:27.5434262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:27.5434598Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:27.5434929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:27.5435223Z ) 2025-05-07T20:31:27.5435585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:27.5436030Z def test_silu_mul_quant( 2025-05-07T20:31:27.5436283Z self, 2025-05-07T20:31:27.5436489Z T: int, 2025-05-07T20:31:27.5436688Z D: int, 2025-05-07T20:31:27.5436922Z scale_ub: Optional[float], 2025-05-07T20:31:27.5437208Z contiguous: bool, 2025-05-07T20:31:27.5437453Z compiled: bool, 2025-05-07T20:31:27.5437828Z ) -> None: 2025-05-07T20:31:27.5438055Z torch.manual_seed(2025) 2025-05-07T20:31:27.5438306Z 2025-05-07T20:31:27.5438588Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:27.5438940Z 2025-05-07T20:31:27.5439139Z x_sign = torch.sign(x) 2025-05-07T20:31:27.5439442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:27.5439762Z x = x_sign * x_clamp 2025-05-07T20:31:27.5440014Z x0 = x[:, :D] 2025-05-07T20:31:27.5440672Z x1 = x[:, D:] 2025-05-07T20:31:27.5440972Z 2025-05-07T20:31:27.5441232Z if contiguous: 2025-05-07T20:31:27.5441494Z x0 = x0.contiguous() 2025-05-07T20:31:27.5441761Z x1 = x1.contiguous() 2025-05-07T20:31:27.5442006Z 2025-05-07T20:31:27.5442195Z if scale_ub is not None: 2025-05-07T20:31:27.5442474Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:27.5442821Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:27.5443137Z ) 2025-05-07T20:31:27.5443356Z else: 2025-05-07T20:31:27.5443574Z scale_ub_tensor = None 2025-05-07T20:31:27.5443830Z 2025-05-07T20:31:27.5444066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:27.5444389Z op = silu_mul_quant 2025-05-07T20:31:27.5444641Z if compiled: 2025-05-07T20:31:27.5444895Z op = torch.compile(op) 2025-05-07T20:31:27.5445202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:27.5445478Z 2025-05-07T20:31:27.5445679Z y_fp8, y_scale = fn() 2025-05-07T20:31:27.5445971Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:27.5446261Z 2025-05-07T20:31:27.5446504Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:27.5446847Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:27.5447140Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:27.5456593Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:27.5456991Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:27.5457319Z 2025-05-07T20:31:27.5457526Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:27.5457741Z 2025-05-07T20:31:27.5457847Z moe/activation_test.py:126: 2025-05-07T20:31:27.5458156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.5458496Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:27.5458842Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:27.5459651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:27.5460411Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:27.5460961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:27.5461752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:27.5462459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:27.5463180Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:27.5463938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:27.5464700Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:27.5465432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:27.5466069Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:27.5466686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:27.5467353Z fn() 2025-05-07T20:31:27.5467862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:27.5468436Z self.fn.run( 2025-05-07T20:31:27.5468912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:27.5469457Z kernel = self.compile( 2025-05-07T20:31:27.5469996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:27.5470655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.5471073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.5471309Z 2025-05-07T20:31:27.5471524Z self = 2025-05-07T20:31:27.5472600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:27.5473993Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317a693a60>} 2025-05-07T20:31:27.5475348Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:27.5476370Z context = 2025-05-07T20:31:27.5476658Z 2025-05-07T20:31:27.5476836Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:27.5477367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.5477852Z module_map=module_map) 2025-05-07T20:31:27.5478313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.5478678Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:27.5478952Z E ^ 2025-05-07T20:31:27.5479424Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.5479872Z 2025-05-07T20:31:27.5480298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:27.5480816Z 2025-05-07T20:31:27.5480922Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:27.5481341Z self=, 2025-05-07T20:31:27.5481747Z T=16384, 2025-05-07T20:31:27.5481941Z D=7168, 2025-05-07T20:31:27.5482141Z scale_ub=1200.0, 2025-05-07T20:31:27.5482373Z contiguous=False, 2025-05-07T20:31:27.5482601Z compiled=False, 2025-05-07T20:31:27.5482817Z ) 2025-05-07T20:31:27.9704478Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.9706645Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:27.9708202Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.9709650Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.9711048Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.9712663Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.9713992Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.9715397Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.9716842Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.9718158Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:27.9719395Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.9720617Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:27.9721680Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:27.9722720Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:27.9724086Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.9725406Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.9726540Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:27.9727608Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:27.9728808Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.9730176Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.9731270Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.9732210Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.9732974Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:27.9734013Z W0507 20:31:27.966295 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.1320504Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.1321940Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:28.1323302Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.1324741Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.1326123Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.1327515Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.1328848Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.1330236Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.1331664Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.1333099Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:28.1334341Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.1335559Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:28.1336614Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:28.1337652Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:28.1338890Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.1340431Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.1341668Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:28.1342729Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:28.1343928Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.1345311Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.1346515Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.1347444Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.1348207Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:28.1349249Z W0507 20:31:28.128102 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.6265297Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.6266432Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:28.6267791Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.6269356Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.6270753Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.6273030Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.6274370Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.6275769Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.6277213Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.6278479Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:28.6279718Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.6280939Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:28.6281982Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:28.6283020Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:28.6284258Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.6285699Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.6286815Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:28.6287870Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:28.6289061Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.6290428Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.6291508Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.6292427Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.6293188Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:28.6294225Z W0507 20:31:28.622463 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.6661927Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.6663144Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:28.6664494Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.6665932Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.6667324Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.6668774Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.6670092Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.6671484Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.6672903Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.6674168Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:28.6675577Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.6676786Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:28.6677840Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:28.6678883Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:28.6680120Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.6681419Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.6682545Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:28.6683600Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:28.6684797Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.6686163Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.6687329Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.6688260Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.6689022Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:28.6690057Z W0507 20:31:28.662323 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.1524381Z self = 2025-05-07T20:31:30.1524950Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:30.1525246Z 2025-05-07T20:31:30.1525333Z @given( 2025-05-07T20:31:30.1525625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:30.1525955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:30.1526269Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:30.1526617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:30.1526963Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:30.1527258Z ) 2025-05-07T20:31:30.1527621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:30.1528075Z def test_silu_mul_quant( 2025-05-07T20:31:30.1528333Z self, 2025-05-07T20:31:30.1528535Z T: int, 2025-05-07T20:31:30.1528743Z D: int, 2025-05-07T20:31:30.1528977Z scale_ub: Optional[float], 2025-05-07T20:31:30.1529253Z contiguous: bool, 2025-05-07T20:31:30.1529504Z compiled: bool, 2025-05-07T20:31:30.1529745Z ) -> None: 2025-05-07T20:31:30.1529968Z torch.manual_seed(2025) 2025-05-07T20:31:30.1530224Z 2025-05-07T20:31:30.1530965Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:30.1531328Z 2025-05-07T20:31:30.1531535Z x_sign = torch.sign(x) 2025-05-07T20:31:30.1531836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:30.1532161Z x = x_sign * x_clamp 2025-05-07T20:31:30.1532418Z x0 = x[:, :D] 2025-05-07T20:31:30.1532641Z x1 = x[:, D:] 2025-05-07T20:31:30.1532866Z 2025-05-07T20:31:30.1533069Z if contiguous: 2025-05-07T20:31:30.1533311Z x0 = x0.contiguous() 2025-05-07T20:31:30.1533586Z x1 = x1.contiguous() 2025-05-07T20:31:30.1533842Z 2025-05-07T20:31:30.1534040Z if scale_ub is not None: 2025-05-07T20:31:30.1534328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:30.1534683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:30.1535000Z ) 2025-05-07T20:31:30.1535212Z else: 2025-05-07T20:31:30.1535457Z scale_ub_tensor = None 2025-05-07T20:31:30.1535726Z 2025-05-07T20:31:30.1535970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:30.1536304Z op = silu_mul_quant 2025-05-07T20:31:30.1536573Z if compiled: 2025-05-07T20:31:30.1536832Z op = torch.compile(op) 2025-05-07T20:31:30.1537142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:30.1537432Z 2025-05-07T20:31:30.1537635Z > y_fp8, y_scale = fn() 2025-05-07T20:31:30.1537833Z 2025-05-07T20:31:30.1537951Z moe/activation_test.py:117: 2025-05-07T20:31:30.1538283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:30.1538621Z moe/activation_test.py:115: in fn 2025-05-07T20:31:30.1538917Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:30.1539622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:30.1540743Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:30.1541372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:30.1542069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:30.1542739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:30.1543274Z kernel = self.compile( 2025-05-07T20:31:30.1543829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:30.1544494Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.1544902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:30.1545136Z 2025-05-07T20:31:30.1545347Z self = 2025-05-07T20:31:30.1546443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:30.1547852Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317ad33700>} 2025-05-07T20:31:30.1549198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:30.1550222Z context = 2025-05-07T20:31:30.1550514Z 2025-05-07T20:31:30.1550685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:30.1551219Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.1551828Z module_map=module_map) 2025-05-07T20:31:30.1552203Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.1552575Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.1552844Z E ^ 2025-05-07T20:31:30.1553318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.1553779Z 2025-05-07T20:31:30.1554199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:30.1554733Z 2025-05-07T20:31:30.1554841Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:30.1555265Z self=, 2025-05-07T20:31:30.1555680Z T=1, 2025-05-07T20:31:30.1555871Z D=7168, 2025-05-07T20:31:30.1556076Z scale_ub=None, 2025-05-07T20:31:30.1556301Z contiguous=True, 2025-05-07T20:31:30.1556529Z compiled=True, 2025-05-07T20:31:30.1556762Z ) 2025-05-07T20:31:30.1557096Z self = 2025-05-07T20:31:30.1557584Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:30.1557854Z 2025-05-07T20:31:30.1557934Z @given( 2025-05-07T20:31:30.1558178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:30.1558494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:30.1558813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:30.1559158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:30.1559494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:30.1559799Z ) 2025-05-07T20:31:30.1560159Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:30.1560616Z def test_silu_mul_quant( 2025-05-07T20:31:30.1560862Z self, 2025-05-07T20:31:30.1561071Z T: int, 2025-05-07T20:31:30.1561286Z D: int, 2025-05-07T20:31:30.1561591Z scale_ub: Optional[float], 2025-05-07T20:31:30.1561878Z contiguous: bool, 2025-05-07T20:31:30.1562127Z compiled: bool, 2025-05-07T20:31:30.1562355Z ) -> None: 2025-05-07T20:31:30.1562584Z torch.manual_seed(2025) 2025-05-07T20:31:30.1562839Z 2025-05-07T20:31:30.1563114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:30.1563472Z 2025-05-07T20:31:30.1563681Z x_sign = torch.sign(x) 2025-05-07T20:31:30.1563976Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:30.1564301Z x = x_sign * x_clamp 2025-05-07T20:31:30.1564555Z x0 = x[:, :D] 2025-05-07T20:31:30.1564778Z x1 = x[:, D:] 2025-05-07T20:31:30.1564996Z 2025-05-07T20:31:30.1565193Z if contiguous: 2025-05-07T20:31:30.1565434Z x0 = x0.contiguous() 2025-05-07T20:31:30.1565696Z x1 = x1.contiguous() 2025-05-07T20:31:30.1565949Z 2025-05-07T20:31:30.1566161Z if scale_ub is not None: 2025-05-07T20:31:30.1566441Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:30.1566787Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:30.1567106Z ) 2025-05-07T20:31:30.1567302Z else: 2025-05-07T20:31:30.1567525Z scale_ub_tensor = None 2025-05-07T20:31:30.1567784Z 2025-05-07T20:31:30.1568017Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:30.1568348Z op = silu_mul_quant 2025-05-07T20:31:30.1568654Z if compiled: 2025-05-07T20:31:30.1568912Z op = torch.compile(op) 2025-05-07T20:31:30.1569224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:30.1569512Z 2025-05-07T20:31:30.1569707Z y_fp8, y_scale = fn() 2025-05-07T20:31:30.1570008Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:30.1570313Z 2025-05-07T20:31:30.1570563Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:30.1570992Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:30.1571301Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:30.1571627Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:30.1571994Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:30.1572313Z 2025-05-07T20:31:30.1572528Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:30.1572727Z 2025-05-07T20:31:30.1572837Z moe/activation_test.py:126: 2025-05-07T20:31:30.1573138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:30.1573490Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:30.1573830Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:30.1574619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:30.1575391Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:30.1575965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:30.1576664Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:30.1577355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:30.1578086Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:30.1578850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:30.1579607Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:30.1580336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:30.1581061Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:30.1581757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:30.1582281Z fn() 2025-05-07T20:31:30.1582803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:30.1583391Z self.fn.run( 2025-05-07T20:31:30.1583872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:30.1584402Z kernel = self.compile( 2025-05-07T20:31:30.1584960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:30.1585619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.1586023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:30.1586262Z 2025-05-07T20:31:30.1586482Z self = 2025-05-07T20:31:30.1587576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:30.1588949Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317b531ee0>} 2025-05-07T20:31:30.1590299Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:30.1591322Z context = 2025-05-07T20:31:30.1591621Z 2025-05-07T20:31:30.1591794Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:30.1592336Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.1592902Z module_map=module_map) 2025-05-07T20:31:30.1593272Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.1593640Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:30.1593917Z E ^ 2025-05-07T20:31:30.1594382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.1594854Z 2025-05-07T20:31:30.1595273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:30.1595807Z 2025-05-07T20:31:30.1595913Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:30.1596333Z self=, 2025-05-07T20:31:30.1596736Z T=4096, 2025-05-07T20:31:30.1596933Z D=5120, 2025-05-07T20:31:30.1597132Z scale_ub=None, 2025-05-07T20:31:30.1597363Z contiguous=False, 2025-05-07T20:31:30.1597600Z compiled=False, 2025-05-07T20:31:30.1597813Z ) 2025-05-07T20:31:30.7871697Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.7872811Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:30.7874165Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.7875729Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.7877750Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.7879165Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.7880491Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.7881889Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.7883327Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.7884587Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:30.7885829Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.7887057Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:30.7888119Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:30.7889327Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:30.7890556Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.7891863Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.7892999Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:30.7894059Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:30.7895273Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.7896651Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.7897722Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.7898646Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.7899399Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:30.7900512Z W0507 20:31:30.782595 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.3770013Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.3771122Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:31.3772475Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.3773914Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.3775319Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.3776732Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.3778051Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.3779451Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.3780883Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.3782397Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:31.3783631Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.3784831Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:31.3785892Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:31.3786922Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:31.3788156Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.3789458Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.3790589Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:31.3791655Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:31.3792965Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.3794325Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.3795396Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.3796320Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.3797080Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:31.3798124Z W0507 20:31:31.372934 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.1494819Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.1495936Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:32.1497309Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.1498747Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.1500146Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.1502016Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.1503338Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.1504735Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.1506145Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.1507407Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:32.1508639Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.1509863Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:32.1510908Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:32.1511935Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:32.1513315Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.1514623Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.1515753Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:32.1516813Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:32.1517997Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.1519418Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.1520502Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.1521437Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.1522193Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:32.1523236Z W0507 20:31:32.145497 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.1893300Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.1894549Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:32.1895894Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.1897320Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.1898769Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.1900342Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.1901722Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.1903116Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.1904542Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.1905932Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:32.1907160Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.1908383Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:32.1909426Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:32.1910452Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:32.1911694Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.1913007Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.1914143Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:32.1915203Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:32.1916385Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.1917833Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.1918915Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.1919843Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.1920605Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:32.1921634Z W0507 20:31:32.185501 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.7886005Z self = 2025-05-07T20:31:35.7886711Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:35.7887010Z 2025-05-07T20:31:35.7887102Z @given( 2025-05-07T20:31:35.7887360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.7887685Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.7888016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.7888367Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.7888701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.7889003Z ) 2025-05-07T20:31:35.7889367Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.7889822Z def test_silu_mul_quant( 2025-05-07T20:31:35.7890080Z self, 2025-05-07T20:31:35.7890287Z T: int, 2025-05-07T20:31:35.7890489Z D: int, 2025-05-07T20:31:35.7890728Z scale_ub: Optional[float], 2025-05-07T20:31:35.7891017Z contiguous: bool, 2025-05-07T20:31:35.7891605Z compiled: bool, 2025-05-07T20:31:35.7891855Z ) -> None: 2025-05-07T20:31:35.7892084Z torch.manual_seed(2025) 2025-05-07T20:31:35.7892342Z 2025-05-07T20:31:35.7892625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.7892982Z 2025-05-07T20:31:35.7893187Z x_sign = torch.sign(x) 2025-05-07T20:31:35.7893488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.7893816Z x = x_sign * x_clamp 2025-05-07T20:31:35.7894074Z x0 = x[:, :D] 2025-05-07T20:31:35.7894302Z x1 = x[:, D:] 2025-05-07T20:31:35.7894519Z 2025-05-07T20:31:35.7894717Z if contiguous: 2025-05-07T20:31:35.7894953Z x0 = x0.contiguous() 2025-05-07T20:31:35.7895225Z x1 = x1.contiguous() 2025-05-07T20:31:35.7895480Z 2025-05-07T20:31:35.7895677Z if scale_ub is not None: 2025-05-07T20:31:35.7895964Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.7896318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.7896643Z ) 2025-05-07T20:31:35.7896840Z else: 2025-05-07T20:31:35.7897063Z scale_ub_tensor = None 2025-05-07T20:31:35.7897325Z 2025-05-07T20:31:35.7897562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.7897891Z op = silu_mul_quant 2025-05-07T20:31:35.7898157Z if compiled: 2025-05-07T20:31:35.7898409Z op = torch.compile(op) 2025-05-07T20:31:35.7898724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.7899011Z 2025-05-07T20:31:35.7899207Z > y_fp8, y_scale = fn() 2025-05-07T20:31:35.7899383Z 2025-05-07T20:31:35.7899492Z moe/activation_test.py:117: 2025-05-07T20:31:35.7899798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.7900133Z moe/activation_test.py:115: in fn 2025-05-07T20:31:35.7900426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.7901372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:35.7902081Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:35.7902626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.7903326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.7904001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.7904545Z kernel = self.compile( 2025-05-07T20:31:35.7905092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.7905757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.7906167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.7906409Z 2025-05-07T20:31:35.7906627Z self = 2025-05-07T20:31:35.7907719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.7909137Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313c57af70>} 2025-05-07T20:31:35.7910486Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.7911515Z context = 2025-05-07T20:31:35.7911806Z 2025-05-07T20:31:35.7912057Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.7912603Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.7913083Z module_map=module_map) 2025-05-07T20:31:35.7913460Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.7913818Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.7914088Z E ^ 2025-05-07T20:31:35.7914560Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.7915012Z 2025-05-07T20:31:35.7915435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.7915960Z 2025-05-07T20:31:35.7916069Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.7916494Z self=, 2025-05-07T20:31:35.7916905Z T=4096, 2025-05-07T20:31:35.7917107Z D=7168, 2025-05-07T20:31:35.7917310Z scale_ub=None, 2025-05-07T20:31:35.7917539Z contiguous=False, 2025-05-07T20:31:35.7917772Z compiled=False, 2025-05-07T20:31:35.7917994Z ) 2025-05-07T20:31:35.7918323Z self = 2025-05-07T20:31:35.7918826Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:35.7919110Z 2025-05-07T20:31:35.7919192Z @given( 2025-05-07T20:31:35.7919442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.7919760Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.7920088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.7920430Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.7920774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.7921064Z ) 2025-05-07T20:31:35.7921427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.7921984Z def test_silu_mul_quant( 2025-05-07T20:31:35.7922231Z self, 2025-05-07T20:31:35.7922440Z T: int, 2025-05-07T20:31:35.7922648Z D: int, 2025-05-07T20:31:35.7922868Z scale_ub: Optional[float], 2025-05-07T20:31:35.7923157Z contiguous: bool, 2025-05-07T20:31:35.7923405Z compiled: bool, 2025-05-07T20:31:35.7923631Z ) -> None: 2025-05-07T20:31:35.7923859Z torch.manual_seed(2025) 2025-05-07T20:31:35.7924113Z 2025-05-07T20:31:35.7924391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.7924747Z 2025-05-07T20:31:35.7924954Z x_sign = torch.sign(x) 2025-05-07T20:31:35.7925250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.7925577Z x = x_sign * x_clamp 2025-05-07T20:31:35.7925833Z x0 = x[:, :D] 2025-05-07T20:31:35.7926061Z x1 = x[:, D:] 2025-05-07T20:31:35.7926271Z 2025-05-07T20:31:35.7926466Z if contiguous: 2025-05-07T20:31:35.7926717Z x0 = x0.contiguous() 2025-05-07T20:31:35.7926983Z x1 = x1.contiguous() 2025-05-07T20:31:35.7927234Z 2025-05-07T20:31:35.7927433Z if scale_ub is not None: 2025-05-07T20:31:35.7927711Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.7928057Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.7928376Z ) 2025-05-07T20:31:35.7928576Z else: 2025-05-07T20:31:35.7928799Z scale_ub_tensor = None 2025-05-07T20:31:35.7929063Z 2025-05-07T20:31:35.7929300Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.7929632Z op = silu_mul_quant 2025-05-07T20:31:35.7929896Z if compiled: 2025-05-07T20:31:35.7930147Z op = torch.compile(op) 2025-05-07T20:31:35.7930455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.7930743Z 2025-05-07T20:31:35.7930948Z > y_fp8, y_scale = fn() 2025-05-07T20:31:35.7931125Z 2025-05-07T20:31:35.7931312Z moe/activation_test.py:117: 2025-05-07T20:31:35.7931621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.7931965Z moe/activation_test.py:115: in fn 2025-05-07T20:31:35.7932252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.7932957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:35.7933658Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:35.7934210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.7934905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.7935575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.7936115Z kernel = self.compile( 2025-05-07T20:31:35.7936671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.7937340Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.7937749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.7937984Z 2025-05-07T20:31:35.7938203Z self = 2025-05-07T20:31:35.7939286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.7940842Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313c48f940>} 2025-05-07T20:31:35.7942235Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.7943409Z context = 2025-05-07T20:31:35.7943700Z 2025-05-07T20:31:35.7943880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.7944406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.7944882Z module_map=module_map) 2025-05-07T20:31:35.7945261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.7945619Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.7945888Z E ^ 2025-05-07T20:31:35.7946361Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.7946814Z 2025-05-07T20:31:35.7947253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.7947780Z 2025-05-07T20:31:35.7947888Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.7948319Z self=, 2025-05-07T20:31:35.7948737Z T=128, 2025-05-07T20:31:35.7948928Z D=7168, 2025-05-07T20:31:35.7949139Z scale_ub=None, 2025-05-07T20:31:35.7949366Z contiguous=False, 2025-05-07T20:31:35.7949624Z compiled=True, 2025-05-07T20:31:35.7949863Z ) 2025-05-07T20:31:35.8774497Z self = 2025-05-07T20:31:35.8775568Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:35.8776113Z 2025-05-07T20:31:35.8776286Z @given( 2025-05-07T20:31:35.8776746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.8777383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.8778322Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.8779012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.8779622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.8779919Z ) 2025-05-07T20:31:35.8780271Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.8780728Z def test_silu_mul_quant( 2025-05-07T20:31:35.8780982Z self, 2025-05-07T20:31:35.8781251Z T: int, 2025-05-07T20:31:35.8781461Z D: int, 2025-05-07T20:31:35.8781691Z scale_ub: Optional[float], 2025-05-07T20:31:35.8781976Z contiguous: bool, 2025-05-07T20:31:35.8782220Z compiled: bool, 2025-05-07T20:31:35.8782456Z ) -> None: 2025-05-07T20:31:35.8782683Z torch.manual_seed(2025) 2025-05-07T20:31:35.8782930Z 2025-05-07T20:31:35.8783214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.8783566Z 2025-05-07T20:31:35.8783764Z x_sign = torch.sign(x) 2025-05-07T20:31:35.8784079Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.8784401Z x = x_sign * x_clamp 2025-05-07T20:31:35.8784645Z x0 = x[:, :D] 2025-05-07T20:31:35.8784875Z x1 = x[:, D:] 2025-05-07T20:31:35.8785091Z 2025-05-07T20:31:35.8785285Z if contiguous: 2025-05-07T20:31:35.8785528Z x0 = x0.contiguous() 2025-05-07T20:31:35.8785801Z x1 = x1.contiguous() 2025-05-07T20:31:35.8786047Z 2025-05-07T20:31:35.8786251Z if scale_ub is not None: 2025-05-07T20:31:35.8786533Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.8786880Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.8787193Z ) 2025-05-07T20:31:35.8787395Z else: 2025-05-07T20:31:35.8787615Z scale_ub_tensor = None 2025-05-07T20:31:35.8787869Z 2025-05-07T20:31:35.8788118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.8788610Z op = silu_mul_quant 2025-05-07T20:31:35.8788873Z if compiled: 2025-05-07T20:31:35.8789131Z op = torch.compile(op) 2025-05-07T20:31:35.8789439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.8789719Z 2025-05-07T20:31:35.8789919Z y_fp8, y_scale = fn() 2025-05-07T20:31:35.8790222Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:35.8790519Z 2025-05-07T20:31:35.8790766Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.8791113Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:35.8791411Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:35.8791739Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:35.8792116Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:35.8792436Z 2025-05-07T20:31:35.8792642Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:35.8792851Z 2025-05-07T20:31:35.8792966Z moe/activation_test.py:126: 2025-05-07T20:31:35.8793281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.8793621Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:35.8793960Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:35.8794763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:35.8795544Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:35.8796100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.8796789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.8797485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:35.8798294Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:35.8799062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:35.8799837Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:35.8800604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:35.8801249Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:35.8801862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:35.8802391Z fn() 2025-05-07T20:31:35.8802913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:35.8803494Z self.fn.run( 2025-05-07T20:31:35.8803970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.8804521Z kernel = self.compile( 2025-05-07T20:31:35.8805069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.8805728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.8806135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.8806372Z 2025-05-07T20:31:35.8806592Z self = 2025-05-07T20:31:35.8807677Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.8809056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317abe5160>} 2025-05-07T20:31:35.8810558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.8811607Z context = 2025-05-07T20:31:35.8811902Z 2025-05-07T20:31:35.8812081Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.8812613Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.8813092Z module_map=module_map) 2025-05-07T20:31:35.8813474Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.8813835Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:35.8814115Z E ^ 2025-05-07T20:31:35.8814585Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.8815046Z 2025-05-07T20:31:35.8815470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.8815998Z 2025-05-07T20:31:35.8816106Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.8816535Z self=, 2025-05-07T20:31:35.8816949Z T=128, 2025-05-07T20:31:35.8817139Z D=7168, 2025-05-07T20:31:35.8817346Z scale_ub=None, 2025-05-07T20:31:35.8817575Z contiguous=False, 2025-05-07T20:31:35.8817809Z compiled=False, 2025-05-07T20:31:35.8818032Z ) 2025-05-07T20:31:36.1332838Z self = 2025-05-07T20:31:36.1333361Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.1333649Z 2025-05-07T20:31:36.1334706Z @given( 2025-05-07T20:31:36.1335460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.1336967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.1337714Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.1338408Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.1339087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.1339631Z ) 2025-05-07T20:31:36.1340042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.1340820Z def test_silu_mul_quant( 2025-05-07T20:31:36.1341068Z self, 2025-05-07T20:31:36.1341335Z T: int, 2025-05-07T20:31:36.1341547Z D: int, 2025-05-07T20:31:36.1341769Z scale_ub: Optional[float], 2025-05-07T20:31:36.1342052Z contiguous: bool, 2025-05-07T20:31:36.1342306Z compiled: bool, 2025-05-07T20:31:36.1342537Z ) -> None: 2025-05-07T20:31:36.1342762Z torch.manual_seed(2025) 2025-05-07T20:31:36.1343014Z 2025-05-07T20:31:36.1343298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.1343655Z 2025-05-07T20:31:36.1343856Z x_sign = torch.sign(x) 2025-05-07T20:31:36.1344148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.1344467Z x = x_sign * x_clamp 2025-05-07T20:31:36.1344719Z x0 = x[:, :D] 2025-05-07T20:31:36.1344980Z x1 = x[:, D:] 2025-05-07T20:31:36.1345217Z 2025-05-07T20:31:36.1345410Z if contiguous: 2025-05-07T20:31:36.1345651Z x0 = x0.contiguous() 2025-05-07T20:31:36.1354881Z x1 = x1.contiguous() 2025-05-07T20:31:36.1355147Z 2025-05-07T20:31:36.1355351Z if scale_ub is not None: 2025-05-07T20:31:36.1355638Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.1355981Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.1356302Z ) 2025-05-07T20:31:36.1356504Z else: 2025-05-07T20:31:36.1356716Z scale_ub_tensor = None 2025-05-07T20:31:36.1356977Z 2025-05-07T20:31:36.1357462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.1357790Z op = silu_mul_quant 2025-05-07T20:31:36.1358057Z if compiled: 2025-05-07T20:31:36.1358318Z op = torch.compile(op) 2025-05-07T20:31:36.1358616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.1358901Z 2025-05-07T20:31:36.1359102Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.1359275Z 2025-05-07T20:31:36.1359388Z moe/activation_test.py:117: 2025-05-07T20:31:36.1359686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.1360032Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.1360321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.1361012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.1361714Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.1362271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.1362963Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.1363626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.1364172Z kernel = self.compile( 2025-05-07T20:31:36.1364727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.1365384Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.1365797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.1366040Z 2025-05-07T20:31:36.1366250Z self = 2025-05-07T20:31:36.1367470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.1368960Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313c07bdc0>} 2025-05-07T20:31:36.1370370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.1371400Z context = 2025-05-07T20:31:36.1371695Z 2025-05-07T20:31:36.1371865Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.1372400Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.1372866Z module_map=module_map) 2025-05-07T20:31:36.1373253Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.1373617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.1373884Z E ^ 2025-05-07T20:31:36.1374349Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.1374815Z 2025-05-07T20:31:36.1375227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.1375745Z 2025-05-07T20:31:36.1375857Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.1376298Z self=, 2025-05-07T20:31:36.1376867Z T=4096, 2025-05-07T20:31:36.1377138Z D=5120, 2025-05-07T20:31:36.1377417Z scale_ub=1200.0, 2025-05-07T20:31:36.1377730Z contiguous=True, 2025-05-07T20:31:36.1378018Z compiled=False, 2025-05-07T20:31:36.1378243Z ) 2025-05-07T20:31:36.1378670Z self = 2025-05-07T20:31:36.1379176Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:36.1379452Z 2025-05-07T20:31:36.1379539Z @given( 2025-05-07T20:31:36.1379769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.1380091Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.1380410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.1380755Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.1381084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.1381474Z ) 2025-05-07T20:31:36.1381834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.1382275Z def test_silu_mul_quant( 2025-05-07T20:31:36.1382523Z self, 2025-05-07T20:31:36.1382727Z T: int, 2025-05-07T20:31:36.1382924Z D: int, 2025-05-07T20:31:36.1383151Z scale_ub: Optional[float], 2025-05-07T20:31:36.1383440Z contiguous: bool, 2025-05-07T20:31:36.1383680Z compiled: bool, 2025-05-07T20:31:36.1383910Z ) -> None: 2025-05-07T20:31:36.1384131Z torch.manual_seed(2025) 2025-05-07T20:31:36.1384373Z 2025-05-07T20:31:36.1384653Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.1385002Z 2025-05-07T20:31:36.1385194Z x_sign = torch.sign(x) 2025-05-07T20:31:36.1385488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.1385805Z x = x_sign * x_clamp 2025-05-07T20:31:36.1386052Z x0 = x[:, :D] 2025-05-07T20:31:36.1386268Z x1 = x[:, D:] 2025-05-07T20:31:36.1386480Z 2025-05-07T20:31:36.1386672Z if contiguous: 2025-05-07T20:31:36.1386904Z x0 = x0.contiguous() 2025-05-07T20:31:36.1387173Z x1 = x1.contiguous() 2025-05-07T20:31:36.1387421Z 2025-05-07T20:31:36.1387611Z if scale_ub is not None: 2025-05-07T20:31:36.1387990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.1388342Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.1388656Z ) 2025-05-07T20:31:36.1388861Z else: 2025-05-07T20:31:36.1389083Z scale_ub_tensor = None 2025-05-07T20:31:36.1389338Z 2025-05-07T20:31:36.1389583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.1389953Z op = silu_mul_quant 2025-05-07T20:31:36.1390215Z if compiled: 2025-05-07T20:31:36.1390479Z op = torch.compile(op) 2025-05-07T20:31:36.1390790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.1391076Z 2025-05-07T20:31:36.1391273Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.1391454Z 2025-05-07T20:31:36.1391558Z moe/activation_test.py:117: 2025-05-07T20:31:36.1391870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.1392207Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.1392515Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.1393229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.1393918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.1394478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.1395174Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.1395856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.1396399Z kernel = self.compile( 2025-05-07T20:31:36.1396953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.1397618Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.1398124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.1398358Z 2025-05-07T20:31:36.1398572Z self = 2025-05-07T20:31:36.1399656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.1401090Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313d5650d0>} 2025-05-07T20:31:36.1402437Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.1403463Z context = 2025-05-07T20:31:36.1403763Z 2025-05-07T20:31:36.1403932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.1404465Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.1404938Z module_map=module_map) 2025-05-07T20:31:36.1405314Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.1405666Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.1405937Z E ^ 2025-05-07T20:31:36.1406404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.1406852Z 2025-05-07T20:31:36.1407267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.1407785Z 2025-05-07T20:31:36.1407890Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.1408389Z self=, 2025-05-07T20:31:36.1408803Z T=1, 2025-05-07T20:31:36.1408986Z D=5120, 2025-05-07T20:31:36.1409184Z scale_ub=None, 2025-05-07T20:31:36.1409403Z contiguous=True, 2025-05-07T20:31:36.1409626Z compiled=True, 2025-05-07T20:31:36.1409841Z ) 2025-05-07T20:31:36.6748815Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:36.6750264Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:36.6751628Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:36.6753106Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:36.6754512Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:36.6755892Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.6757200Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:36.6758583Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.6760462Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:36.6761721Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:36.6762952Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:36.6764174Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:36.6765232Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:36.6766277Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:36.6767515Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:36.6768812Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:36.6769936Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:36.6771143Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:36.6772349Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:36.6773711Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:36.6774789Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.6775708Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.6776462Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:36.6777510Z W0507 20:31:36.670769 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.8645773Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:36.8647016Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:36.8648369Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:36.8649837Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:36.8651542Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:36.8652943Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.8654272Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:36.8655656Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.8657094Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:36.8658343Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:36.8659588Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:36.8660818Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:36.8662103Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:36.8663156Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:36.8664389Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:36.8665700Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:36.8666843Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:36.8667904Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:36.8669116Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:36.8670487Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:36.8671571Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.8672495Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.8673255Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:36.8674294Z W0507 20:31:36.860468 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.3732200Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.3733459Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.3734824Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.3736285Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.3737725Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.3739130Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.3740722Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.3742177Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.3743927Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.3745206Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.3746444Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.3747658Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.3748709Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:37.3749743Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.3751024Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.3752332Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.3753468Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:37.3754521Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.3755700Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.3757223Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.3758314Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.3759245Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.3760010Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.3761090Z W0507 20:31:37.369179 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.4128158Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.4129447Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.4130983Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.4132429Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.4134002Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.4135421Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.4136752Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.4138155Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.4139598Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.4141120Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.4142698Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.4143930Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.4144990Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:37.4146037Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.4147448Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.4148747Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.4149869Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:37.4150932Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.4152133Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.4153497Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.4154572Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.4155497Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.4156248Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.4157828Z W0507 20:31:37.408936 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.7610169Z self = 2025-05-07T20:31:37.7611159Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:37.7611516Z 2025-05-07T20:31:37.7611635Z @given( 2025-05-07T20:31:37.7611943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.7612350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.7612676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.7613012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.7613355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.7613656Z ) 2025-05-07T20:31:37.7614022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.7614477Z def test_silu_mul_quant( 2025-05-07T20:31:37.7614733Z self, 2025-05-07T20:31:37.7614941Z T: int, 2025-05-07T20:31:37.7615174Z D: int, 2025-05-07T20:31:37.7615423Z scale_ub: Optional[float], 2025-05-07T20:31:37.7615711Z contiguous: bool, 2025-05-07T20:31:37.7615957Z compiled: bool, 2025-05-07T20:31:37.7616201Z ) -> None: 2025-05-07T20:31:37.7616432Z torch.manual_seed(2025) 2025-05-07T20:31:37.7616682Z 2025-05-07T20:31:37.7616971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.7617330Z 2025-05-07T20:31:37.7617528Z x_sign = torch.sign(x) 2025-05-07T20:31:37.7617837Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.7618164Z x = x_sign * x_clamp 2025-05-07T20:31:37.7618411Z x0 = x[:, :D] 2025-05-07T20:31:37.7618645Z x1 = x[:, D:] 2025-05-07T20:31:37.7618870Z 2025-05-07T20:31:37.7619062Z if contiguous: 2025-05-07T20:31:37.7619308Z x0 = x0.contiguous() 2025-05-07T20:31:37.7619584Z x1 = x1.contiguous() 2025-05-07T20:31:37.7619843Z 2025-05-07T20:31:37.7620460Z if scale_ub is not None: 2025-05-07T20:31:37.7620777Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.7621241Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.7621581Z ) 2025-05-07T20:31:37.7621790Z else: 2025-05-07T20:31:37.7622016Z scale_ub_tensor = None 2025-05-07T20:31:37.7622273Z 2025-05-07T20:31:37.7622520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.7622851Z op = silu_mul_quant 2025-05-07T20:31:37.7623108Z if compiled: 2025-05-07T20:31:37.7623371Z op = torch.compile(op) 2025-05-07T20:31:37.7623680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.7623961Z 2025-05-07T20:31:37.7624165Z y_fp8, y_scale = fn() 2025-05-07T20:31:37.7624464Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:37.7624761Z 2025-05-07T20:31:37.7625011Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.7625367Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:37.7625678Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:37.7625999Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:37.7626370Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.7626702Z 2025-05-07T20:31:37.7626919Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:37.7627119Z 2025-05-07T20:31:37.7627228Z moe/activation_test.py:126: 2025-05-07T20:31:37.7627543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.7627895Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:37.7628237Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.7629036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:37.7630157Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:37.7630751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.7631449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.7632155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:37.7632891Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.7633656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:37.7634406Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.7635144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:37.7635801Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:37.7636430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:37.7636952Z fn() 2025-05-07T20:31:37.7637468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:37.7638055Z self.fn.run( 2025-05-07T20:31:37.7638530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.7639069Z kernel = self.compile( 2025-05-07T20:31:37.7639620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.7640691Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.7641102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.7641346Z 2025-05-07T20:31:37.7641566Z self = 2025-05-07T20:31:37.7642805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.7644207Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313d565280>} 2025-05-07T20:31:37.7645544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.7646576Z context = 2025-05-07T20:31:37.7646878Z 2025-05-07T20:31:37.7647052Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.7647594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.7648069Z module_map=module_map) 2025-05-07T20:31:37.7648453Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.7648828Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:37.7649110Z E ^ 2025-05-07T20:31:37.7649578Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.7650038Z 2025-05-07T20:31:37.7650461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.7650980Z 2025-05-07T20:31:37.7651097Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.7651513Z self=, 2025-05-07T20:31:37.7651927Z T=2048, 2025-05-07T20:31:37.7652128Z D=5120, 2025-05-07T20:31:37.7652335Z scale_ub=None, 2025-05-07T20:31:37.7652675Z contiguous=True, 2025-05-07T20:31:37.7652915Z compiled=True, 2025-05-07T20:31:37.7653133Z ) 2025-05-07T20:31:38.2682168Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.2683443Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.2684811Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.2686260Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.2687671Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.2689079Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.2690400Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.2691799Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.2693229Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.2694709Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.2696109Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.2697336Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.2698405Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:38.2699463Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.2700714Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.2702097Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.2703223Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:38.2704285Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.2705629Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.2707014Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.2708112Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.2716596Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.2717361Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.2718407Z W0507 20:31:38.264129 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.4578620Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.4579804Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.4581244Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.4582702Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.4584131Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.4585895Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.4587221Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.4588625Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.4590060Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.4591332Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.4592557Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.4593773Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.4594827Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:38.4595994Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.4597229Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.4598525Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.4599646Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:38.4600706Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.4601933Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.4603304Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.4604377Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.4605304Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.4606062Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.4607083Z W0507 20:31:38.453808 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.9662123Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.9663230Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.9664587Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.9666044Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.9667443Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.9668842Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.9670148Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.9671541Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.9673303Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.9674563Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.9675793Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.9677016Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.9678067Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:38.9679095Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.9680331Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.9681681Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.9682804Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:38.9683863Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.9685059Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.9686577Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.9687652Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.9688581Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.9689341Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.9690375Z W0507 20:31:38.962108 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.0054441Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.0055612Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:39.0056948Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.0058386Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.0059962Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.0061418Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.0062749Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.0064135Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.0065553Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.0066820Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:39.0068030Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.0069256Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:39.0070296Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:39.0071370Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:39.0072629Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.0074044Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.0075179Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:39.0076245Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:39.0077424Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.0078814Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.0079891Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.0080815Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.0081579Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:39.0082622Z W0507 20:31:39.001620 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.5039748Z self = 2025-05-07T20:31:39.5041225Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:39.5041612Z 2025-05-07T20:31:39.5041719Z @given( 2025-05-07T20:31:39.5042028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.5042415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.5042734Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.5043076Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.5043408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.5043705Z ) 2025-05-07T20:31:39.5044070Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.5044525Z def test_silu_mul_quant( 2025-05-07T20:31:39.5044772Z self, 2025-05-07T20:31:39.5044979Z T: int, 2025-05-07T20:31:39.5045188Z D: int, 2025-05-07T20:31:39.5045411Z scale_ub: Optional[float], 2025-05-07T20:31:39.5045693Z contiguous: bool, 2025-05-07T20:31:39.5045962Z compiled: bool, 2025-05-07T20:31:39.5046204Z ) -> None: 2025-05-07T20:31:39.5046433Z torch.manual_seed(2025) 2025-05-07T20:31:39.5046692Z 2025-05-07T20:31:39.5046971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.5047327Z 2025-05-07T20:31:39.5047527Z x_sign = torch.sign(x) 2025-05-07T20:31:39.5047832Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.5048161Z x = x_sign * x_clamp 2025-05-07T20:31:39.5048412Z x0 = x[:, :D] 2025-05-07T20:31:39.5048639Z x1 = x[:, D:] 2025-05-07T20:31:39.5048853Z 2025-05-07T20:31:39.5049048Z if contiguous: 2025-05-07T20:31:39.5049292Z x0 = x0.contiguous() 2025-05-07T20:31:39.5049559Z x1 = x1.contiguous() 2025-05-07T20:31:39.5049811Z 2025-05-07T20:31:39.5050011Z if scale_ub is not None: 2025-05-07T20:31:39.5050289Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.5050640Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.5051138Z ) 2025-05-07T20:31:39.5051333Z else: 2025-05-07T20:31:39.5051552Z scale_ub_tensor = None 2025-05-07T20:31:39.5051811Z 2025-05-07T20:31:39.5052046Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.5052371Z op = silu_mul_quant 2025-05-07T20:31:39.5052636Z if compiled: 2025-05-07T20:31:39.5052888Z op = torch.compile(op) 2025-05-07T20:31:39.5053196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.5053483Z 2025-05-07T20:31:39.5053687Z y_fp8, y_scale = fn() 2025-05-07T20:31:39.5053974Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:39.5054276Z 2025-05-07T20:31:39.5054521Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.5054860Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:39.5055164Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:39.5055503Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:39.5055867Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.5056185Z 2025-05-07T20:31:39.5056396Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:39.5056593Z 2025-05-07T20:31:39.5056699Z moe/activation_test.py:126: 2025-05-07T20:31:39.5057011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.5057359Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:39.5057693Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.5058482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:39.5059243Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:39.5059801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.5060663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.5061516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:39.5062250Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.5063009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:39.5063754Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.5064484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:39.5065135Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:39.5065749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:39.5066279Z fn() 2025-05-07T20:31:39.5066798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:39.5067385Z self.fn.run( 2025-05-07T20:31:39.5067851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.5068387Z kernel = self.compile( 2025-05-07T20:31:39.5068941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.5069600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.5070001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.5070245Z 2025-05-07T20:31:39.5070457Z self = 2025-05-07T20:31:39.5071590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.5073065Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd947f0d0>} 2025-05-07T20:31:39.5074407Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.5075422Z context = 2025-05-07T20:31:39.5075718Z 2025-05-07T20:31:39.5075890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.5076420Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.5076888Z module_map=module_map) 2025-05-07T20:31:39.5077496Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.5077860Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:39.5078134Z E ^ 2025-05-07T20:31:39.5078602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.5079054Z 2025-05-07T20:31:39.5079471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.5079982Z 2025-05-07T20:31:39.5080092Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.5080513Z self=, 2025-05-07T20:31:39.5080915Z T=128, 2025-05-07T20:31:39.5081108Z D=5120, 2025-05-07T20:31:39.5081310Z scale_ub=None, 2025-05-07T20:31:39.5081527Z contiguous=True, 2025-05-07T20:31:39.5081762Z compiled=True, 2025-05-07T20:31:39.5081981Z ) 2025-05-07T20:31:40.0378415Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.0379532Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:40.0380880Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.0382436Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.0383850Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.0385250Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.0386568Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.0387973Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.0389407Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.0390825Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:40.0392054Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.0393270Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:40.0394326Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:40.0395363Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:40.0396614Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.0397906Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.0399021Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:40.0400072Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:40.0401265Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.0402799Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.0403870Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.0404788Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.0405537Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:40.0406572Z W0507 20:31:40.033683 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.2286011Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.2287128Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:40.2288475Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.2289931Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.2291321Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.2293102Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.2294412Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.2295799Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.2297237Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.2298505Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:40.2299736Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.2300945Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:40.2302066Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:40.2303105Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:40.2304489Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.2305805Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.2306926Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:40.2307964Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:40.2309153Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.2310514Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.2311587Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.2312502Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.2313255Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:40.2314296Z W0507 20:31:40.224560 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.7413332Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.7414831Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:40.7416173Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.7417610Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.7419000Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.7420404Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.7421812Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.7423200Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.7424632Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.7426038Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:40.7427288Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.7428508Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:40.7429563Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:40.7430592Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:40.7431822Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.7433128Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.7434254Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:40.7435311Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:40.7436500Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.7437877Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.7439028Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.7439959Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.7440967Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:40.7442057Z W0507 20:31:40.737247 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.7810243Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.7811974Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:40.7813315Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.7814734Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.7816136Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.7817677Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.7819012Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.7820405Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.7821929Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.7823190Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:40.7824434Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.7825645Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:40.7826691Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:40.7827722Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:40.7828964Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.7830422Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.7831551Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:40.7832600Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:40.7833778Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.7835134Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.7836213Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.7837140Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.7837888Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:40.7838914Z W0507 20:31:40.777088 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2379497Z self = 2025-05-07T20:31:41.2380241Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:41.2380626Z 2025-05-07T20:31:41.2380747Z @given( 2025-05-07T20:31:41.2381609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2382041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2382389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2382737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2383077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2383380Z ) 2025-05-07T20:31:41.2383747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2384215Z def test_silu_mul_quant( 2025-05-07T20:31:41.2384466Z self, 2025-05-07T20:31:41.2384673Z T: int, 2025-05-07T20:31:41.2384884Z D: int, 2025-05-07T20:31:41.2385108Z scale_ub: Optional[float], 2025-05-07T20:31:41.2385393Z contiguous: bool, 2025-05-07T20:31:41.2385646Z compiled: bool, 2025-05-07T20:31:41.2385880Z ) -> None: 2025-05-07T20:31:41.2386113Z torch.manual_seed(2025) 2025-05-07T20:31:41.2386375Z 2025-05-07T20:31:41.2386661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2387020Z 2025-05-07T20:31:41.2387227Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2387551Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2387870Z x = x_sign * x_clamp 2025-05-07T20:31:41.2388123Z x0 = x[:, :D] 2025-05-07T20:31:41.2388351Z x1 = x[:, D:] 2025-05-07T20:31:41.2388562Z 2025-05-07T20:31:41.2388760Z if contiguous: 2025-05-07T20:31:41.2389006Z x0 = x0.contiguous() 2025-05-07T20:31:41.2389275Z x1 = x1.contiguous() 2025-05-07T20:31:41.2389532Z 2025-05-07T20:31:41.2389735Z if scale_ub is not None: 2025-05-07T20:31:41.2390026Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2390367Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2390689Z ) 2025-05-07T20:31:41.2390896Z else: 2025-05-07T20:31:41.2391310Z scale_ub_tensor = None 2025-05-07T20:31:41.2391573Z 2025-05-07T20:31:41.2391816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2392136Z op = silu_mul_quant 2025-05-07T20:31:41.2392406Z if compiled: 2025-05-07T20:31:41.2392667Z op = torch.compile(op) 2025-05-07T20:31:41.2392972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2393266Z 2025-05-07T20:31:41.2393471Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.2393766Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.2394077Z 2025-05-07T20:31:41.2394328Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2394679Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.2394980Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.2395307Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.2395677Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2396003Z 2025-05-07T20:31:41.2396220Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.2396419Z 2025-05-07T20:31:41.2396531Z moe/activation_test.py:126: 2025-05-07T20:31:41.2396836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2397185Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.2397523Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.2398320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.2399072Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.2399640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2400332Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2401124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.2401921Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2402679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.2403431Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.2404154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.2404808Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.2405418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.2405945Z fn() 2025-05-07T20:31:41.2406461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.2407049Z self.fn.run( 2025-05-07T20:31:41.2407527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2408060Z kernel = self.compile( 2025-05-07T20:31:41.2408611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2409271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2409676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2409930Z 2025-05-07T20:31:41.2417988Z self = 2025-05-07T20:31:41.2419149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2422845Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd90d25e0>} 2025-05-07T20:31:41.2424219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2425256Z context = 2025-05-07T20:31:41.2425559Z 2025-05-07T20:31:41.2425737Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2426273Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2426762Z module_map=module_map) 2025-05-07T20:31:41.2427137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2427512Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.2427812Z E ^ 2025-05-07T20:31:41.2428284Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2428743Z 2025-05-07T20:31:41.2429162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2429684Z 2025-05-07T20:31:41.2429792Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2430219Z self=, 2025-05-07T20:31:41.2430625Z T=4096, 2025-05-07T20:31:41.2430831Z D=5120, 2025-05-07T20:31:41.2431036Z scale_ub=None, 2025-05-07T20:31:41.2431259Z contiguous=True, 2025-05-07T20:31:41.2431494Z compiled=True, 2025-05-07T20:31:41.2431716Z ) 2025-05-07T20:31:41.7703134Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:41.7704295Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:41.7705642Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:41.7707093Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:41.7708494Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:41.7709890Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.7711225Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:41.7712602Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.7714020Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:41.7715292Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:41.7716676Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:41.7717899Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:41.7718937Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:41.7719965Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:41.7721195Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:41.7722499Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:41.7723623Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:41.7724672Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:41.7725857Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:41.7727304Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:41.7728401Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.7729324Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.7730077Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:41.7731110Z W0507 20:31:41.766130 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.9625771Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:41.9627210Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:41.9628578Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:41.9630012Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:41.9631381Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:41.9632817Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.9634488Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:41.9635878Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.9637311Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:41.9638564Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:41.9639798Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:41.9641341Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:41.9642400Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:41.9643438Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:41.9644665Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:41.9646120Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:41.9647249Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:41.9648310Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:41.9649495Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:41.9650859Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:41.9651949Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.9652869Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.9653628Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:41.9654657Z W0507 20:31:41.958533 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.4746581Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:42.4747843Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:42.4749774Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:42.4751237Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:42.4752627Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:42.4753998Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.4755338Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:42.4756730Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.4758138Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:42.4759375Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:42.4760742Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:42.4761968Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:42.4763005Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:42.4764037Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:42.4765271Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:42.4766567Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:42.4767705Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:42.4768784Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:42.4770488Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:42.4771998Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:42.4773074Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.4774098Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.4774849Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:42.4775889Z W0507 20:31:42.470564 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.5138400Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:42.5139636Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:42.5141341Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:42.5142839Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:42.5144235Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:42.5145627Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.5147129Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:42.5148521Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.5149956Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:42.5151213Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:42.5152454Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:42.5153679Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:42.5154724Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:42.5155763Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:42.5156984Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:42.5158266Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:42.5159507Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:42.5160538Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:42.5161724Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:42.5163082Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:42.5164158Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.5165085Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.5165833Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:42.5166866Z W0507 20:31:42.510011 86812 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.9677522Z self = 2025-05-07T20:31:42.9678320Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:42.9678687Z 2025-05-07T20:31:42.9678776Z @given( 2025-05-07T20:31:42.9679023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.9679346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.9680114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.9680468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.9680818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.9681110Z ) 2025-05-07T20:31:42.9681471Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.9681927Z def test_silu_mul_quant( 2025-05-07T20:31:42.9682176Z self, 2025-05-07T20:31:42.9682384Z T: int, 2025-05-07T20:31:42.9682597Z D: int, 2025-05-07T20:31:42.9682820Z scale_ub: Optional[float], 2025-05-07T20:31:42.9683109Z contiguous: bool, 2025-05-07T20:31:42.9683364Z compiled: bool, 2025-05-07T20:31:42.9683598Z ) -> None: 2025-05-07T20:31:42.9683832Z torch.manual_seed(2025) 2025-05-07T20:31:42.9684090Z 2025-05-07T20:31:42.9684368Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.9684726Z 2025-05-07T20:31:42.9684945Z x_sign = torch.sign(x) 2025-05-07T20:31:42.9685246Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.9685571Z x = x_sign * x_clamp 2025-05-07T20:31:42.9685826Z x0 = x[:, :D] 2025-05-07T20:31:42.9686057Z x1 = x[:, D:] 2025-05-07T20:31:42.9686269Z 2025-05-07T20:31:42.9686469Z if contiguous: 2025-05-07T20:31:42.9686718Z x0 = x0.contiguous() 2025-05-07T20:31:42.9686988Z x1 = x1.contiguous() 2025-05-07T20:31:42.9687241Z 2025-05-07T20:31:42.9687447Z if scale_ub is not None: 2025-05-07T20:31:42.9687733Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.9688086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.9688414Z ) 2025-05-07T20:31:42.9688614Z else: 2025-05-07T20:31:42.9688840Z scale_ub_tensor = None 2025-05-07T20:31:42.9689106Z 2025-05-07T20:31:42.9689345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.9689850Z op = silu_mul_quant 2025-05-07T20:31:42.9690123Z if compiled: 2025-05-07T20:31:42.9690379Z op = torch.compile(op) 2025-05-07T20:31:42.9690691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.9690980Z 2025-05-07T20:31:42.9691191Z y_fp8, y_scale = fn() 2025-05-07T20:31:42.9691486Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:42.9691789Z 2025-05-07T20:31:42.9692036Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.9692403Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:42.9692740Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:42.9693071Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:42.9693435Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.9693758Z 2025-05-07T20:31:42.9693971Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:42.9694174Z 2025-05-07T20:31:42.9694292Z moe/activation_test.py:126: 2025-05-07T20:31:42.9694603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.9694951Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:42.9695295Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.9696083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:42.9696844Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:42.9697407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.9698098Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.9698792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:42.9699641Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.9700414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:42.9701256Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.9702006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:42.9702662Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:42.9703277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:42.9703800Z fn() 2025-05-07T20:31:42.9704324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:42.9704918Z self.fn.run( 2025-05-07T20:31:42.9705398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.9705957Z kernel = self.compile( 2025-05-07T20:31:42.9706514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.9707188Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.9707592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.9707834Z 2025-05-07T20:31:42.9708050Z self = 2025-05-07T20:31:42.9709143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.9710562Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd91a5940>} 2025-05-07T20:31:42.9712000Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.9713031Z context = 2025-05-07T20:31:42.9713334Z 2025-05-07T20:31:42.9713507Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.9714049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.9714520Z module_map=module_map) 2025-05-07T20:31:42.9714905Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.9715275Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:42.9715555Z E ^ 2025-05-07T20:31:42.9716022Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.9716495Z 2025-05-07T20:31:42.9716914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.9717430Z 2025-05-07T20:31:42.9717549Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.9717978Z self=, 2025-05-07T20:31:42.9718387Z T=16384, 2025-05-07T20:31:42.9718595Z D=5120, 2025-05-07T20:31:42.9718804Z scale_ub=None, 2025-05-07T20:31:42.9719024Z contiguous=True, 2025-05-07T20:31:42.9719259Z compiled=True, 2025-05-07T20:31:42.9719480Z ) 2025-05-07T20:31:43.0171488Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:43.0173764Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:43.0176436Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:43.0178409Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:43.0180617Z W0507 20:31:43.015486 86812 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:43.1391978Z self = 2025-05-07T20:31:43.1392799Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:43.1393211Z 2025-05-07T20:31:43.1393310Z @given( 2025-05-07T20:31:43.1393569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.1393944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.1394261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.1394610Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.1394956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.1395250Z ) 2025-05-07T20:31:43.1395614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.1396071Z def test_silu_mul_quant( 2025-05-07T20:31:43.1396324Z self, 2025-05-07T20:31:43.1396534Z T: int, 2025-05-07T20:31:43.1396749Z D: int, 2025-05-07T20:31:43.1396980Z scale_ub: Optional[float], 2025-05-07T20:31:43.1397272Z contiguous: bool, 2025-05-07T20:31:43.1397526Z compiled: bool, 2025-05-07T20:31:43.1397764Z ) -> None: 2025-05-07T20:31:43.1397997Z torch.manual_seed(2025) 2025-05-07T20:31:43.1398257Z 2025-05-07T20:31:43.1398544Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.1399256Z 2025-05-07T20:31:43.1399467Z x_sign = torch.sign(x) 2025-05-07T20:31:43.1399777Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.1400100Z x = x_sign * x_clamp 2025-05-07T20:31:43.1400357Z x0 = x[:, :D] 2025-05-07T20:31:43.1400587Z x1 = x[:, D:] 2025-05-07T20:31:43.1400802Z 2025-05-07T20:31:43.1401005Z if contiguous: 2025-05-07T20:31:43.1401254Z x0 = x0.contiguous() 2025-05-07T20:31:43.1401523Z x1 = x1.contiguous() 2025-05-07T20:31:43.1401781Z 2025-05-07T20:31:43.1402018Z if scale_ub is not None: 2025-05-07T20:31:43.1402327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.1402680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.1403008Z ) 2025-05-07T20:31:43.1403209Z else: 2025-05-07T20:31:43.1403432Z scale_ub_tensor = None 2025-05-07T20:31:43.1403710Z 2025-05-07T20:31:43.1403954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.1404286Z op = silu_mul_quant 2025-05-07T20:31:43.1404553Z if compiled: 2025-05-07T20:31:43.1404814Z op = torch.compile(op) 2025-05-07T20:31:43.1405117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.1405405Z 2025-05-07T20:31:43.1405611Z y_fp8, y_scale = fn() 2025-05-07T20:31:43.1405903Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:43.1406210Z 2025-05-07T20:31:43.1406459Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.1406805Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:43.1407113Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:43.1407441Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:43.1407810Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:43.1408140Z 2025-05-07T20:31:43.1409168Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:43.1409377Z 2025-05-07T20:31:43.1409491Z moe/activation_test.py:126: 2025-05-07T20:31:43.1409802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.1410156Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:43.1410496Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:43.1411302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:43.1412070Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:43.1412625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.1413323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.1414031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:43.1414762Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:43.1415526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:43.1416285Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:43.1417025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:43.1417672Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:43.1418288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:43.1418818Z fn() 2025-05-07T20:31:43.1419334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:43.1419918Z self.fn.run( 2025-05-07T20:31:43.1420499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.1421041Z kernel = self.compile( 2025-05-07T20:31:43.1421685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.1422349Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.1422762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.1422997Z 2025-05-07T20:31:43.1423215Z self = 2025-05-07T20:31:43.1424292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.1425675Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7d56940>} 2025-05-07T20:31:43.1427033Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.1428058Z context = 2025-05-07T20:31:43.1428351Z 2025-05-07T20:31:43.1428529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.1429057Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.1429534Z module_map=module_map) 2025-05-07T20:31:43.1429913Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.1430272Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:43.1430550Z E ^ 2025-05-07T20:31:43.1431115Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.1431576Z 2025-05-07T20:31:43.1432000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.1432517Z 2025-05-07T20:31:43.1432625Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.1433048Z self=, 2025-05-07T20:31:43.1433459Z T=1, 2025-05-07T20:31:43.1433647Z D=5120, 2025-05-07T20:31:43.1433851Z scale_ub=1200.0, 2025-05-07T20:31:43.1434083Z contiguous=True, 2025-05-07T20:31:43.1434309Z compiled=True, 2025-05-07T20:31:43.1434525Z ) 2025-05-07T20:31:43.5278667Z self = 2025-05-07T20:31:43.5279422Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.5279793Z 2025-05-07T20:31:43.5279962Z @given( 2025-05-07T20:31:43.5280292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.5280723Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.5281133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.5281527Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.5281869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.5282176Z ) 2025-05-07T20:31:43.5282532Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.5282995Z def test_silu_mul_quant( 2025-05-07T20:31:43.5283258Z self, 2025-05-07T20:31:43.5283460Z T: int, 2025-05-07T20:31:43.5283672Z D: int, 2025-05-07T20:31:43.5283908Z scale_ub: Optional[float], 2025-05-07T20:31:43.5284192Z contiguous: bool, 2025-05-07T20:31:43.5284446Z compiled: bool, 2025-05-07T20:31:43.5284687Z ) -> None: 2025-05-07T20:31:43.5284911Z torch.manual_seed(2025) 2025-05-07T20:31:43.5285597Z 2025-05-07T20:31:43.5285886Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.5286250Z 2025-05-07T20:31:43.5286447Z x_sign = torch.sign(x) 2025-05-07T20:31:43.5286758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.5287089Z x = x_sign * x_clamp 2025-05-07T20:31:43.5287337Z x0 = x[:, :D] 2025-05-07T20:31:43.5287565Z x1 = x[:, D:] 2025-05-07T20:31:43.5287784Z 2025-05-07T20:31:43.5287975Z if contiguous: 2025-05-07T20:31:43.5288219Z x0 = x0.contiguous() 2025-05-07T20:31:43.5288490Z x1 = x1.contiguous() 2025-05-07T20:31:43.5296771Z 2025-05-07T20:31:43.5297016Z if scale_ub is not None: 2025-05-07T20:31:43.5297315Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.5297672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.5297993Z ) 2025-05-07T20:31:43.5298212Z else: 2025-05-07T20:31:43.5298438Z scale_ub_tensor = None 2025-05-07T20:31:43.5298705Z 2025-05-07T20:31:43.5298956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.5299293Z op = silu_mul_quant 2025-05-07T20:31:43.5299553Z if compiled: 2025-05-07T20:31:43.5299819Z op = torch.compile(op) 2025-05-07T20:31:43.5300129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5300438Z 2025-05-07T20:31:43.5300633Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.5300811Z 2025-05-07T20:31:43.5300919Z moe/activation_test.py:117: 2025-05-07T20:31:43.5301297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5301639Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.5301946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5302526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.5303283Z return fn(*args, **kwargs) 2025-05-07T20:31:43.5303964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.5304663Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.5305209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.5305898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.5306566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.5307118Z kernel = self.compile( 2025-05-07T20:31:43.5307668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.5308333Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.5308755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5308996Z 2025-05-07T20:31:43.5309215Z self = 2025-05-07T20:31:43.5310303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.5311673Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd82f5ca0>} 2025-05-07T20:31:43.5313018Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.5314038Z context = 2025-05-07T20:31:43.5314424Z 2025-05-07T20:31:43.5314605Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.5315128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.5315608Z module_map=module_map) 2025-05-07T20:31:43.5315992Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.5316358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.5316622Z E ^ 2025-05-07T20:31:43.5317096Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.5317553Z 2025-05-07T20:31:43.5317976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.5318487Z 2025-05-07T20:31:43.5318593Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.5319020Z self=, 2025-05-07T20:31:43.5319435Z T=1, 2025-05-07T20:31:43.5319628Z D=5120, 2025-05-07T20:31:43.5319821Z scale_ub=None, 2025-05-07T20:31:43.5320048Z contiguous=False, 2025-05-07T20:31:43.5320283Z compiled=True, 2025-05-07T20:31:43.5320493Z ) 2025-05-07T20:31:43.6133429Z self = 2025-05-07T20:31:43.6134893Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.6135644Z 2025-05-07T20:31:43.6135875Z @given( 2025-05-07T20:31:43.6136346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.6136995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.6137632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.6138300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.6138980Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.6139567Z ) 2025-05-07T20:31:43.6141047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.6142051Z def test_silu_mul_quant( 2025-05-07T20:31:43.6142508Z self, 2025-05-07T20:31:43.6142718Z T: int, 2025-05-07T20:31:43.6142928Z D: int, 2025-05-07T20:31:43.6143162Z scale_ub: Optional[float], 2025-05-07T20:31:43.6143450Z contiguous: bool, 2025-05-07T20:31:43.6143697Z compiled: bool, 2025-05-07T20:31:43.6143936Z ) -> None: 2025-05-07T20:31:43.6144167Z torch.manual_seed(2025) 2025-05-07T20:31:43.6144418Z 2025-05-07T20:31:43.6144707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.6145069Z 2025-05-07T20:31:43.6145267Z x_sign = torch.sign(x) 2025-05-07T20:31:43.6145576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.6145904Z x = x_sign * x_clamp 2025-05-07T20:31:43.6146153Z x0 = x[:, :D] 2025-05-07T20:31:43.6146386Z x1 = x[:, D:] 2025-05-07T20:31:43.6146617Z 2025-05-07T20:31:43.6146817Z if contiguous: 2025-05-07T20:31:43.6147058Z x0 = x0.contiguous() 2025-05-07T20:31:43.6147334Z x1 = x1.contiguous() 2025-05-07T20:31:43.6147590Z 2025-05-07T20:31:43.6147786Z if scale_ub is not None: 2025-05-07T20:31:43.6148070Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.6148419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.6148742Z ) 2025-05-07T20:31:43.6148948Z else: 2025-05-07T20:31:43.6149170Z scale_ub_tensor = None 2025-05-07T20:31:43.6149427Z 2025-05-07T20:31:43.6149672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.6149998Z op = silu_mul_quant 2025-05-07T20:31:43.6150255Z if compiled: 2025-05-07T20:31:43.6150520Z op = torch.compile(op) 2025-05-07T20:31:43.6150833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6151268Z 2025-05-07T20:31:43.6151480Z y_fp8, y_scale = fn() 2025-05-07T20:31:43.6151780Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:43.6152121Z 2025-05-07T20:31:43.6152385Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.6152730Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:43.6153043Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:43.6153368Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:43.6153745Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:43.6154069Z 2025-05-07T20:31:43.6154277Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:43.6154485Z 2025-05-07T20:31:43.6154592Z moe/activation_test.py:126: 2025-05-07T20:31:43.6154909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6155260Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:43.6155600Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:43.6156592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:43.6157373Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:43.6157933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.6158621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.6159319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:43.6160051Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:43.6160811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:43.6161644Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:43.6162392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:43.6163040Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:43.6163645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:43.6164187Z fn() 2025-05-07T20:31:43.6164710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:43.6165292Z self.fn.run( 2025-05-07T20:31:43.6165759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.6166303Z kernel = self.compile( 2025-05-07T20:31:43.6166852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.6167509Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.6167927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6168168Z 2025-05-07T20:31:43.6168381Z self = 2025-05-07T20:31:43.6169466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.6170850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd838f280>} 2025-05-07T20:31:43.6172185Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.6173320Z context = 2025-05-07T20:31:43.6173620Z 2025-05-07T20:31:43.6173792Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.6174327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.6174795Z module_map=module_map) 2025-05-07T20:31:43.6175174Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.6175542Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:43.6175812Z E ^ 2025-05-07T20:31:43.6176281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.6176740Z 2025-05-07T20:31:43.6177163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.6177675Z 2025-05-07T20:31:43.6177788Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.6178216Z self=, 2025-05-07T20:31:43.6178626Z T=1, 2025-05-07T20:31:43.6178821Z D=5120, 2025-05-07T20:31:43.6179016Z scale_ub=None, 2025-05-07T20:31:43.6179240Z contiguous=True, 2025-05-07T20:31:43.6179481Z compiled=False, 2025-05-07T20:31:43.6179690Z ) 2025-05-07T20:31:43.8158882Z self = 2025-05-07T20:31:43.8159738Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:43.8160123Z 2025-05-07T20:31:43.8160238Z @given( 2025-05-07T20:31:43.8160516Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.8160837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.8161155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.8161499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.8162114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.8162442Z ) 2025-05-07T20:31:43.8162805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.8163264Z def test_silu_mul_quant( 2025-05-07T20:31:43.8163511Z self, 2025-05-07T20:31:43.8163713Z T: int, 2025-05-07T20:31:43.8163918Z D: int, 2025-05-07T20:31:43.8164140Z scale_ub: Optional[float], 2025-05-07T20:31:43.8164422Z contiguous: bool, 2025-05-07T20:31:43.8164678Z compiled: bool, 2025-05-07T20:31:43.8164910Z ) -> None: 2025-05-07T20:31:43.8165145Z torch.manual_seed(2025) 2025-05-07T20:31:43.8165396Z 2025-05-07T20:31:43.8165672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.8166028Z 2025-05-07T20:31:43.8166232Z x_sign = torch.sign(x) 2025-05-07T20:31:43.8166531Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.8166857Z x = x_sign * x_clamp 2025-05-07T20:31:43.8167122Z x0 = x[:, :D] 2025-05-07T20:31:43.8167347Z x1 = x[:, D:] 2025-05-07T20:31:43.8167569Z 2025-05-07T20:31:43.8167769Z if contiguous: 2025-05-07T20:31:43.8168017Z x0 = x0.contiguous() 2025-05-07T20:31:43.8168284Z x1 = x1.contiguous() 2025-05-07T20:31:43.8168541Z 2025-05-07T20:31:43.8168744Z if scale_ub is not None: 2025-05-07T20:31:43.8169023Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.8169367Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.8169690Z ) 2025-05-07T20:31:43.8169888Z else: 2025-05-07T20:31:43.8170109Z scale_ub_tensor = None 2025-05-07T20:31:43.8170371Z 2025-05-07T20:31:43.8170610Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.8170939Z op = silu_mul_quant 2025-05-07T20:31:43.8171202Z if compiled: 2025-05-07T20:31:43.8171454Z op = torch.compile(op) 2025-05-07T20:31:43.8171963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8172247Z 2025-05-07T20:31:43.8172444Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.8172622Z 2025-05-07T20:31:43.8172727Z moe/activation_test.py:117: 2025-05-07T20:31:43.8173039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8173383Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.8173669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8174366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.8175064Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.8175601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.8176295Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.8176971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.8177518Z kernel = self.compile( 2025-05-07T20:31:43.8178058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.8178722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.8179136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8179373Z 2025-05-07T20:31:43.8179590Z self = 2025-05-07T20:31:43.8180666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.8182318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7d56040>} 2025-05-07T20:31:43.8183671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.8184693Z context = 2025-05-07T20:31:43.8184984Z 2025-05-07T20:31:43.8185156Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.8185691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.8186164Z module_map=module_map) 2025-05-07T20:31:43.8186548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.8186906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.8187181Z E ^ 2025-05-07T20:31:43.8187658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.8188113Z 2025-05-07T20:31:43.8188537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.8189059Z 2025-05-07T20:31:43.8189167Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.8189590Z self=, 2025-05-07T20:31:43.8190000Z T=128, 2025-05-07T20:31:43.8190192Z D=5120, 2025-05-07T20:31:43.8190396Z scale_ub=None, 2025-05-07T20:31:43.8190623Z contiguous=False, 2025-05-07T20:31:43.8190855Z compiled=True, 2025-05-07T20:31:43.8191073Z ) 2025-05-07T20:31:43.8191404Z self = 2025-05-07T20:31:43.8191899Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.8192179Z 2025-05-07T20:31:43.8192259Z @given( 2025-05-07T20:31:43.8192588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.8192916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.8193226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.8193567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.8193906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.8194194Z ) 2025-05-07T20:31:43.8194551Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.8195004Z def test_silu_mul_quant( 2025-05-07T20:31:43.8195247Z self, 2025-05-07T20:31:43.8195456Z T: int, 2025-05-07T20:31:43.8195665Z D: int, 2025-05-07T20:31:43.8195885Z scale_ub: Optional[float], 2025-05-07T20:31:43.8196168Z contiguous: bool, 2025-05-07T20:31:43.8196415Z compiled: bool, 2025-05-07T20:31:43.8196641Z ) -> None: 2025-05-07T20:31:43.8196866Z torch.manual_seed(2025) 2025-05-07T20:31:43.8197117Z 2025-05-07T20:31:43.8197409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.8197759Z 2025-05-07T20:31:43.8197964Z x_sign = torch.sign(x) 2025-05-07T20:31:43.8198265Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.8198579Z x = x_sign * x_clamp 2025-05-07T20:31:43.8198830Z x0 = x[:, :D] 2025-05-07T20:31:43.8199057Z x1 = x[:, D:] 2025-05-07T20:31:43.8199269Z 2025-05-07T20:31:43.8199469Z if contiguous: 2025-05-07T20:31:43.8199713Z x0 = x0.contiguous() 2025-05-07T20:31:43.8199978Z x1 = x1.contiguous() 2025-05-07T20:31:43.8200234Z 2025-05-07T20:31:43.8200438Z if scale_ub is not None: 2025-05-07T20:31:43.8200716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.8201064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.8201387Z ) 2025-05-07T20:31:43.8201583Z else: 2025-05-07T20:31:43.8201886Z scale_ub_tensor = None 2025-05-07T20:31:43.8202158Z 2025-05-07T20:31:43.8202401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.8202719Z op = silu_mul_quant 2025-05-07T20:31:43.8202983Z if compiled: 2025-05-07T20:31:43.8203242Z op = torch.compile(op) 2025-05-07T20:31:43.8203550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8203839Z 2025-05-07T20:31:43.8204042Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.8204210Z 2025-05-07T20:31:43.8204320Z moe/activation_test.py:117: 2025-05-07T20:31:43.8204625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8204966Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.8205250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8205819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.8206389Z return fn(*args, **kwargs) 2025-05-07T20:31:43.8207060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.8207747Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.8208289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.8208974Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.8209656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.8210190Z kernel = self.compile( 2025-05-07T20:31:43.8210739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.8211404Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.8211804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8212139Z 2025-05-07T20:31:43.8212351Z self = 2025-05-07T20:31:43.8213432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.8214804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd912ac10>} 2025-05-07T20:31:43.8216152Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.8217177Z context = 2025-05-07T20:31:43.8217470Z 2025-05-07T20:31:43.8217644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.8218182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.8218663Z module_map=module_map) 2025-05-07T20:31:43.8219044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.8219401Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.8219674Z E ^ 2025-05-07T20:31:43.8220141Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.8220589Z 2025-05-07T20:31:43.8221008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.8221606Z 2025-05-07T20:31:43.8221713Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.8222140Z self=, 2025-05-07T20:31:43.8222557Z T=128, 2025-05-07T20:31:43.8222832Z D=7168, 2025-05-07T20:31:43.8223038Z scale_ub=1200.0, 2025-05-07T20:31:43.8223275Z contiguous=False, 2025-05-07T20:31:43.8223504Z compiled=False, 2025-05-07T20:31:43.8223722Z ) 2025-05-07T20:31:43.9761503Z self = 2025-05-07T20:31:43.9762317Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:43.9762954Z 2025-05-07T20:31:43.9763176Z @given( 2025-05-07T20:31:43.9763828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.9764520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.9765157Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.9765837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.9766497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.9767080Z ) 2025-05-07T20:31:43.9767822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.9768722Z def test_silu_mul_quant( 2025-05-07T20:31:43.9769223Z self, 2025-05-07T20:31:43.9769626Z T: int, 2025-05-07T20:31:43.9770027Z D: int, 2025-05-07T20:31:43.9770479Z scale_ub: Optional[float], 2025-05-07T20:31:43.9771036Z contiguous: bool, 2025-05-07T20:31:43.9771516Z compiled: bool, 2025-05-07T20:31:43.9771980Z ) -> None: 2025-05-07T20:31:43.9772284Z torch.manual_seed(2025) 2025-05-07T20:31:43.9772539Z 2025-05-07T20:31:43.9772819Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.9773173Z 2025-05-07T20:31:43.9773378Z x_sign = torch.sign(x) 2025-05-07T20:31:43.9773676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.9774001Z x = x_sign * x_clamp 2025-05-07T20:31:43.9774255Z x0 = x[:, :D] 2025-05-07T20:31:43.9774483Z x1 = x[:, D:] 2025-05-07T20:31:43.9774706Z 2025-05-07T20:31:43.9775742Z if contiguous: 2025-05-07T20:31:43.9775983Z x0 = x0.contiguous() 2025-05-07T20:31:43.9776259Z x1 = x1.contiguous() 2025-05-07T20:31:43.9776513Z 2025-05-07T20:31:43.9776712Z if scale_ub is not None: 2025-05-07T20:31:43.9777000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.9777356Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.9777671Z ) 2025-05-07T20:31:43.9777880Z else: 2025-05-07T20:31:43.9778104Z scale_ub_tensor = None 2025-05-07T20:31:43.9778369Z 2025-05-07T20:31:43.9778610Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.9778940Z op = silu_mul_quant 2025-05-07T20:31:43.9779211Z if compiled: 2025-05-07T20:31:43.9779470Z op = torch.compile(op) 2025-05-07T20:31:43.9779778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9780068Z 2025-05-07T20:31:43.9780269Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.9780455Z 2025-05-07T20:31:43.9780563Z moe/activation_test.py:117: 2025-05-07T20:31:43.9780872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9781346Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.9781638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9782339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.9783086Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.9783631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.9784323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.9784997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.9785534Z kernel = self.compile( 2025-05-07T20:31:43.9795062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.9795766Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.9796181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9796430Z 2025-05-07T20:31:43.9796649Z self = 2025-05-07T20:31:43.9797740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.9799148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd8467dc0>} 2025-05-07T20:31:43.9800506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.9801550Z context = 2025-05-07T20:31:43.9801854Z 2025-05-07T20:31:43.9802030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.9802570Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.9803055Z module_map=module_map) 2025-05-07T20:31:43.9803443Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.9803814Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.9804091Z E ^ 2025-05-07T20:31:43.9804567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.9805029Z 2025-05-07T20:31:43.9805462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.9806131Z 2025-05-07T20:31:43.9806255Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.9806686Z self=, 2025-05-07T20:31:43.9807093Z T=128, 2025-05-07T20:31:43.9807298Z D=5120, 2025-05-07T20:31:43.9807508Z scale_ub=None, 2025-05-07T20:31:43.9807731Z contiguous=False, 2025-05-07T20:31:43.9807973Z compiled=False, 2025-05-07T20:31:43.9808196Z ) 2025-05-07T20:31:43.9808520Z self = 2025-05-07T20:31:43.9809030Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:43.9809310Z 2025-05-07T20:31:43.9809402Z @given( 2025-05-07T20:31:43.9809645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.9809981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.9810327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.9810683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.9811026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.9811329Z ) 2025-05-07T20:31:43.9811694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.9812151Z def test_silu_mul_quant( 2025-05-07T20:31:43.9812415Z self, 2025-05-07T20:31:43.9812629Z T: int, 2025-05-07T20:31:43.9812858Z D: int, 2025-05-07T20:31:43.9813117Z scale_ub: Optional[float], 2025-05-07T20:31:43.9813409Z contiguous: bool, 2025-05-07T20:31:43.9813658Z compiled: bool, 2025-05-07T20:31:43.9813903Z ) -> None: 2025-05-07T20:31:43.9814135Z torch.manual_seed(2025) 2025-05-07T20:31:43.9814388Z 2025-05-07T20:31:43.9814675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.9815034Z 2025-05-07T20:31:43.9815328Z x_sign = torch.sign(x) 2025-05-07T20:31:43.9815642Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.9815972Z x = x_sign * x_clamp 2025-05-07T20:31:43.9816234Z x0 = x[:, :D] 2025-05-07T20:31:43.9816458Z x1 = x[:, D:] 2025-05-07T20:31:43.9816673Z 2025-05-07T20:31:43.9816862Z if contiguous: 2025-05-07T20:31:43.9817103Z x0 = x0.contiguous() 2025-05-07T20:31:43.9817376Z x1 = x1.contiguous() 2025-05-07T20:31:43.9817636Z 2025-05-07T20:31:43.9817839Z if scale_ub is not None: 2025-05-07T20:31:43.9818136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.9818491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.9818811Z ) 2025-05-07T20:31:43.9819026Z else: 2025-05-07T20:31:43.9819257Z scale_ub_tensor = None 2025-05-07T20:31:43.9819519Z 2025-05-07T20:31:43.9819765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.9820114Z op = silu_mul_quant 2025-05-07T20:31:43.9820381Z if compiled: 2025-05-07T20:31:43.9820652Z op = torch.compile(op) 2025-05-07T20:31:43.9820967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9821329Z 2025-05-07T20:31:43.9821531Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.9821715Z 2025-05-07T20:31:43.9821821Z moe/activation_test.py:117: 2025-05-07T20:31:43.9822139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9822482Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.9822809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9823542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.9824238Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.9824799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.9825583Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.9826264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.9826817Z kernel = self.compile( 2025-05-07T20:31:43.9827376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.9828067Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.9828482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9828717Z 2025-05-07T20:31:43.9828929Z self = 2025-05-07T20:31:43.9830020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.9831411Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7d7b3a0>} 2025-05-07T20:31:43.9832759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.9833777Z context = 2025-05-07T20:31:43.9834077Z 2025-05-07T20:31:43.9834249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.9834783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.9835266Z module_map=module_map) 2025-05-07T20:31:43.9835642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.9836092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.9836369Z E ^ 2025-05-07T20:31:43.9836834Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.9837289Z 2025-05-07T20:31:43.9837722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.9838257Z 2025-05-07T20:31:43.9838364Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.9838796Z self=, 2025-05-07T20:31:43.9839203Z T=128, 2025-05-07T20:31:43.9839410Z D=5120, 2025-05-07T20:31:43.9839620Z scale_ub=1200.0, 2025-05-07T20:31:43.9839862Z contiguous=True, 2025-05-07T20:31:43.9840360Z compiled=False, 2025-05-07T20:31:43.9840678Z ) 2025-05-07T20:31:44.2109247Z self = 2025-05-07T20:31:44.2110753Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.2111486Z 2025-05-07T20:31:44.2111657Z @given( 2025-05-07T20:31:44.2112131Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.2112513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.2112852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.2113197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.2113539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.2113836Z ) 2025-05-07T20:31:44.2114192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.2114848Z def test_silu_mul_quant( 2025-05-07T20:31:44.2115104Z self, 2025-05-07T20:31:44.2115313Z T: int, 2025-05-07T20:31:44.2115517Z D: int, 2025-05-07T20:31:44.2115747Z scale_ub: Optional[float], 2025-05-07T20:31:44.2116034Z contiguous: bool, 2025-05-07T20:31:44.2116651Z compiled: bool, 2025-05-07T20:31:44.2116923Z ) -> None: 2025-05-07T20:31:44.2117156Z torch.manual_seed(2025) 2025-05-07T20:31:44.2117414Z 2025-05-07T20:31:44.2117695Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.2118055Z 2025-05-07T20:31:44.2118260Z x_sign = torch.sign(x) 2025-05-07T20:31:44.2118561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.2118886Z x = x_sign * x_clamp 2025-05-07T20:31:44.2119140Z x0 = x[:, :D] 2025-05-07T20:31:44.2119371Z x1 = x[:, D:] 2025-05-07T20:31:44.2119584Z 2025-05-07T20:31:44.2119782Z if contiguous: 2025-05-07T20:31:44.2120030Z x0 = x0.contiguous() 2025-05-07T20:31:44.2120298Z x1 = x1.contiguous() 2025-05-07T20:31:44.2120552Z 2025-05-07T20:31:44.2120755Z if scale_ub is not None: 2025-05-07T20:31:44.2121038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.2121405Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.2121730Z ) 2025-05-07T20:31:44.2121931Z else: 2025-05-07T20:31:44.2122155Z scale_ub_tensor = None 2025-05-07T20:31:44.2122421Z 2025-05-07T20:31:44.2122660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.2122989Z op = silu_mul_quant 2025-05-07T20:31:44.2123254Z if compiled: 2025-05-07T20:31:44.2123515Z op = torch.compile(op) 2025-05-07T20:31:44.2123820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2124116Z 2025-05-07T20:31:44.2124325Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.2124500Z 2025-05-07T20:31:44.2124607Z moe/activation_test.py:117: 2025-05-07T20:31:44.2124916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2125266Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.2125560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2126463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.2127182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.2127738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.2128426Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.2129099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.2129644Z kernel = self.compile( 2025-05-07T20:31:44.2130191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.2130861Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.2131270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2131518Z 2025-05-07T20:31:44.2131736Z self = 2025-05-07T20:31:44.2132813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.2134198Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd8303c10>} 2025-05-07T20:31:44.2135543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.2136565Z context = 2025-05-07T20:31:44.2136856Z 2025-05-07T20:31:44.2137040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.2137659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.2138135Z module_map=module_map) 2025-05-07T20:31:44.2138510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.2138868Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.2139141Z E ^ 2025-05-07T20:31:44.2139607Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.2140351Z 2025-05-07T20:31:44.2140788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.2141406Z 2025-05-07T20:31:44.2141513Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.2141938Z self=, 2025-05-07T20:31:44.2142361Z T=1, 2025-05-07T20:31:44.2142555Z D=7168, 2025-05-07T20:31:44.2142759Z scale_ub=1200.0, 2025-05-07T20:31:44.2142994Z contiguous=True, 2025-05-07T20:31:44.2143222Z compiled=True, 2025-05-07T20:31:44.2143443Z ) 2025-05-07T20:31:44.2143772Z self = 2025-05-07T20:31:44.2144271Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.2144534Z 2025-05-07T20:31:44.2144616Z @given( 2025-05-07T20:31:44.2144855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.2145182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.2145492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.2145840Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.2146180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.2146469Z ) 2025-05-07T20:31:44.2146833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.2147425Z def test_silu_mul_quant( 2025-05-07T20:31:44.2147683Z self, 2025-05-07T20:31:44.2147884Z T: int, 2025-05-07T20:31:44.2148092Z D: int, 2025-05-07T20:31:44.2148323Z scale_ub: Optional[float], 2025-05-07T20:31:44.2148603Z contiguous: bool, 2025-05-07T20:31:44.2148854Z compiled: bool, 2025-05-07T20:31:44.2149085Z ) -> None: 2025-05-07T20:31:44.2149304Z torch.manual_seed(2025) 2025-05-07T20:31:44.2149558Z 2025-05-07T20:31:44.2149841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.2150190Z 2025-05-07T20:31:44.2150392Z x_sign = torch.sign(x) 2025-05-07T20:31:44.2150691Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.2151005Z x = x_sign * x_clamp 2025-05-07T20:31:44.2151255Z x0 = x[:, :D] 2025-05-07T20:31:44.2151480Z x1 = x[:, D:] 2025-05-07T20:31:44.2151690Z 2025-05-07T20:31:44.2151897Z if contiguous: 2025-05-07T20:31:44.2152137Z x0 = x0.contiguous() 2025-05-07T20:31:44.2152399Z x1 = x1.contiguous() 2025-05-07T20:31:44.2152652Z 2025-05-07T20:31:44.2152852Z if scale_ub is not None: 2025-05-07T20:31:44.2153138Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.2153479Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.2153795Z ) 2025-05-07T20:31:44.2153998Z else: 2025-05-07T20:31:44.2154211Z scale_ub_tensor = None 2025-05-07T20:31:44.2154473Z 2025-05-07T20:31:44.2154717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.2155037Z op = silu_mul_quant 2025-05-07T20:31:44.2155299Z if compiled: 2025-05-07T20:31:44.2155556Z op = torch.compile(op) 2025-05-07T20:31:44.2155856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2156140Z 2025-05-07T20:31:44.2156345Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.2156658Z 2025-05-07T20:31:44.2156761Z moe/activation_test.py:117: 2025-05-07T20:31:44.2157064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2157408Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.2157703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2158266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.2158843Z return fn(*args, **kwargs) 2025-05-07T20:31:44.2159512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.2160202Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.2160755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.2161443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.2162123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.2162658Z kernel = self.compile( 2025-05-07T20:31:44.2163211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.2163875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.2164283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2164520Z 2025-05-07T20:31:44.2164731Z self = 2025-05-07T20:31:44.2165830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.2167277Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd83601f0>} 2025-05-07T20:31:44.2168626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.2169641Z context = 2025-05-07T20:31:44.2169943Z 2025-05-07T20:31:44.2170113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.2170652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.2171128Z module_map=module_map) 2025-05-07T20:31:44.2171498Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.2171865Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.2172135Z E ^ 2025-05-07T20:31:44.2172606Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.2173122Z 2025-05-07T20:31:44.2173547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.2174068Z 2025-05-07T20:31:44.2174175Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.2174600Z self=, 2025-05-07T20:31:44.2175003Z T=1, 2025-05-07T20:31:44.2175194Z D=7168, 2025-05-07T20:31:44.2175400Z scale_ub=1200.0, 2025-05-07T20:31:44.2175630Z contiguous=False, 2025-05-07T20:31:44.2175864Z compiled=True, 2025-05-07T20:31:44.2176076Z ) 2025-05-07T20:31:44.5953707Z self = 2025-05-07T20:31:44.5954461Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.5954846Z 2025-05-07T20:31:44.5955421Z @given( 2025-05-07T20:31:44.5955748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.5956167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.5956520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.5956855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.5957197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.5957493Z ) 2025-05-07T20:31:44.5957850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.5958292Z def test_silu_mul_quant( 2025-05-07T20:31:44.5958541Z self, 2025-05-07T20:31:44.5958747Z T: int, 2025-05-07T20:31:44.5958946Z D: int, 2025-05-07T20:31:44.5959176Z scale_ub: Optional[float], 2025-05-07T20:31:44.5959456Z contiguous: bool, 2025-05-07T20:31:44.5959697Z compiled: bool, 2025-05-07T20:31:44.5959934Z ) -> None: 2025-05-07T20:31:44.5960161Z torch.manual_seed(2025) 2025-05-07T20:31:44.5960425Z 2025-05-07T20:31:44.5960702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.5961061Z 2025-05-07T20:31:44.5961265Z x_sign = torch.sign(x) 2025-05-07T20:31:44.5961564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.5961877Z x = x_sign * x_clamp 2025-05-07T20:31:44.5962128Z x0 = x[:, :D] 2025-05-07T20:31:44.5962353Z x1 = x[:, D:] 2025-05-07T20:31:44.5962564Z 2025-05-07T20:31:44.5962789Z if contiguous: 2025-05-07T20:31:44.5963053Z x0 = x0.contiguous() 2025-05-07T20:31:44.5963315Z x1 = x1.contiguous() 2025-05-07T20:31:44.5963566Z 2025-05-07T20:31:44.5963767Z if scale_ub is not None: 2025-05-07T20:31:44.5964044Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.5964389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.5964708Z ) 2025-05-07T20:31:44.5964902Z else: 2025-05-07T20:31:44.5965281Z scale_ub_tensor = None 2025-05-07T20:31:44.5965543Z 2025-05-07T20:31:44.5965776Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.5966101Z op = silu_mul_quant 2025-05-07T20:31:44.5966360Z if compiled: 2025-05-07T20:31:44.5966610Z op = torch.compile(op) 2025-05-07T20:31:44.5966915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5967200Z 2025-05-07T20:31:44.5967400Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.5967568Z 2025-05-07T20:31:44.5967672Z moe/activation_test.py:117: 2025-05-07T20:31:44.5967973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5968312Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.5968596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5969164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.5969745Z return fn(*args, **kwargs) 2025-05-07T20:31:44.5970419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.5971101Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.5971641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.5972324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.5972982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.5973529Z kernel = self.compile( 2025-05-07T20:31:44.5974085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.5974746Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.5975147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5975503Z 2025-05-07T20:31:44.5975716Z self = 2025-05-07T20:31:44.5976796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.5978191Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd8f718b0>} 2025-05-07T20:31:44.5979524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.5980561Z context = 2025-05-07T20:31:44.5980864Z 2025-05-07T20:31:44.5981039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.5981686Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.5982153Z module_map=module_map) 2025-05-07T20:31:44.5982528Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.5982916Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.5983209Z E ^ 2025-05-07T20:31:44.5983670Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.5984126Z 2025-05-07T20:31:44.5984542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.5985052Z 2025-05-07T20:31:44.5985165Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5985577Z self=, 2025-05-07T20:31:44.5986101Z T=1, 2025-05-07T20:31:44.5986298Z D=7168, 2025-05-07T20:31:44.5986503Z scale_ub=None, 2025-05-07T20:31:44.5986721Z contiguous=False, 2025-05-07T20:31:44.5986959Z compiled=True, 2025-05-07T20:31:44.5987174Z ) 2025-05-07T20:31:44.7120646Z self = 2025-05-07T20:31:44.7121414Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:44.7121789Z 2025-05-07T20:31:44.7121903Z @given( 2025-05-07T20:31:44.7122153Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7122470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7122832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7123499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7124169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7124737Z ) 2025-05-07T20:31:44.7125463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7126374Z def test_silu_mul_quant( 2025-05-07T20:31:44.7126886Z self, 2025-05-07T20:31:44.7127279Z T: int, 2025-05-07T20:31:44.7127671Z D: int, 2025-05-07T20:31:44.7128115Z scale_ub: Optional[float], 2025-05-07T20:31:44.7138402Z contiguous: bool, 2025-05-07T20:31:44.7138665Z compiled: bool, 2025-05-07T20:31:44.7138897Z ) -> None: 2025-05-07T20:31:44.7139128Z torch.manual_seed(2025) 2025-05-07T20:31:44.7139384Z 2025-05-07T20:31:44.7139666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7140018Z 2025-05-07T20:31:44.7140514Z x_sign = torch.sign(x) 2025-05-07T20:31:44.7140823Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.7141190Z x = x_sign * x_clamp 2025-05-07T20:31:44.7141441Z x0 = x[:, :D] 2025-05-07T20:31:44.7141669Z x1 = x[:, D:] 2025-05-07T20:31:44.7142214Z 2025-05-07T20:31:44.7142438Z if contiguous: 2025-05-07T20:31:44.7142705Z x0 = x0.contiguous() 2025-05-07T20:31:44.7142966Z x1 = x1.contiguous() 2025-05-07T20:31:44.7143220Z 2025-05-07T20:31:44.7143420Z if scale_ub is not None: 2025-05-07T20:31:44.7143696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.7144045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.7144365Z ) 2025-05-07T20:31:44.7144558Z else: 2025-05-07T20:31:44.7144776Z scale_ub_tensor = None 2025-05-07T20:31:44.7145037Z 2025-05-07T20:31:44.7145269Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.7145601Z op = silu_mul_quant 2025-05-07T20:31:44.7145862Z if compiled: 2025-05-07T20:31:44.7146126Z op = torch.compile(op) 2025-05-07T20:31:44.7146431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.7146720Z 2025-05-07T20:31:44.7146935Z y_fp8, y_scale = fn() 2025-05-07T20:31:44.7147232Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:44.7147533Z 2025-05-07T20:31:44.7147781Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.7148117Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:44.7148421Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:44.7148747Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:44.7149106Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:44.7149423Z 2025-05-07T20:31:44.7149630Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:44.7149830Z 2025-05-07T20:31:44.7149940Z moe/activation_test.py:126: 2025-05-07T20:31:44.7150242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.7150588Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:44.7150924Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:44.7151868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:44.7152631Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:44.7153187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.7153884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.7154582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:44.7155315Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:44.7156073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:44.7156843Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:44.7157581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:44.7158224Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:44.7158843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:44.7159361Z fn() 2025-05-07T20:31:44.7159873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:44.7160456Z self.fn.run( 2025-05-07T20:31:44.7160930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.7161456Z kernel = self.compile( 2025-05-07T20:31:44.7162011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.7162670Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.7163152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.7163421Z 2025-05-07T20:31:44.7163659Z self = 2025-05-07T20:31:44.7164745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.7166131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7721430>} 2025-05-07T20:31:44.7167476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.7168513Z context = 2025-05-07T20:31:44.7168811Z 2025-05-07T20:31:44.7168981Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.7169518Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.7169995Z module_map=module_map) 2025-05-07T20:31:44.7170366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.7170736Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:44.7171011Z E ^ 2025-05-07T20:31:44.7171476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.7171932Z 2025-05-07T20:31:44.7172354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.7172873Z 2025-05-07T20:31:44.7172979Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7173488Z self=, 2025-05-07T20:31:44.7173891Z T=1, 2025-05-07T20:31:44.7174089Z D=5120, 2025-05-07T20:31:44.7174294Z scale_ub=1200.0, 2025-05-07T20:31:44.7174522Z contiguous=False, 2025-05-07T20:31:44.7174760Z compiled=True, 2025-05-07T20:31:44.7174982Z ) 2025-05-07T20:31:44.9156846Z self = 2025-05-07T20:31:44.9157594Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.9158015Z 2025-05-07T20:31:44.9158131Z @given( 2025-05-07T20:31:44.9158467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9158915Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9159238Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9159592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9159942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9160277Z ) 2025-05-07T20:31:44.9160646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9161115Z def test_silu_mul_quant( 2025-05-07T20:31:44.9161364Z self, 2025-05-07T20:31:44.9161580Z T: int, 2025-05-07T20:31:44.9161792Z D: int, 2025-05-07T20:31:44.9162019Z scale_ub: Optional[float], 2025-05-07T20:31:44.9162309Z contiguous: bool, 2025-05-07T20:31:44.9162565Z compiled: bool, 2025-05-07T20:31:44.9162798Z ) -> None: 2025-05-07T20:31:44.9163036Z torch.manual_seed(2025) 2025-05-07T20:31:44.9163298Z 2025-05-07T20:31:44.9163580Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9163949Z 2025-05-07T20:31:44.9164177Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9164482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9164799Z x = x_sign * x_clamp 2025-05-07T20:31:44.9165402Z x0 = x[:, :D] 2025-05-07T20:31:44.9165629Z x1 = x[:, D:] 2025-05-07T20:31:44.9165839Z 2025-05-07T20:31:44.9166035Z if contiguous: 2025-05-07T20:31:44.9166280Z x0 = x0.contiguous() 2025-05-07T20:31:44.9166546Z x1 = x1.contiguous() 2025-05-07T20:31:44.9166798Z 2025-05-07T20:31:44.9166999Z if scale_ub is not None: 2025-05-07T20:31:44.9167279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.9167629Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.9167959Z ) 2025-05-07T20:31:44.9168158Z else: 2025-05-07T20:31:44.9168380Z scale_ub_tensor = None 2025-05-07T20:31:44.9168644Z 2025-05-07T20:31:44.9168881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.9169211Z op = silu_mul_quant 2025-05-07T20:31:44.9169478Z if compiled: 2025-05-07T20:31:44.9169734Z op = torch.compile(op) 2025-05-07T20:31:44.9170057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9170347Z 2025-05-07T20:31:44.9170554Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.9170726Z 2025-05-07T20:31:44.9170832Z moe/activation_test.py:117: 2025-05-07T20:31:44.9171140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9171485Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.9171774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9172355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.9172923Z return fn(*args, **kwargs) 2025-05-07T20:31:44.9173600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.9174293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.9174996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.9175711Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.9176377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.9176928Z kernel = self.compile( 2025-05-07T20:31:44.9177489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.9178156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.9178559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9178801Z 2025-05-07T20:31:44.9179015Z self = 2025-05-07T20:31:44.9180108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.9181582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7721e50>} 2025-05-07T20:31:44.9182987Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.9184013Z context = 2025-05-07T20:31:44.9184310Z 2025-05-07T20:31:44.9184481Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.9185014Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.9185482Z module_map=module_map) 2025-05-07T20:31:44.9185857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.9186348Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.9186621Z E ^ 2025-05-07T20:31:44.9187091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.9187549Z 2025-05-07T20:31:44.9187967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.9188482Z 2025-05-07T20:31:44.9188597Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9189014Z self=, 2025-05-07T20:31:44.9189427Z T=1, 2025-05-07T20:31:44.9189624Z D=5120, 2025-05-07T20:31:44.9189830Z scale_ub=1200.0, 2025-05-07T20:31:44.9190061Z contiguous=False, 2025-05-07T20:31:44.9190305Z compiled=False, 2025-05-07T20:31:44.9190529Z ) 2025-05-07T20:31:44.9190850Z self = 2025-05-07T20:31:44.9191362Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.9191633Z 2025-05-07T20:31:44.9191720Z @given( 2025-05-07T20:31:44.9191955Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9192283Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9192607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9192949Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9193297Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9193598Z ) 2025-05-07T20:31:44.9193958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9194406Z def test_silu_mul_quant( 2025-05-07T20:31:44.9194665Z self, 2025-05-07T20:31:44.9194874Z T: int, 2025-05-07T20:31:44.9195078Z D: int, 2025-05-07T20:31:44.9195308Z scale_ub: Optional[float], 2025-05-07T20:31:44.9195592Z contiguous: bool, 2025-05-07T20:31:44.9195928Z compiled: bool, 2025-05-07T20:31:44.9196167Z ) -> None: 2025-05-07T20:31:44.9196395Z torch.manual_seed(2025) 2025-05-07T20:31:44.9196647Z 2025-05-07T20:31:44.9196930Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9197287Z 2025-05-07T20:31:44.9197485Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9197788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9198108Z x = x_sign * x_clamp 2025-05-07T20:31:44.9198354Z x0 = x[:, :D] 2025-05-07T20:31:44.9198585Z x1 = x[:, D:] 2025-05-07T20:31:44.9198800Z 2025-05-07T20:31:44.9198991Z if contiguous: 2025-05-07T20:31:44.9199235Z x0 = x0.contiguous() 2025-05-07T20:31:44.9199505Z x1 = x1.contiguous() 2025-05-07T20:31:44.9199757Z 2025-05-07T20:31:44.9199959Z if scale_ub is not None: 2025-05-07T20:31:44.9200243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.9200604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.9200918Z ) 2025-05-07T20:31:44.9201123Z else: 2025-05-07T20:31:44.9201344Z scale_ub_tensor = None 2025-05-07T20:31:44.9201603Z 2025-05-07T20:31:44.9201844Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.9202167Z op = silu_mul_quant 2025-05-07T20:31:44.9202427Z if compiled: 2025-05-07T20:31:44.9202686Z op = torch.compile(op) 2025-05-07T20:31:44.9202996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9203279Z 2025-05-07T20:31:44.9203482Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.9203660Z 2025-05-07T20:31:44.9203764Z moe/activation_test.py:117: 2025-05-07T20:31:44.9204072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9204406Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.9204698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9205502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.9206194Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.9206746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.9207439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.9208114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.9208651Z kernel = self.compile( 2025-05-07T20:31:44.9209203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.9209870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.9210273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9210527Z 2025-05-07T20:31:44.9210743Z self = 2025-05-07T20:31:44.9211824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.9213196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd76e7820>} 2025-05-07T20:31:44.9214539Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.9215558Z context = 2025-05-07T20:31:44.9215857Z 2025-05-07T20:31:44.9216111Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.9216656Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.9217130Z module_map=module_map) 2025-05-07T20:31:44.9217501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.9217864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.9218143Z E ^ 2025-05-07T20:31:44.9218605Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.9219061Z 2025-05-07T20:31:44.9219487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.9220010Z 2025-05-07T20:31:44.9220117Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9220540Z self=, 2025-05-07T20:31:44.9220944Z T=16384, 2025-05-07T20:31:44.9221255Z D=5120, 2025-05-07T20:31:44.9221464Z scale_ub=1200.0, 2025-05-07T20:31:44.9221693Z contiguous=False, 2025-05-07T20:31:44.9221929Z compiled=True, 2025-05-07T20:31:44.9222142Z ) 2025-05-07T20:31:45.0411463Z self = 2025-05-07T20:31:45.0412853Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.0413270Z 2025-05-07T20:31:45.0413383Z @given( 2025-05-07T20:31:45.0413644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.0413970Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.0414287Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.0414620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.0414964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.0415261Z ) 2025-05-07T20:31:45.0415636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.0416469Z def test_silu_mul_quant( 2025-05-07T20:31:45.0416721Z self, 2025-05-07T20:31:45.0416918Z T: int, 2025-05-07T20:31:45.0417125Z D: int, 2025-05-07T20:31:45.0417352Z scale_ub: Optional[float], 2025-05-07T20:31:45.0417631Z contiguous: bool, 2025-05-07T20:31:45.0417878Z compiled: bool, 2025-05-07T20:31:45.0418115Z ) -> None: 2025-05-07T20:31:45.0418339Z torch.manual_seed(2025) 2025-05-07T20:31:45.0418593Z 2025-05-07T20:31:45.0418878Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.0419231Z 2025-05-07T20:31:45.0419433Z x_sign = torch.sign(x) 2025-05-07T20:31:45.0419738Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.0420064Z x = x_sign * x_clamp 2025-05-07T20:31:45.0420310Z x0 = x[:, :D] 2025-05-07T20:31:45.0420536Z x1 = x[:, D:] 2025-05-07T20:31:45.0420755Z 2025-05-07T20:31:45.0420955Z if contiguous: 2025-05-07T20:31:45.0421292Z x0 = x0.contiguous() 2025-05-07T20:31:45.0421568Z x1 = x1.contiguous() 2025-05-07T20:31:45.0421813Z 2025-05-07T20:31:45.0422019Z if scale_ub is not None: 2025-05-07T20:31:45.0422307Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.0422647Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.0422969Z ) 2025-05-07T20:31:45.0423176Z else: 2025-05-07T20:31:45.0423391Z scale_ub_tensor = None 2025-05-07T20:31:45.0423652Z 2025-05-07T20:31:45.0423895Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.0424215Z op = silu_mul_quant 2025-05-07T20:31:45.0424482Z if compiled: 2025-05-07T20:31:45.0424742Z op = torch.compile(op) 2025-05-07T20:31:45.0425059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.0425341Z 2025-05-07T20:31:45.0425546Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.0425872Z 2025-05-07T20:31:45.0425982Z moe/activation_test.py:117: 2025-05-07T20:31:45.0426293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.0426633Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.0426930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.0427498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.0428059Z return fn(*args, **kwargs) 2025-05-07T20:31:45.0428730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.0429426Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.0429978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.0430663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.0431354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.0431896Z kernel = self.compile( 2025-05-07T20:31:45.0432441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.0433158Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.0433565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.0433799Z 2025-05-07T20:31:45.0434016Z self = 2025-05-07T20:31:45.0435094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.0436486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd70b6790>} 2025-05-07T20:31:45.0437923Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.0438946Z context = 2025-05-07T20:31:45.0439240Z 2025-05-07T20:31:45.0439416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.0439942Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.0440716Z module_map=module_map) 2025-05-07T20:31:45.0441093Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.0441450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.0441720Z E ^ 2025-05-07T20:31:45.0442201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.0442651Z 2025-05-07T20:31:45.0443075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.0443588Z 2025-05-07T20:31:45.0443696Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.0444117Z self=, 2025-05-07T20:31:45.0444530Z T=2048, 2025-05-07T20:31:45.0444722Z D=7168, 2025-05-07T20:31:45.0444919Z scale_ub=1200.0, 2025-05-07T20:31:45.0445153Z contiguous=False, 2025-05-07T20:31:45.0445384Z compiled=True, 2025-05-07T20:31:45.0445604Z ) 2025-05-07T20:31:45.0445930Z self = 2025-05-07T20:31:45.0446443Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.0446722Z 2025-05-07T20:31:45.0446931Z @given( 2025-05-07T20:31:45.0447176Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.0447500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.0447810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.0448155Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.0448495Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.0448783Z ) 2025-05-07T20:31:45.0449144Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.0449604Z def test_silu_mul_quant( 2025-05-07T20:31:45.0449857Z self, 2025-05-07T20:31:45.0450054Z T: int, 2025-05-07T20:31:45.0450261Z D: int, 2025-05-07T20:31:45.0450491Z scale_ub: Optional[float], 2025-05-07T20:31:45.0450768Z contiguous: bool, 2025-05-07T20:31:45.0451021Z compiled: bool, 2025-05-07T20:31:45.0451257Z ) -> None: 2025-05-07T20:31:45.0451481Z torch.manual_seed(2025) 2025-05-07T20:31:45.0451748Z 2025-05-07T20:31:45.0452030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.0452376Z 2025-05-07T20:31:45.0452583Z x_sign = torch.sign(x) 2025-05-07T20:31:45.0452884Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.0453200Z x = x_sign * x_clamp 2025-05-07T20:31:45.0453451Z x0 = x[:, :D] 2025-05-07T20:31:45.0453679Z x1 = x[:, D:] 2025-05-07T20:31:45.0453888Z 2025-05-07T20:31:45.0454082Z if contiguous: 2025-05-07T20:31:45.0454324Z x0 = x0.contiguous() 2025-05-07T20:31:45.0454585Z x1 = x1.contiguous() 2025-05-07T20:31:45.0454843Z 2025-05-07T20:31:45.0455044Z if scale_ub is not None: 2025-05-07T20:31:45.0455327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.0455668Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.0456009Z ) 2025-05-07T20:31:45.0456367Z else: 2025-05-07T20:31:45.0456596Z scale_ub_tensor = None 2025-05-07T20:31:45.0456859Z 2025-05-07T20:31:45.0457096Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.0457425Z op = silu_mul_quant 2025-05-07T20:31:45.0466003Z if compiled: 2025-05-07T20:31:45.0466304Z op = torch.compile(op) 2025-05-07T20:31:45.0466620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.0466915Z 2025-05-07T20:31:45.0467115Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.0467297Z 2025-05-07T20:31:45.0467402Z moe/activation_test.py:117: 2025-05-07T20:31:45.0467713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.0468054Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.0468350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.0468920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.0469515Z return fn(*args, **kwargs) 2025-05-07T20:31:45.0470176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.0470872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.0471422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.0472104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.0472786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.0473336Z kernel = self.compile( 2025-05-07T20:31:45.0473896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.0474553Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.0475076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.0475319Z 2025-05-07T20:31:45.0475536Z self = 2025-05-07T20:31:45.0476627Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.0477986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd76bc4c0>} 2025-05-07T20:31:45.0479335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.0480368Z context = 2025-05-07T20:31:45.0480669Z 2025-05-07T20:31:45.0480854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.0481383Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.0481865Z module_map=module_map) 2025-05-07T20:31:45.0482245Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.0482614Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.0482878Z E ^ 2025-05-07T20:31:45.0483351Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.0483807Z 2025-05-07T20:31:45.0484232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.0484743Z 2025-05-07T20:31:45.3146830Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3147503Z self=, 2025-05-07T20:31:45.3148516Z T=1, 2025-05-07T20:31:45.3148789Z D=5120, 2025-05-07T20:31:45.3149064Z scale_ub=None, 2025-05-07T20:31:45.3149353Z contiguous=False, 2025-05-07T20:31:45.3149594Z compiled=False, 2025-05-07T20:31:45.3149813Z ) 2025-05-07T20:31:45.3150134Z self = 2025-05-07T20:31:45.3150641Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.3150917Z 2025-05-07T20:31:45.3151001Z @given( 2025-05-07T20:31:45.3151246Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3151566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3151887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3152236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3152577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3152876Z ) 2025-05-07T20:31:45.3153248Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3153703Z def test_silu_mul_quant( 2025-05-07T20:31:45.3153959Z self, 2025-05-07T20:31:45.3154169Z T: int, 2025-05-07T20:31:45.3154380Z D: int, 2025-05-07T20:31:45.3154605Z scale_ub: Optional[float], 2025-05-07T20:31:45.3154892Z contiguous: bool, 2025-05-07T20:31:45.3155150Z compiled: bool, 2025-05-07T20:31:45.3155384Z ) -> None: 2025-05-07T20:31:45.3155621Z torch.manual_seed(2025) 2025-05-07T20:31:45.3155881Z 2025-05-07T20:31:45.3156160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3156516Z 2025-05-07T20:31:45.3156725Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3157024Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3157352Z x = x_sign * x_clamp 2025-05-07T20:31:45.3157608Z x0 = x[:, :D] 2025-05-07T20:31:45.3157828Z x1 = x[:, D:] 2025-05-07T20:31:45.3158244Z 2025-05-07T20:31:45.3158448Z if contiguous: 2025-05-07T20:31:45.3158688Z x0 = x0.contiguous() 2025-05-07T20:31:45.3158961Z x1 = x1.contiguous() 2025-05-07T20:31:45.3159217Z 2025-05-07T20:31:45.3159415Z if scale_ub is not None: 2025-05-07T20:31:45.3159706Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.3160065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.3160390Z ) 2025-05-07T20:31:45.3160588Z else: 2025-05-07T20:31:45.3160813Z scale_ub_tensor = None 2025-05-07T20:31:45.3161078Z 2025-05-07T20:31:45.3161318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3161643Z op = silu_mul_quant 2025-05-07T20:31:45.3161906Z if compiled: 2025-05-07T20:31:45.3162162Z op = torch.compile(op) 2025-05-07T20:31:45.3162481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3162778Z 2025-05-07T20:31:45.3162979Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.3163160Z 2025-05-07T20:31:45.3163265Z moe/activation_test.py:117: 2025-05-07T20:31:45.3163571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3163916Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.3164206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3164909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.3165606Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.3166149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.3166840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3167509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.3168155Z kernel = self.compile( 2025-05-07T20:31:45.3168708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.3169378Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3169792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3170028Z 2025-05-07T20:31:45.3170242Z self = 2025-05-07T20:31:45.3171353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.3172742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd76bc820>} 2025-05-07T20:31:45.3174155Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.3175195Z context = 2025-05-07T20:31:45.3175493Z 2025-05-07T20:31:45.3175663Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.3176199Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.3176666Z module_map=module_map) 2025-05-07T20:31:45.3177048Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.3177412Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.3177684Z E ^ 2025-05-07T20:31:45.3178152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.3178608Z 2025-05-07T20:31:45.3179106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.3179633Z 2025-05-07T20:31:45.3179748Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3180160Z self=, 2025-05-07T20:31:45.3180571Z T=4096, 2025-05-07T20:31:45.3180768Z D=7168, 2025-05-07T20:31:45.3180965Z scale_ub=1200.0, 2025-05-07T20:31:45.3181282Z contiguous=False, 2025-05-07T20:31:45.3181517Z compiled=False, 2025-05-07T20:31:45.3181729Z ) 2025-05-07T20:31:45.3182047Z self = 2025-05-07T20:31:45.3182554Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.3182837Z 2025-05-07T20:31:45.3182923Z @given( 2025-05-07T20:31:45.3183155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3183485Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3183810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3184141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3184480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3184775Z ) 2025-05-07T20:31:45.3185132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3185574Z def test_silu_mul_quant( 2025-05-07T20:31:45.3185826Z self, 2025-05-07T20:31:45.3186030Z T: int, 2025-05-07T20:31:45.3186227Z D: int, 2025-05-07T20:31:45.3186454Z scale_ub: Optional[float], 2025-05-07T20:31:45.3186739Z contiguous: bool, 2025-05-07T20:31:45.3186982Z compiled: bool, 2025-05-07T20:31:45.3187212Z ) -> None: 2025-05-07T20:31:45.3187438Z torch.manual_seed(2025) 2025-05-07T20:31:45.3187686Z 2025-05-07T20:31:45.3187967Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3188416Z 2025-05-07T20:31:45.3188611Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3188908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3189228Z x = x_sign * x_clamp 2025-05-07T20:31:45.3189471Z x0 = x[:, :D] 2025-05-07T20:31:45.3189698Z x1 = x[:, D:] 2025-05-07T20:31:45.3189919Z 2025-05-07T20:31:45.3190119Z if contiguous: 2025-05-07T20:31:45.3190355Z x0 = x0.contiguous() 2025-05-07T20:31:45.3190636Z x1 = x1.contiguous() 2025-05-07T20:31:45.3190894Z 2025-05-07T20:31:45.3191089Z if scale_ub is not None: 2025-05-07T20:31:45.3191376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.3191729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.3192040Z ) 2025-05-07T20:31:45.3192245Z else: 2025-05-07T20:31:45.3192467Z scale_ub_tensor = None 2025-05-07T20:31:45.3192721Z 2025-05-07T20:31:45.3192973Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3193304Z op = silu_mul_quant 2025-05-07T20:31:45.3193558Z if compiled: 2025-05-07T20:31:45.3193819Z op = torch.compile(op) 2025-05-07T20:31:45.3194126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3194404Z 2025-05-07T20:31:45.3194606Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.3194782Z 2025-05-07T20:31:45.3194885Z moe/activation_test.py:117: 2025-05-07T20:31:45.3195194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3195528Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.3195818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3196520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.3197207Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.3197837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.3198542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3199209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.3199742Z kernel = self.compile( 2025-05-07T20:31:45.3200288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.3200952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3201354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3201594Z 2025-05-07T20:31:45.3201805Z self = 2025-05-07T20:31:45.3202892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.3204261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7466af0>} 2025-05-07T20:31:45.3205604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.3206619Z context = 2025-05-07T20:31:45.3206918Z 2025-05-07T20:31:45.3207089Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.3207618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.3208101Z module_map=module_map) 2025-05-07T20:31:45.3208590Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.3208953Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.3209223Z E ^ 2025-05-07T20:31:45.3209683Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.3210140Z 2025-05-07T20:31:45.3210555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.3211075Z 2025-05-07T20:31:45.3211183Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3211606Z self=, 2025-05-07T20:31:45.3212011Z T=16384, 2025-05-07T20:31:45.3212219Z D=7168, 2025-05-07T20:31:45.3212419Z scale_ub=None, 2025-05-07T20:31:45.3212637Z contiguous=True, 2025-05-07T20:31:45.3212873Z compiled=True, 2025-05-07T20:31:45.3213086Z ) 2025-05-07T20:31:45.4387839Z self = 2025-05-07T20:31:45.4388668Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4389055Z 2025-05-07T20:31:45.4389168Z @given( 2025-05-07T20:31:45.4389444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4389772Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4390093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4390427Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4390765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4391055Z ) 2025-05-07T20:31:45.4391405Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4391854Z def test_silu_mul_quant( 2025-05-07T20:31:45.4392102Z self, 2025-05-07T20:31:45.4392296Z T: int, 2025-05-07T20:31:45.4392499Z D: int, 2025-05-07T20:31:45.4392724Z scale_ub: Optional[float], 2025-05-07T20:31:45.4393341Z contiguous: bool, 2025-05-07T20:31:45.4393596Z compiled: bool, 2025-05-07T20:31:45.4393833Z ) -> None: 2025-05-07T20:31:45.4394051Z torch.manual_seed(2025) 2025-05-07T20:31:45.4394302Z 2025-05-07T20:31:45.4394579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4394928Z 2025-05-07T20:31:45.4395122Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4395424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4395742Z x = x_sign * x_clamp 2025-05-07T20:31:45.4395987Z x0 = x[:, :D] 2025-05-07T20:31:45.4396208Z x1 = x[:, D:] 2025-05-07T20:31:45.4396423Z 2025-05-07T20:31:45.4396611Z if contiguous: 2025-05-07T20:31:45.4396849Z x0 = x0.contiguous() 2025-05-07T20:31:45.4397112Z x1 = x1.contiguous() 2025-05-07T20:31:45.4397356Z 2025-05-07T20:31:45.4397554Z if scale_ub is not None: 2025-05-07T20:31:45.4397842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4398188Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4398504Z ) 2025-05-07T20:31:45.4398706Z else: 2025-05-07T20:31:45.4398917Z scale_ub_tensor = None 2025-05-07T20:31:45.4399178Z 2025-05-07T20:31:45.4399417Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4399743Z op = silu_mul_quant 2025-05-07T20:31:45.4400000Z if compiled: 2025-05-07T20:31:45.4400257Z op = torch.compile(op) 2025-05-07T20:31:45.4400564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4400848Z 2025-05-07T20:31:45.4401050Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4401219Z 2025-05-07T20:31:45.4401327Z moe/activation_test.py:117: 2025-05-07T20:31:45.4401623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4401964Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4402426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4402993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4403559Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4404229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4404918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4405456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4406139Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4406808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4407346Z kernel = self.compile( 2025-05-07T20:31:45.4407892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4408556Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4408961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4409194Z 2025-05-07T20:31:45.4409404Z self = 2025-05-07T20:31:45.4410482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4411856Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd764b790>} 2025-05-07T20:31:45.4413272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4414314Z context = 2025-05-07T20:31:45.4414603Z 2025-05-07T20:31:45.4414774Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4415304Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4415774Z module_map=module_map) 2025-05-07T20:31:45.4416148Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4416501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4416770Z E ^ 2025-05-07T20:31:45.4417242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4417690Z 2025-05-07T20:31:45.4418109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4418635Z 2025-05-07T20:31:45.4418743Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4419166Z self=, 2025-05-07T20:31:45.4419571Z T=4096, 2025-05-07T20:31:45.4419759Z D=5120, 2025-05-07T20:31:45.4419961Z scale_ub=None, 2025-05-07T20:31:45.4420183Z contiguous=False, 2025-05-07T20:31:45.4420410Z compiled=True, 2025-05-07T20:31:45.4420624Z ) 2025-05-07T20:31:45.4420952Z self = 2025-05-07T20:31:45.4421531Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4421815Z 2025-05-07T20:31:45.4421901Z @given( 2025-05-07T20:31:45.4422133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4422460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4422775Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4423211Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4423550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4423843Z ) 2025-05-07T20:31:45.4424199Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4424651Z def test_silu_mul_quant( 2025-05-07T20:31:45.4424900Z self, 2025-05-07T20:31:45.4425096Z T: int, 2025-05-07T20:31:45.4425305Z D: int, 2025-05-07T20:31:45.4425533Z scale_ub: Optional[float], 2025-05-07T20:31:45.4425820Z contiguous: bool, 2025-05-07T20:31:45.4426063Z compiled: bool, 2025-05-07T20:31:45.4426295Z ) -> None: 2025-05-07T20:31:45.4426523Z torch.manual_seed(2025) 2025-05-07T20:31:45.4426769Z 2025-05-07T20:31:45.4427053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4427403Z 2025-05-07T20:31:45.4427598Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4427899Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4428229Z x = x_sign * x_clamp 2025-05-07T20:31:45.4428473Z x0 = x[:, :D] 2025-05-07T20:31:45.4428696Z x1 = x[:, D:] 2025-05-07T20:31:45.4428913Z 2025-05-07T20:31:45.4429099Z if contiguous: 2025-05-07T20:31:45.4429338Z x0 = x0.contiguous() 2025-05-07T20:31:45.4429605Z x1 = x1.contiguous() 2025-05-07T20:31:45.4429848Z 2025-05-07T20:31:45.4430045Z if scale_ub is not None: 2025-05-07T20:31:45.4430328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4430666Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4430983Z ) 2025-05-07T20:31:45.4431185Z else: 2025-05-07T20:31:45.4431404Z scale_ub_tensor = None 2025-05-07T20:31:45.4431657Z 2025-05-07T20:31:45.4431895Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4432214Z op = silu_mul_quant 2025-05-07T20:31:45.4432558Z if compiled: 2025-05-07T20:31:45.4432860Z op = torch.compile(op) 2025-05-07T20:31:45.4433174Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4433453Z 2025-05-07T20:31:45.4433654Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4433823Z 2025-05-07T20:31:45.4433933Z moe/activation_test.py:117: 2025-05-07T20:31:45.4434230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4434570Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4434863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4435429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4435988Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4436658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4437351Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4437904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4438603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4439273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4439810Z kernel = self.compile( 2025-05-07T20:31:45.4440648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4441323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4441731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4441966Z 2025-05-07T20:31:45.4442187Z self = 2025-05-07T20:31:45.4443276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4444785Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd72e0550>} 2025-05-07T20:31:45.4446125Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4447143Z context = 2025-05-07T20:31:45.4447433Z 2025-05-07T20:31:45.4447605Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4448135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4448608Z module_map=module_map) 2025-05-07T20:31:45.4448985Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4449339Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4449605Z E ^ 2025-05-07T20:31:45.4450077Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4450525Z 2025-05-07T20:31:45.4450941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4451468Z 2025-05-07T20:31:45.8512218Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.8512929Z self=, 2025-05-07T20:31:45.8513490Z T=4096, 2025-05-07T20:31:45.8513754Z D=5120, 2025-05-07T20:31:45.8514016Z scale_ub=1200.0, 2025-05-07T20:31:45.8514310Z contiguous=False, 2025-05-07T20:31:45.8514598Z compiled=False, 2025-05-07T20:31:45.8514821Z ) 2025-05-07T20:31:45.8515517Z self = 2025-05-07T20:31:45.8516050Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.8516330Z 2025-05-07T20:31:45.8516423Z @given( 2025-05-07T20:31:45.8516662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.8516994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.8517317Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.8517666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.8518009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.8518313Z ) 2025-05-07T20:31:45.8518685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.8519142Z def test_silu_mul_quant( 2025-05-07T20:31:45.8519399Z self, 2025-05-07T20:31:45.8519605Z T: int, 2025-05-07T20:31:45.8519809Z D: int, 2025-05-07T20:31:45.8520052Z scale_ub: Optional[float], 2025-05-07T20:31:45.8520335Z contiguous: bool, 2025-05-07T20:31:45.8520586Z compiled: bool, 2025-05-07T20:31:45.8520830Z ) -> None: 2025-05-07T20:31:45.8521062Z torch.manual_seed(2025) 2025-05-07T20:31:45.8521315Z 2025-05-07T20:31:45.8521601Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.8521964Z 2025-05-07T20:31:45.8522161Z x_sign = torch.sign(x) 2025-05-07T20:31:45.8522466Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.8522796Z x = x_sign * x_clamp 2025-05-07T20:31:45.8523050Z x0 = x[:, :D] 2025-05-07T20:31:45.8523272Z x1 = x[:, D:] 2025-05-07T20:31:45.8523494Z 2025-05-07T20:31:45.8523691Z if contiguous: 2025-05-07T20:31:45.8523935Z x0 = x0.contiguous() 2025-05-07T20:31:45.8524207Z x1 = x1.contiguous() 2025-05-07T20:31:45.8524458Z 2025-05-07T20:31:45.8524648Z if scale_ub is not None: 2025-05-07T20:31:45.8525100Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.8525447Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.8525761Z ) 2025-05-07T20:31:45.8525965Z else: 2025-05-07T20:31:45.8526188Z scale_ub_tensor = None 2025-05-07T20:31:45.8526444Z 2025-05-07T20:31:45.8526689Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.8527022Z op = silu_mul_quant 2025-05-07T20:31:45.8527279Z if compiled: 2025-05-07T20:31:45.8527540Z op = torch.compile(op) 2025-05-07T20:31:45.8527851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.8528139Z 2025-05-07T20:31:45.8528337Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.8528516Z 2025-05-07T20:31:45.8528621Z moe/activation_test.py:117: 2025-05-07T20:31:45.8528936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.8529287Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.8529587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.8530298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.8530995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.8531575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.8532412Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.8533097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.8533634Z kernel = self.compile( 2025-05-07T20:31:45.8534189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.8534859Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.8535408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.8535647Z 2025-05-07T20:31:45.8535860Z self = 2025-05-07T20:31:45.8536948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.8538402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd72a40d0>} 2025-05-07T20:31:45.8539755Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.8541075Z context = 2025-05-07T20:31:45.8541441Z 2025-05-07T20:31:45.8541619Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.8542167Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.8542645Z module_map=module_map) 2025-05-07T20:31:45.8543035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.8543396Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.8543669Z E ^ 2025-05-07T20:31:45.8544147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.8544603Z 2025-05-07T20:31:45.8545023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.8545559Z 2025-05-07T20:31:45.8545670Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.8546241Z self=, 2025-05-07T20:31:45.8546656Z T=4096, 2025-05-07T20:31:45.8546851Z D=5120, 2025-05-07T20:31:45.8547055Z scale_ub=1200.0, 2025-05-07T20:31:45.8547291Z contiguous=False, 2025-05-07T20:31:45.8547527Z compiled=True, 2025-05-07T20:31:45.8547745Z ) 2025-05-07T20:31:45.8548077Z self = 2025-05-07T20:31:45.8548579Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.8548868Z 2025-05-07T20:31:45.8548948Z @given( 2025-05-07T20:31:45.8549188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.8549504Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.8549826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.8550168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.8550513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.8550816Z ) 2025-05-07T20:31:45.8551178Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.8551639Z def test_silu_mul_quant( 2025-05-07T20:31:45.8551892Z self, 2025-05-07T20:31:45.8552101Z T: int, 2025-05-07T20:31:45.8552316Z D: int, 2025-05-07T20:31:45.8552544Z scale_ub: Optional[float], 2025-05-07T20:31:45.8552832Z contiguous: bool, 2025-05-07T20:31:45.8553086Z compiled: bool, 2025-05-07T20:31:45.8553316Z ) -> None: 2025-05-07T20:31:45.8553548Z torch.manual_seed(2025) 2025-05-07T20:31:45.8553806Z 2025-05-07T20:31:45.8554087Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.8554444Z 2025-05-07T20:31:45.8554647Z x_sign = torch.sign(x) 2025-05-07T20:31:45.8554953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.8555274Z x = x_sign * x_clamp 2025-05-07T20:31:45.8555529Z x0 = x[:, :D] 2025-05-07T20:31:45.8555881Z x1 = x[:, D:] 2025-05-07T20:31:45.8556106Z 2025-05-07T20:31:45.8556304Z if contiguous: 2025-05-07T20:31:45.8556539Z x0 = x0.contiguous() 2025-05-07T20:31:45.8556811Z x1 = x1.contiguous() 2025-05-07T20:31:45.8557063Z 2025-05-07T20:31:45.8557257Z if scale_ub is not None: 2025-05-07T20:31:45.8557548Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.8557891Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.8558205Z ) 2025-05-07T20:31:45.8558403Z else: 2025-05-07T20:31:45.8558622Z scale_ub_tensor = None 2025-05-07T20:31:45.8558878Z 2025-05-07T20:31:45.8559118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.8559444Z op = silu_mul_quant 2025-05-07T20:31:45.8559700Z if compiled: 2025-05-07T20:31:45.8559959Z op = torch.compile(op) 2025-05-07T20:31:45.8560273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.8560566Z 2025-05-07T20:31:45.8560760Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.8560938Z 2025-05-07T20:31:45.8561043Z moe/activation_test.py:117: 2025-05-07T20:31:45.8561346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.8561681Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.8561973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.8562539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.8563148Z return fn(*args, **kwargs) 2025-05-07T20:31:45.8563812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.8564499Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.8565041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.8565819Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.8566486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.8567022Z kernel = self.compile( 2025-05-07T20:31:45.8567568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.8568222Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.8568627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.8568861Z 2025-05-07T20:31:45.8569077Z self = 2025-05-07T20:31:45.8570164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.8571531Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd72a4dc0>} 2025-05-07T20:31:45.8572870Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.8573893Z context = 2025-05-07T20:31:45.8574185Z 2025-05-07T20:31:45.8574368Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.8574892Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.8575367Z module_map=module_map) 2025-05-07T20:31:45.8575744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.8576225Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.8576493Z E ^ 2025-05-07T20:31:45.8576962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.8577413Z 2025-05-07T20:31:45.8577838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.8578349Z 2025-05-07T20:31:46.1325324Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.1325947Z self=, 2025-05-07T20:31:46.1326595Z T=2048, 2025-05-07T20:31:46.1326871Z D=7168, 2025-05-07T20:31:46.1327135Z scale_ub=1200.0, 2025-05-07T20:31:46.1327364Z contiguous=False, 2025-05-07T20:31:46.1327601Z compiled=False, 2025-05-07T20:31:46.1327819Z ) 2025-05-07T20:31:46.1328142Z self = 2025-05-07T20:31:46.1328680Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.1328975Z 2025-05-07T20:31:46.1329063Z @given( 2025-05-07T20:31:46.1329298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.1329620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.1329939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.1330276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.1330614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.1330911Z ) 2025-05-07T20:31:46.1331276Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.1331720Z def test_silu_mul_quant( 2025-05-07T20:31:46.1331972Z self, 2025-05-07T20:31:46.1332179Z T: int, 2025-05-07T20:31:46.1332381Z D: int, 2025-05-07T20:31:46.1332611Z scale_ub: Optional[float], 2025-05-07T20:31:46.1332892Z contiguous: bool, 2025-05-07T20:31:46.1333136Z compiled: bool, 2025-05-07T20:31:46.1333770Z ) -> None: 2025-05-07T20:31:46.1333998Z torch.manual_seed(2025) 2025-05-07T20:31:46.1334246Z 2025-05-07T20:31:46.1334529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.1334884Z 2025-05-07T20:31:46.1335078Z x_sign = torch.sign(x) 2025-05-07T20:31:46.1335381Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.1335701Z x = x_sign * x_clamp 2025-05-07T20:31:46.1335954Z x0 = x[:, :D] 2025-05-07T20:31:46.1336179Z x1 = x[:, D:] 2025-05-07T20:31:46.1336403Z 2025-05-07T20:31:46.1336600Z if contiguous: 2025-05-07T20:31:46.1336837Z x0 = x0.contiguous() 2025-05-07T20:31:46.1337110Z x1 = x1.contiguous() 2025-05-07T20:31:46.1337367Z 2025-05-07T20:31:46.1337561Z if scale_ub is not None: 2025-05-07T20:31:46.1337848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.1338200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.1338528Z ) 2025-05-07T20:31:46.1338736Z else: 2025-05-07T20:31:46.1338957Z scale_ub_tensor = None 2025-05-07T20:31:46.1339212Z 2025-05-07T20:31:46.1339453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.1339784Z op = silu_mul_quant 2025-05-07T20:31:46.1340043Z if compiled: 2025-05-07T20:31:46.1340605Z op = torch.compile(op) 2025-05-07T20:31:46.1340914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1341250Z 2025-05-07T20:31:46.1341453Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.1341627Z 2025-05-07T20:31:46.1341735Z moe/activation_test.py:117: 2025-05-07T20:31:46.1342039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1342374Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.1342665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1343523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.1344229Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.1344784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.1345478Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.1346148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.1346684Z kernel = self.compile( 2025-05-07T20:31:46.1347241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.1347907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.1348310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1348556Z 2025-05-07T20:31:46.1348774Z self = 2025-05-07T20:31:46.1349860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.1351275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd71b0670>} 2025-05-07T20:31:46.1352622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.1353637Z context = 2025-05-07T20:31:46.1353937Z 2025-05-07T20:31:46.1354113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.1354769Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.1355243Z module_map=module_map) 2025-05-07T20:31:46.1355616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.1355977Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.1356250Z E ^ 2025-05-07T20:31:46.1356714Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.1357172Z 2025-05-07T20:31:46.1357589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.1358112Z 2025-05-07T20:31:46.1358218Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.1358644Z self=, 2025-05-07T20:31:46.1359047Z T=1, 2025-05-07T20:31:46.1359248Z D=7168, 2025-05-07T20:31:46.1359459Z scale_ub=None, 2025-05-07T20:31:46.1359682Z contiguous=True, 2025-05-07T20:31:46.1359918Z compiled=False, 2025-05-07T20:31:46.1360132Z ) 2025-05-07T20:31:46.1360456Z self = 2025-05-07T20:31:46.1360954Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:46.1361226Z 2025-05-07T20:31:46.1361309Z @given( 2025-05-07T20:31:46.1361549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.1361869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.1362188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.1362536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.1362872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.1363172Z ) 2025-05-07T20:31:46.1363531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.1364065Z def test_silu_mul_quant( 2025-05-07T20:31:46.1364318Z self, 2025-05-07T20:31:46.1364522Z T: int, 2025-05-07T20:31:46.1364754Z D: int, 2025-05-07T20:31:46.1364977Z scale_ub: Optional[float], 2025-05-07T20:31:46.1365260Z contiguous: bool, 2025-05-07T20:31:46.1365509Z compiled: bool, 2025-05-07T20:31:46.1365731Z ) -> None: 2025-05-07T20:31:46.1365957Z torch.manual_seed(2025) 2025-05-07T20:31:46.1366209Z 2025-05-07T20:31:46.1366487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.1366832Z 2025-05-07T20:31:46.1367033Z x_sign = torch.sign(x) 2025-05-07T20:31:46.1367334Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.1367647Z x = x_sign * x_clamp 2025-05-07T20:31:46.1367902Z x0 = x[:, :D] 2025-05-07T20:31:46.1368128Z x1 = x[:, D:] 2025-05-07T20:31:46.1368338Z 2025-05-07T20:31:46.1368538Z if contiguous: 2025-05-07T20:31:46.1368791Z x0 = x0.contiguous() 2025-05-07T20:31:46.1369054Z x1 = x1.contiguous() 2025-05-07T20:31:46.1369306Z 2025-05-07T20:31:46.1369508Z if scale_ub is not None: 2025-05-07T20:31:46.1369786Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.1370130Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.1370453Z ) 2025-05-07T20:31:46.1370648Z else: 2025-05-07T20:31:46.1370871Z scale_ub_tensor = None 2025-05-07T20:31:46.1371138Z 2025-05-07T20:31:46.1371376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.1371698Z op = silu_mul_quant 2025-05-07T20:31:46.1371963Z if compiled: 2025-05-07T20:31:46.1372223Z op = torch.compile(op) 2025-05-07T20:31:46.1372527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1372818Z 2025-05-07T20:31:46.1373059Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.1373330Z 2025-05-07T20:31:46.1373437Z moe/activation_test.py:117: 2025-05-07T20:31:46.1373743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1374080Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.1374373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1375075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.1375773Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.1376547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.1377389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.1378071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.1378608Z kernel = self.compile( 2025-05-07T20:31:46.1379175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.1379856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.1380258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1380501Z 2025-05-07T20:31:46.1380713Z self = 2025-05-07T20:31:46.1381878Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.1383250Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6ee3160>} 2025-05-07T20:31:46.1384692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.1385731Z context = 2025-05-07T20:31:46.1386029Z 2025-05-07T20:31:46.1386199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.1386733Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.1387216Z module_map=module_map) 2025-05-07T20:31:46.1387586Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.1387947Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.1388216Z E ^ 2025-05-07T20:31:46.1388680Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.1389134Z 2025-05-07T20:31:46.1389558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.1390092Z 2025-05-07T20:31:46.1390198Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.1390620Z self=, 2025-05-07T20:31:46.1391021Z T=16384, 2025-05-07T20:31:46.1391226Z D=7168, 2025-05-07T20:31:46.1391427Z scale_ub=1200.0, 2025-05-07T20:31:46.1391652Z contiguous=False, 2025-05-07T20:31:46.1391888Z compiled=True, 2025-05-07T20:31:46.1392101Z ) 2025-05-07T20:31:46.3299773Z self = 2025-05-07T20:31:46.3300602Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.3301018Z 2025-05-07T20:31:46.3301241Z @given( 2025-05-07T20:31:46.3301575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3301991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3302328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3303088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3303430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3303723Z ) 2025-05-07T20:31:46.3304097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3304552Z def test_silu_mul_quant( 2025-05-07T20:31:46.3304799Z self, 2025-05-07T20:31:46.3305012Z T: int, 2025-05-07T20:31:46.3305218Z D: int, 2025-05-07T20:31:46.3305440Z scale_ub: Optional[float], 2025-05-07T20:31:46.3305726Z contiguous: bool, 2025-05-07T20:31:46.3305975Z compiled: bool, 2025-05-07T20:31:46.3306209Z ) -> None: 2025-05-07T20:31:46.3306439Z torch.manual_seed(2025) 2025-05-07T20:31:46.3306690Z 2025-05-07T20:31:46.3306973Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3307323Z 2025-05-07T20:31:46.3307525Z x_sign = torch.sign(x) 2025-05-07T20:31:46.3307838Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.3308156Z x = x_sign * x_clamp 2025-05-07T20:31:46.3308410Z x0 = x[:, :D] 2025-05-07T20:31:46.3308637Z x1 = x[:, D:] 2025-05-07T20:31:46.3308848Z 2025-05-07T20:31:46.3309047Z if contiguous: 2025-05-07T20:31:46.3309291Z x0 = x0.contiguous() 2025-05-07T20:31:46.3309560Z x1 = x1.contiguous() 2025-05-07T20:31:46.3309832Z 2025-05-07T20:31:46.3310036Z if scale_ub is not None: 2025-05-07T20:31:46.3310315Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.3310668Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.3319117Z ) 2025-05-07T20:31:46.3319345Z else: 2025-05-07T20:31:46.3319572Z scale_ub_tensor = None 2025-05-07T20:31:46.3319840Z 2025-05-07T20:31:46.3320081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.3320614Z op = silu_mul_quant 2025-05-07T20:31:46.3320894Z if compiled: 2025-05-07T20:31:46.3321151Z op = torch.compile(op) 2025-05-07T20:31:46.3321458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3321743Z 2025-05-07T20:31:46.3321940Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.3322122Z 2025-05-07T20:31:46.3322228Z moe/activation_test.py:117: 2025-05-07T20:31:46.3322540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3322910Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.3323230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3323800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.3324369Z return fn(*args, **kwargs) 2025-05-07T20:31:46.3325030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.3325740Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.3326296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.3326992Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.3327661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.3328198Z kernel = self.compile( 2025-05-07T20:31:46.3328747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.3329409Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.3329816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3330056Z 2025-05-07T20:31:46.3330264Z self = 2025-05-07T20:31:46.3331352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.3332837Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6ee34c0>} 2025-05-07T20:31:46.3334180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.3335200Z context = 2025-05-07T20:31:46.3335496Z 2025-05-07T20:31:46.3335667Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.3336196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.3336683Z module_map=module_map) 2025-05-07T20:31:46.3337062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.3337421Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.3337682Z E ^ 2025-05-07T20:31:46.3338154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.3338611Z 2025-05-07T20:31:46.3339034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.3339545Z 2025-05-07T20:31:46.3339658Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3340371Z self=, 2025-05-07T20:31:46.3340786Z T=1, 2025-05-07T20:31:46.3341002Z D=7168, 2025-05-07T20:31:46.3341249Z scale_ub=None, 2025-05-07T20:31:46.3341471Z contiguous=False, 2025-05-07T20:31:46.3341712Z compiled=False, 2025-05-07T20:31:46.3342671Z ) 2025-05-07T20:31:46.3342998Z self = 2025-05-07T20:31:46.3343502Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:46.3343772Z 2025-05-07T20:31:46.3343853Z @given( 2025-05-07T20:31:46.3344094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3344415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3344731Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3345071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3345404Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3345701Z ) 2025-05-07T20:31:46.3346059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3346506Z def test_silu_mul_quant( 2025-05-07T20:31:46.3346764Z self, 2025-05-07T20:31:46.3346972Z T: int, 2025-05-07T20:31:46.3347186Z D: int, 2025-05-07T20:31:46.3347419Z scale_ub: Optional[float], 2025-05-07T20:31:46.3347710Z contiguous: bool, 2025-05-07T20:31:46.3347960Z compiled: bool, 2025-05-07T20:31:46.3348195Z ) -> None: 2025-05-07T20:31:46.3348425Z torch.manual_seed(2025) 2025-05-07T20:31:46.3348682Z 2025-05-07T20:31:46.3348957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3349313Z 2025-05-07T20:31:46.3349513Z x_sign = torch.sign(x) 2025-05-07T20:31:46.3349811Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.3350135Z x = x_sign * x_clamp 2025-05-07T20:31:46.3350387Z x0 = x[:, :D] 2025-05-07T20:31:46.3350613Z x1 = x[:, D:] 2025-05-07T20:31:46.3350833Z 2025-05-07T20:31:46.3351033Z if contiguous: 2025-05-07T20:31:46.3351265Z x0 = x0.contiguous() 2025-05-07T20:31:46.3351534Z x1 = x1.contiguous() 2025-05-07T20:31:46.3351911Z 2025-05-07T20:31:46.3352109Z if scale_ub is not None: 2025-05-07T20:31:46.3352398Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.3352746Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.3353101Z ) 2025-05-07T20:31:46.3353315Z else: 2025-05-07T20:31:46.3353536Z scale_ub_tensor = None 2025-05-07T20:31:46.3353795Z 2025-05-07T20:31:46.3354029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.3354358Z op = silu_mul_quant 2025-05-07T20:31:46.3354620Z if compiled: 2025-05-07T20:31:46.3354869Z op = torch.compile(op) 2025-05-07T20:31:46.3355175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3355461Z 2025-05-07T20:31:46.3355654Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.3355830Z 2025-05-07T20:31:46.3355934Z moe/activation_test.py:117: 2025-05-07T20:31:46.3356241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3356588Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.3356886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3357587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.3358293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.3358838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.3359532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.3360211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.3360756Z kernel = self.compile( 2025-05-07T20:31:46.3361301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.3362057Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.3362472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3362708Z 2025-05-07T20:31:46.3362920Z self = 2025-05-07T20:31:46.3364006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.3365381Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7023820>} 2025-05-07T20:31:46.3366720Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.3367770Z context = 2025-05-07T20:31:46.3368060Z 2025-05-07T20:31:46.3368227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.3368766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.3369237Z module_map=module_map) 2025-05-07T20:31:46.3369608Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.3369967Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.3370233Z E ^ 2025-05-07T20:31:46.3370697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.3371149Z 2025-05-07T20:31:46.3371571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.3372093Z 2025-05-07T20:31:46.3372204Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3372761Z self=, 2025-05-07T20:31:46.3373164Z T=2048, 2025-05-07T20:31:46.3373359Z D=7168, 2025-05-07T20:31:46.3373556Z scale_ub=None, 2025-05-07T20:31:46.3373773Z contiguous=False, 2025-05-07T20:31:46.3374008Z compiled=True, 2025-05-07T20:31:46.3374218Z ) 2025-05-07T20:31:46.4543570Z self = 2025-05-07T20:31:46.4544355Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.4544757Z 2025-05-07T20:31:46.4544870Z @given( 2025-05-07T20:31:46.4545201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.4545643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.4545958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.4546307Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.4546667Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.4546979Z ) 2025-05-07T20:31:46.4547339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.4547789Z def test_silu_mul_quant( 2025-05-07T20:31:46.4548035Z self, 2025-05-07T20:31:46.4548245Z T: int, 2025-05-07T20:31:46.4548457Z D: int, 2025-05-07T20:31:46.4548681Z scale_ub: Optional[float], 2025-05-07T20:31:46.4548970Z contiguous: bool, 2025-05-07T20:31:46.4549222Z compiled: bool, 2025-05-07T20:31:46.4549456Z ) -> None: 2025-05-07T20:31:46.4549687Z torch.manual_seed(2025) 2025-05-07T20:31:46.4549943Z 2025-05-07T20:31:46.4550221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.4550578Z 2025-05-07T20:31:46.4550783Z x_sign = torch.sign(x) 2025-05-07T20:31:46.4551089Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.4551403Z x = x_sign * x_clamp 2025-05-07T20:31:46.4552055Z x0 = x[:, :D] 2025-05-07T20:31:46.4552288Z x1 = x[:, D:] 2025-05-07T20:31:46.4552501Z 2025-05-07T20:31:46.4552698Z if contiguous: 2025-05-07T20:31:46.4552969Z x0 = x0.contiguous() 2025-05-07T20:31:46.4553264Z x1 = x1.contiguous() 2025-05-07T20:31:46.4553516Z 2025-05-07T20:31:46.4553721Z if scale_ub is not None: 2025-05-07T20:31:46.4554003Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.4554348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.4554671Z ) 2025-05-07T20:31:46.4554876Z else: 2025-05-07T20:31:46.4555101Z scale_ub_tensor = None 2025-05-07T20:31:46.4555361Z 2025-05-07T20:31:46.4555595Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.4555923Z op = silu_mul_quant 2025-05-07T20:31:46.4556187Z if compiled: 2025-05-07T20:31:46.4556447Z op = torch.compile(op) 2025-05-07T20:31:46.4556766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.4557061Z 2025-05-07T20:31:46.4557267Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.4557435Z 2025-05-07T20:31:46.4557540Z moe/activation_test.py:117: 2025-05-07T20:31:46.4557851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.4558196Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.4558482Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.4559049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.4559614Z return fn(*args, **kwargs) 2025-05-07T20:31:46.4560287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.4560982Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.4561535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.4562408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.4563081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.4563633Z kernel = self.compile( 2025-05-07T20:31:46.4564183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.4564848Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.4565256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.4565500Z 2025-05-07T20:31:46.4565715Z self = 2025-05-07T20:31:46.4566802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.4568192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6fe1790>} 2025-05-07T20:31:46.4569533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.4570555Z context = 2025-05-07T20:31:46.4570856Z 2025-05-07T20:31:46.4571029Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.4571564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.4572035Z module_map=module_map) 2025-05-07T20:31:46.4572499Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.4572879Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.4573155Z E ^ 2025-05-07T20:31:46.4573629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.4574089Z 2025-05-07T20:31:46.4574507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.4575022Z 2025-05-07T20:31:46.4575137Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.4575562Z self=, 2025-05-07T20:31:46.4575967Z T=4096, 2025-05-07T20:31:46.4576168Z D=7168, 2025-05-07T20:31:46.4576368Z scale_ub=None, 2025-05-07T20:31:46.4576589Z contiguous=False, 2025-05-07T20:31:46.4576825Z compiled=True, 2025-05-07T20:31:46.4577040Z ) 2025-05-07T20:31:46.4577363Z self = 2025-05-07T20:31:46.4577880Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.4578155Z 2025-05-07T20:31:46.4578243Z @given( 2025-05-07T20:31:46.4578478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.4578807Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.4579124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.4579465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.4579801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.4580097Z ) 2025-05-07T20:31:46.4580455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.4580907Z def test_silu_mul_quant( 2025-05-07T20:31:46.4581278Z self, 2025-05-07T20:31:46.4581485Z T: int, 2025-05-07T20:31:46.4581685Z D: int, 2025-05-07T20:31:46.4581914Z scale_ub: Optional[float], 2025-05-07T20:31:46.4582198Z contiguous: bool, 2025-05-07T20:31:46.4582535Z compiled: bool, 2025-05-07T20:31:46.4582773Z ) -> None: 2025-05-07T20:31:46.4582999Z torch.manual_seed(2025) 2025-05-07T20:31:46.4583268Z 2025-05-07T20:31:46.4583578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.4583948Z 2025-05-07T20:31:46.4584150Z x_sign = torch.sign(x) 2025-05-07T20:31:46.4584445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.4584768Z x = x_sign * x_clamp 2025-05-07T20:31:46.4585022Z x0 = x[:, :D] 2025-05-07T20:31:46.4585245Z x1 = x[:, D:] 2025-05-07T20:31:46.4585465Z 2025-05-07T20:31:46.4585663Z if contiguous: 2025-05-07T20:31:46.4585900Z x0 = x0.contiguous() 2025-05-07T20:31:46.4586170Z x1 = x1.contiguous() 2025-05-07T20:31:46.4586423Z 2025-05-07T20:31:46.4586619Z if scale_ub is not None: 2025-05-07T20:31:46.4586906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.4587267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.4587594Z ) 2025-05-07T20:31:46.4587793Z else: 2025-05-07T20:31:46.4588021Z scale_ub_tensor = None 2025-05-07T20:31:46.4588286Z 2025-05-07T20:31:46.4588520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.4588846Z op = silu_mul_quant 2025-05-07T20:31:46.4589112Z if compiled: 2025-05-07T20:31:46.4589365Z op = torch.compile(op) 2025-05-07T20:31:46.4589675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.4589961Z 2025-05-07T20:31:46.4590159Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.4590334Z 2025-05-07T20:31:46.4590438Z moe/activation_test.py:117: 2025-05-07T20:31:46.4590743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.4591080Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.4591382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.4592043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.4592621Z return fn(*args, **kwargs) 2025-05-07T20:31:46.4593341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.4594046Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.4594594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.4595277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.4595951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.4596498Z kernel = self.compile( 2025-05-07T20:31:46.4597048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.4597719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.4598125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.4598361Z 2025-05-07T20:31:46.4598578Z self = 2025-05-07T20:31:46.4599664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.4601027Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6f6a4c0>} 2025-05-07T20:31:46.4602372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.4603491Z context = 2025-05-07T20:31:46.4603784Z 2025-05-07T20:31:46.4603966Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.4604492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.4604976Z module_map=module_map) 2025-05-07T20:31:46.4605353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.4605717Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.4605985Z E ^ 2025-05-07T20:31:46.4606457Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.4606908Z 2025-05-07T20:31:46.4607334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.4607854Z 2025-05-07T20:31:46.6668764Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.6669366Z self=, 2025-05-07T20:31:46.6669959Z T=16384, 2025-05-07T20:31:46.6670273Z D=5120, 2025-05-07T20:31:46.6670523Z scale_ub=1200.0, 2025-05-07T20:31:46.6670757Z contiguous=False, 2025-05-07T20:31:46.6670992Z compiled=False, 2025-05-07T20:31:46.6671213Z ) 2025-05-07T20:31:46.6671541Z self = 2025-05-07T20:31:46.6672048Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.6672339Z 2025-05-07T20:31:46.6672430Z @given( 2025-05-07T20:31:46.6672667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.6672996Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.6673313Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.6673662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.6674382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.6674685Z ) 2025-05-07T20:31:46.6675049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.6675501Z def test_silu_mul_quant( 2025-05-07T20:31:46.6675754Z self, 2025-05-07T20:31:46.6675962Z T: int, 2025-05-07T20:31:46.6676165Z D: int, 2025-05-07T20:31:46.6676396Z scale_ub: Optional[float], 2025-05-07T20:31:46.6676686Z contiguous: bool, 2025-05-07T20:31:46.6676931Z compiled: bool, 2025-05-07T20:31:46.6677168Z ) -> None: 2025-05-07T20:31:46.6677397Z torch.manual_seed(2025) 2025-05-07T20:31:46.6677644Z 2025-05-07T20:31:46.6677929Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.6678281Z 2025-05-07T20:31:46.6678479Z x_sign = torch.sign(x) 2025-05-07T20:31:46.6678780Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.6679110Z x = x_sign * x_clamp 2025-05-07T20:31:46.6679366Z x0 = x[:, :D] 2025-05-07T20:31:46.6679586Z x1 = x[:, D:] 2025-05-07T20:31:46.6679804Z 2025-05-07T20:31:46.6680002Z if contiguous: 2025-05-07T20:31:46.6680237Z x0 = x0.contiguous() 2025-05-07T20:31:46.6680508Z x1 = x1.contiguous() 2025-05-07T20:31:46.6680760Z 2025-05-07T20:31:46.6680955Z if scale_ub is not None: 2025-05-07T20:31:46.6681241Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.6681588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.6681902Z ) 2025-05-07T20:31:46.6682108Z else: 2025-05-07T20:31:46.6682332Z scale_ub_tensor = None 2025-05-07T20:31:46.6682586Z 2025-05-07T20:31:46.6682831Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.6683200Z op = silu_mul_quant 2025-05-07T20:31:46.6683463Z if compiled: 2025-05-07T20:31:46.6683887Z op = torch.compile(op) 2025-05-07T20:31:46.6684201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.6684491Z 2025-05-07T20:31:46.6684687Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.6684864Z 2025-05-07T20:31:46.6684967Z moe/activation_test.py:117: 2025-05-07T20:31:46.6685278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.6685614Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.6685908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.6686613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.6687307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.6687860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.6688558Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.6689241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.6689780Z kernel = self.compile( 2025-05-07T20:31:46.6690344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.6691016Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.6691426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.6691662Z 2025-05-07T20:31:46.6691876Z self = 2025-05-07T20:31:46.6692956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.6694479Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6f6a820>} 2025-05-07T20:31:46.6695845Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.6696866Z context = 2025-05-07T20:31:46.6697167Z 2025-05-07T20:31:46.6697339Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.6697874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.6698349Z module_map=module_map) 2025-05-07T20:31:46.6698721Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.6699088Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.6699355Z E ^ 2025-05-07T20:31:46.6699829Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.6700294Z 2025-05-07T20:31:46.6700712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.6701326Z 2025-05-07T20:31:46.6701434Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.6701855Z self=, 2025-05-07T20:31:46.6702258Z T=16384, 2025-05-07T20:31:46.6702461Z D=5120, 2025-05-07T20:31:46.6702663Z scale_ub=1200.0, 2025-05-07T20:31:46.6702891Z contiguous=True, 2025-05-07T20:31:46.6703143Z compiled=True, 2025-05-07T20:31:46.6703358Z ) 2025-05-07T20:31:46.6703716Z self = 2025-05-07T20:31:46.6704242Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.6704609Z 2025-05-07T20:31:46.6704705Z @given( 2025-05-07T20:31:46.6713711Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.6714057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.6714382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.6714717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.6715058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.6715363Z ) 2025-05-07T20:31:46.6715722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.6716179Z def test_silu_mul_quant( 2025-05-07T20:31:46.6716435Z self, 2025-05-07T20:31:46.6716636Z T: int, 2025-05-07T20:31:46.6716850Z D: int, 2025-05-07T20:31:46.6717082Z scale_ub: Optional[float], 2025-05-07T20:31:46.6717360Z contiguous: bool, 2025-05-07T20:31:46.6717615Z compiled: bool, 2025-05-07T20:31:46.6717858Z ) -> None: 2025-05-07T20:31:46.6718101Z torch.manual_seed(2025) 2025-05-07T20:31:46.6718361Z 2025-05-07T20:31:46.6718648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.6719007Z 2025-05-07T20:31:46.6719205Z x_sign = torch.sign(x) 2025-05-07T20:31:46.6719514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.6719839Z x = x_sign * x_clamp 2025-05-07T20:31:46.6720088Z x0 = x[:, :D] 2025-05-07T20:31:46.6720318Z x1 = x[:, D:] 2025-05-07T20:31:46.6720539Z 2025-05-07T20:31:46.6720727Z if contiguous: 2025-05-07T20:31:46.6720975Z x0 = x0.contiguous() 2025-05-07T20:31:46.6721248Z x1 = x1.contiguous() 2025-05-07T20:31:46.6721495Z 2025-05-07T20:31:46.6721698Z if scale_ub is not None: 2025-05-07T20:31:46.6721984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.6722327Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.6722651Z ) 2025-05-07T20:31:46.6722980Z else: 2025-05-07T20:31:46.6723199Z scale_ub_tensor = None 2025-05-07T20:31:46.6723463Z 2025-05-07T20:31:46.6723705Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.6724032Z op = silu_mul_quant 2025-05-07T20:31:46.6724288Z if compiled: 2025-05-07T20:31:46.6724545Z op = torch.compile(op) 2025-05-07T20:31:46.6724858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.6725136Z 2025-05-07T20:31:46.6725338Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.6725508Z 2025-05-07T20:31:46.6725621Z moe/activation_test.py:117: 2025-05-07T20:31:46.6725920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.6726264Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.6726557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.6727120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.6727703Z return fn(*args, **kwargs) 2025-05-07T20:31:46.6728383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.6729080Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.6729628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.6730320Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.6730995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.6731538Z kernel = self.compile( 2025-05-07T20:31:46.6732085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.6732758Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.6733277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.6733516Z 2025-05-07T20:31:46.6733730Z self = 2025-05-07T20:31:46.6734819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.6736203Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6df6e50>} 2025-05-07T20:31:46.6737558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.6738582Z context = 2025-05-07T20:31:46.6738887Z 2025-05-07T20:31:46.6739062Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.6739609Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.6740408Z module_map=module_map) 2025-05-07T20:31:46.6740801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.6741219Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.6741495Z E ^ 2025-05-07T20:31:46.6741970Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.6742419Z 2025-05-07T20:31:46.6742842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.6743364Z 2025-05-07T20:31:47.1105424Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.1106526Z self=, 2025-05-07T20:31:47.1107137Z T=16384, 2025-05-07T20:31:47.1107427Z D=5120, 2025-05-07T20:31:47.1107699Z scale_ub=None, 2025-05-07T20:31:47.1107988Z contiguous=False, 2025-05-07T20:31:47.1108295Z compiled=True, 2025-05-07T20:31:47.1108579Z ) 2025-05-07T20:31:47.1108953Z self = 2025-05-07T20:31:47.1109470Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:47.1109765Z 2025-05-07T20:31:47.1109850Z @given( 2025-05-07T20:31:47.1110103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.1110425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.1110748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.1111095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.1111433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.1111752Z ) 2025-05-07T20:31:47.1112131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.1112589Z def test_silu_mul_quant( 2025-05-07T20:31:47.1112845Z self, 2025-05-07T20:31:47.1113058Z T: int, 2025-05-07T20:31:47.1113275Z D: int, 2025-05-07T20:31:47.1113506Z scale_ub: Optional[float], 2025-05-07T20:31:47.1113797Z contiguous: bool, 2025-05-07T20:31:47.1114060Z compiled: bool, 2025-05-07T20:31:47.1114296Z ) -> None: 2025-05-07T20:31:47.1114532Z torch.manual_seed(2025) 2025-05-07T20:31:47.1114793Z 2025-05-07T20:31:47.1115083Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.1115442Z 2025-05-07T20:31:47.1115648Z x_sign = torch.sign(x) 2025-05-07T20:31:47.1115956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.1116278Z x = x_sign * x_clamp 2025-05-07T20:31:47.1116534Z x0 = x[:, :D] 2025-05-07T20:31:47.1116764Z x1 = x[:, D:] 2025-05-07T20:31:47.1117171Z 2025-05-07T20:31:47.1117375Z if contiguous: 2025-05-07T20:31:47.1117621Z x0 = x0.contiguous() 2025-05-07T20:31:47.1117892Z x1 = x1.contiguous() 2025-05-07T20:31:47.1118150Z 2025-05-07T20:31:47.1118359Z if scale_ub is not None: 2025-05-07T20:31:47.1118648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.1118993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.1119317Z ) 2025-05-07T20:31:47.1119527Z else: 2025-05-07T20:31:47.1119748Z scale_ub_tensor = None 2025-05-07T20:31:47.1120013Z 2025-05-07T20:31:47.1120261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.1120581Z op = silu_mul_quant 2025-05-07T20:31:47.1120848Z if compiled: 2025-05-07T20:31:47.1121114Z op = torch.compile(op) 2025-05-07T20:31:47.1121418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.1121717Z 2025-05-07T20:31:47.1121921Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.1122091Z 2025-05-07T20:31:47.1122199Z moe/activation_test.py:117: 2025-05-07T20:31:47.1122511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.1122854Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.1123146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.1123710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.1124284Z return fn(*args, **kwargs) 2025-05-07T20:31:47.1124953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.1125654Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.1126208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.1127045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.1127734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.1128272Z kernel = self.compile( 2025-05-07T20:31:47.1128823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.1129491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.1129892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.1130134Z 2025-05-07T20:31:47.1130346Z self = 2025-05-07T20:31:47.1131433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.1132826Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6d799d0>} 2025-05-07T20:31:47.1134168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.1135188Z context = 2025-05-07T20:31:47.1135485Z 2025-05-07T20:31:47.1135656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.1136188Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.1136659Z module_map=module_map) 2025-05-07T20:31:47.1137031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.1137396Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.1137756Z E ^ 2025-05-07T20:31:47.1138221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.1138677Z 2025-05-07T20:31:47.1139097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.1139620Z 2025-05-07T20:31:47.1139727Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.1140456Z self=, 2025-05-07T20:31:47.1140862Z T=2048, 2025-05-07T20:31:47.1141058Z D=5120, 2025-05-07T20:31:47.1141322Z scale_ub=None, 2025-05-07T20:31:47.1141542Z contiguous=False, 2025-05-07T20:31:47.1141780Z compiled=True, 2025-05-07T20:31:47.1141993Z ) 2025-05-07T20:31:47.2350881Z self = 2025-05-07T20:31:47.2351678Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:47.2352099Z 2025-05-07T20:31:47.2352191Z @given( 2025-05-07T20:31:47.2352522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.2352974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.2353402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.2353748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.2354093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.2354383Z ) 2025-05-07T20:31:47.2354742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.2355194Z def test_silu_mul_quant( 2025-05-07T20:31:47.2355439Z self, 2025-05-07T20:31:47.2355641Z T: int, 2025-05-07T20:31:47.2355845Z D: int, 2025-05-07T20:31:47.2356071Z scale_ub: Optional[float], 2025-05-07T20:31:47.2356345Z contiguous: bool, 2025-05-07T20:31:47.2356591Z compiled: bool, 2025-05-07T20:31:47.2356831Z ) -> None: 2025-05-07T20:31:47.2357388Z torch.manual_seed(2025) 2025-05-07T20:31:47.2357645Z 2025-05-07T20:31:47.2357925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.2358271Z 2025-05-07T20:31:47.2358472Z x_sign = torch.sign(x) 2025-05-07T20:31:47.2358773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.2359091Z x = x_sign * x_clamp 2025-05-07T20:31:47.2359345Z x0 = x[:, :D] 2025-05-07T20:31:47.2359572Z x1 = x[:, D:] 2025-05-07T20:31:47.2359782Z 2025-05-07T20:31:47.2359974Z if contiguous: 2025-05-07T20:31:47.2360214Z x0 = x0.contiguous() 2025-05-07T20:31:47.2360477Z x1 = x1.contiguous() 2025-05-07T20:31:47.2360730Z 2025-05-07T20:31:47.2360930Z if scale_ub is not None: 2025-05-07T20:31:47.2361209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.2361555Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.2361885Z ) 2025-05-07T20:31:47.2362085Z else: 2025-05-07T20:31:47.2362297Z scale_ub_tensor = None 2025-05-07T20:31:47.2362559Z 2025-05-07T20:31:47.2362800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.2363119Z op = silu_mul_quant 2025-05-07T20:31:47.2363383Z if compiled: 2025-05-07T20:31:47.2363641Z op = torch.compile(op) 2025-05-07T20:31:47.2363943Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.2364232Z 2025-05-07T20:31:47.2364433Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.2364603Z 2025-05-07T20:31:47.2364710Z moe/activation_test.py:117: 2025-05-07T20:31:47.2365016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.2365359Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.2365655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.2366227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.2366980Z return fn(*args, **kwargs) 2025-05-07T20:31:47.2367651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.2368340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.2368894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.2369587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.2370258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.2370789Z kernel = self.compile( 2025-05-07T20:31:47.2371343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.2372011Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.2372417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.2372656Z 2025-05-07T20:31:47.2372868Z self = 2025-05-07T20:31:47.2373998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.2375382Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6caa550>} 2025-05-07T20:31:47.2376722Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.2377821Z context = 2025-05-07T20:31:47.2378127Z 2025-05-07T20:31:47.2378297Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.2378826Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.2379302Z module_map=module_map) 2025-05-07T20:31:47.2379670Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.2380031Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.2380300Z E ^ 2025-05-07T20:31:47.2380764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.2381345Z 2025-05-07T20:31:47.2381770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.2382296Z 2025-05-07T20:31:47.2382403Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.2382836Z self=, 2025-05-07T20:31:47.2383239Z T=2048, 2025-05-07T20:31:47.2383457Z D=5120, 2025-05-07T20:31:47.2383658Z scale_ub=1200.0, 2025-05-07T20:31:47.2383896Z contiguous=False, 2025-05-07T20:31:47.2384124Z compiled=True, 2025-05-07T20:31:47.2384342Z ) 2025-05-07T20:31:47.2384669Z self = 2025-05-07T20:31:47.2385167Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:47.2385450Z 2025-05-07T20:31:47.2385532Z @given( 2025-05-07T20:31:47.2385771Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.2386099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.2386411Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.2386750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.2387086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.2387468Z ) 2025-05-07T20:31:47.2387825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.2388276Z def test_silu_mul_quant( 2025-05-07T20:31:47.2388523Z self, 2025-05-07T20:31:47.2388725Z T: int, 2025-05-07T20:31:47.2388933Z D: int, 2025-05-07T20:31:47.2389155Z scale_ub: Optional[float], 2025-05-07T20:31:47.2389436Z contiguous: bool, 2025-05-07T20:31:47.2389689Z compiled: bool, 2025-05-07T20:31:47.2389913Z ) -> None: 2025-05-07T20:31:47.2390137Z torch.manual_seed(2025) 2025-05-07T20:31:47.2390391Z 2025-05-07T20:31:47.2390664Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.2391018Z 2025-05-07T20:31:47.2391217Z x_sign = torch.sign(x) 2025-05-07T20:31:47.2391515Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.2391826Z x = x_sign * x_clamp 2025-05-07T20:31:47.2392089Z x0 = x[:, :D] 2025-05-07T20:31:47.2392313Z x1 = x[:, D:] 2025-05-07T20:31:47.2392523Z 2025-05-07T20:31:47.2392713Z if contiguous: 2025-05-07T20:31:47.2392950Z x0 = x0.contiguous() 2025-05-07T20:31:47.2393211Z x1 = x1.contiguous() 2025-05-07T20:31:47.2393458Z 2025-05-07T20:31:47.2393656Z if scale_ub is not None: 2025-05-07T20:31:47.2393930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.2394275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.2394592Z ) 2025-05-07T20:31:47.2394785Z else: 2025-05-07T20:31:47.2395004Z scale_ub_tensor = None 2025-05-07T20:31:47.2395267Z 2025-05-07T20:31:47.2395506Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.2395830Z op = silu_mul_quant 2025-05-07T20:31:47.2396085Z if compiled: 2025-05-07T20:31:47.2396341Z op = torch.compile(op) 2025-05-07T20:31:47.2396735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.2397017Z 2025-05-07T20:31:47.2397216Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.2397387Z 2025-05-07T20:31:47.2397496Z moe/activation_test.py:117: 2025-05-07T20:31:47.2397793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.2398135Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.2398426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.2398986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.2399558Z return fn(*args, **kwargs) 2025-05-07T20:31:47.2400228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.2400921Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.2401467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.2402175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.2402850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.2403391Z kernel = self.compile( 2025-05-07T20:31:47.2403991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.2404650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.2405057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.2405290Z 2025-05-07T20:31:47.2405499Z self = 2025-05-07T20:31:47.2406585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.2408049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6c3b310>} 2025-05-07T20:31:47.2409390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.2410412Z context = 2025-05-07T20:31:47.2410703Z 2025-05-07T20:31:47.2410872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.2411399Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.2411871Z module_map=module_map) 2025-05-07T20:31:47.2412244Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.2412612Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.2412881Z E ^ 2025-05-07T20:31:47.2413354Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.2413803Z 2025-05-07T20:31:47.2414228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.2414749Z 2025-05-07T20:31:47.4663936Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.4664618Z self=, 2025-05-07T20:31:47.4665194Z T=4096, 2025-05-07T20:31:47.4665463Z D=5120, 2025-05-07T20:31:47.4665727Z scale_ub=1200.0, 2025-05-07T20:31:47.4665951Z contiguous=True, 2025-05-07T20:31:47.4666180Z compiled=True, 2025-05-07T20:31:47.4666395Z ) 2025-05-07T20:31:47.4666719Z self = 2025-05-07T20:31:47.4667668Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:47.4667951Z 2025-05-07T20:31:47.4668043Z @given( 2025-05-07T20:31:47.4668278Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.4668599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.4668918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.4669250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.4669592Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.4669889Z ) 2025-05-07T20:31:47.4670245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.4670687Z def test_silu_mul_quant( 2025-05-07T20:31:47.4670938Z self, 2025-05-07T20:31:47.4671141Z T: int, 2025-05-07T20:31:47.4671340Z D: int, 2025-05-07T20:31:47.4671574Z scale_ub: Optional[float], 2025-05-07T20:31:47.4671858Z contiguous: bool, 2025-05-07T20:31:47.4672112Z compiled: bool, 2025-05-07T20:31:47.4672352Z ) -> None: 2025-05-07T20:31:47.4672580Z torch.manual_seed(2025) 2025-05-07T20:31:47.4672823Z 2025-05-07T20:31:47.4673106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.4673459Z 2025-05-07T20:31:47.4673659Z x_sign = torch.sign(x) 2025-05-07T20:31:47.4673967Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.4674291Z x = x_sign * x_clamp 2025-05-07T20:31:47.4674567Z x0 = x[:, :D] 2025-05-07T20:31:47.4674784Z x1 = x[:, D:] 2025-05-07T20:31:47.4675005Z 2025-05-07T20:31:47.4675202Z if contiguous: 2025-05-07T20:31:47.4675439Z x0 = x0.contiguous() 2025-05-07T20:31:47.4675711Z x1 = x1.contiguous() 2025-05-07T20:31:47.4675965Z 2025-05-07T20:31:47.4684192Z if scale_ub is not None: 2025-05-07T20:31:47.4684523Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.4685110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.4685432Z ) 2025-05-07T20:31:47.4685644Z else: 2025-05-07T20:31:47.4685867Z scale_ub_tensor = None 2025-05-07T20:31:47.4686125Z 2025-05-07T20:31:47.4686371Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.4686701Z op = silu_mul_quant 2025-05-07T20:31:47.4686961Z if compiled: 2025-05-07T20:31:47.4687224Z op = torch.compile(op) 2025-05-07T20:31:47.4687537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.4687822Z 2025-05-07T20:31:47.4688029Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.4688205Z 2025-05-07T20:31:47.4688310Z moe/activation_test.py:117: 2025-05-07T20:31:47.4688619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.4688962Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.4689258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.4689868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.4690444Z return fn(*args, **kwargs) 2025-05-07T20:31:47.4691108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.4691800Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.4692346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.4693027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.4693696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.4694236Z kernel = self.compile( 2025-05-07T20:31:47.4694783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.4695540Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.4695954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.4696194Z 2025-05-07T20:31:47.4696415Z self = 2025-05-07T20:31:47.4697514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.4698901Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6fb3040>} 2025-05-07T20:31:47.4700246Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.4701359Z context = 2025-05-07T20:31:47.4701652Z 2025-05-07T20:31:47.4701833Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.4702364Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.4702835Z module_map=module_map) 2025-05-07T20:31:47.4703210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.4703574Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.4703834Z E ^ 2025-05-07T20:31:47.4704308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.4704763Z 2025-05-07T20:31:47.4705189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.4705699Z 2025-05-07T20:31:47.4705902Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.4706319Z self=, 2025-05-07T20:31:47.4706726Z T=128, 2025-05-07T20:31:47.4706922Z D=5120, 2025-05-07T20:31:47.4707115Z scale_ub=1200.0, 2025-05-07T20:31:47.4707352Z contiguous=False, 2025-05-07T20:31:47.4707587Z compiled=True, 2025-05-07T20:31:47.4707795Z ) 2025-05-07T20:31:47.6024284Z self = 2025-05-07T20:31:47.6025082Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:47.6025483Z 2025-05-07T20:31:47.6025619Z @given( 2025-05-07T20:31:47.6025957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.6026401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.6026795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.6027145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.6027522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.6027825Z ) 2025-05-07T20:31:47.6028193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.6028650Z def test_silu_mul_quant( 2025-05-07T20:31:47.6028907Z self, 2025-05-07T20:31:47.6029118Z T: int, 2025-05-07T20:31:47.6029324Z D: int, 2025-05-07T20:31:47.6029559Z scale_ub: Optional[float], 2025-05-07T20:31:47.6029852Z contiguous: bool, 2025-05-07T20:31:47.6030096Z compiled: bool, 2025-05-07T20:31:47.6030339Z ) -> None: 2025-05-07T20:31:47.6030573Z torch.manual_seed(2025) 2025-05-07T20:31:47.6030821Z 2025-05-07T20:31:47.6031105Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.6031469Z 2025-05-07T20:31:47.6031674Z x_sign = torch.sign(x) 2025-05-07T20:31:47.6031971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.6032669Z x = x_sign * x_clamp 2025-05-07T20:31:47.6032936Z x0 = x[:, :D] 2025-05-07T20:31:47.6033156Z x1 = x[:, D:] 2025-05-07T20:31:47.6033375Z 2025-05-07T20:31:47.6033580Z if contiguous: 2025-05-07T20:31:47.6033821Z x0 = x0.contiguous() 2025-05-07T20:31:47.6034100Z x1 = x1.contiguous() 2025-05-07T20:31:47.6034353Z 2025-05-07T20:31:47.6034550Z if scale_ub is not None: 2025-05-07T20:31:47.6034838Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.6035196Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.6035513Z ) 2025-05-07T20:31:47.6035717Z else: 2025-05-07T20:31:47.6035941Z scale_ub_tensor = None 2025-05-07T20:31:47.6036199Z 2025-05-07T20:31:47.6036442Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.6036770Z op = silu_mul_quant 2025-05-07T20:31:47.6037034Z if compiled: 2025-05-07T20:31:47.6037292Z op = torch.compile(op) 2025-05-07T20:31:47.6037616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.6037912Z 2025-05-07T20:31:47.6038109Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.6038292Z 2025-05-07T20:31:47.6038398Z moe/activation_test.py:117: 2025-05-07T20:31:47.6038712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.6039050Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.6039351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.6039930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.6040838Z return fn(*args, **kwargs) 2025-05-07T20:31:47.6041513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.6042215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.6042776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.6043661Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.6044348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.6044892Z kernel = self.compile( 2025-05-07T20:31:47.6045453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.6046122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.6046541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.6046776Z 2025-05-07T20:31:47.6046990Z self = 2025-05-07T20:31:47.6048083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.6049478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6fb3ca0>} 2025-05-07T20:31:47.6050814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.6051833Z context = 2025-05-07T20:31:47.6052124Z 2025-05-07T20:31:47.6052298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.6052826Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.6053294Z module_map=module_map) 2025-05-07T20:31:47.6053792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.6054155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.6054425Z E ^ 2025-05-07T20:31:47.6054893Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.6055342Z 2025-05-07T20:31:47.6055757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.6056276Z 2025-05-07T20:31:47.6056383Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.6056804Z self=, 2025-05-07T20:31:47.6057219Z T=16384, 2025-05-07T20:31:47.6057416Z D=7168, 2025-05-07T20:31:47.6057618Z scale_ub=1200.0, 2025-05-07T20:31:47.6057877Z contiguous=True, 2025-05-07T20:31:47.6058109Z compiled=True, 2025-05-07T20:31:47.6058321Z ) 2025-05-07T20:31:47.6058656Z self = 2025-05-07T20:31:47.6059170Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:47.6059447Z 2025-05-07T20:31:47.6059528Z @given( 2025-05-07T20:31:47.6059774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.6060103Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.6060414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.6060754Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.6061093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.6061471Z ) 2025-05-07T20:31:47.6061827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.6062276Z def test_silu_mul_quant( 2025-05-07T20:31:47.6062528Z self, 2025-05-07T20:31:47.6062726Z T: int, 2025-05-07T20:31:47.6062931Z D: int, 2025-05-07T20:31:47.6063159Z scale_ub: Optional[float], 2025-05-07T20:31:47.6063533Z contiguous: bool, 2025-05-07T20:31:47.6063780Z compiled: bool, 2025-05-07T20:31:47.6064012Z ) -> None: 2025-05-07T20:31:47.6064232Z torch.manual_seed(2025) 2025-05-07T20:31:47.6064483Z 2025-05-07T20:31:47.6064766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.6065107Z 2025-05-07T20:31:47.6065312Z x_sign = torch.sign(x) 2025-05-07T20:31:47.6065613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.6065931Z x = x_sign * x_clamp 2025-05-07T20:31:47.6066176Z x0 = x[:, :D] 2025-05-07T20:31:47.6066406Z x1 = x[:, D:] 2025-05-07T20:31:47.6066622Z 2025-05-07T20:31:47.6066810Z if contiguous: 2025-05-07T20:31:47.6067054Z x0 = x0.contiguous() 2025-05-07T20:31:47.6067323Z x1 = x1.contiguous() 2025-05-07T20:31:47.6067568Z 2025-05-07T20:31:47.6067767Z if scale_ub is not None: 2025-05-07T20:31:47.6068058Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.6068402Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.6068727Z ) 2025-05-07T20:31:47.6068930Z else: 2025-05-07T20:31:47.6069145Z scale_ub_tensor = None 2025-05-07T20:31:47.6069406Z 2025-05-07T20:31:47.6069646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.6069968Z op = silu_mul_quant 2025-05-07T20:31:47.6070231Z if compiled: 2025-05-07T20:31:47.6070488Z op = torch.compile(op) 2025-05-07T20:31:47.6070795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.6071079Z 2025-05-07T20:31:47.6071281Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.6071451Z 2025-05-07T20:31:47.6071561Z moe/activation_test.py:117: 2025-05-07T20:31:47.6071857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.6072204Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.6072618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.6073185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.6073757Z return fn(*args, **kwargs) 2025-05-07T20:31:47.6074609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.6075324Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.6075866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.6076559Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.6077235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.6078011Z kernel = self.compile( 2025-05-07T20:31:47.6078793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.6079472Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.6079881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.6080116Z 2025-05-07T20:31:47.6080330Z self = 2025-05-07T20:31:47.6081428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.6082799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6ceaa60>} 2025-05-07T20:31:47.6084147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.6085338Z context = 2025-05-07T20:31:47.6085632Z 2025-05-07T20:31:47.6085804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.6086340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.6086817Z module_map=module_map) 2025-05-07T20:31:47.6087189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.6087558Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.6087829Z E ^ 2025-05-07T20:31:47.6088298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.6088747Z 2025-05-07T20:31:47.6089167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.6089705Z 2025-05-07T20:31:48.1155241Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.1155960Z self=, 2025-05-07T20:31:48.1156546Z T=16384, 2025-05-07T20:31:48.1156814Z D=5120, 2025-05-07T20:31:48.1157070Z scale_ub=1200.0, 2025-05-07T20:31:48.1157370Z contiguous=True, 2025-05-07T20:31:48.1157671Z compiled=False, 2025-05-07T20:31:48.1157951Z ) 2025-05-07T20:31:48.1158280Z self = 2025-05-07T20:31:48.1158796Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.1159080Z 2025-05-07T20:31:48.1159171Z @given( 2025-05-07T20:31:48.1159410Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.1159741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.1160063Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.1160758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.1161130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.1161435Z ) 2025-05-07T20:31:48.1161797Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.1162250Z def test_silu_mul_quant( 2025-05-07T20:31:48.1162507Z self, 2025-05-07T20:31:48.1162716Z T: int, 2025-05-07T20:31:48.1162921Z D: int, 2025-05-07T20:31:48.1163153Z scale_ub: Optional[float], 2025-05-07T20:31:48.1163438Z contiguous: bool, 2025-05-07T20:31:48.1163685Z compiled: bool, 2025-05-07T20:31:48.1163925Z ) -> None: 2025-05-07T20:31:48.1164158Z torch.manual_seed(2025) 2025-05-07T20:31:48.1164409Z 2025-05-07T20:31:48.1164696Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.1165056Z 2025-05-07T20:31:48.1165258Z x_sign = torch.sign(x) 2025-05-07T20:31:48.1165574Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.1165905Z x = x_sign * x_clamp 2025-05-07T20:31:48.1166153Z x0 = x[:, :D] 2025-05-07T20:31:48.1166384Z x1 = x[:, D:] 2025-05-07T20:31:48.1166608Z 2025-05-07T20:31:48.1166802Z if contiguous: 2025-05-07T20:31:48.1167056Z x0 = x0.contiguous() 2025-05-07T20:31:48.1167332Z x1 = x1.contiguous() 2025-05-07T20:31:48.1167589Z 2025-05-07T20:31:48.1167790Z if scale_ub is not None: 2025-05-07T20:31:48.1168080Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.1168439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.1168761Z ) 2025-05-07T20:31:48.1168971Z else: 2025-05-07T20:31:48.1169195Z scale_ub_tensor = None 2025-05-07T20:31:48.1169453Z 2025-05-07T20:31:48.1169704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.1170035Z op = silu_mul_quant 2025-05-07T20:31:48.1170302Z if compiled: 2025-05-07T20:31:48.1170739Z op = torch.compile(op) 2025-05-07T20:31:48.1171048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.1171332Z 2025-05-07T20:31:48.1171538Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.1171709Z 2025-05-07T20:31:48.1171821Z moe/activation_test.py:117: 2025-05-07T20:31:48.1172130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.1172469Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.1172765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.1173470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.1174166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.1174716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.1175417Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.1176100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.1176637Z kernel = self.compile( 2025-05-07T20:31:48.1177198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.1177872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.1178277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.1178522Z 2025-05-07T20:31:48.1178735Z self = 2025-05-07T20:31:48.1179824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.1181422Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6c08550>} 2025-05-07T20:31:48.1182774Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.1183799Z context = 2025-05-07T20:31:48.1184149Z 2025-05-07T20:31:48.1184320Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.1184860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.1185335Z module_map=module_map) 2025-05-07T20:31:48.1185707Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.1186073Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.1186351Z E ^ 2025-05-07T20:31:48.1186845Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.1187427Z 2025-05-07T20:31:48.1188044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.1188643Z 2025-05-07T20:31:48.1188753Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.1189182Z self=, 2025-05-07T20:31:48.1189594Z T=1, 2025-05-07T20:31:48.1189792Z D=7168, 2025-05-07T20:31:48.1190004Z scale_ub=1200.0, 2025-05-07T20:31:48.1190237Z contiguous=False, 2025-05-07T20:31:48.1190479Z compiled=False, 2025-05-07T20:31:48.1190700Z ) 2025-05-07T20:31:48.1191031Z self = 2025-05-07T20:31:48.1191542Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.1191918Z 2025-05-07T20:31:48.1192008Z @given( 2025-05-07T20:31:48.1192251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.1192571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.1192894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.1193239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.1193576Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.1193874Z ) 2025-05-07T20:31:48.1194235Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.1194683Z def test_silu_mul_quant( 2025-05-07T20:31:48.1194940Z self, 2025-05-07T20:31:48.1195148Z T: int, 2025-05-07T20:31:48.1195348Z D: int, 2025-05-07T20:31:48.1195575Z scale_ub: Optional[float], 2025-05-07T20:31:48.1195858Z contiguous: bool, 2025-05-07T20:31:48.1196101Z compiled: bool, 2025-05-07T20:31:48.1196337Z ) -> None: 2025-05-07T20:31:48.1196574Z torch.manual_seed(2025) 2025-05-07T20:31:48.1196828Z 2025-05-07T20:31:48.1197102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.1197461Z 2025-05-07T20:31:48.1197667Z x_sign = torch.sign(x) 2025-05-07T20:31:48.1197964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.1198290Z x = x_sign * x_clamp 2025-05-07T20:31:48.1198543Z x0 = x[:, :D] 2025-05-07T20:31:48.1198765Z x1 = x[:, D:] 2025-05-07T20:31:48.1198986Z 2025-05-07T20:31:48.1199180Z if contiguous: 2025-05-07T20:31:48.1199416Z x0 = x0.contiguous() 2025-05-07T20:31:48.1199690Z x1 = x1.contiguous() 2025-05-07T20:31:48.1199946Z 2025-05-07T20:31:48.1200141Z if scale_ub is not None: 2025-05-07T20:31:48.1200428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.1200775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.1201106Z ) 2025-05-07T20:31:48.1201402Z else: 2025-05-07T20:31:48.1201627Z scale_ub_tensor = None 2025-05-07T20:31:48.1201884Z 2025-05-07T20:31:48.1202128Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.1202456Z op = silu_mul_quant 2025-05-07T20:31:48.1202722Z if compiled: 2025-05-07T20:31:48.1202983Z op = torch.compile(op) 2025-05-07T20:31:48.1203295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.1203582Z 2025-05-07T20:31:48.1203779Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.1203958Z 2025-05-07T20:31:48.1204062Z moe/activation_test.py:117: 2025-05-07T20:31:48.1204370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.1204710Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.1205005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.1205710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.1206416Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.1206963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.1207660Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.1208337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.1208877Z kernel = self.compile( 2025-05-07T20:31:48.1209429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.1210098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.1210507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.1210743Z 2025-05-07T20:31:48.1210958Z self = 2025-05-07T20:31:48.1212165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.1213534Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6ceae50>} 2025-05-07T20:31:48.1214881Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.1215908Z context = 2025-05-07T20:31:48.1216201Z 2025-05-07T20:31:48.1216373Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.1216911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.1217404Z module_map=module_map) 2025-05-07T20:31:48.1217775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.1218158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.1218432Z E ^ 2025-05-07T20:31:48.1218895Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.1219355Z 2025-05-07T20:31:48.1219777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.1229763Z 2025-05-07T20:31:48.1229910Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.1230349Z self=, 2025-05-07T20:31:48.1230770Z T=4096, 2025-05-07T20:31:48.1230975Z D=7168, 2025-05-07T20:31:48.1231176Z scale_ub=1200.0, 2025-05-07T20:31:48.1231546Z contiguous=False, 2025-05-07T20:31:48.1231793Z compiled=True, 2025-05-07T20:31:48.1232012Z ) 2025-05-07T20:31:48.2403054Z self = 2025-05-07T20:31:48.2403877Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.2404225Z 2025-05-07T20:31:48.2404313Z @given( 2025-05-07T20:31:48.2404565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.2404898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.2405218Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.2405578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.2405928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.2406221Z ) 2025-05-07T20:31:48.2406588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.2407052Z def test_silu_mul_quant( 2025-05-07T20:31:48.2407345Z self, 2025-05-07T20:31:48.2407548Z T: int, 2025-05-07T20:31:48.2407759Z D: int, 2025-05-07T20:31:48.2407992Z scale_ub: Optional[float], 2025-05-07T20:31:48.2408269Z contiguous: bool, 2025-05-07T20:31:48.2408527Z compiled: bool, 2025-05-07T20:31:48.2408768Z ) -> None: 2025-05-07T20:31:48.2408991Z torch.manual_seed(2025) 2025-05-07T20:31:48.2409249Z 2025-05-07T20:31:48.2409535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.2409884Z 2025-05-07T20:31:48.2410096Z x_sign = torch.sign(x) 2025-05-07T20:31:48.2410479Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.2410803Z x = x_sign * x_clamp 2025-05-07T20:31:48.2411059Z x0 = x[:, :D] 2025-05-07T20:31:48.2411287Z x1 = x[:, D:] 2025-05-07T20:31:48.2411505Z 2025-05-07T20:31:48.2411702Z if contiguous: 2025-05-07T20:31:48.2411950Z x0 = x0.contiguous() 2025-05-07T20:31:48.2412583Z x1 = x1.contiguous() 2025-05-07T20:31:48.2412839Z 2025-05-07T20:31:48.2413048Z if scale_ub is not None: 2025-05-07T20:31:48.2413336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.2413678Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.2414001Z ) 2025-05-07T20:31:48.2414207Z else: 2025-05-07T20:31:48.2414426Z scale_ub_tensor = None 2025-05-07T20:31:48.2414694Z 2025-05-07T20:31:48.2414941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.2415264Z op = silu_mul_quant 2025-05-07T20:31:48.2415531Z if compiled: 2025-05-07T20:31:48.2415793Z op = torch.compile(op) 2025-05-07T20:31:48.2416098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.2416390Z 2025-05-07T20:31:48.2416597Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.2416772Z 2025-05-07T20:31:48.2416880Z moe/activation_test.py:117: 2025-05-07T20:31:48.2417209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.2417559Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.2417858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.2418429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.2419007Z return fn(*args, **kwargs) 2025-05-07T20:31:48.2419681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.2420373Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.2420923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.2421766Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.2422601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.2423158Z kernel = self.compile( 2025-05-07T20:31:48.2423714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.2424386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.2424791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.2425036Z 2025-05-07T20:31:48.2425249Z self = 2025-05-07T20:31:48.2426345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.2427764Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6b5eee0>} 2025-05-07T20:31:48.2429138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.2430168Z context = 2025-05-07T20:31:48.2430467Z 2025-05-07T20:31:48.2430642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.2431183Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.2431660Z module_map=module_map) 2025-05-07T20:31:48.2432034Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.2432402Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.2432678Z E ^ 2025-05-07T20:31:48.2433154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.2433708Z 2025-05-07T20:31:48.2434132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.2434656Z 2025-05-07T20:31:48.2434765Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.2435188Z self=, 2025-05-07T20:31:48.2435596Z T=128, 2025-05-07T20:31:48.2435798Z D=7168, 2025-05-07T20:31:48.2436005Z scale_ub=1200.0, 2025-05-07T20:31:48.2436242Z contiguous=False, 2025-05-07T20:31:48.2436484Z compiled=True, 2025-05-07T20:31:48.2436702Z ) 2025-05-07T20:31:48.2437027Z self = 2025-05-07T20:31:48.2437535Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.2437816Z 2025-05-07T20:31:48.2437899Z @given( 2025-05-07T20:31:48.2438143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.2438478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.2438804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.2439152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.2439489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.2439793Z ) 2025-05-07T20:31:48.2440438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.2440902Z def test_silu_mul_quant( 2025-05-07T20:31:48.2441159Z self, 2025-05-07T20:31:48.2441367Z T: int, 2025-05-07T20:31:48.2441570Z D: int, 2025-05-07T20:31:48.2441807Z scale_ub: Optional[float], 2025-05-07T20:31:48.2442097Z contiguous: bool, 2025-05-07T20:31:48.2442350Z compiled: bool, 2025-05-07T20:31:48.2442581Z ) -> None: 2025-05-07T20:31:48.2442814Z torch.manual_seed(2025) 2025-05-07T20:31:48.2443069Z 2025-05-07T20:31:48.2443495Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.2443863Z 2025-05-07T20:31:48.2444060Z x_sign = torch.sign(x) 2025-05-07T20:31:48.2444365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.2444688Z x = x_sign * x_clamp 2025-05-07T20:31:48.2444933Z x0 = x[:, :D] 2025-05-07T20:31:48.2445161Z x1 = x[:, D:] 2025-05-07T20:31:48.2445382Z 2025-05-07T20:31:48.2445571Z if contiguous: 2025-05-07T20:31:48.2445821Z x0 = x0.contiguous() 2025-05-07T20:31:48.2446091Z x1 = x1.contiguous() 2025-05-07T20:31:48.2446340Z 2025-05-07T20:31:48.2446542Z if scale_ub is not None: 2025-05-07T20:31:48.2446829Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.2447180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.2447495Z ) 2025-05-07T20:31:48.2447702Z else: 2025-05-07T20:31:48.2447925Z scale_ub_tensor = None 2025-05-07T20:31:48.2448192Z 2025-05-07T20:31:48.2448443Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.2448772Z op = silu_mul_quant 2025-05-07T20:31:48.2449027Z if compiled: 2025-05-07T20:31:48.2449285Z op = torch.compile(op) 2025-05-07T20:31:48.2449592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.2449871Z 2025-05-07T20:31:48.2450074Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.2450242Z 2025-05-07T20:31:48.2450353Z moe/activation_test.py:117: 2025-05-07T20:31:48.2450653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.2451000Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.2451293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.2451859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.2452423Z return fn(*args, **kwargs) 2025-05-07T20:31:48.2453097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.2453932Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.2454472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.2455171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.2455839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.2456382Z kernel = self.compile( 2025-05-07T20:31:48.2456930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.2457596Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.2458012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.2458247Z 2025-05-07T20:31:48.2458479Z self = 2025-05-07T20:31:48.2459553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.2460925Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6b1eaf0>} 2025-05-07T20:31:48.2462397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.2463428Z context = 2025-05-07T20:31:48.2463724Z 2025-05-07T20:31:48.2463900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.2464528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.2465017Z module_map=module_map) 2025-05-07T20:31:48.2465398Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.2465759Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.2466034Z E ^ 2025-05-07T20:31:48.2466515Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.2466970Z 2025-05-07T20:31:48.2467390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.2467911Z 2025-05-07T20:31:48.4187548Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.4188225Z self=, 2025-05-07T20:31:48.4188812Z T=2048, 2025-05-07T20:31:48.4189113Z D=7168, 2025-05-07T20:31:48.4189371Z scale_ub=None, 2025-05-07T20:31:48.4189599Z contiguous=True, 2025-05-07T20:31:48.4189830Z compiled=True, 2025-05-07T20:31:48.4190058Z ) 2025-05-07T20:31:48.4190419Z self = 2025-05-07T20:31:48.4190929Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.4191208Z 2025-05-07T20:31:48.4191299Z @given( 2025-05-07T20:31:48.4191535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.4191870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.4192190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.4192537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.4192880Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.4193182Z ) 2025-05-07T20:31:48.4193545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.4194387Z def test_silu_mul_quant( 2025-05-07T20:31:48.4194643Z self, 2025-05-07T20:31:48.4194851Z T: int, 2025-05-07T20:31:48.4195053Z D: int, 2025-05-07T20:31:48.4195285Z scale_ub: Optional[float], 2025-05-07T20:31:48.4195570Z contiguous: bool, 2025-05-07T20:31:48.4195816Z compiled: bool, 2025-05-07T20:31:48.4196058Z ) -> None: 2025-05-07T20:31:48.4196288Z torch.manual_seed(2025) 2025-05-07T20:31:48.4196539Z 2025-05-07T20:31:48.4196825Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.4197185Z 2025-05-07T20:31:48.4197390Z x_sign = torch.sign(x) 2025-05-07T20:31:48.4197688Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.4198014Z x = x_sign * x_clamp 2025-05-07T20:31:48.4198269Z x0 = x[:, :D] 2025-05-07T20:31:48.4198494Z x1 = x[:, D:] 2025-05-07T20:31:48.4198714Z 2025-05-07T20:31:48.4198914Z if contiguous: 2025-05-07T20:31:48.4199165Z x0 = x0.contiguous() 2025-05-07T20:31:48.4199442Z x1 = x1.contiguous() 2025-05-07T20:31:48.4199697Z 2025-05-07T20:31:48.4199896Z if scale_ub is not None: 2025-05-07T20:31:48.4200188Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.4200540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.4200860Z ) 2025-05-07T20:31:48.4201067Z else: 2025-05-07T20:31:48.4201292Z scale_ub_tensor = None 2025-05-07T20:31:48.4201551Z 2025-05-07T20:31:48.4201798Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.4202127Z op = silu_mul_quant 2025-05-07T20:31:48.4202395Z if compiled: 2025-05-07T20:31:48.4202656Z op = torch.compile(op) 2025-05-07T20:31:48.4202971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.4203264Z 2025-05-07T20:31:48.4203464Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.4203641Z 2025-05-07T20:31:48.4203899Z moe/activation_test.py:117: 2025-05-07T20:31:48.4204214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.4204556Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.4204856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.4205430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.4205998Z return fn(*args, **kwargs) 2025-05-07T20:31:48.4206659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.4207354Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.4207901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.4208584Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.4209258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.4209802Z kernel = self.compile( 2025-05-07T20:31:48.4210360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.4211017Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.4211422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.4211656Z 2025-05-07T20:31:48.4211878Z self = 2025-05-07T20:31:48.4212962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.4214362Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd68b48b0>} 2025-05-07T20:31:48.4215793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.4216822Z context = 2025-05-07T20:31:48.4217111Z 2025-05-07T20:31:48.4217290Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.4217818Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.4218295Z module_map=module_map) 2025-05-07T20:31:48.4218671Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.4219035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.4219301Z E ^ 2025-05-07T20:31:48.4219783Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.4220237Z 2025-05-07T20:31:48.4220663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.4221285Z 2025-05-07T20:31:48.4221423Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.4221857Z self=, 2025-05-07T20:31:48.4222265Z T=16384, 2025-05-07T20:31:48.4222468Z D=5120, 2025-05-07T20:31:48.4222664Z scale_ub=None, 2025-05-07T20:31:48.4222887Z contiguous=False, 2025-05-07T20:31:48.4223124Z compiled=False, 2025-05-07T20:31:48.4223334Z ) 2025-05-07T20:31:48.4223662Z self = 2025-05-07T20:31:48.4224183Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.4224503Z 2025-05-07T20:31:48.4224585Z @given( 2025-05-07T20:31:48.4224916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.4225243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.4225556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.4225895Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.4226235Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.4226532Z ) 2025-05-07T20:31:48.4226884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.4227340Z def test_silu_mul_quant( 2025-05-07T20:31:48.4227600Z self, 2025-05-07T20:31:48.4227798Z T: int, 2025-05-07T20:31:48.4228006Z D: int, 2025-05-07T20:31:48.4228238Z scale_ub: Optional[float], 2025-05-07T20:31:48.4228516Z contiguous: bool, 2025-05-07T20:31:48.4228772Z compiled: bool, 2025-05-07T20:31:48.4229007Z ) -> None: 2025-05-07T20:31:48.4229225Z torch.manual_seed(2025) 2025-05-07T20:31:48.4229478Z 2025-05-07T20:31:48.4229768Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.4230114Z 2025-05-07T20:31:48.4230313Z x_sign = torch.sign(x) 2025-05-07T20:31:48.4230613Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.4232656Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.4234549Z 2025-05-07T20:31:48.4234671Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.4234975Z 2025-05-07T20:31:48.4235094Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.4235509Z self=, 2025-05-07T20:31:48.4235920Z T=4096, 2025-05-07T20:31:48.4236115Z D=7168, 2025-05-07T20:31:48.4236318Z scale_ub=1200.0, 2025-05-07T20:31:48.4236544Z contiguous=True, 2025-05-07T20:31:48.4236775Z compiled=True, 2025-05-07T20:31:48.4236993Z ) 2025-05-07T20:31:48.4237314Z self = 2025-05-07T20:31:48.4237820Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.4238104Z 2025-05-07T20:31:48.4238185Z @given( 2025-05-07T20:31:48.4238427Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.4238743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.4239059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.4239398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.4239743Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.4240039Z ) 2025-05-07T20:31:48.4240822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.4241300Z def test_silu_mul_quant( 2025-05-07T20:31:48.4241548Z self, 2025-05-07T20:31:48.4241748Z T: int, 2025-05-07T20:31:48.4241943Z D: int, 2025-05-07T20:31:48.4242166Z scale_ub: Optional[float], 2025-05-07T20:31:48.4242443Z contiguous: bool, 2025-05-07T20:31:48.4242685Z compiled: bool, 2025-05-07T20:31:48.4242905Z ) -> None: 2025-05-07T20:31:48.4243124Z torch.manual_seed(2025) 2025-05-07T20:31:48.4243375Z 2025-05-07T20:31:48.4243642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.4243990Z 2025-05-07T20:31:48.4244184Z x_sign = torch.sign(x) 2025-05-07T20:31:48.4244472Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.4246619Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.4248580Z 2025-05-07T20:31:48.4248704Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.4248926Z 2025-05-07T20:31:48.4249029Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.4249445Z self=, 2025-05-07T20:31:48.4249843Z T=16384, 2025-05-07T20:31:48.4250044Z D=7168, 2025-05-07T20:31:48.4250236Z scale_ub=None, 2025-05-07T20:31:48.4250458Z contiguous=False, 2025-05-07T20:31:48.4250688Z compiled=False, 2025-05-07T20:31:48.4250899Z ) 2025-05-07T20:31:48.5301638Z self = 2025-05-07T20:31:48.5302209Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.5302497Z 2025-05-07T20:31:48.5302587Z @given( 2025-05-07T20:31:48.5302821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5303145Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5303461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5303843Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5304187Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5304482Z ) 2025-05-07T20:31:48.5304834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5305293Z def test_silu_mul_quant( 2025-05-07T20:31:48.5305871Z self, 2025-05-07T20:31:48.5306083Z T: int, 2025-05-07T20:31:48.5306283Z D: int, 2025-05-07T20:31:48.5306512Z scale_ub: Optional[float], 2025-05-07T20:31:48.5306796Z contiguous: bool, 2025-05-07T20:31:48.5307043Z compiled: bool, 2025-05-07T20:31:48.5307282Z ) -> None: 2025-05-07T20:31:48.5307509Z torch.manual_seed(2025) 2025-05-07T20:31:48.5307755Z 2025-05-07T20:31:48.5308039Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5310125Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5312053Z 2025-05-07T20:31:48.5312177Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.5312395Z 2025-05-07T20:31:48.5312506Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5312929Z self=, 2025-05-07T20:31:48.5313332Z T=2048, 2025-05-07T20:31:48.5313532Z D=7168, 2025-05-07T20:31:48.5313735Z scale_ub=1200.0, 2025-05-07T20:31:48.5313961Z contiguous=True, 2025-05-07T20:31:48.5314195Z compiled=True, 2025-05-07T20:31:48.5314407Z ) 2025-05-07T20:31:48.5314725Z self = 2025-05-07T20:31:48.5315231Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.5315505Z 2025-05-07T20:31:48.5315591Z @given( 2025-05-07T20:31:48.5315961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5316292Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5316609Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5316942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5317300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5317604Z ) 2025-05-07T20:31:48.5317957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5318407Z def test_silu_mul_quant( 2025-05-07T20:31:48.5327102Z self, 2025-05-07T20:31:48.5327317Z T: int, 2025-05-07T20:31:48.5327527Z D: int, 2025-05-07T20:31:48.5327750Z scale_ub: Optional[float], 2025-05-07T20:31:48.5328041Z contiguous: bool, 2025-05-07T20:31:48.5328296Z compiled: bool, 2025-05-07T20:31:48.5328532Z ) -> None: 2025-05-07T20:31:48.5328755Z torch.manual_seed(2025) 2025-05-07T20:31:48.5329012Z 2025-05-07T20:31:48.5329319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5329671Z 2025-05-07T20:31:48.5329863Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5330156Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5332183Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5334060Z 2025-05-07T20:31:48.5334185Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.5334412Z 2025-05-07T20:31:48.5334522Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5335070Z self=, 2025-05-07T20:31:48.5335484Z T=2048, 2025-05-07T20:31:48.5335676Z D=7168, 2025-05-07T20:31:48.5335879Z scale_ub=None, 2025-05-07T20:31:48.5336110Z contiguous=True, 2025-05-07T20:31:48.5336339Z compiled=False, 2025-05-07T20:31:48.5336560Z ) 2025-05-07T20:31:48.5336886Z self = 2025-05-07T20:31:48.5337389Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.5337671Z 2025-05-07T20:31:48.5337753Z @given( 2025-05-07T20:31:48.5338000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5338318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5338639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5338981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5339337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5339634Z ) 2025-05-07T20:31:48.5339998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5340755Z def test_silu_mul_quant( 2025-05-07T20:31:48.5340999Z self, 2025-05-07T20:31:48.5341280Z T: int, 2025-05-07T20:31:48.5341495Z D: int, 2025-05-07T20:31:48.5341718Z scale_ub: Optional[float], 2025-05-07T20:31:48.5341996Z contiguous: bool, 2025-05-07T20:31:48.5342248Z compiled: bool, 2025-05-07T20:31:48.5342472Z ) -> None: 2025-05-07T20:31:48.5342694Z torch.manual_seed(2025) 2025-05-07T20:31:48.5342947Z 2025-05-07T20:31:48.5343221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5343573Z 2025-05-07T20:31:48.5343775Z > x_sign = torch.sign(x) 2025-05-07T20:31:48.5345841Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5347714Z 2025-05-07T20:31:48.5347845Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:48.5348068Z 2025-05-07T20:31:48.5348173Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5348595Z self=, 2025-05-07T20:31:48.5349004Z T=1, 2025-05-07T20:31:48.5349192Z D=7168, 2025-05-07T20:31:48.5349394Z scale_ub=1200.0, 2025-05-07T20:31:48.5349632Z contiguous=True, 2025-05-07T20:31:48.5349857Z compiled=False, 2025-05-07T20:31:48.5350082Z ) 2025-05-07T20:31:48.6911980Z self = 2025-05-07T20:31:48.6912699Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.6913079Z 2025-05-07T20:31:48.6913186Z @given( 2025-05-07T20:31:48.6913501Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6913920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6914253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6914601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6914936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6915233Z ) 2025-05-07T20:31:48.6915598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6916043Z def test_silu_mul_quant( 2025-05-07T20:31:48.6916295Z self, 2025-05-07T20:31:48.6916505Z T: int, 2025-05-07T20:31:48.6916714Z D: int, 2025-05-07T20:31:48.6917301Z scale_ub: Optional[float], 2025-05-07T20:31:48.6917591Z contiguous: bool, 2025-05-07T20:31:48.6917843Z compiled: bool, 2025-05-07T20:31:48.6918077Z ) -> None: 2025-05-07T20:31:48.6918306Z torch.manual_seed(2025) 2025-05-07T20:31:48.6918561Z 2025-05-07T20:31:48.6918837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6919191Z 2025-05-07T20:31:48.6919397Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6919693Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6920018Z x = x_sign * x_clamp 2025-05-07T20:31:48.6920270Z x0 = x[:, :D] 2025-05-07T20:31:48.6920490Z x1 = x[:, D:] 2025-05-07T20:31:48.6920708Z 2025-05-07T20:31:48.6920906Z if contiguous: 2025-05-07T20:31:48.6921141Z x0 = x0.contiguous() 2025-05-07T20:31:48.6921409Z x1 = x1.contiguous() 2025-05-07T20:31:48.6921661Z 2025-05-07T20:31:48.6921870Z if scale_ub is not None: 2025-05-07T20:31:48.6922152Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6922496Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6922814Z ) 2025-05-07T20:31:48.6923011Z else: 2025-05-07T20:31:48.6923230Z scale_ub_tensor = None 2025-05-07T20:31:48.6923494Z 2025-05-07T20:31:48.6923729Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6924056Z op = silu_mul_quant 2025-05-07T20:31:48.6924321Z if compiled: 2025-05-07T20:31:48.6924575Z op = torch.compile(op) 2025-05-07T20:31:48.6924888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6925177Z 2025-05-07T20:31:48.6925373Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6925552Z 2025-05-07T20:31:48.6925658Z moe/activation_test.py:117: 2025-05-07T20:31:48.6925970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6926454Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6926755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6927460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6928162Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6928711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6929405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6930082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6930626Z kernel = self.compile( 2025-05-07T20:31:48.6931182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6931853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6932257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6932497Z 2025-05-07T20:31:48.6932710Z self = 2025-05-07T20:31:48.6933801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6935176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6798550>} 2025-05-07T20:31:48.6936517Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6937534Z context = 2025-05-07T20:31:48.6937918Z 2025-05-07T20:31:48.6938092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6938628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6939103Z module_map=module_map) 2025-05-07T20:31:48.6939473Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6939836Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6940390Z E ^ 2025-05-07T20:31:48.6940854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6941361Z 2025-05-07T20:31:48.6941781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6942301Z 2025-05-07T20:31:48.6942406Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6942841Z self=, 2025-05-07T20:31:48.6943243Z T=128, 2025-05-07T20:31:48.6943441Z D=5120, 2025-05-07T20:31:48.6943644Z scale_ub=None, 2025-05-07T20:31:48.6943863Z contiguous=True, 2025-05-07T20:31:48.6944100Z compiled=False, 2025-05-07T20:31:48.6944345Z ) 2025-05-07T20:31:48.6944691Z self = 2025-05-07T20:31:48.6945191Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6945458Z 2025-05-07T20:31:48.6945544Z @given( 2025-05-07T20:31:48.6945774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6946099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6946414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6946751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6947080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6947500Z ) 2025-05-07T20:31:48.6947864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6948309Z def test_silu_mul_quant( 2025-05-07T20:31:48.6948559Z self, 2025-05-07T20:31:48.6948762Z T: int, 2025-05-07T20:31:48.6948962Z D: int, 2025-05-07T20:31:48.6949194Z scale_ub: Optional[float], 2025-05-07T20:31:48.6949484Z contiguous: bool, 2025-05-07T20:31:48.6949725Z compiled: bool, 2025-05-07T20:31:48.6949959Z ) -> None: 2025-05-07T20:31:48.6950182Z torch.manual_seed(2025) 2025-05-07T20:31:48.6950424Z 2025-05-07T20:31:48.6950700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6951048Z 2025-05-07T20:31:48.6951242Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6951543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6951859Z x = x_sign * x_clamp 2025-05-07T20:31:48.6952111Z x0 = x[:, :D] 2025-05-07T20:31:48.6952341Z x1 = x[:, D:] 2025-05-07T20:31:48.6952557Z 2025-05-07T20:31:48.6952749Z if contiguous: 2025-05-07T20:31:48.6952981Z x0 = x0.contiguous() 2025-05-07T20:31:48.6953254Z x1 = x1.contiguous() 2025-05-07T20:31:48.6953506Z 2025-05-07T20:31:48.6953701Z if scale_ub is not None: 2025-05-07T20:31:48.6953982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6954333Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6954645Z ) 2025-05-07T20:31:48.6954858Z else: 2025-05-07T20:31:48.6955082Z scale_ub_tensor = None 2025-05-07T20:31:48.6955342Z 2025-05-07T20:31:48.6955591Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6955916Z op = silu_mul_quant 2025-05-07T20:31:48.6956170Z if compiled: 2025-05-07T20:31:48.6956428Z op = torch.compile(op) 2025-05-07T20:31:48.6956746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6957182Z 2025-05-07T20:31:48.6957375Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6957556Z 2025-05-07T20:31:48.6957657Z moe/activation_test.py:117: 2025-05-07T20:31:48.6957963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6958297Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6958590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6959285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6959978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6960518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6961209Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6961882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6962419Z kernel = self.compile( 2025-05-07T20:31:48.6962963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6963647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6964074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6964307Z 2025-05-07T20:31:48.6964518Z self = 2025-05-07T20:31:48.6965595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6967068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd696b040>} 2025-05-07T20:31:48.6968426Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6969440Z context = 2025-05-07T20:31:48.6969741Z 2025-05-07T20:31:48.6969913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6970448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6970923Z module_map=module_map) 2025-05-07T20:31:48.6971289Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6971655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6971929Z E ^ 2025-05-07T20:31:48.6972399Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6972872Z 2025-05-07T20:31:48.6973294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6973810Z 2025-05-07T20:31:48.6973920Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6974344Z self=, 2025-05-07T20:31:48.6974751Z T=128, 2025-05-07T20:31:48.6974947Z D=7168, 2025-05-07T20:31:48.6975148Z scale_ub=None, 2025-05-07T20:31:48.6975364Z contiguous=True, 2025-05-07T20:31:48.6975597Z compiled=False, 2025-05-07T20:31:48.6975815Z ) 2025-05-07T20:31:48.7875553Z self = 2025-05-07T20:31:48.7876099Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.7876371Z 2025-05-07T20:31:48.7876459Z @given( 2025-05-07T20:31:48.7876712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.7877279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.7877604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.7877946Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.7878281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.7878579Z ) 2025-05-07T20:31:48.7878937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.7879382Z def test_silu_mul_quant( 2025-05-07T20:31:48.7879633Z self, 2025-05-07T20:31:48.7879840Z T: int, 2025-05-07T20:31:48.7880039Z D: int, 2025-05-07T20:31:48.7880268Z scale_ub: Optional[float], 2025-05-07T20:31:48.7880551Z contiguous: bool, 2025-05-07T20:31:48.7880795Z compiled: bool, 2025-05-07T20:31:48.7881034Z ) -> None: 2025-05-07T20:31:48.7881260Z torch.manual_seed(2025) 2025-05-07T20:31:48.7881506Z 2025-05-07T20:31:48.7881988Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.7882353Z 2025-05-07T20:31:48.7882556Z x_sign = torch.sign(x) 2025-05-07T20:31:48.7882853Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.7883174Z x = x_sign * x_clamp 2025-05-07T20:31:48.7883428Z x0 = x[:, :D] 2025-05-07T20:31:48.7883646Z x1 = x[:, D:] 2025-05-07T20:31:48.7883861Z 2025-05-07T20:31:48.7884056Z if contiguous: 2025-05-07T20:31:48.7884293Z x0 = x0.contiguous() 2025-05-07T20:31:48.7884564Z x1 = x1.contiguous() 2025-05-07T20:31:48.7884818Z 2025-05-07T20:31:48.7885014Z if scale_ub is not None: 2025-05-07T20:31:48.7885299Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.7885654Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.7885966Z ) 2025-05-07T20:31:48.7886168Z else: 2025-05-07T20:31:48.7886389Z scale_ub_tensor = None 2025-05-07T20:31:48.7886801Z 2025-05-07T20:31:48.7887044Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.7887369Z op = silu_mul_quant 2025-05-07T20:31:48.7887631Z if compiled: 2025-05-07T20:31:48.7887881Z op = torch.compile(op) 2025-05-07T20:31:48.7888184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.7888468Z 2025-05-07T20:31:48.7888661Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.7888834Z 2025-05-07T20:31:48.7888939Z moe/activation_test.py:117: 2025-05-07T20:31:48.7889241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.7889574Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.7889865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.7890560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.7891258Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.7891810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.7892497Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.7893166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.7893701Z kernel = self.compile( 2025-05-07T20:31:48.7894299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.7894958Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.7895368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.7895601Z 2025-05-07T20:31:48.7895811Z self = 2025-05-07T20:31:48.7896911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.7898363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd696bc10>} 2025-05-07T20:31:48.7899702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.7900721Z context = 2025-05-07T20:31:48.7901013Z 2025-05-07T20:31:48.7901245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.7901785Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.7902265Z module_map=module_map) 2025-05-07T20:31:48.7902642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.7903004Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.7903274Z E ^ 2025-05-07T20:31:48.7903738Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.7904200Z 2025-05-07T20:31:48.7904617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.7905140Z 2025-05-07T20:31:48.7905247Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.7905670Z self=, 2025-05-07T20:31:48.7906082Z T=2048, 2025-05-07T20:31:48.7906272Z D=7168, 2025-05-07T20:31:48.7906473Z scale_ub=1200.0, 2025-05-07T20:31:48.7906710Z contiguous=True, 2025-05-07T20:31:48.7906937Z compiled=False, 2025-05-07T20:31:48.7907162Z ) 2025-05-07T20:31:48.7907576Z self = 2025-05-07T20:31:48.7908076Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.7908361Z 2025-05-07T20:31:48.7908445Z @given( 2025-05-07T20:31:48.7908685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.7909000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.7909321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.7909662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.7910000Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.7910293Z ) 2025-05-07T20:31:48.7910651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.7911102Z def test_silu_mul_quant( 2025-05-07T20:31:48.7911348Z self, 2025-05-07T20:31:48.7911552Z T: int, 2025-05-07T20:31:48.7911755Z D: int, 2025-05-07T20:31:48.7911990Z scale_ub: Optional[float], 2025-05-07T20:31:48.7912272Z contiguous: bool, 2025-05-07T20:31:48.7912520Z compiled: bool, 2025-05-07T20:31:48.7912744Z ) -> None: 2025-05-07T20:31:48.7912977Z torch.manual_seed(2025) 2025-05-07T20:31:48.7913228Z 2025-05-07T20:31:48.7913501Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.7915596Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.7917737Z 2025-05-07T20:31:48.7917865Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.7918090Z 2025-05-07T20:31:48.7918195Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.7918617Z self=, 2025-05-07T20:31:48.7919018Z T=1, 2025-05-07T20:31:48.7919212Z D=5120, 2025-05-07T20:31:48.7919412Z scale_ub=1200.0, 2025-05-07T20:31:48.7919635Z contiguous=True, 2025-05-07T20:31:48.7919866Z compiled=False, 2025-05-07T20:31:48.7920080Z ) 2025-05-07T20:31:48.8411177Z self = 2025-05-07T20:31:48.8411730Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.8411996Z 2025-05-07T20:31:48.8412077Z @given( 2025-05-07T20:31:48.8412317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.8412638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.8412971Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.8413313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.8413655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.8413947Z ) 2025-05-07T20:31:48.8414299Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.8414747Z def test_silu_mul_quant( 2025-05-07T20:31:48.8414997Z self, 2025-05-07T20:31:48.8415193Z T: int, 2025-05-07T20:31:48.8415398Z D: int, 2025-05-07T20:31:48.8415628Z scale_ub: Optional[float], 2025-05-07T20:31:48.8415902Z contiguous: bool, 2025-05-07T20:31:48.8416149Z compiled: bool, 2025-05-07T20:31:48.8416384Z ) -> None: 2025-05-07T20:31:48.8416604Z torch.manual_seed(2025) 2025-05-07T20:31:48.8416859Z 2025-05-07T20:31:48.8417139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.8417490Z 2025-05-07T20:31:48.8417917Z x_sign = torch.sign(x) 2025-05-07T20:31:48.8418231Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.8418553Z x = x_sign * x_clamp 2025-05-07T20:31:48.8418801Z x0 = x[:, :D] 2025-05-07T20:31:48.8419024Z x1 = x[:, D:] 2025-05-07T20:31:48.8419235Z 2025-05-07T20:31:48.8419422Z if contiguous: 2025-05-07T20:31:48.8419660Z x0 = x0.contiguous() 2025-05-07T20:31:48.8419928Z x1 = x1.contiguous() 2025-05-07T20:31:48.8420173Z 2025-05-07T20:31:48.8420370Z if scale_ub is not None: 2025-05-07T20:31:48.8420654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.8420991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.8421413Z ) 2025-05-07T20:31:48.8429619Z else: 2025-05-07T20:31:48.8429892Z scale_ub_tensor = None 2025-05-07T20:31:48.8430156Z 2025-05-07T20:31:48.8430405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.8430745Z op = silu_mul_quant 2025-05-07T20:31:48.8431013Z if compiled: 2025-05-07T20:31:48.8431272Z op = torch.compile(op) 2025-05-07T20:31:48.8431573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.8431861Z 2025-05-07T20:31:48.8432065Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.8432234Z 2025-05-07T20:31:48.8432350Z moe/activation_test.py:117: 2025-05-07T20:31:48.8432654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.8433001Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.8433293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.8433986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.8434735Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.8435283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.8436189Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.8436865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.8437413Z kernel = self.compile( 2025-05-07T20:31:48.8437970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.8438637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.8439052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.8439295Z 2025-05-07T20:31:48.8439504Z self = 2025-05-07T20:31:48.8440888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.8442295Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd66ef9d0>} 2025-05-07T20:31:48.8443654Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.8444716Z context = 2025-05-07T20:31:48.8445005Z 2025-05-07T20:31:48.8445187Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.8445727Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.8446191Z module_map=module_map) 2025-05-07T20:31:48.8446569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.8447060Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.8447324Z E ^ 2025-05-07T20:31:48.8447796Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.8448244Z 2025-05-07T20:31:48.8448666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.8449177Z 2025-05-07T20:31:48.8449295Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.8449714Z self=, 2025-05-07T20:31:48.8450124Z T=2048, 2025-05-07T20:31:48.8450317Z D=5120, 2025-05-07T20:31:48.8450507Z scale_ub=None, 2025-05-07T20:31:48.8450734Z contiguous=True, 2025-05-07T20:31:48.8450969Z compiled=False, 2025-05-07T20:31:48.8451178Z ) 2025-05-07T20:31:48.8451503Z self = 2025-05-07T20:31:48.8452018Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.8452289Z 2025-05-07T20:31:48.8452374Z @given( 2025-05-07T20:31:48.8452604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.8452923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.8453238Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.8453570Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.8453908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.8454237Z ) 2025-05-07T20:31:48.8454608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.8455057Z def test_silu_mul_quant( 2025-05-07T20:31:48.8455307Z self, 2025-05-07T20:31:48.8455500Z T: int, 2025-05-07T20:31:48.8455696Z D: int, 2025-05-07T20:31:48.8455913Z scale_ub: Optional[float], 2025-05-07T20:31:48.8456187Z contiguous: bool, 2025-05-07T20:31:48.8456566Z compiled: bool, 2025-05-07T20:31:48.8456798Z ) -> None: 2025-05-07T20:31:48.8457027Z torch.manual_seed(2025) 2025-05-07T20:31:48.8457271Z 2025-05-07T20:31:48.8457556Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.8457912Z 2025-05-07T20:31:48.8458107Z > x_sign = torch.sign(x) 2025-05-07T20:31:48.8460045Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.8461996Z 2025-05-07T20:31:48.8462129Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:48.8462351Z 2025-05-07T20:31:48.8462468Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.8462887Z self=, 2025-05-07T20:31:48.8463290Z T=16384, 2025-05-07T20:31:48.8463491Z D=5120, 2025-05-07T20:31:48.8463689Z scale_ub=None, 2025-05-07T20:31:48.8463902Z contiguous=True, 2025-05-07T20:31:48.8464139Z compiled=False, 2025-05-07T20:31:48.8464376Z ) 2025-05-07T20:31:48.8464718Z self = 2025-05-07T20:31:48.8465228Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.8465504Z 2025-05-07T20:31:48.8465592Z @given( 2025-05-07T20:31:48.8465820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.8466141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.8466458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.8466881Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.8467231Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.8467530Z ) 2025-05-07T20:31:48.8467886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.8468333Z def test_silu_mul_quant( 2025-05-07T20:31:48.8468581Z self, 2025-05-07T20:31:48.8468783Z T: int, 2025-05-07T20:31:48.8468984Z D: int, 2025-05-07T20:31:48.8469209Z scale_ub: Optional[float], 2025-05-07T20:31:48.8469489Z contiguous: bool, 2025-05-07T20:31:48.8469727Z compiled: bool, 2025-05-07T20:31:48.8469958Z ) -> None: 2025-05-07T20:31:48.8470179Z torch.manual_seed(2025) 2025-05-07T20:31:48.8470421Z 2025-05-07T20:31:48.8470701Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.8472735Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.8474585Z 2025-05-07T20:31:48.8474711Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.8474928Z 2025-05-07T20:31:48.8475042Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.8475454Z self=, 2025-05-07T20:31:48.8475865Z T=4096, 2025-05-07T20:31:48.8476058Z D=5120, 2025-05-07T20:31:48.8476249Z scale_ub=None, 2025-05-07T20:31:48.8476473Z contiguous=True, 2025-05-07T20:31:48.8476707Z compiled=False, 2025-05-07T20:31:48.8476998Z ) 2025-05-07T20:31:48.9503334Z self = 2025-05-07T20:31:48.9503921Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.9504324Z 2025-05-07T20:31:48.9504501Z @given( 2025-05-07T20:31:48.9504973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.9505603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.9506233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.9506898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.9507546Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.9508128Z ) 2025-05-07T20:31:48.9508829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.9509706Z def test_silu_mul_quant( 2025-05-07T20:31:48.9510197Z self, 2025-05-07T20:31:48.9510597Z T: int, 2025-05-07T20:31:48.9511024Z D: int, 2025-05-07T20:31:48.9511467Z scale_ub: Optional[float], 2025-05-07T20:31:48.9512020Z contiguous: bool, 2025-05-07T20:31:48.9512496Z compiled: bool, 2025-05-07T20:31:48.9512949Z ) -> None: 2025-05-07T20:31:48.9513388Z torch.manual_seed(2025) 2025-05-07T20:31:48.9513746Z 2025-05-07T20:31:48.9514019Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.9516083Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.9518182Z 2025-05-07T20:31:48.9518312Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.9518535Z 2025-05-07T20:31:48.9518648Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.9519061Z self=, 2025-05-07T20:31:48.9519471Z T=2048, 2025-05-07T20:31:48.9519671Z D=5120, 2025-05-07T20:31:48.9519871Z scale_ub=None, 2025-05-07T20:31:48.9520087Z contiguous=False, 2025-05-07T20:31:48.9520323Z compiled=False, 2025-05-07T20:31:48.9520538Z ) 2025-05-07T20:31:48.9520853Z self = 2025-05-07T20:31:48.9521352Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.9521626Z 2025-05-07T20:31:48.9521713Z @given( 2025-05-07T20:31:48.9521942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.9522260Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.9522588Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.9522916Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.9523257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.9523555Z ) 2025-05-07T20:31:48.9523939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.9524406Z def test_silu_mul_quant( 2025-05-07T20:31:48.9524657Z self, 2025-05-07T20:31:48.9524859Z T: int, 2025-05-07T20:31:48.9525058Z D: int, 2025-05-07T20:31:48.9525288Z scale_ub: Optional[float], 2025-05-07T20:31:48.9525571Z contiguous: bool, 2025-05-07T20:31:48.9525817Z compiled: bool, 2025-05-07T20:31:48.9526048Z ) -> None: 2025-05-07T20:31:48.9526273Z torch.manual_seed(2025) 2025-05-07T20:31:48.9526518Z 2025-05-07T20:31:48.9526794Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.9528817Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.9530822Z 2025-05-07T20:31:48.9530943Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.9531165Z 2025-05-07T20:31:48.9531277Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.9531690Z self=, 2025-05-07T20:31:48.9532097Z T=4096, 2025-05-07T20:31:48.9532291Z D=7168, 2025-05-07T20:31:48.9532483Z scale_ub=None, 2025-05-07T20:31:48.9532714Z contiguous=True, 2025-05-07T20:31:48.9532943Z compiled=True, 2025-05-07T20:31:48.9533147Z ) 2025-05-07T20:31:48.9533475Z self = 2025-05-07T20:31:48.9533971Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.9534244Z 2025-05-07T20:31:48.9534331Z @given( 2025-05-07T20:31:48.9534563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.9534888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.9535203Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.9535533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.9535870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.9536167Z ) 2025-05-07T20:31:48.9536518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.9536971Z def test_silu_mul_quant( 2025-05-07T20:31:48.9537223Z self, 2025-05-07T20:31:48.9537535Z T: int, 2025-05-07T20:31:48.9537746Z D: int, 2025-05-07T20:31:48.9537973Z scale_ub: Optional[float], 2025-05-07T20:31:48.9538246Z contiguous: bool, 2025-05-07T20:31:48.9538495Z compiled: bool, 2025-05-07T20:31:48.9538723Z ) -> None: 2025-05-07T20:31:48.9538945Z torch.manual_seed(2025) 2025-05-07T20:31:48.9539187Z 2025-05-07T20:31:48.9539464Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.9541843Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.9543724Z 2025-05-07T20:31:48.9543852Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.9544067Z 2025-05-07T20:31:48.9544172Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.9544592Z self=, 2025-05-07T20:31:48.9544999Z T=2048, 2025-05-07T20:31:48.9545194Z D=5120, 2025-05-07T20:31:48.9545385Z scale_ub=1200.0, 2025-05-07T20:31:48.9545618Z contiguous=False, 2025-05-07T20:31:48.9545855Z compiled=False, 2025-05-07T20:31:48.9546060Z ) 2025-05-07T20:31:48.9546387Z self = 2025-05-07T20:31:48.9546889Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.9547167Z 2025-05-07T20:31:48.9547249Z @given( 2025-05-07T20:31:48.9547485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.9547941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.9548255Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.9548597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.9548936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.9549233Z ) 2025-05-07T20:31:48.9549585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.9550032Z def test_silu_mul_quant( 2025-05-07T20:31:48.9550282Z self, 2025-05-07T20:31:48.9550478Z T: int, 2025-05-07T20:31:48.9550680Z D: int, 2025-05-07T20:31:48.9550906Z scale_ub: Optional[float], 2025-05-07T20:31:48.9551181Z contiguous: bool, 2025-05-07T20:31:48.9551426Z compiled: bool, 2025-05-07T20:31:48.9551658Z ) -> None: 2025-05-07T20:31:48.9551879Z torch.manual_seed(2025) 2025-05-07T20:31:48.9552129Z 2025-05-07T20:31:48.9552410Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.9554487Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.9556349Z 2025-05-07T20:31:48.9556479Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.9556694Z 2025-05-07T20:31:48.9556797Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.9557215Z self=, 2025-05-07T20:31:48.9557630Z T=4096, 2025-05-07T20:31:48.9557820Z D=7168, 2025-05-07T20:31:48.9558138Z scale_ub=1200.0, 2025-05-07T20:31:48.9558374Z contiguous=True, 2025-05-07T20:31:48.9558597Z compiled=False, 2025-05-07T20:31:48.9558814Z ) 2025-05-07T20:31:48.9559137Z self = 2025-05-07T20:31:48.9559631Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.9559912Z 2025-05-07T20:31:48.9559990Z @given( 2025-05-07T20:31:48.9560224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.9560541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.9560848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.9561182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.9561520Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.9561804Z ) 2025-05-07T20:31:48.9562158Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.9562615Z def test_silu_mul_quant( 2025-05-07T20:31:48.9562857Z self, 2025-05-07T20:31:48.9563058Z T: int, 2025-05-07T20:31:48.9563262Z D: int, 2025-05-07T20:31:48.9563480Z scale_ub: Optional[float], 2025-05-07T20:31:48.9563759Z contiguous: bool, 2025-05-07T20:31:48.9564007Z compiled: bool, 2025-05-07T20:31:48.9564234Z ) -> None: 2025-05-07T20:31:48.9564450Z torch.manual_seed(2025) 2025-05-07T20:31:48.9564698Z 2025-05-07T20:31:48.9564973Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.9566985Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.9568943Z 2025-05-07T20:31:48.9569063Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.9569285Z 2025-05-07T20:31:48.9569390Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.9569807Z self=, 2025-05-07T20:31:48.9570214Z T=16384, 2025-05-07T20:31:48.9570406Z D=7168, 2025-05-07T20:31:48.9570601Z scale_ub=None, 2025-05-07T20:31:48.9570827Z contiguous=False, 2025-05-07T20:31:48.9571051Z compiled=True, 2025-05-07T20:31:48.9571261Z ) 2025-05-07T20:31:49.0869363Z self = 2025-05-07T20:31:49.0870080Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.0870500Z 2025-05-07T20:31:49.0870616Z @given( 2025-05-07T20:31:49.0870892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.0871217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.0871535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.0871868Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.0872207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.0872502Z ) 2025-05-07T20:31:49.0872853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.0873305Z def test_silu_mul_quant( 2025-05-07T20:31:49.0873562Z self, 2025-05-07T20:31:49.0873776Z T: int, 2025-05-07T20:31:49.0873978Z D: int, 2025-05-07T20:31:49.0874209Z scale_ub: Optional[float], 2025-05-07T20:31:49.0874495Z contiguous: bool, 2025-05-07T20:31:49.0874739Z compiled: bool, 2025-05-07T20:31:49.0874971Z ) -> None: 2025-05-07T20:31:49.0875193Z torch.manual_seed(2025) 2025-05-07T20:31:49.0875794Z 2025-05-07T20:31:49.0876082Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.0878152Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.0880029Z 2025-05-07T20:31:49.0880157Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.0880371Z 2025-05-07T20:31:49.0880484Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.0880897Z self=, 2025-05-07T20:31:49.0881315Z T=4096, 2025-05-07T20:31:49.0881511Z D=7168, 2025-05-07T20:31:49.0881701Z scale_ub=None, 2025-05-07T20:31:49.0881925Z contiguous=True, 2025-05-07T20:31:49.0882156Z compiled=False, 2025-05-07T20:31:49.0882365Z ) 2025-05-07T20:31:49.0882690Z self = 2025-05-07T20:31:49.0883194Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.0883465Z 2025-05-07T20:31:49.0883546Z @given( 2025-05-07T20:31:49.0883792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.0884105Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.0884457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.0884824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.0885157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.0885452Z ) 2025-05-07T20:31:49.0885817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.0886420Z def test_silu_mul_quant( 2025-05-07T20:31:49.0886672Z self, 2025-05-07T20:31:49.0886881Z T: int, 2025-05-07T20:31:49.0887081Z D: int, 2025-05-07T20:31:49.0887314Z scale_ub: Optional[float], 2025-05-07T20:31:49.0887595Z contiguous: bool, 2025-05-07T20:31:49.0887847Z compiled: bool, 2025-05-07T20:31:49.0888073Z ) -> None: 2025-05-07T20:31:49.0888299Z torch.manual_seed(2025) 2025-05-07T20:31:49.0888552Z 2025-05-07T20:31:49.0888823Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.0890847Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.0892692Z 2025-05-07T20:31:49.0892815Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.0893032Z 2025-05-07T20:31:49.0893142Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.0893553Z self=, 2025-05-07T20:31:49.0893961Z T=16384, 2025-05-07T20:31:49.0894164Z D=7168, 2025-05-07T20:31:49.0894361Z scale_ub=None, 2025-05-07T20:31:49.0894576Z contiguous=True, 2025-05-07T20:31:49.0894809Z compiled=False, 2025-05-07T20:31:49.0895019Z ) 2025-05-07T20:31:49.0895338Z self = 2025-05-07T20:31:49.0895840Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.0896204Z 2025-05-07T20:31:49.0896295Z @given( 2025-05-07T20:31:49.0896527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.0896847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.0897163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.0897492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.0897832Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.0898125Z ) 2025-05-07T20:31:49.0898479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.0898923Z def test_silu_mul_quant( 2025-05-07T20:31:49.0899174Z self, 2025-05-07T20:31:49.0899378Z T: int, 2025-05-07T20:31:49.0899577Z D: int, 2025-05-07T20:31:49.0899803Z scale_ub: Optional[float], 2025-05-07T20:31:49.0900082Z contiguous: bool, 2025-05-07T20:31:49.0900323Z compiled: bool, 2025-05-07T20:31:49.0900554Z ) -> None: 2025-05-07T20:31:49.0900788Z torch.manual_seed(2025) 2025-05-07T20:31:49.0901035Z 2025-05-07T20:31:49.0901439Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.0903468Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.0905302Z 2025-05-07T20:31:49.0905423Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.0905638Z 2025-05-07T20:31:49.0905751Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.0906295Z self=, 2025-05-07T20:31:49.0906712Z T=16384, 2025-05-07T20:31:49.0906913Z D=7168, 2025-05-07T20:31:49.0907106Z scale_ub=1200.0, 2025-05-07T20:31:49.0907338Z contiguous=True, 2025-05-07T20:31:49.0907568Z compiled=False, 2025-05-07T20:31:49.0907774Z ) 2025-05-07T20:31:49.0908099Z self = 2025-05-07T20:31:49.0908604Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.0908881Z 2025-05-07T20:31:49.0908974Z @given( 2025-05-07T20:31:49.0909204Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.0909527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.0909841Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.0910172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.0910511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.0910817Z ) 2025-05-07T20:31:49.0911167Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.0911617Z def test_silu_mul_quant( 2025-05-07T20:31:49.0911869Z self, 2025-05-07T20:31:49.0912065Z T: int, 2025-05-07T20:31:49.0912271Z D: int, 2025-05-07T20:31:49.0912497Z scale_ub: Optional[float], 2025-05-07T20:31:49.0912776Z contiguous: bool, 2025-05-07T20:31:49.0913021Z compiled: bool, 2025-05-07T20:31:49.0913253Z ) -> None: 2025-05-07T20:31:49.0921481Z torch.manual_seed(2025) 2025-05-07T20:31:49.0921787Z 2025-05-07T20:31:49.0922082Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.0924238Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.0926161Z 2025-05-07T20:31:49.0926295Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.0926512Z 2025-05-07T20:31:49.0926619Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.0927047Z self=, 2025-05-07T20:31:49.0927458Z T=128, 2025-05-07T20:31:49.0927648Z D=5120, 2025-05-07T20:31:49.0927850Z scale_ub=1200.0, 2025-05-07T20:31:49.0928087Z contiguous=False, 2025-05-07T20:31:49.0928315Z compiled=False, 2025-05-07T20:31:49.0928532Z ) 2025-05-07T20:31:49.4798676Z self = 2025-05-07T20:31:49.4799304Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.4799587Z 2025-05-07T20:31:49.4799691Z @given( 2025-05-07T20:31:49.4799935Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4800272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4800603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4800948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4801304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4801608Z ) 2025-05-07T20:31:49.4801976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4802426Z def test_silu_mul_quant( 2025-05-07T20:31:49.4802686Z self, 2025-05-07T20:31:49.4802900Z T: int, 2025-05-07T20:31:49.4803107Z D: int, 2025-05-07T20:31:49.4803342Z scale_ub: Optional[float], 2025-05-07T20:31:49.4803631Z contiguous: bool, 2025-05-07T20:31:49.4804265Z compiled: bool, 2025-05-07T20:31:49.4804539Z ) -> None: 2025-05-07T20:31:49.4804797Z torch.manual_seed(2025) 2025-05-07T20:31:49.4805050Z 2025-05-07T20:31:49.4805340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4805699Z 2025-05-07T20:31:49.4805902Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4806217Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4806548Z x = x_sign * x_clamp 2025-05-07T20:31:49.4806801Z x0 = x[:, :D] 2025-05-07T20:31:49.4807040Z x1 = x[:, D:] 2025-05-07T20:31:49.4807266Z 2025-05-07T20:31:49.4807470Z if contiguous: 2025-05-07T20:31:49.4807714Z x0 = x0.contiguous() 2025-05-07T20:31:49.4807995Z x1 = x1.contiguous() 2025-05-07T20:31:49.4808259Z 2025-05-07T20:31:49.4808459Z if scale_ub is not None: 2025-05-07T20:31:49.4808754Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4809122Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4809446Z ) 2025-05-07T20:31:49.4809660Z else: 2025-05-07T20:31:49.4809889Z scale_ub_tensor = None 2025-05-07T20:31:49.4810150Z 2025-05-07T20:31:49.4810406Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4810740Z op = silu_mul_quant 2025-05-07T20:31:49.4811001Z if compiled: 2025-05-07T20:31:49.4811269Z op = torch.compile(op) 2025-05-07T20:31:49.4811591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4811874Z 2025-05-07T20:31:49.4812084Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4812265Z 2025-05-07T20:31:49.4812375Z moe/activation_test.py:117: 2025-05-07T20:31:49.4812690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4813031Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4813330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4814229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4814943Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4815505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4816206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4816885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4817428Z kernel = self.compile( 2025-05-07T20:31:49.4817984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4818652Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4819060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4819324Z 2025-05-07T20:31:49.4819539Z self = 2025-05-07T20:31:49.4820632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4822156Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd65c6670>} 2025-05-07T20:31:49.4823519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4824549Z context = 2025-05-07T20:31:49.4824860Z 2025-05-07T20:31:49.4825066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4825694Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4826171Z module_map=module_map) 2025-05-07T20:31:49.4826548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4826913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4827190Z E ^ 2025-05-07T20:31:49.4827655Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4828111Z 2025-05-07T20:31:49.4828532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4829057Z 2025-05-07T20:31:49.4829166Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4829594Z self=, 2025-05-07T20:31:49.4830007Z T=2048, 2025-05-07T20:31:49.4830216Z D=7168, 2025-05-07T20:31:49.4830427Z scale_ub=None, 2025-05-07T20:31:49.4830652Z contiguous=False, 2025-05-07T20:31:49.4830898Z compiled=False, 2025-05-07T20:31:49.4831122Z ) 2025-05-07T20:31:49.4831446Z self = 2025-05-07T20:31:49.4831963Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.4832252Z 2025-05-07T20:31:49.4832334Z @given( 2025-05-07T20:31:49.4832581Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4832901Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4833227Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4833573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4833914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4834214Z ) 2025-05-07T20:31:49.4834643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4835116Z def test_silu_mul_quant( 2025-05-07T20:31:49.4835373Z self, 2025-05-07T20:31:49.4835569Z T: int, 2025-05-07T20:31:49.4835776Z D: int, 2025-05-07T20:31:49.4836005Z scale_ub: Optional[float], 2025-05-07T20:31:49.4836288Z contiguous: bool, 2025-05-07T20:31:49.4836532Z compiled: bool, 2025-05-07T20:31:49.4836764Z ) -> None: 2025-05-07T20:31:49.4836990Z torch.manual_seed(2025) 2025-05-07T20:31:49.4837238Z 2025-05-07T20:31:49.4837519Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4839568Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.4841754Z 2025-05-07T20:31:49.4841890Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.4842111Z 2025-05-07T20:31:49.4842217Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4842641Z self=, 2025-05-07T20:31:49.4843050Z T=128, 2025-05-07T20:31:49.4843245Z D=7168, 2025-05-07T20:31:49.4843440Z scale_ub=1200.0, 2025-05-07T20:31:49.4843682Z contiguous=True, 2025-05-07T20:31:49.4843930Z compiled=True, 2025-05-07T20:31:49.4844167Z ) 2025-05-07T20:31:49.5299416Z self = 2025-05-07T20:31:49.5299986Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.5300348Z 2025-05-07T20:31:49.5300714Z @given( 2025-05-07T20:31:49.5300959Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5301377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5301690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5302029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5302365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5302657Z ) 2025-05-07T20:31:49.5303016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5303469Z def test_silu_mul_quant( 2025-05-07T20:31:49.5303717Z self, 2025-05-07T20:31:49.5303920Z T: int, 2025-05-07T20:31:49.5304120Z D: int, 2025-05-07T20:31:49.5304350Z scale_ub: Optional[float], 2025-05-07T20:31:49.5304630Z contiguous: bool, 2025-05-07T20:31:49.5304873Z compiled: bool, 2025-05-07T20:31:49.5305105Z ) -> None: 2025-05-07T20:31:49.5305331Z torch.manual_seed(2025) 2025-05-07T20:31:49.5305587Z 2025-05-07T20:31:49.5305866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5306220Z 2025-05-07T20:31:49.5306414Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5306716Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5307038Z x = x_sign * x_clamp 2025-05-07T20:31:49.5307290Z x0 = x[:, :D] 2025-05-07T20:31:49.5307506Z x1 = x[:, D:] 2025-05-07T20:31:49.5307722Z 2025-05-07T20:31:49.5307917Z if contiguous: 2025-05-07T20:31:49.5308153Z x0 = x0.contiguous() 2025-05-07T20:31:49.5308422Z x1 = x1.contiguous() 2025-05-07T20:31:49.5308674Z 2025-05-07T20:31:49.5308870Z if scale_ub is not None: 2025-05-07T20:31:49.5309155Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5309499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5309815Z ) 2025-05-07T20:31:49.5310170Z else: 2025-05-07T20:31:49.5310397Z scale_ub_tensor = None 2025-05-07T20:31:49.5310655Z 2025-05-07T20:31:49.5310894Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5311218Z op = silu_mul_quant 2025-05-07T20:31:49.5311478Z if compiled: 2025-05-07T20:31:49.5311740Z op = torch.compile(op) 2025-05-07T20:31:49.5312049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5312338Z 2025-05-07T20:31:49.5312533Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5312711Z 2025-05-07T20:31:49.5312819Z moe/activation_test.py:117: 2025-05-07T20:31:49.5313130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5313466Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5313761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5314337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5314906Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5315571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5316261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5316810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5317491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5318157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5318694Z kernel = self.compile( 2025-05-07T20:31:49.5319243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5319898Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5320309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5320638Z 2025-05-07T20:31:49.5320855Z self = 2025-05-07T20:31:49.5321930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5323320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd659b5e0>} 2025-05-07T20:31:49.5324721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5325742Z context = 2025-05-07T20:31:49.5326042Z 2025-05-07T20:31:49.5326221Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5326748Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5327220Z module_map=module_map) 2025-05-07T20:31:49.5327597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5327950Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5328220Z E ^ 2025-05-07T20:31:49.5328685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5329135Z 2025-05-07T20:31:49.5329558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5330068Z 2025-05-07T20:31:49.5330175Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5330685Z self=, 2025-05-07T20:31:49.5331104Z T=128, 2025-05-07T20:31:49.5331295Z D=7168, 2025-05-07T20:31:49.5331505Z scale_ub=1200.0, 2025-05-07T20:31:49.5331765Z contiguous=True, 2025-05-07T20:31:49.5332000Z compiled=False, 2025-05-07T20:31:49.5332215Z ) 2025-05-07T20:31:49.5332542Z self = 2025-05-07T20:31:49.5333041Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.5333314Z 2025-05-07T20:31:49.5333404Z @given( 2025-05-07T20:31:49.5333635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5333958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5334280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5334614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5334955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5335252Z ) 2025-05-07T20:31:49.5335615Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5336077Z def test_silu_mul_quant( 2025-05-07T20:31:49.5336329Z self, 2025-05-07T20:31:49.5336534Z T: int, 2025-05-07T20:31:49.5336733Z D: int, 2025-05-07T20:31:49.5336958Z scale_ub: Optional[float], 2025-05-07T20:31:49.5337236Z contiguous: bool, 2025-05-07T20:31:49.5337477Z compiled: bool, 2025-05-07T20:31:49.5337708Z ) -> None: 2025-05-07T20:31:49.5337932Z torch.manual_seed(2025) 2025-05-07T20:31:49.5338181Z 2025-05-07T20:31:49.5338462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5338814Z 2025-05-07T20:31:49.5339008Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5339309Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5341780Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5343823Z 2025-05-07T20:31:49.5343948Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.5344169Z 2025-05-07T20:31:49.5344281Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5344694Z self=, 2025-05-07T20:31:49.5345105Z T=128, 2025-05-07T20:31:49.5345300Z D=5120, 2025-05-07T20:31:49.5345495Z scale_ub=1200.0, 2025-05-07T20:31:49.5345726Z contiguous=True, 2025-05-07T20:31:49.5345954Z compiled=True, 2025-05-07T20:31:49.5346160Z ) 2025-05-07T20:31:49.5346495Z self = 2025-05-07T20:31:49.5346995Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.5347266Z 2025-05-07T20:31:49.5347352Z @given( 2025-05-07T20:31:49.5347583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5347908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5348225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5348557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5348894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5349190Z ) 2025-05-07T20:31:49.5349545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5349994Z def test_silu_mul_quant( 2025-05-07T20:31:49.5350246Z self, 2025-05-07T20:31:49.5350445Z T: int, 2025-05-07T20:31:49.5350650Z D: int, 2025-05-07T20:31:49.5351036Z scale_ub: Optional[float], 2025-05-07T20:31:49.5351316Z contiguous: bool, 2025-05-07T20:31:49.5351565Z compiled: bool, 2025-05-07T20:31:49.5351797Z ) -> None: 2025-05-07T20:31:49.5352026Z torch.manual_seed(2025) 2025-05-07T20:31:49.5352271Z 2025-05-07T20:31:49.5352549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5352900Z 2025-05-07T20:31:49.5353094Z > x_sign = torch.sign(x) 2025-05-07T20:31:49.5355067Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5356901Z 2025-05-07T20:31:49.5357021Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:49.5357236Z 2025-05-07T20:31:49.5357349Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5357763Z self=, 2025-05-07T20:31:49.5358171Z T=128, 2025-05-07T20:31:49.5358364Z D=7168, 2025-05-07T20:31:49.5358561Z scale_ub=None, 2025-05-07T20:31:49.5358774Z contiguous=True, 2025-05-07T20:31:49.5359003Z compiled=True, 2025-05-07T20:31:49.5359211Z ) 2025-05-07T20:31:49.8471064Z self = 2025-05-07T20:31:49.8471638Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.8471917Z 2025-05-07T20:31:49.8472019Z @given( 2025-05-07T20:31:49.8472275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8473073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8473394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8473742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8474125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8474436Z ) 2025-05-07T20:31:49.8474804Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8475263Z def test_silu_mul_quant( 2025-05-07T20:31:49.8475524Z self, 2025-05-07T20:31:49.8475725Z T: int, 2025-05-07T20:31:49.8475938Z D: int, 2025-05-07T20:31:49.8476170Z scale_ub: Optional[float], 2025-05-07T20:31:49.8476450Z contiguous: bool, 2025-05-07T20:31:49.8476705Z compiled: bool, 2025-05-07T20:31:49.8476952Z ) -> None: 2025-05-07T20:31:49.8477179Z torch.manual_seed(2025) 2025-05-07T20:31:49.8477437Z 2025-05-07T20:31:49.8477727Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8479777Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.8481663Z 2025-05-07T20:31:49.8481789Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.8482015Z 2025-05-07T20:31:49.8542604Z FAILED 2025-05-07T20:31:49.8543024Z 2025-05-07T20:31:49.8543561Z =================================== FAILURES =================================== 2025-05-07T20:31:49.8544070Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:31:49.8544847Z + Exception Group Traceback (most recent call last): 2025-05-07T20:31:49.8545555Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:31:49.8546116Z | yield 2025-05-07T20:31:49.8546564Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:31:49.8547083Z | self._callTestMethod(testMethod) 2025-05-07T20:31:49.8547656Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:31:49.8548211Z | method() 2025-05-07T20:31:49.8548874Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:31:49.8549612Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8550273Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:31:49.8550927Z | raise the_error_hypothesis_found 2025-05-07T20:31:49.8551432Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:31:49.8551950Z +-+---------------- 1 ---------------- 2025-05-07T20:31:49.8552257Z | Traceback (most recent call last): 2025-05-07T20:31:49.8552988Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:49.8553780Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8555893Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.8558173Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:49.8558635Z | self=, 2025-05-07T20:31:49.8559059Z | T=128, 2025-05-07T20:31:49.8559273Z | D=7168, 2025-05-07T20:31:49.8559495Z | scale_ub=1200.0, 2025-05-07T20:31:49.8559772Z | contiguous=True, 2025-05-07T20:31:49.8560033Z | compiled=False, 2025-05-07T20:31:49.8560265Z | ) 2025-05-07T20:31:49.8560458Z | 2025-05-07T20:31:49.8561002Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:31:49.8561636Z +---------------- 2 ---------------- 2025-05-07T20:31:49.8561977Z | Traceback (most recent call last): 2025-05-07T20:31:49.8562714Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:49.8576253Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8578298Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.8580395Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:49.8580858Z | self=, 2025-05-07T20:31:49.8581381Z | T=128, 2025-05-07T20:31:49.8581599Z | D=7168, 2025-05-07T20:31:49.8581836Z | scale_ub=None, 2025-05-07T20:31:49.8582095Z | contiguous=True, 2025-05-07T20:31:49.8582348Z | compiled=True, 2025-05-07T20:31:49.8582587Z | ) 2025-05-07T20:31:49.8582783Z | 2025-05-07T20:31:49.8583316Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:49.8583954Z +---------------- 3 ---------------- 2025-05-07T20:31:49.8584265Z | Traceback (most recent call last): 2025-05-07T20:31:49.8584988Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:49.8585786Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8588363Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.8590366Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:49.8590875Z | self=, 2025-05-07T20:31:49.8591294Z | T=128, 2025-05-07T20:31:49.8591504Z | D=5120, 2025-05-07T20:31:49.8591732Z | scale_ub=1200.0, 2025-05-07T20:31:49.8592081Z | contiguous=True, 2025-05-07T20:31:49.8592327Z | compiled=True, 2025-05-07T20:31:49.8592571Z | ) 2025-05-07T20:31:49.8592765Z | 2025-05-07T20:31:49.8593288Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:31:49.8593904Z +---------------- 4 ---------------- 2025-05-07T20:31:49.8594208Z | Traceback (most recent call last): 2025-05-07T20:31:49.8594938Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:31:49.8595654Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.8596324Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:31:49.8597038Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8597899Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:31:49.8598761Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.8599409Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:31:49.8600159Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8600920Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:31:49.8601734Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8602567Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:31:49.8603488Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8604272Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:31:49.8605020Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.8605711Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:31:49.8606325Z | fn() 2025-05-07T20:31:49.8606910Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:31:49.8607592Z | self.fn.run( 2025-05-07T20:31:49.8608155Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:31:49.8608753Z | kernel = self.compile( 2025-05-07T20:31:49.8609396Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:31:49.8610162Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8610919Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.8611725Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8612252Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8612624Z | def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.8612922Z | ^ 2025-05-07T20:31:49.8613403Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8614051Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:49.8614501Z | # The test always failed when commented parts were varied together. 2025-05-07T20:31:49.8615123Z | self=, 2025-05-07T20:31:49.8615573Z | T=1, # or any other generated value 2025-05-07T20:31:49.8615906Z | D=5120, # or any other generated value 2025-05-07T20:31:49.8616272Z | scale_ub=None, # or any other generated value 2025-05-07T20:31:49.8616643Z | contiguous=True, # or any other generated value 2025-05-07T20:31:49.8617022Z | compiled=True, # or any other generated value 2025-05-07T20:31:49.8617340Z | ) 2025-05-07T20:31:49.8617522Z | 2025-05-07T20:31:49.8618056Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:49.8618673Z +------------------------------------ 2025-05-07T20:31:49.8619042Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:31:49.8619434Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8619876Z self=, 2025-05-07T20:31:49.8620283Z T=1, 2025-05-07T20:31:49.8620469Z D=5120, 2025-05-07T20:31:49.8620673Z scale_ub=None, 2025-05-07T20:31:49.8620896Z contiguous=True, 2025-05-07T20:31:49.8621188Z compiled=True, 2025-05-07T20:31:49.8621405Z ) 2025-05-07T20:31:49.8621729Z self = 2025-05-07T20:31:49.8622229Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.8622500Z 2025-05-07T20:31:49.8622580Z @given( 2025-05-07T20:31:49.8622835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8623159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8623464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8623803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8624232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8624530Z ) 2025-05-07T20:31:49.8624888Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8625342Z def test_silu_mul_quant( 2025-05-07T20:31:49.8625586Z self, 2025-05-07T20:31:49.8625790Z T: int, 2025-05-07T20:31:49.8626000Z D: int, 2025-05-07T20:31:49.8626221Z scale_ub: Optional[float], 2025-05-07T20:31:49.8626503Z contiguous: bool, 2025-05-07T20:31:49.8626753Z compiled: bool, 2025-05-07T20:31:49.8627021Z ) -> None: 2025-05-07T20:31:49.8627335Z torch.manual_seed(2025) 2025-05-07T20:31:49.8627697Z 2025-05-07T20:31:49.8628090Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8628583Z 2025-05-07T20:31:49.8628872Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8629302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8629758Z x = x_sign * x_clamp 2025-05-07T20:31:49.8630144Z x0 = x[:, :D] 2025-05-07T20:31:49.8630472Z x1 = x[:, D:] 2025-05-07T20:31:49.8630780Z 2025-05-07T20:31:49.8631064Z if contiguous: 2025-05-07T20:31:49.8631418Z x0 = x0.contiguous() 2025-05-07T20:31:49.8631803Z x1 = x1.contiguous() 2025-05-07T20:31:49.8632167Z 2025-05-07T20:31:49.8632458Z if scale_ub is not None: 2025-05-07T20:31:49.8632857Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8633347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8633774Z ) 2025-05-07T20:31:49.8634038Z else: 2025-05-07T20:31:49.8634339Z scale_ub_tensor = None 2025-05-07T20:31:49.8634705Z 2025-05-07T20:31:49.8635042Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8635488Z op = silu_mul_quant 2025-05-07T20:31:49.8635862Z if compiled: 2025-05-07T20:31:49.8636242Z op = torch.compile(op) 2025-05-07T20:31:49.8636884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8637292Z 2025-05-07T20:31:49.8637578Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.8637998Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.8638432Z 2025-05-07T20:31:49.8638786Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8639277Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.8639702Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.8640382Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.8640906Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8641355Z 2025-05-07T20:31:49.8641644Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.8641922Z 2025-05-07T20:31:49.8642075Z moe/activation_test.py:126: 2025-05-07T20:31:49.8642497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8643011Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.8643486Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8644607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.8645692Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.8646460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.8647437Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8648411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.8649423Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8650722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.8651781Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8652813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.8653738Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.8654599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.8655326Z fn() 2025-05-07T20:31:49.8656042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.8656870Z self.fn.run( 2025-05-07T20:31:49.8657530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.8658273Z kernel = self.compile( 2025-05-07T20:31:49.8659038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.8659952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8660480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8660796Z 2025-05-07T20:31:49.8661065Z self = 2025-05-07T20:31:49.8662668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.8664639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317a5ba040>} 2025-05-07T20:31:49.8666557Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.8668145Z context = 2025-05-07T20:31:49.8668534Z 2025-05-07T20:31:49.8668760Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.8669481Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8670135Z module_map=module_map) 2025-05-07T20:31:49.8670637Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8671130Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.8671517Z E ^ 2025-05-07T20:31:49.8672158Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8672770Z 2025-05-07T20:31:49.8673347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.8674050Z 2025-05-07T20:31:49.8674188Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8674748Z self=, 2025-05-07T20:31:49.8675301Z T=2048, 2025-05-07T20:31:49.8675572Z D=5120, 2025-05-07T20:31:49.8675851Z scale_ub=1200.0, 2025-05-07T20:31:49.8676169Z contiguous=True, 2025-05-07T20:31:49.8676495Z compiled=False, 2025-05-07T20:31:49.8676800Z ) 2025-05-07T20:31:49.8677262Z self = 2025-05-07T20:31:49.8677943Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.8678327Z 2025-05-07T20:31:49.8678443Z @given( 2025-05-07T20:31:49.8678777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8679217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8679659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8680241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8680701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8681102Z ) 2025-05-07T20:31:49.8681572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8682137Z def test_silu_mul_quant( 2025-05-07T20:31:49.8682449Z self, 2025-05-07T20:31:49.8682702Z T: int, 2025-05-07T20:31:49.8682956Z D: int, 2025-05-07T20:31:49.8683254Z scale_ub: Optional[float], 2025-05-07T20:31:49.8683602Z contiguous: bool, 2025-05-07T20:31:49.8683911Z compiled: bool, 2025-05-07T20:31:49.8684212Z ) -> None: 2025-05-07T20:31:49.8684534Z torch.manual_seed(2025) 2025-05-07T20:31:49.8684897Z 2025-05-07T20:31:49.8685281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8685783Z 2025-05-07T20:31:49.8686072Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8686499Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8686948Z x = x_sign * x_clamp 2025-05-07T20:31:49.8687308Z x0 = x[:, :D] 2025-05-07T20:31:49.8687615Z x1 = x[:, D:] 2025-05-07T20:31:49.8687931Z 2025-05-07T20:31:49.8688202Z if contiguous: 2025-05-07T20:31:49.8688536Z x0 = x0.contiguous() 2025-05-07T20:31:49.8688919Z x1 = x1.contiguous() 2025-05-07T20:31:49.8689275Z 2025-05-07T20:31:49.8689560Z if scale_ub is not None: 2025-05-07T20:31:49.8689956Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8690443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8690895Z ) 2025-05-07T20:31:49.8691184Z else: 2025-05-07T20:31:49.8691498Z scale_ub_tensor = None 2025-05-07T20:31:49.8691869Z 2025-05-07T20:31:49.8692194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8692652Z op = silu_mul_quant 2025-05-07T20:31:49.8693136Z if compiled: 2025-05-07T20:31:49.8693488Z op = torch.compile(op) 2025-05-07T20:31:49.8693921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8694326Z 2025-05-07T20:31:49.8694577Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.8694799Z 2025-05-07T20:31:49.8694928Z moe/activation_test.py:117: 2025-05-07T20:31:49.8695315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8695737Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.8696090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8696960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.8697831Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.8698542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.8699459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8700384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.8701078Z kernel = self.compile( 2025-05-07T20:31:49.8701844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.8702665Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8703182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8703478Z 2025-05-07T20:31:49.8703763Z self = 2025-05-07T20:31:49.8705154Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.8706968Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317a60f9d0>} 2025-05-07T20:31:49.8708728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.8710056Z context = 2025-05-07T20:31:49.8710441Z 2025-05-07T20:31:49.8710682Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.8711428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8712099Z module_map=module_map) 2025-05-07T20:31:49.8712598Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8713077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.8713449Z E ^ 2025-05-07T20:31:49.8714089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8714723Z 2025-05-07T20:31:49.8715266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.8715974Z 2025-05-07T20:31:49.8716125Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8716695Z self=, 2025-05-07T20:31:49.8717232Z T=2048, 2025-05-07T20:31:49.8717495Z D=5120, 2025-05-07T20:31:49.8717762Z scale_ub=1200.0, 2025-05-07T20:31:49.8718060Z contiguous=True, 2025-05-07T20:31:49.8718370Z compiled=True, 2025-05-07T20:31:49.8718649Z ) 2025-05-07T20:31:49.8719066Z self = 2025-05-07T20:31:49.8719741Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.8720194Z 2025-05-07T20:31:49.8720309Z @given( 2025-05-07T20:31:49.8720613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8721054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8721486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8721945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8722403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8722793Z ) 2025-05-07T20:31:49.8723263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8723853Z def test_silu_mul_quant( 2025-05-07T20:31:49.8724203Z self, 2025-05-07T20:31:49.8724476Z T: int, 2025-05-07T20:31:49.8724752Z D: int, 2025-05-07T20:31:49.8725066Z scale_ub: Optional[float], 2025-05-07T20:31:49.8725447Z contiguous: bool, 2025-05-07T20:31:49.8725782Z compiled: bool, 2025-05-07T20:31:49.8726141Z ) -> None: 2025-05-07T20:31:49.8726446Z torch.manual_seed(2025) 2025-05-07T20:31:49.8726774Z 2025-05-07T20:31:49.8727162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8727647Z 2025-05-07T20:31:49.8727904Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8728295Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8728691Z x = x_sign * x_clamp 2025-05-07T20:31:49.8729039Z x0 = x[:, :D] 2025-05-07T20:31:49.8729352Z x1 = x[:, D:] 2025-05-07T20:31:49.8729652Z 2025-05-07T20:31:49.8729926Z if contiguous: 2025-05-07T20:31:49.8730229Z x0 = x0.contiguous() 2025-05-07T20:31:49.8730564Z x1 = x1.contiguous() 2025-05-07T20:31:49.8730915Z 2025-05-07T20:31:49.8731186Z if scale_ub is not None: 2025-05-07T20:31:49.8731574Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8732048Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8733099Z ) 2025-05-07T20:31:49.8733376Z else: 2025-05-07T20:31:49.8733669Z scale_ub_tensor = None 2025-05-07T20:31:49.8734017Z 2025-05-07T20:31:49.8734336Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8734776Z op = silu_mul_quant 2025-05-07T20:31:49.8735139Z if compiled: 2025-05-07T20:31:49.8735474Z op = torch.compile(op) 2025-05-07T20:31:49.8735897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8736291Z 2025-05-07T20:31:49.8736561Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.8736970Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.8737363Z 2025-05-07T20:31:49.8737672Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8738129Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.8738528Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.8738954Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.8739444Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8739868Z 2025-05-07T20:31:49.8740451Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.8740730Z 2025-05-07T20:31:49.8740868Z moe/activation_test.py:126: 2025-05-07T20:31:49.8741345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8741806Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.8742243Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8743348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.8744387Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.8745125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.8746050Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8747178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.8748170Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8749173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.8750208Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8751236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.8752111Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.8752971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.8753722Z fn() 2025-05-07T20:31:49.8754491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.8755277Z self.fn.run( 2025-05-07T20:31:49.8755909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.8756614Z kernel = self.compile( 2025-05-07T20:31:49.8757323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.8758202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8758751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8759074Z 2025-05-07T20:31:49.8759366Z self = 2025-05-07T20:31:49.8761819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.8784050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317a693a60>} 2025-05-07T20:31:49.8785931Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.8787347Z context = 2025-05-07T20:31:49.8787739Z 2025-05-07T20:31:49.8787982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.8788699Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8789364Z module_map=module_map) 2025-05-07T20:31:49.8789908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8790416Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.8790765Z E ^ 2025-05-07T20:31:49.8791378Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8791987Z 2025-05-07T20:31:49.8792554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.8793247Z 2025-05-07T20:31:49.8793407Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8793975Z self=, 2025-05-07T20:31:49.8794536Z T=16384, 2025-05-07T20:31:49.8794822Z D=7168, 2025-05-07T20:31:49.8795095Z scale_ub=1200.0, 2025-05-07T20:31:49.8795419Z contiguous=False, 2025-05-07T20:31:49.8795743Z compiled=False, 2025-05-07T20:31:49.8796039Z ) 2025-05-07T20:31:49.8796476Z self = 2025-05-07T20:31:49.8797394Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.8797769Z 2025-05-07T20:31:49.8797889Z @given( 2025-05-07T20:31:49.8798190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8798607Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8799028Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8799467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8799924Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8800322Z ) 2025-05-07T20:31:49.8800819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8801450Z def test_silu_mul_quant( 2025-05-07T20:31:49.8801805Z self, 2025-05-07T20:31:49.8802086Z T: int, 2025-05-07T20:31:49.8802363Z D: int, 2025-05-07T20:31:49.8802680Z scale_ub: Optional[float], 2025-05-07T20:31:49.8803095Z contiguous: bool, 2025-05-07T20:31:49.8803420Z compiled: bool, 2025-05-07T20:31:49.8803732Z ) -> None: 2025-05-07T20:31:49.8804030Z torch.manual_seed(2025) 2025-05-07T20:31:49.8804323Z 2025-05-07T20:31:49.8804602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8804955Z 2025-05-07T20:31:49.8805150Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8805445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8805765Z x = x_sign * x_clamp 2025-05-07T20:31:49.8806006Z x0 = x[:, :D] 2025-05-07T20:31:49.8806227Z x1 = x[:, D:] 2025-05-07T20:31:49.8806439Z 2025-05-07T20:31:49.8806627Z if contiguous: 2025-05-07T20:31:49.8806864Z x0 = x0.contiguous() 2025-05-07T20:31:49.8807131Z x1 = x1.contiguous() 2025-05-07T20:31:49.8807375Z 2025-05-07T20:31:49.8807562Z if scale_ub is not None: 2025-05-07T20:31:49.8807945Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8808296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8808606Z ) 2025-05-07T20:31:49.8808802Z else: 2025-05-07T20:31:49.8809017Z scale_ub_tensor = None 2025-05-07T20:31:49.8809264Z 2025-05-07T20:31:49.8809502Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8809819Z op = silu_mul_quant 2025-05-07T20:31:49.8810069Z if compiled: 2025-05-07T20:31:49.8810324Z op = torch.compile(op) 2025-05-07T20:31:49.8810624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8810898Z 2025-05-07T20:31:49.8811091Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.8811261Z 2025-05-07T20:31:49.8811370Z moe/activation_test.py:117: 2025-05-07T20:31:49.8811667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8812001Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.8812295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8813002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.8813689Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.8814229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.8814924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8815595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.8816124Z kernel = self.compile( 2025-05-07T20:31:49.8816669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.8817329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8817731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8818057Z 2025-05-07T20:31:49.8818269Z self = 2025-05-07T20:31:49.8819346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.8820735Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317ad33700>} 2025-05-07T20:31:49.8822231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.8823255Z context = 2025-05-07T20:31:49.8823559Z 2025-05-07T20:31:49.8823731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.8824257Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8824726Z module_map=module_map) 2025-05-07T20:31:49.8825097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8825448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.8825712Z E ^ 2025-05-07T20:31:49.8826176Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8826633Z 2025-05-07T20:31:49.8827046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.8827574Z 2025-05-07T20:31:49.8827678Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8828097Z self=, 2025-05-07T20:31:49.8828585Z T=1, 2025-05-07T20:31:49.8828776Z D=7168, 2025-05-07T20:31:49.8828976Z scale_ub=None, 2025-05-07T20:31:49.8829186Z contiguous=True, 2025-05-07T20:31:49.8829419Z compiled=True, 2025-05-07T20:31:49.8829633Z ) 2025-05-07T20:31:49.8829950Z self = 2025-05-07T20:31:49.8830446Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.8830710Z 2025-05-07T20:31:49.8830796Z @given( 2025-05-07T20:31:49.8831031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8831352Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8831672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8832017Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8832354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8832653Z ) 2025-05-07T20:31:49.8833018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8833472Z def test_silu_mul_quant( 2025-05-07T20:31:49.8833724Z self, 2025-05-07T20:31:49.8833923Z T: int, 2025-05-07T20:31:49.8834127Z D: int, 2025-05-07T20:31:49.8834347Z scale_ub: Optional[float], 2025-05-07T20:31:49.8834615Z contiguous: bool, 2025-05-07T20:31:49.8834859Z compiled: bool, 2025-05-07T20:31:49.8835082Z ) -> None: 2025-05-07T20:31:49.8835293Z torch.manual_seed(2025) 2025-05-07T20:31:49.8835537Z 2025-05-07T20:31:49.8835813Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8836154Z 2025-05-07T20:31:49.8836342Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8836634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8836943Z x = x_sign * x_clamp 2025-05-07T20:31:49.8837180Z x0 = x[:, :D] 2025-05-07T20:31:49.8837396Z x1 = x[:, D:] 2025-05-07T20:31:49.8837789Z 2025-05-07T20:31:49.8837975Z if contiguous: 2025-05-07T20:31:49.8838209Z x0 = x0.contiguous() 2025-05-07T20:31:49.8838473Z x1 = x1.contiguous() 2025-05-07T20:31:49.8838710Z 2025-05-07T20:31:49.8838904Z if scale_ub is not None: 2025-05-07T20:31:49.8839181Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8839514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8839825Z ) 2025-05-07T20:31:49.8840021Z else: 2025-05-07T20:31:49.8840589Z scale_ub_tensor = None 2025-05-07T20:31:49.8840953Z 2025-05-07T20:31:49.8841306Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8841775Z op = silu_mul_quant 2025-05-07T20:31:49.8842034Z if compiled: 2025-05-07T20:31:49.8842288Z op = torch.compile(op) 2025-05-07T20:31:49.8842591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8842865Z 2025-05-07T20:31:49.8843074Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.8843365Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.8843657Z 2025-05-07T20:31:49.8843899Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8844240Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.8844534Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.8844854Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.8845220Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8845541Z 2025-05-07T20:31:49.8845741Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.8845944Z 2025-05-07T20:31:49.8846045Z moe/activation_test.py:126: 2025-05-07T20:31:49.8846353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8846689Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.8847020Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8848058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.8848812Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.8849366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.8850053Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8850750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.8851471Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8852231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.8852986Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8853731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.8854367Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.8854984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.8855501Z fn() 2025-05-07T20:31:49.8856002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.8856583Z self.fn.run( 2025-05-07T20:31:49.8857052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.8857583Z kernel = self.compile( 2025-05-07T20:31:49.8858118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.8858776Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8859316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8859547Z 2025-05-07T20:31:49.8859760Z self = 2025-05-07T20:31:49.8860829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.8862333Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317b531ee0>} 2025-05-07T20:31:49.8863667Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.8864707Z context = 2025-05-07T20:31:49.8864994Z 2025-05-07T20:31:49.8865165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.8865693Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8866163Z module_map=module_map) 2025-05-07T20:31:49.8866534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8866888Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.8867157Z E ^ 2025-05-07T20:31:49.8867618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8868063Z 2025-05-07T20:31:49.8868482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.8869000Z 2025-05-07T20:31:49.8869102Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8869609Z self=, 2025-05-07T20:31:49.8870014Z T=4096, 2025-05-07T20:31:49.8870203Z D=5120, 2025-05-07T20:31:49.8870400Z scale_ub=None, 2025-05-07T20:31:49.8870624Z contiguous=False, 2025-05-07T20:31:49.8870848Z compiled=False, 2025-05-07T20:31:49.8871057Z ) 2025-05-07T20:31:49.8871376Z self = 2025-05-07T20:31:49.8871872Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.8872158Z 2025-05-07T20:31:49.8872240Z @given( 2025-05-07T20:31:49.8872473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8872797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8873105Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8873441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8873784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8874078Z ) 2025-05-07T20:31:49.8874438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8874893Z def test_silu_mul_quant( 2025-05-07T20:31:49.8875134Z self, 2025-05-07T20:31:49.8875333Z T: int, 2025-05-07T20:31:49.8875534Z D: int, 2025-05-07T20:31:49.8875751Z scale_ub: Optional[float], 2025-05-07T20:31:49.8876031Z contiguous: bool, 2025-05-07T20:31:49.8876281Z compiled: bool, 2025-05-07T20:31:49.8876505Z ) -> None: 2025-05-07T20:31:49.8876729Z torch.manual_seed(2025) 2025-05-07T20:31:49.8876977Z 2025-05-07T20:31:49.8877252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8877596Z 2025-05-07T20:31:49.8877791Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8878086Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8878394Z x = x_sign * x_clamp 2025-05-07T20:31:49.8878730Z x0 = x[:, :D] 2025-05-07T20:31:49.8878949Z x1 = x[:, D:] 2025-05-07T20:31:49.8879156Z 2025-05-07T20:31:49.8879342Z if contiguous: 2025-05-07T20:31:49.8879577Z x0 = x0.contiguous() 2025-05-07T20:31:49.8879833Z x1 = x1.contiguous() 2025-05-07T20:31:49.8880080Z 2025-05-07T20:31:49.8880279Z if scale_ub is not None: 2025-05-07T20:31:49.8880551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8880893Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8881206Z ) 2025-05-07T20:31:49.8881397Z else: 2025-05-07T20:31:49.8881623Z scale_ub_tensor = None 2025-05-07T20:31:49.8881881Z 2025-05-07T20:31:49.8882115Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8882433Z op = silu_mul_quant 2025-05-07T20:31:49.8882694Z if compiled: 2025-05-07T20:31:49.8882948Z op = torch.compile(op) 2025-05-07T20:31:49.8883257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8883540Z 2025-05-07T20:31:49.8883735Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.8883903Z 2025-05-07T20:31:49.8884006Z moe/activation_test.py:117: 2025-05-07T20:31:49.8884310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8884649Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.8884931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8885633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.8886323Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.8886867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.8887550Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8888298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.8888854Z kernel = self.compile( 2025-05-07T20:31:49.8889403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.8890056Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8890460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8890693Z 2025-05-07T20:31:49.8890912Z self = 2025-05-07T20:31:49.8891986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.8893354Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313c57af70>} 2025-05-07T20:31:49.8894703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.8895733Z context = 2025-05-07T20:31:49.8896020Z 2025-05-07T20:31:49.8896199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.8896720Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8897196Z module_map=module_map) 2025-05-07T20:31:49.8897565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8897917Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.8898181Z E ^ 2025-05-07T20:31:49.8898649Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8899183Z 2025-05-07T20:31:49.8899607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.8900118Z 2025-05-07T20:31:49.8900222Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8900642Z self=, 2025-05-07T20:31:49.8901046Z T=4096, 2025-05-07T20:31:49.8901288Z D=7168, 2025-05-07T20:31:49.8901478Z scale_ub=None, 2025-05-07T20:31:49.8901701Z contiguous=False, 2025-05-07T20:31:49.8901940Z compiled=False, 2025-05-07T20:31:49.8902144Z ) 2025-05-07T20:31:49.8902466Z self = 2025-05-07T20:31:49.8902972Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.8903244Z 2025-05-07T20:31:49.8903323Z @given( 2025-05-07T20:31:49.8903572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8903894Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8904202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8904534Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8904868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8905161Z ) 2025-05-07T20:31:49.8905715Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8906174Z def test_silu_mul_quant( 2025-05-07T20:31:49.8906422Z self, 2025-05-07T20:31:49.8906617Z T: int, 2025-05-07T20:31:49.8906821Z D: int, 2025-05-07T20:31:49.8907045Z scale_ub: Optional[float], 2025-05-07T20:31:49.8907319Z contiguous: bool, 2025-05-07T20:31:49.8907568Z compiled: bool, 2025-05-07T20:31:49.8907799Z ) -> None: 2025-05-07T20:31:49.8908013Z torch.manual_seed(2025) 2025-05-07T20:31:49.8908265Z 2025-05-07T20:31:49.8908673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8909029Z 2025-05-07T20:31:49.8909232Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8909535Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8909857Z x = x_sign * x_clamp 2025-05-07T20:31:49.8910105Z x0 = x[:, :D] 2025-05-07T20:31:49.8910330Z x1 = x[:, D:] 2025-05-07T20:31:49.8910546Z 2025-05-07T20:31:49.8910732Z if contiguous: 2025-05-07T20:31:49.8910973Z x0 = x0.contiguous() 2025-05-07T20:31:49.8911238Z x1 = x1.contiguous() 2025-05-07T20:31:49.8911482Z 2025-05-07T20:31:49.8911682Z if scale_ub is not None: 2025-05-07T20:31:49.8911963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8912299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8912616Z ) 2025-05-07T20:31:49.8912821Z else: 2025-05-07T20:31:49.8913040Z scale_ub_tensor = None 2025-05-07T20:31:49.8913304Z 2025-05-07T20:31:49.8913544Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8913856Z op = silu_mul_quant 2025-05-07T20:31:49.8914116Z if compiled: 2025-05-07T20:31:49.8914369Z op = torch.compile(op) 2025-05-07T20:31:49.8914675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8914949Z 2025-05-07T20:31:49.8915146Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.8915314Z 2025-05-07T20:31:49.8915421Z moe/activation_test.py:117: 2025-05-07T20:31:49.8915715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8916055Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.8916344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8917034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.8917830Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.8918370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.8919060Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8919721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.8920256Z kernel = self.compile( 2025-05-07T20:31:49.8920801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.8921454Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8921862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8922100Z 2025-05-07T20:31:49.8922309Z self = 2025-05-07T20:31:49.8923392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.8924762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313c48f940>} 2025-05-07T20:31:49.8926086Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.8927103Z context = 2025-05-07T20:31:49.8927398Z 2025-05-07T20:31:49.8927569Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.8928097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8928654Z module_map=module_map) 2025-05-07T20:31:49.8929035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8929394Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.8929657Z E ^ 2025-05-07T20:31:49.8930128Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8930586Z 2025-05-07T20:31:49.8931009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.8931523Z 2025-05-07T20:31:49.8931634Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8932050Z self=, 2025-05-07T20:31:49.8932459Z T=128, 2025-05-07T20:31:49.8932655Z D=7168, 2025-05-07T20:31:49.8932852Z scale_ub=None, 2025-05-07T20:31:49.8933075Z contiguous=False, 2025-05-07T20:31:49.8933308Z compiled=True, 2025-05-07T20:31:49.8933522Z ) 2025-05-07T20:31:49.8933849Z self = 2025-05-07T20:31:49.8934348Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.8934621Z 2025-05-07T20:31:49.8934711Z @given( 2025-05-07T20:31:49.8934942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8935267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8935584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8935914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8936252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8936545Z ) 2025-05-07T20:31:49.8936894Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8937354Z def test_silu_mul_quant( 2025-05-07T20:31:49.8937606Z self, 2025-05-07T20:31:49.8937801Z T: int, 2025-05-07T20:31:49.8938096Z D: int, 2025-05-07T20:31:49.8938331Z scale_ub: Optional[float], 2025-05-07T20:31:49.8938612Z contiguous: bool, 2025-05-07T20:31:49.8938852Z compiled: bool, 2025-05-07T20:31:49.8939080Z ) -> None: 2025-05-07T20:31:49.8939296Z torch.manual_seed(2025) 2025-05-07T20:31:49.8948836Z 2025-05-07T20:31:49.8949150Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8949502Z 2025-05-07T20:31:49.8949711Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8950021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8950339Z x = x_sign * x_clamp 2025-05-07T20:31:49.8950591Z x0 = x[:, :D] 2025-05-07T20:31:49.8950817Z x1 = x[:, D:] 2025-05-07T20:31:49.8951037Z 2025-05-07T20:31:49.8951226Z if contiguous: 2025-05-07T20:31:49.8951471Z x0 = x0.contiguous() 2025-05-07T20:31:49.8951742Z x1 = x1.contiguous() 2025-05-07T20:31:49.8951987Z 2025-05-07T20:31:49.8952208Z if scale_ub is not None: 2025-05-07T20:31:49.8952495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8952837Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8953162Z ) 2025-05-07T20:31:49.8953371Z else: 2025-05-07T20:31:49.8953589Z scale_ub_tensor = None 2025-05-07T20:31:49.8953852Z 2025-05-07T20:31:49.8954137Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8954470Z op = silu_mul_quant 2025-05-07T20:31:49.8954733Z if compiled: 2025-05-07T20:31:49.8954989Z op = torch.compile(op) 2025-05-07T20:31:49.8955292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8955589Z 2025-05-07T20:31:49.8955795Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.8956086Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.8956395Z 2025-05-07T20:31:49.8956643Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8957189Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.8957491Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.8957825Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.8958200Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8958516Z 2025-05-07T20:31:49.8958732Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.8958934Z 2025-05-07T20:31:49.8959044Z moe/activation_test.py:126: 2025-05-07T20:31:49.8959348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8959696Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.8960039Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.8960835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.8961592Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.8962164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.8962856Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.8963553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.8964294Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8965062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.8965824Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.8966548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.8967196Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.8967934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.8968471Z fn() 2025-05-07T20:31:49.8968978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.8969566Z self.fn.run( 2025-05-07T20:31:49.8970046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.8970582Z kernel = self.compile( 2025-05-07T20:31:49.8971134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.8971792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.8972202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8972435Z 2025-05-07T20:31:49.8972648Z self = 2025-05-07T20:31:49.8973735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.8975126Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f317abe5160>} 2025-05-07T20:31:49.8976480Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.8977500Z context = 2025-05-07T20:31:49.8977792Z 2025-05-07T20:31:49.8977967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.8978606Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.8979094Z module_map=module_map) 2025-05-07T20:31:49.8979463Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.8979829Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.8980113Z E ^ 2025-05-07T20:31:49.8980579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.8981035Z 2025-05-07T20:31:49.8981531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.8982050Z 2025-05-07T20:31:49.8982157Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.8982584Z self=, 2025-05-07T20:31:49.8982995Z T=128, 2025-05-07T20:31:49.8983183Z D=7168, 2025-05-07T20:31:49.8983383Z scale_ub=None, 2025-05-07T20:31:49.8983626Z contiguous=False, 2025-05-07T20:31:49.8983859Z compiled=False, 2025-05-07T20:31:49.8984078Z ) 2025-05-07T20:31:49.8984405Z self = 2025-05-07T20:31:49.8984898Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.8985177Z 2025-05-07T20:31:49.8985259Z @given( 2025-05-07T20:31:49.8985500Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.8985815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.8986132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.8986471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.8986810Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.8987103Z ) 2025-05-07T20:31:49.8987464Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.8987920Z def test_silu_mul_quant( 2025-05-07T20:31:49.8988260Z self, 2025-05-07T20:31:49.8988465Z T: int, 2025-05-07T20:31:49.8988669Z D: int, 2025-05-07T20:31:49.8988894Z scale_ub: Optional[float], 2025-05-07T20:31:49.8989176Z contiguous: bool, 2025-05-07T20:31:49.8989427Z compiled: bool, 2025-05-07T20:31:49.8989657Z ) -> None: 2025-05-07T20:31:49.8989890Z torch.manual_seed(2025) 2025-05-07T20:31:49.8990149Z 2025-05-07T20:31:49.8990425Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.8990777Z 2025-05-07T20:31:49.8990982Z x_sign = torch.sign(x) 2025-05-07T20:31:49.8991277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.8991602Z x = x_sign * x_clamp 2025-05-07T20:31:49.8991863Z x0 = x[:, :D] 2025-05-07T20:31:49.8992095Z x1 = x[:, D:] 2025-05-07T20:31:49.8992307Z 2025-05-07T20:31:49.8992501Z if contiguous: 2025-05-07T20:31:49.8992746Z x0 = x0.contiguous() 2025-05-07T20:31:49.8993021Z x1 = x1.contiguous() 2025-05-07T20:31:49.8993265Z 2025-05-07T20:31:49.8993459Z if scale_ub is not None: 2025-05-07T20:31:49.8993742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.8994119Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.8994444Z ) 2025-05-07T20:31:49.8994641Z else: 2025-05-07T20:31:49.8994859Z scale_ub_tensor = None 2025-05-07T20:31:49.8995111Z 2025-05-07T20:31:49.8995348Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.8995666Z op = silu_mul_quant 2025-05-07T20:31:49.8995917Z if compiled: 2025-05-07T20:31:49.8996170Z op = torch.compile(op) 2025-05-07T20:31:49.8996477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8996757Z 2025-05-07T20:31:49.8996950Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.8997124Z 2025-05-07T20:31:49.8997226Z moe/activation_test.py:117: 2025-05-07T20:31:49.8997638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.8997977Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.8998262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.8998958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.8999641Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9000182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9000868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9001536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9002066Z kernel = self.compile( 2025-05-07T20:31:49.9002623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9003286Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9003691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9003924Z 2025-05-07T20:31:49.9004134Z self = 2025-05-07T20:31:49.9005215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9006574Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313c07bdc0>} 2025-05-07T20:31:49.9007913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9009021Z context = 2025-05-07T20:31:49.9009317Z 2025-05-07T20:31:49.9009490Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9010018Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9010486Z module_map=module_map) 2025-05-07T20:31:49.9010853Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9011214Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9011479Z E ^ 2025-05-07T20:31:49.9011944Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9012405Z 2025-05-07T20:31:49.9012829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9013356Z 2025-05-07T20:31:49.9013462Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9013884Z self=, 2025-05-07T20:31:49.9014285Z T=4096, 2025-05-07T20:31:49.9014480Z D=5120, 2025-05-07T20:31:49.9014684Z scale_ub=1200.0, 2025-05-07T20:31:49.9014906Z contiguous=True, 2025-05-07T20:31:49.9015141Z compiled=False, 2025-05-07T20:31:49.9015351Z ) 2025-05-07T20:31:49.9015670Z self = 2025-05-07T20:31:49.9016173Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.9016455Z 2025-05-07T20:31:49.9016532Z @given( 2025-05-07T20:31:49.9016766Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9017080Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9017393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9017809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9018150Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9018442Z ) 2025-05-07T20:31:49.9018792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9019232Z def test_silu_mul_quant( 2025-05-07T20:31:49.9019480Z self, 2025-05-07T20:31:49.9019677Z T: int, 2025-05-07T20:31:49.9019875Z D: int, 2025-05-07T20:31:49.9020100Z scale_ub: Optional[float], 2025-05-07T20:31:49.9020372Z contiguous: bool, 2025-05-07T20:31:49.9020615Z compiled: bool, 2025-05-07T20:31:49.9020838Z ) -> None: 2025-05-07T20:31:49.9021054Z torch.manual_seed(2025) 2025-05-07T20:31:49.9021359Z 2025-05-07T20:31:49.9021629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9021973Z 2025-05-07T20:31:49.9022169Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9022464Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9022782Z x = x_sign * x_clamp 2025-05-07T20:31:49.9023027Z x0 = x[:, :D] 2025-05-07T20:31:49.9023243Z x1 = x[:, D:] 2025-05-07T20:31:49.9023452Z 2025-05-07T20:31:49.9023640Z if contiguous: 2025-05-07T20:31:49.9023873Z x0 = x0.contiguous() 2025-05-07T20:31:49.9024160Z x1 = x1.contiguous() 2025-05-07T20:31:49.9024433Z 2025-05-07T20:31:49.9024623Z if scale_ub is not None: 2025-05-07T20:31:49.9024900Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9025237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9025552Z ) 2025-05-07T20:31:49.9025746Z else: 2025-05-07T20:31:49.9025960Z scale_ub_tensor = None 2025-05-07T20:31:49.9026215Z 2025-05-07T20:31:49.9026444Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9026759Z op = silu_mul_quant 2025-05-07T20:31:49.9027012Z if compiled: 2025-05-07T20:31:49.9027350Z op = torch.compile(op) 2025-05-07T20:31:49.9027649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9027929Z 2025-05-07T20:31:49.9028118Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9028288Z 2025-05-07T20:31:49.9028387Z moe/activation_test.py:117: 2025-05-07T20:31:49.9028690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9029021Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9029313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9030008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9030694Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9031229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9031921Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9032598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9033135Z kernel = self.compile( 2025-05-07T20:31:49.9033673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9034334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9034731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9034961Z 2025-05-07T20:31:49.9035170Z self = 2025-05-07T20:31:49.9036246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9037689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313d5650d0>} 2025-05-07T20:31:49.9039031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9040046Z context = 2025-05-07T20:31:49.9040623Z 2025-05-07T20:31:49.9040799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9041322Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9041785Z module_map=module_map) 2025-05-07T20:31:49.9042149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9042504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9042775Z E ^ 2025-05-07T20:31:49.9043246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9043697Z 2025-05-07T20:31:49.9044118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9044640Z 2025-05-07T20:31:49.9044744Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9045158Z self=, 2025-05-07T20:31:49.9045560Z T=1, 2025-05-07T20:31:49.9045737Z D=5120, 2025-05-07T20:31:49.9045928Z scale_ub=None, 2025-05-07T20:31:49.9046143Z contiguous=True, 2025-05-07T20:31:49.9046359Z compiled=True, 2025-05-07T20:31:49.9046568Z ) 2025-05-07T20:31:49.9046893Z self = 2025-05-07T20:31:49.9047377Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.9047780Z 2025-05-07T20:31:49.9047864Z @given( 2025-05-07T20:31:49.9048096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9048410Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9048722Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9049052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9049382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9049670Z ) 2025-05-07T20:31:49.9050021Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9050465Z def test_silu_mul_quant( 2025-05-07T20:31:49.9050706Z self, 2025-05-07T20:31:49.9050901Z T: int, 2025-05-07T20:31:49.9051099Z D: int, 2025-05-07T20:31:49.9051317Z scale_ub: Optional[float], 2025-05-07T20:31:49.9051412Z contiguous: bool, 2025-05-07T20:31:49.9051499Z compiled: bool, 2025-05-07T20:31:49.9051579Z ) -> None: 2025-05-07T20:31:49.9051692Z torch.manual_seed(2025) 2025-05-07T20:31:49.9051766Z 2025-05-07T20:31:49.9051936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9052017Z 2025-05-07T20:31:49.9052108Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9052241Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9052329Z x = x_sign * x_clamp 2025-05-07T20:31:49.9052410Z x0 = x[:, :D] 2025-05-07T20:31:49.9052496Z x1 = x[:, D:] 2025-05-07T20:31:49.9052569Z 2025-05-07T20:31:49.9052654Z if contiguous: 2025-05-07T20:31:49.9052749Z x0 = x0.contiguous() 2025-05-07T20:31:49.9052839Z x1 = x1.contiguous() 2025-05-07T20:31:49.9052911Z 2025-05-07T20:31:49.9053006Z if scale_ub is not None: 2025-05-07T20:31:49.9053111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9053247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9053336Z ) 2025-05-07T20:31:49.9053530Z else: 2025-05-07T20:31:49.9053627Z scale_ub_tensor = None 2025-05-07T20:31:49.9053707Z 2025-05-07T20:31:49.9053840Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9053938Z op = silu_mul_quant 2025-05-07T20:31:49.9054037Z if compiled: 2025-05-07T20:31:49.9054152Z op = torch.compile(op) 2025-05-07T20:31:49.9054289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9054362Z 2025-05-07T20:31:49.9054454Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.9054579Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.9054652Z 2025-05-07T20:31:49.9054787Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9054893Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.9054992Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.9055119Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.9055275Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9055350Z 2025-05-07T20:31:49.9055453Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.9055457Z 2025-05-07T20:31:49.9055557Z moe/activation_test.py:126: 2025-05-07T20:31:49.9055689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9055798Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.9055933Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9056492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.9056604Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.9056964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9057200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9057674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.9057936Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9058341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.9058597Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9058979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.9059146Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.9059488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.9059569Z fn() 2025-05-07T20:31:49.9059971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.9060057Z self.fn.run( 2025-05-07T20:31:49.9060405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9060500Z kernel = self.compile( 2025-05-07T20:31:49.9060880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9061063Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9061246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9061251Z 2025-05-07T20:31:49.9061464Z self = 2025-05-07T20:31:49.9062305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9062825Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f313d565280>} 2025-05-07T20:31:49.9063561Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9063757Z context = 2025-05-07T20:31:49.9063762Z 2025-05-07T20:31:49.9063929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9064240Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9064353Z module_map=module_map) 2025-05-07T20:31:49.9064514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9064627Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.9064712Z E ^ 2025-05-07T20:31:49.9065067Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9065072Z 2025-05-07T20:31:49.9065488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9065492Z 2025-05-07T20:31:49.9065595Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9065816Z self=, 2025-05-07T20:31:49.9065895Z T=2048, 2025-05-07T20:31:49.9065970Z D=5120, 2025-05-07T20:31:49.9066050Z scale_ub=None, 2025-05-07T20:31:49.9066139Z contiguous=True, 2025-05-07T20:31:49.9066221Z compiled=True, 2025-05-07T20:31:49.9066294Z ) 2025-05-07T20:31:49.9066517Z self = 2025-05-07T20:31:49.9066776Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.9066780Z 2025-05-07T20:31:49.9066862Z @given( 2025-05-07T20:31:49.9066981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9067088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9067204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9067321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9067441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9067518Z ) 2025-05-07T20:31:49.9067765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9067864Z def test_silu_mul_quant( 2025-05-07T20:31:49.9067943Z self, 2025-05-07T20:31:49.9068023Z T: int, 2025-05-07T20:31:49.9068105Z D: int, 2025-05-07T20:31:49.9068206Z scale_ub: Optional[float], 2025-05-07T20:31:49.9068295Z contiguous: bool, 2025-05-07T20:31:49.9068398Z compiled: bool, 2025-05-07T20:31:49.9068479Z ) -> None: 2025-05-07T20:31:49.9068576Z torch.manual_seed(2025) 2025-05-07T20:31:49.9068648Z 2025-05-07T20:31:49.9068818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9068896Z 2025-05-07T20:31:49.9068988Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9069115Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9069209Z x = x_sign * x_clamp 2025-05-07T20:31:49.9069290Z x0 = x[:, :D] 2025-05-07T20:31:49.9069371Z x1 = x[:, D:] 2025-05-07T20:31:49.9069449Z 2025-05-07T20:31:49.9069534Z if contiguous: 2025-05-07T20:31:49.9069625Z x0 = x0.contiguous() 2025-05-07T20:31:49.9069723Z x1 = x1.contiguous() 2025-05-07T20:31:49.9069796Z 2025-05-07T20:31:49.9069891Z if scale_ub is not None: 2025-05-07T20:31:49.9069996Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9070216Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9070297Z ) 2025-05-07T20:31:49.9070373Z else: 2025-05-07T20:31:49.9070466Z scale_ub_tensor = None 2025-05-07T20:31:49.9070543Z 2025-05-07T20:31:49.9070675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9070769Z op = silu_mul_quant 2025-05-07T20:31:49.9070859Z if compiled: 2025-05-07T20:31:49.9070960Z op = torch.compile(op) 2025-05-07T20:31:49.9071068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9071147Z 2025-05-07T20:31:49.9071237Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.9071359Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.9071435Z 2025-05-07T20:31:49.9071569Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9071675Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.9071776Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.9071913Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.9072055Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9072129Z 2025-05-07T20:31:49.9072229Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.9072234Z 2025-05-07T20:31:49.9072336Z moe/activation_test.py:126: 2025-05-07T20:31:49.9072465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9072575Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.9072711Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9073266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.9073372Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.9073732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9074044Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9074417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.9074670Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9075074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.9075325Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9075694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.9075864Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.9076203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.9076292Z fn() 2025-05-07T20:31:49.9076688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.9076772Z self.fn.run( 2025-05-07T20:31:49.9077123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9077217Z kernel = self.compile( 2025-05-07T20:31:49.9077600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9077782Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9077909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9077913Z 2025-05-07T20:31:49.9078120Z self = 2025-05-07T20:31:49.9078992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9079504Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd947f0d0>} 2025-05-07T20:31:49.9080243Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9080436Z context = 2025-05-07T20:31:49.9080441Z 2025-05-07T20:31:49.9080612Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9080875Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9080991Z module_map=module_map) 2025-05-07T20:31:49.9081157Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9081260Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.9081345Z E ^ 2025-05-07T20:31:49.9081696Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9081700Z 2025-05-07T20:31:49.9082113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9082118Z 2025-05-07T20:31:49.9082230Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9082451Z self=, 2025-05-07T20:31:49.9082535Z T=128, 2025-05-07T20:31:49.9082610Z D=5120, 2025-05-07T20:31:49.9082694Z scale_ub=None, 2025-05-07T20:31:49.9082786Z contiguous=True, 2025-05-07T20:31:49.9082869Z compiled=True, 2025-05-07T20:31:49.9082942Z ) 2025-05-07T20:31:49.9083251Z self = 2025-05-07T20:31:49.9083423Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.9083427Z 2025-05-07T20:31:49.9083504Z @given( 2025-05-07T20:31:49.9083627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9083726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9083849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9083976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9084110Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9084207Z ) 2025-05-07T20:31:49.9084460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9084555Z def test_silu_mul_quant( 2025-05-07T20:31:49.9084634Z self, 2025-05-07T20:31:49.9084711Z T: int, 2025-05-07T20:31:49.9084788Z D: int, 2025-05-07T20:31:49.9084907Z scale_ub: Optional[float], 2025-05-07T20:31:49.9084997Z contiguous: bool, 2025-05-07T20:31:49.9085081Z compiled: bool, 2025-05-07T20:31:49.9085164Z ) -> None: 2025-05-07T20:31:49.9085258Z torch.manual_seed(2025) 2025-05-07T20:31:49.9085334Z 2025-05-07T20:31:49.9085503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9085578Z 2025-05-07T20:31:49.9085676Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9085803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9085894Z x = x_sign * x_clamp 2025-05-07T20:31:49.9085977Z x0 = x[:, :D] 2025-05-07T20:31:49.9086055Z x1 = x[:, D:] 2025-05-07T20:31:49.9086129Z 2025-05-07T20:31:49.9086215Z if contiguous: 2025-05-07T20:31:49.9086311Z x0 = x0.contiguous() 2025-05-07T20:31:49.9086401Z x1 = x1.contiguous() 2025-05-07T20:31:49.9086482Z 2025-05-07T20:31:49.9086576Z if scale_ub is not None: 2025-05-07T20:31:49.9086771Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9086911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9086986Z ) 2025-05-07T20:31:49.9087068Z else: 2025-05-07T20:31:49.9087162Z scale_ub_tensor = None 2025-05-07T20:31:49.9087234Z 2025-05-07T20:31:49.9087367Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9087457Z op = silu_mul_quant 2025-05-07T20:31:49.9087543Z if compiled: 2025-05-07T20:31:49.9087649Z op = torch.compile(op) 2025-05-07T20:31:49.9087756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9087831Z 2025-05-07T20:31:49.9087925Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.9088045Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.9088122Z 2025-05-07T20:31:49.9088256Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9088368Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.9088471Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.9088594Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.9088734Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9088811Z 2025-05-07T20:31:49.9088910Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.9088914Z 2025-05-07T20:31:49.9089011Z moe/activation_test.py:126: 2025-05-07T20:31:49.9089151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9089259Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.9089398Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9094879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.9095004Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.9095513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9095741Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9096121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.9096376Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9096777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.9097031Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9097409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.9097581Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.9097946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.9098026Z fn() 2025-05-07T20:31:49.9098435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.9098518Z self.fn.run( 2025-05-07T20:31:49.9098858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9098959Z kernel = self.compile( 2025-05-07T20:31:49.9099334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9099515Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9099642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9099648Z 2025-05-07T20:31:49.9099853Z self = 2025-05-07T20:31:49.9100713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9101299Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd90d25e0>} 2025-05-07T20:31:49.9102052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9102246Z context = 2025-05-07T20:31:49.9102251Z 2025-05-07T20:31:49.9102417Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9102691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9102807Z module_map=module_map) 2025-05-07T20:31:49.9102974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9103077Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.9103153Z E ^ 2025-05-07T20:31:49.9103514Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9103519Z 2025-05-07T20:31:49.9103937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9103941Z 2025-05-07T20:31:49.9104047Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9104267Z self=, 2025-05-07T20:31:49.9104343Z T=4096, 2025-05-07T20:31:49.9104422Z D=5120, 2025-05-07T20:31:49.9104505Z scale_ub=None, 2025-05-07T20:31:49.9104669Z contiguous=True, 2025-05-07T20:31:49.9104762Z compiled=True, 2025-05-07T20:31:49.9104838Z ) 2025-05-07T20:31:49.9105055Z self = 2025-05-07T20:31:49.9105230Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.9105234Z 2025-05-07T20:31:49.9105312Z @given( 2025-05-07T20:31:49.9105436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9105534Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9105649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9105772Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9105887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9105960Z ) 2025-05-07T20:31:49.9106208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9106301Z def test_silu_mul_quant( 2025-05-07T20:31:49.9106379Z self, 2025-05-07T20:31:49.9106469Z T: int, 2025-05-07T20:31:49.9106548Z D: int, 2025-05-07T20:31:49.9106649Z scale_ub: Optional[float], 2025-05-07T20:31:49.9106739Z contiguous: bool, 2025-05-07T20:31:49.9106824Z compiled: bool, 2025-05-07T20:31:49.9106905Z ) -> None: 2025-05-07T20:31:49.9106999Z torch.manual_seed(2025) 2025-05-07T20:31:49.9107072Z 2025-05-07T20:31:49.9107250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9107323Z 2025-05-07T20:31:49.9107414Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9107545Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9107634Z x = x_sign * x_clamp 2025-05-07T20:31:49.9107716Z x0 = x[:, :D] 2025-05-07T20:31:49.9107798Z x1 = x[:, D:] 2025-05-07T20:31:49.9107871Z 2025-05-07T20:31:49.9107954Z if contiguous: 2025-05-07T20:31:49.9108048Z x0 = x0.contiguous() 2025-05-07T20:31:49.9108136Z x1 = x1.contiguous() 2025-05-07T20:31:49.9108294Z 2025-05-07T20:31:49.9108388Z if scale_ub is not None: 2025-05-07T20:31:49.9108493Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9108635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9108711Z ) 2025-05-07T20:31:49.9108786Z else: 2025-05-07T20:31:49.9108881Z scale_ub_tensor = None 2025-05-07T20:31:49.9108953Z 2025-05-07T20:31:49.9109084Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9109178Z op = silu_mul_quant 2025-05-07T20:31:49.9109263Z if compiled: 2025-05-07T20:31:49.9109361Z op = torch.compile(op) 2025-05-07T20:31:49.9109470Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9109541Z 2025-05-07T20:31:49.9109637Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.9109758Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.9109834Z 2025-05-07T20:31:49.9109984Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9110087Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.9110186Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.9110313Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.9110452Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9110523Z 2025-05-07T20:31:49.9110628Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.9110633Z 2025-05-07T20:31:49.9110732Z moe/activation_test.py:126: 2025-05-07T20:31:49.9110865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9110969Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.9111104Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9111674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.9111859Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.9112225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9112459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9112821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.9113078Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9113472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.9113721Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9114094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.9114274Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.9114624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.9114703Z fn() 2025-05-07T20:31:49.9115098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.9115183Z self.fn.run( 2025-05-07T20:31:49.9115516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9115608Z kernel = self.compile( 2025-05-07T20:31:49.9115987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9116166Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9116294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9116304Z 2025-05-07T20:31:49.9116588Z self = 2025-05-07T20:31:49.9117360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9117871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd91a5940>} 2025-05-07T20:31:49.9118618Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9118814Z context = 2025-05-07T20:31:49.9118818Z 2025-05-07T20:31:49.9118993Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9119269Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9119376Z module_map=module_map) 2025-05-07T20:31:49.9119542Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9119646Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.9119724Z E ^ 2025-05-07T20:31:49.9120074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9120079Z 2025-05-07T20:31:49.9120490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9120495Z 2025-05-07T20:31:49.9120596Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9120819Z self=, 2025-05-07T20:31:49.9120896Z T=16384, 2025-05-07T20:31:49.9121054Z D=5120, 2025-05-07T20:31:49.9121143Z scale_ub=None, 2025-05-07T20:31:49.9121227Z contiguous=True, 2025-05-07T20:31:49.9121309Z compiled=True, 2025-05-07T20:31:49.9121384Z ) 2025-05-07T20:31:49.9121600Z self = 2025-05-07T20:31:49.9121773Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.9121777Z 2025-05-07T20:31:49.9121860Z @given( 2025-05-07T20:31:49.9121979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9122081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9122195Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9122313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9122431Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9122505Z ) 2025-05-07T20:31:49.9122756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9122863Z def test_silu_mul_quant( 2025-05-07T20:31:49.9122938Z self, 2025-05-07T20:31:49.9123016Z T: int, 2025-05-07T20:31:49.9123097Z D: int, 2025-05-07T20:31:49.9123195Z scale_ub: Optional[float], 2025-05-07T20:31:49.9123285Z contiguous: bool, 2025-05-07T20:31:49.9123372Z compiled: bool, 2025-05-07T20:31:49.9123449Z ) -> None: 2025-05-07T20:31:49.9123549Z torch.manual_seed(2025) 2025-05-07T20:31:49.9123620Z 2025-05-07T20:31:49.9123789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9123866Z 2025-05-07T20:31:49.9123960Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9124088Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9124179Z x = x_sign * x_clamp 2025-05-07T20:31:49.9124259Z x0 = x[:, :D] 2025-05-07T20:31:49.9124338Z x1 = x[:, D:] 2025-05-07T20:31:49.9124414Z 2025-05-07T20:31:49.9124496Z if contiguous: 2025-05-07T20:31:49.9124692Z x0 = x0.contiguous() 2025-05-07T20:31:49.9124788Z x1 = x1.contiguous() 2025-05-07T20:31:49.9124860Z 2025-05-07T20:31:49.9124953Z if scale_ub is not None: 2025-05-07T20:31:49.9125058Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9125195Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9125276Z ) 2025-05-07T20:31:49.9125353Z else: 2025-05-07T20:31:49.9125446Z scale_ub_tensor = None 2025-05-07T20:31:49.9125520Z 2025-05-07T20:31:49.9125650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9125740Z op = silu_mul_quant 2025-05-07T20:31:49.9125831Z if compiled: 2025-05-07T20:31:49.9125932Z op = torch.compile(op) 2025-05-07T20:31:49.9126037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9126116Z 2025-05-07T20:31:49.9126205Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.9126340Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.9126413Z 2025-05-07T20:31:49.9126550Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9126654Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.9126754Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.9126880Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.9127029Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9127103Z 2025-05-07T20:31:49.9127204Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.9127209Z 2025-05-07T20:31:49.9127312Z moe/activation_test.py:126: 2025-05-07T20:31:49.9127440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9127548Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.9127683Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9128241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.9128420Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.9128788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9129014Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9129380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.9129636Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9130040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.9130290Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9130672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.9130851Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.9131189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.9131276Z fn() 2025-05-07T20:31:49.9131679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.9131760Z self.fn.run( 2025-05-07T20:31:49.9132095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9132187Z kernel = self.compile( 2025-05-07T20:31:49.9132569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9132752Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9132958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9132963Z 2025-05-07T20:31:49.9133177Z self = 2025-05-07T20:31:49.9133960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9134511Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7d56940>} 2025-05-07T20:31:49.9135252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9135444Z context = 2025-05-07T20:31:49.9135459Z 2025-05-07T20:31:49.9135623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9135892Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9135999Z module_map=module_map) 2025-05-07T20:31:49.9136164Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9136265Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.9136342Z E ^ 2025-05-07T20:31:49.9136695Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9136699Z 2025-05-07T20:31:49.9137110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9137114Z 2025-05-07T20:31:49.9137220Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9137446Z self=, 2025-05-07T20:31:49.9137597Z T=1, 2025-05-07T20:31:49.9137677Z D=5120, 2025-05-07T20:31:49.9137761Z scale_ub=1200.0, 2025-05-07T20:31:49.9137844Z contiguous=True, 2025-05-07T20:31:49.9137929Z compiled=True, 2025-05-07T20:31:49.9138001Z ) 2025-05-07T20:31:49.9138218Z self = 2025-05-07T20:31:49.9138390Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.9138395Z 2025-05-07T20:31:49.9138473Z @given( 2025-05-07T20:31:49.9138596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9138693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9138806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9138925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9139038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9139111Z ) 2025-05-07T20:31:49.9139376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9139476Z def test_silu_mul_quant( 2025-05-07T20:31:49.9139552Z self, 2025-05-07T20:31:49.9139631Z T: int, 2025-05-07T20:31:49.9139708Z D: int, 2025-05-07T20:31:49.9139806Z scale_ub: Optional[float], 2025-05-07T20:31:49.9139900Z contiguous: bool, 2025-05-07T20:31:49.9139986Z compiled: bool, 2025-05-07T20:31:49.9140356Z ) -> None: 2025-05-07T20:31:49.9140503Z torch.manual_seed(2025) 2025-05-07T20:31:49.9140581Z 2025-05-07T20:31:49.9140757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9140829Z 2025-05-07T20:31:49.9140920Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9141046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9141183Z x = x_sign * x_clamp 2025-05-07T20:31:49.9141264Z x0 = x[:, :D] 2025-05-07T20:31:49.9141352Z x1 = x[:, D:] 2025-05-07T20:31:49.9141572Z 2025-05-07T20:31:49.9141660Z if contiguous: 2025-05-07T20:31:49.9141755Z x0 = x0.contiguous() 2025-05-07T20:31:49.9141845Z x1 = x1.contiguous() 2025-05-07T20:31:49.9141926Z 2025-05-07T20:31:49.9142015Z if scale_ub is not None: 2025-05-07T20:31:49.9142123Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9142263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9142338Z ) 2025-05-07T20:31:49.9142414Z else: 2025-05-07T20:31:49.9142509Z scale_ub_tensor = None 2025-05-07T20:31:49.9142581Z 2025-05-07T20:31:49.9142711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9142803Z op = silu_mul_quant 2025-05-07T20:31:49.9142887Z if compiled: 2025-05-07T20:31:49.9142986Z op = torch.compile(op) 2025-05-07T20:31:49.9143095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9143181Z 2025-05-07T20:31:49.9143275Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9143279Z 2025-05-07T20:31:49.9143379Z moe/activation_test.py:117: 2025-05-07T20:31:49.9143505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9143612Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9143712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9144081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9144179Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9144668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9144770Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9145126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9145355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9145824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9145916Z kernel = self.compile( 2025-05-07T20:31:49.9146292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9146471Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9146595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9146599Z 2025-05-07T20:31:49.9146807Z self = 2025-05-07T20:31:49.9147570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9148089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd82f5ca0>} 2025-05-07T20:31:49.9148839Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9149029Z context = 2025-05-07T20:31:49.9149033Z 2025-05-07T20:31:49.9149202Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9149463Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9149569Z module_map=module_map) 2025-05-07T20:31:49.9149736Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9149834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9149991Z E ^ 2025-05-07T20:31:49.9150342Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9150347Z 2025-05-07T20:31:49.9150757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9150761Z 2025-05-07T20:31:49.9150873Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9151092Z self=, 2025-05-07T20:31:49.9151177Z T=1, 2025-05-07T20:31:49.9151252Z D=5120, 2025-05-07T20:31:49.9151332Z scale_ub=None, 2025-05-07T20:31:49.9151421Z contiguous=False, 2025-05-07T20:31:49.9151502Z compiled=True, 2025-05-07T20:31:49.9151574Z ) 2025-05-07T20:31:49.9151797Z self = 2025-05-07T20:31:49.9151962Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.9151976Z 2025-05-07T20:31:49.9152053Z @given( 2025-05-07T20:31:49.9152174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9152271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9152387Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9152502Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9152614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9152690Z ) 2025-05-07T20:31:49.9152935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9153028Z def test_silu_mul_quant( 2025-05-07T20:31:49.9153106Z self, 2025-05-07T20:31:49.9153181Z T: int, 2025-05-07T20:31:49.9153258Z D: int, 2025-05-07T20:31:49.9153357Z scale_ub: Optional[float], 2025-05-07T20:31:49.9153446Z contiguous: bool, 2025-05-07T20:31:49.9153530Z compiled: bool, 2025-05-07T20:31:49.9153691Z ) -> None: 2025-05-07T20:31:49.9153790Z torch.manual_seed(2025) 2025-05-07T20:31:49.9153869Z 2025-05-07T20:31:49.9154061Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9154145Z 2025-05-07T20:31:49.9154251Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9154375Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9154462Z x = x_sign * x_clamp 2025-05-07T20:31:49.9154543Z x0 = x[:, :D] 2025-05-07T20:31:49.9154623Z x1 = x[:, D:] 2025-05-07T20:31:49.9154698Z 2025-05-07T20:31:49.9154786Z if contiguous: 2025-05-07T20:31:49.9154875Z x0 = x0.contiguous() 2025-05-07T20:31:49.9154963Z x1 = x1.contiguous() 2025-05-07T20:31:49.9155036Z 2025-05-07T20:31:49.9155125Z if scale_ub is not None: 2025-05-07T20:31:49.9155229Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9155370Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9155455Z ) 2025-05-07T20:31:49.9155533Z else: 2025-05-07T20:31:49.9155625Z scale_ub_tensor = None 2025-05-07T20:31:49.9155697Z 2025-05-07T20:31:49.9155829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9155917Z op = silu_mul_quant 2025-05-07T20:31:49.9156002Z if compiled: 2025-05-07T20:31:49.9156105Z op = torch.compile(op) 2025-05-07T20:31:49.9156210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9156282Z 2025-05-07T20:31:49.9156376Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.9156495Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.9156566Z 2025-05-07T20:31:49.9156704Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9156805Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.9156910Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.9157132Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.9157281Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9157355Z 2025-05-07T20:31:49.9157454Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.9157459Z 2025-05-07T20:31:49.9157555Z moe/activation_test.py:126: 2025-05-07T20:31:49.9157683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9157786Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.9157920Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9158480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.9158582Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.9158940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9159173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9159546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.9159810Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9160202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.9160458Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9160829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.9160995Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.9161341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.9161417Z fn() 2025-05-07T20:31:49.9161904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.9161984Z self.fn.run( 2025-05-07T20:31:49.9162318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9162414Z kernel = self.compile( 2025-05-07T20:31:49.9162794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9162970Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9163101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9163106Z 2025-05-07T20:31:49.9163310Z self = 2025-05-07T20:31:49.9164079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9164584Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd838f280>} 2025-05-07T20:31:49.9165321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9165511Z context = 2025-05-07T20:31:49.9165516Z 2025-05-07T20:31:49.9165681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9165951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9166057Z module_map=module_map) 2025-05-07T20:31:49.9166289Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9166405Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.9166481Z E ^ 2025-05-07T20:31:49.9166834Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9166839Z 2025-05-07T20:31:49.9167250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9167255Z 2025-05-07T20:31:49.9167356Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9167580Z self=, 2025-05-07T20:31:49.9167659Z T=1, 2025-05-07T20:31:49.9167740Z D=5120, 2025-05-07T20:31:49.9167820Z scale_ub=None, 2025-05-07T20:31:49.9167903Z contiguous=True, 2025-05-07T20:31:49.9167988Z compiled=False, 2025-05-07T20:31:49.9168060Z ) 2025-05-07T20:31:49.9168279Z self = 2025-05-07T20:31:49.9168449Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9168454Z 2025-05-07T20:31:49.9168530Z @given( 2025-05-07T20:31:49.9168650Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9168750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9168864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9168985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9169096Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9169170Z ) 2025-05-07T20:31:49.9169417Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9169509Z def test_silu_mul_quant( 2025-05-07T20:31:49.9169583Z self, 2025-05-07T20:31:49.9169658Z T: int, 2025-05-07T20:31:49.9169734Z D: int, 2025-05-07T20:31:49.9169832Z scale_ub: Optional[float], 2025-05-07T20:31:49.9169924Z contiguous: bool, 2025-05-07T20:31:49.9170092Z compiled: bool, 2025-05-07T20:31:49.9170169Z ) -> None: 2025-05-07T20:31:49.9170266Z torch.manual_seed(2025) 2025-05-07T20:31:49.9170338Z 2025-05-07T20:31:49.9170509Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9170583Z 2025-05-07T20:31:49.9170675Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9170802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9170892Z x = x_sign * x_clamp 2025-05-07T20:31:49.9170972Z x0 = x[:, :D] 2025-05-07T20:31:49.9171053Z x1 = x[:, D:] 2025-05-07T20:31:49.9171125Z 2025-05-07T20:31:49.9171209Z if contiguous: 2025-05-07T20:31:49.9171305Z x0 = x0.contiguous() 2025-05-07T20:31:49.9171393Z x1 = x1.contiguous() 2025-05-07T20:31:49.9171464Z 2025-05-07T20:31:49.9171560Z if scale_ub is not None: 2025-05-07T20:31:49.9171665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9171812Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9171892Z ) 2025-05-07T20:31:49.9171968Z else: 2025-05-07T20:31:49.9172064Z scale_ub_tensor = None 2025-05-07T20:31:49.9172137Z 2025-05-07T20:31:49.9172270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9172361Z op = silu_mul_quant 2025-05-07T20:31:49.9172447Z if compiled: 2025-05-07T20:31:49.9172545Z op = torch.compile(op) 2025-05-07T20:31:49.9172654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9172725Z 2025-05-07T20:31:49.9172814Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9172818Z 2025-05-07T20:31:49.9172916Z moe/activation_test.py:117: 2025-05-07T20:31:49.9173041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9173146Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9173322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9173828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9173928Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9174333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9174557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9174902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9174994Z kernel = self.compile( 2025-05-07T20:31:49.9175373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9175551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9175684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9175691Z 2025-05-07T20:31:49.9175898Z self = 2025-05-07T20:31:49.9176659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9177173Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7d56040>} 2025-05-07T20:31:49.9177912Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9178103Z context = 2025-05-07T20:31:49.9178110Z 2025-05-07T20:31:49.9178356Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9178623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9178733Z module_map=module_map) 2025-05-07T20:31:49.9178893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9178989Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9179067Z E ^ 2025-05-07T20:31:49.9179416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9179421Z 2025-05-07T20:31:49.9179835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9179840Z 2025-05-07T20:31:49.9179941Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9180162Z self=, 2025-05-07T20:31:49.9180253Z T=128, 2025-05-07T20:31:49.9180330Z D=5120, 2025-05-07T20:31:49.9180411Z scale_ub=None, 2025-05-07T20:31:49.9180499Z contiguous=False, 2025-05-07T20:31:49.9180580Z compiled=True, 2025-05-07T20:31:49.9180651Z ) 2025-05-07T20:31:49.9180868Z self = 2025-05-07T20:31:49.9181038Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.9181043Z 2025-05-07T20:31:49.9181121Z @given( 2025-05-07T20:31:49.9181300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9181399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9181521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9181636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9181747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9181822Z ) 2025-05-07T20:31:49.9182142Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9182248Z def test_silu_mul_quant( 2025-05-07T20:31:49.9182322Z self, 2025-05-07T20:31:49.9182398Z T: int, 2025-05-07T20:31:49.9182476Z D: int, 2025-05-07T20:31:49.9182573Z scale_ub: Optional[float], 2025-05-07T20:31:49.9182661Z contiguous: bool, 2025-05-07T20:31:49.9182747Z compiled: bool, 2025-05-07T20:31:49.9182824Z ) -> None: 2025-05-07T20:31:49.9182918Z torch.manual_seed(2025) 2025-05-07T20:31:49.9182994Z 2025-05-07T20:31:49.9183162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9183236Z 2025-05-07T20:31:49.9183329Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9183452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9183541Z x = x_sign * x_clamp 2025-05-07T20:31:49.9183626Z x0 = x[:, :D] 2025-05-07T20:31:49.9183707Z x1 = x[:, D:] 2025-05-07T20:31:49.9183785Z 2025-05-07T20:31:49.9183877Z if contiguous: 2025-05-07T20:31:49.9183967Z x0 = x0.contiguous() 2025-05-07T20:31:49.9184059Z x1 = x1.contiguous() 2025-05-07T20:31:49.9184130Z 2025-05-07T20:31:49.9184219Z if scale_ub is not None: 2025-05-07T20:31:49.9184324Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9184459Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9184535Z ) 2025-05-07T20:31:49.9184617Z else: 2025-05-07T20:31:49.9184708Z scale_ub_tensor = None 2025-05-07T20:31:49.9184779Z 2025-05-07T20:31:49.9184911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9184999Z op = silu_mul_quant 2025-05-07T20:31:49.9185086Z if compiled: 2025-05-07T20:31:49.9185185Z op = torch.compile(op) 2025-05-07T20:31:49.9185291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9185366Z 2025-05-07T20:31:49.9185562Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9185566Z 2025-05-07T20:31:49.9185663Z moe/activation_test.py:117: 2025-05-07T20:31:49.9185795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9185895Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9185994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9186363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9186456Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9186957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9187055Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9187411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9187647Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9187991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9188082Z kernel = self.compile( 2025-05-07T20:31:49.9188461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9188634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9188761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9188766Z 2025-05-07T20:31:49.9188971Z self = 2025-05-07T20:31:49.9189733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9190309Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd912ac10>} 2025-05-07T20:31:49.9191062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9191258Z context = 2025-05-07T20:31:49.9191262Z 2025-05-07T20:31:49.9191429Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9191694Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9191799Z module_map=module_map) 2025-05-07T20:31:49.9191961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9192059Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9192134Z E ^ 2025-05-07T20:31:49.9192494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9192499Z 2025-05-07T20:31:49.9192920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9192924Z 2025-05-07T20:31:49.9193024Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9193248Z self=, 2025-05-07T20:31:49.9193324Z T=128, 2025-05-07T20:31:49.9193400Z D=7168, 2025-05-07T20:31:49.9193484Z scale_ub=1200.0, 2025-05-07T20:31:49.9193567Z contiguous=False, 2025-05-07T20:31:49.9193650Z compiled=False, 2025-05-07T20:31:49.9193724Z ) 2025-05-07T20:31:49.9193940Z self = 2025-05-07T20:31:49.9194111Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.9194196Z 2025-05-07T20:31:49.9194277Z @given( 2025-05-07T20:31:49.9194395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9194495Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9194610Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9194727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9194844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9194921Z ) 2025-05-07T20:31:49.9195166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9195264Z def test_silu_mul_quant( 2025-05-07T20:31:49.9195339Z self, 2025-05-07T20:31:49.9195414Z T: int, 2025-05-07T20:31:49.9195492Z D: int, 2025-05-07T20:31:49.9195588Z scale_ub: Optional[float], 2025-05-07T20:31:49.9195676Z contiguous: bool, 2025-05-07T20:31:49.9195766Z compiled: bool, 2025-05-07T20:31:49.9195843Z ) -> None: 2025-05-07T20:31:49.9195952Z torch.manual_seed(2025) 2025-05-07T20:31:49.9196024Z 2025-05-07T20:31:49.9196192Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9196270Z 2025-05-07T20:31:49.9196359Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9196483Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9196573Z x = x_sign * x_clamp 2025-05-07T20:31:49.9196653Z x0 = x[:, :D] 2025-05-07T20:31:49.9196731Z x1 = x[:, D:] 2025-05-07T20:31:49.9196805Z 2025-05-07T20:31:49.9196887Z if contiguous: 2025-05-07T20:31:49.9196978Z x0 = x0.contiguous() 2025-05-07T20:31:49.9197070Z x1 = x1.contiguous() 2025-05-07T20:31:49.9197143Z 2025-05-07T20:31:49.9197239Z if scale_ub is not None: 2025-05-07T20:31:49.9197341Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9197480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9197559Z ) 2025-05-07T20:31:49.9197717Z else: 2025-05-07T20:31:49.9197811Z scale_ub_tensor = None 2025-05-07T20:31:49.9197885Z 2025-05-07T20:31:49.9198016Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9198104Z op = silu_mul_quant 2025-05-07T20:31:49.9198191Z if compiled: 2025-05-07T20:31:49.9198291Z op = torch.compile(op) 2025-05-07T20:31:49.9198394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9198467Z 2025-05-07T20:31:49.9198558Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9198562Z 2025-05-07T20:31:49.9198662Z moe/activation_test.py:117: 2025-05-07T20:31:49.9198791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9198892Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9198993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9199490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9199591Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9199950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9200178Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9200519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9200612Z kernel = self.compile( 2025-05-07T20:31:49.9200987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9201164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9201288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9201292Z 2025-05-07T20:31:49.9201498Z self = 2025-05-07T20:31:49.9202356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9202860Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd8467dc0>} 2025-05-07T20:31:49.9203609Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9203799Z context = 2025-05-07T20:31:49.9203803Z 2025-05-07T20:31:49.9203975Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9204248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9204359Z module_map=module_map) 2025-05-07T20:31:49.9204527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9204624Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9204700Z E ^ 2025-05-07T20:31:49.9205055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9205059Z 2025-05-07T20:31:49.9205475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9205480Z 2025-05-07T20:31:49.9205583Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9205803Z self=, 2025-05-07T20:31:49.9205878Z T=128, 2025-05-07T20:31:49.9205955Z D=5120, 2025-05-07T20:31:49.9206038Z scale_ub=None, 2025-05-07T20:31:49.9206206Z contiguous=False, 2025-05-07T20:31:49.9206290Z compiled=False, 2025-05-07T20:31:49.9206362Z ) 2025-05-07T20:31:49.9206583Z self = 2025-05-07T20:31:49.9206753Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.9206758Z 2025-05-07T20:31:49.9206833Z @given( 2025-05-07T20:31:49.9206953Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9207050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9207162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9207281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9207393Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9207469Z ) 2025-05-07T20:31:49.9207713Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9207806Z def test_silu_mul_quant( 2025-05-07T20:31:49.9207883Z self, 2025-05-07T20:31:49.9207971Z T: int, 2025-05-07T20:31:49.9208048Z D: int, 2025-05-07T20:31:49.9208149Z scale_ub: Optional[float], 2025-05-07T20:31:49.9208240Z contiguous: bool, 2025-05-07T20:31:49.9208324Z compiled: bool, 2025-05-07T20:31:49.9208406Z ) -> None: 2025-05-07T20:31:49.9208504Z torch.manual_seed(2025) 2025-05-07T20:31:49.9208574Z 2025-05-07T20:31:49.9208742Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9208815Z 2025-05-07T20:31:49.9208909Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9209031Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9209119Z x = x_sign * x_clamp 2025-05-07T20:31:49.9209203Z x0 = x[:, :D] 2025-05-07T20:31:49.9209281Z x1 = x[:, D:] 2025-05-07T20:31:49.9209352Z 2025-05-07T20:31:49.9209438Z if contiguous: 2025-05-07T20:31:49.9209529Z x0 = x0.contiguous() 2025-05-07T20:31:49.9209704Z x1 = x1.contiguous() 2025-05-07T20:31:49.9209779Z 2025-05-07T20:31:49.9209868Z if scale_ub is not None: 2025-05-07T20:31:49.9209972Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9210113Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9210188Z ) 2025-05-07T20:31:49.9210264Z else: 2025-05-07T20:31:49.9210361Z scale_ub_tensor = None 2025-05-07T20:31:49.9210433Z 2025-05-07T20:31:49.9210564Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9210652Z op = silu_mul_quant 2025-05-07T20:31:49.9210736Z if compiled: 2025-05-07T20:31:49.9210836Z op = torch.compile(op) 2025-05-07T20:31:49.9210940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9211010Z 2025-05-07T20:31:49.9211107Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9211111Z 2025-05-07T20:31:49.9211205Z moe/activation_test.py:117: 2025-05-07T20:31:49.9211341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9211445Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9211546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9212045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9212140Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9212501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9212727Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9213064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9213156Z kernel = self.compile( 2025-05-07T20:31:49.9213535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9213797Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9213934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9213940Z 2025-05-07T20:31:49.9214178Z self = 2025-05-07T20:31:49.9214949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9215449Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7d7b3a0>} 2025-05-07T20:31:49.9216190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9216387Z context = 2025-05-07T20:31:49.9216391Z 2025-05-07T20:31:49.9216556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9216823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9216930Z module_map=module_map) 2025-05-07T20:31:49.9217094Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9217193Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9217269Z E ^ 2025-05-07T20:31:49.9217624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9217629Z 2025-05-07T20:31:49.9218040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9218142Z 2025-05-07T20:31:49.9218251Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9218474Z self=, 2025-05-07T20:31:49.9218552Z T=128, 2025-05-07T20:31:49.9218629Z D=5120, 2025-05-07T20:31:49.9218713Z scale_ub=1200.0, 2025-05-07T20:31:49.9218797Z contiguous=True, 2025-05-07T20:31:49.9218879Z compiled=False, 2025-05-07T20:31:49.9218961Z ) 2025-05-07T20:31:49.9219179Z self = 2025-05-07T20:31:49.9219346Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.9219350Z 2025-05-07T20:31:49.9219433Z @given( 2025-05-07T20:31:49.9224197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9224318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9224437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9224554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9224683Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9224757Z ) 2025-05-07T20:31:49.9225008Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9225102Z def test_silu_mul_quant( 2025-05-07T20:31:49.9225179Z self, 2025-05-07T20:31:49.9225258Z T: int, 2025-05-07T20:31:49.9225335Z D: int, 2025-05-07T20:31:49.9225439Z scale_ub: Optional[float], 2025-05-07T20:31:49.9225530Z contiguous: bool, 2025-05-07T20:31:49.9225616Z compiled: bool, 2025-05-07T20:31:49.9225696Z ) -> None: 2025-05-07T20:31:49.9225790Z torch.manual_seed(2025) 2025-05-07T20:31:49.9225863Z 2025-05-07T20:31:49.9226042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9226115Z 2025-05-07T20:31:49.9226207Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9226336Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9226533Z x = x_sign * x_clamp 2025-05-07T20:31:49.9226616Z x0 = x[:, :D] 2025-05-07T20:31:49.9226698Z x1 = x[:, D:] 2025-05-07T20:31:49.9226771Z 2025-05-07T20:31:49.9226861Z if contiguous: 2025-05-07T20:31:49.9226951Z x0 = x0.contiguous() 2025-05-07T20:31:49.9227041Z x1 = x1.contiguous() 2025-05-07T20:31:49.9227114Z 2025-05-07T20:31:49.9227206Z if scale_ub is not None: 2025-05-07T20:31:49.9227310Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9227447Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9227522Z ) 2025-05-07T20:31:49.9227598Z else: 2025-05-07T20:31:49.9227693Z scale_ub_tensor = None 2025-05-07T20:31:49.9227765Z 2025-05-07T20:31:49.9227894Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9227987Z op = silu_mul_quant 2025-05-07T20:31:49.9228070Z if compiled: 2025-05-07T20:31:49.9228189Z op = torch.compile(op) 2025-05-07T20:31:49.9228295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9228367Z 2025-05-07T20:31:49.9228464Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9228469Z 2025-05-07T20:31:49.9228566Z moe/activation_test.py:117: 2025-05-07T20:31:49.9228695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9228799Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9228901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9229412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9229514Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9229871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9230101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9230533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9230630Z kernel = self.compile( 2025-05-07T20:31:49.9231012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9231190Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9231320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9231324Z 2025-05-07T20:31:49.9231530Z self = 2025-05-07T20:31:49.9232294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9232800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd8303c10>} 2025-05-07T20:31:49.9233548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9233747Z context = 2025-05-07T20:31:49.9233751Z 2025-05-07T20:31:49.9233919Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9234182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9234294Z module_map=module_map) 2025-05-07T20:31:49.9234464Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9234567Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9234643Z E ^ 2025-05-07T20:31:49.9235080Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9235085Z 2025-05-07T20:31:49.9235505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9235509Z 2025-05-07T20:31:49.9235609Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9235832Z self=, 2025-05-07T20:31:49.9235908Z T=1, 2025-05-07T20:31:49.9235983Z D=7168, 2025-05-07T20:31:49.9236067Z scale_ub=1200.0, 2025-05-07T20:31:49.9236153Z contiguous=True, 2025-05-07T20:31:49.9236235Z compiled=True, 2025-05-07T20:31:49.9236309Z ) 2025-05-07T20:31:49.9236525Z self = 2025-05-07T20:31:49.9236688Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.9236692Z 2025-05-07T20:31:49.9236781Z @given( 2025-05-07T20:31:49.9236899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9236998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9237113Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9237229Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9237342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9237419Z ) 2025-05-07T20:31:49.9237664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9237759Z def test_silu_mul_quant( 2025-05-07T20:31:49.9237834Z self, 2025-05-07T20:31:49.9237910Z T: int, 2025-05-07T20:31:49.9237988Z D: int, 2025-05-07T20:31:49.9238086Z scale_ub: Optional[float], 2025-05-07T20:31:49.9238175Z contiguous: bool, 2025-05-07T20:31:49.9238263Z compiled: bool, 2025-05-07T20:31:49.9238343Z ) -> None: 2025-05-07T20:31:49.9238436Z torch.manual_seed(2025) 2025-05-07T20:31:49.9238600Z 2025-05-07T20:31:49.9238771Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9238848Z 2025-05-07T20:31:49.9238939Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9239062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9239153Z x = x_sign * x_clamp 2025-05-07T20:31:49.9239234Z x0 = x[:, :D] 2025-05-07T20:31:49.9239316Z x1 = x[:, D:] 2025-05-07T20:31:49.9239390Z 2025-05-07T20:31:49.9239472Z if contiguous: 2025-05-07T20:31:49.9239563Z x0 = x0.contiguous() 2025-05-07T20:31:49.9239654Z x1 = x1.contiguous() 2025-05-07T20:31:49.9239725Z 2025-05-07T20:31:49.9239814Z if scale_ub is not None: 2025-05-07T20:31:49.9239922Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9240276Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9240398Z ) 2025-05-07T20:31:49.9240524Z else: 2025-05-07T20:31:49.9240660Z scale_ub_tensor = None 2025-05-07T20:31:49.9240770Z 2025-05-07T20:31:49.9240963Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9241086Z op = silu_mul_quant 2025-05-07T20:31:49.9241192Z if compiled: 2025-05-07T20:31:49.9241292Z op = torch.compile(op) 2025-05-07T20:31:49.9241397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9241476Z 2025-05-07T20:31:49.9241569Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9241574Z 2025-05-07T20:31:49.9241671Z moe/activation_test.py:117: 2025-05-07T20:31:49.9241801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9241902Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9242004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9242373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9242612Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9243118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9243214Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9243571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9243796Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9244130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9244225Z kernel = self.compile( 2025-05-07T20:31:49.9244607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9244782Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9244915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9244924Z 2025-05-07T20:31:49.9245128Z self = 2025-05-07T20:31:49.9245893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9246401Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd83601f0>} 2025-05-07T20:31:49.9247144Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9247334Z context = 2025-05-07T20:31:49.9247455Z 2025-05-07T20:31:49.9247623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9247888Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9247998Z module_map=module_map) 2025-05-07T20:31:49.9248161Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9248261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9248336Z E ^ 2025-05-07T20:31:49.9248688Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9248693Z 2025-05-07T20:31:49.9249103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9249107Z 2025-05-07T20:31:49.9249207Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9249442Z self=, 2025-05-07T20:31:49.9249523Z T=1, 2025-05-07T20:31:49.9249603Z D=7168, 2025-05-07T20:31:49.9249688Z scale_ub=1200.0, 2025-05-07T20:31:49.9249772Z contiguous=False, 2025-05-07T20:31:49.9249858Z compiled=True, 2025-05-07T20:31:49.9249932Z ) 2025-05-07T20:31:49.9250151Z self = 2025-05-07T20:31:49.9250320Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9250325Z 2025-05-07T20:31:49.9250400Z @given( 2025-05-07T20:31:49.9250517Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9250621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9250734Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9250851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9250967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9251040Z ) 2025-05-07T20:31:49.9251398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9251492Z def test_silu_mul_quant( 2025-05-07T20:31:49.9251568Z self, 2025-05-07T20:31:49.9251645Z T: int, 2025-05-07T20:31:49.9251722Z D: int, 2025-05-07T20:31:49.9251819Z scale_ub: Optional[float], 2025-05-07T20:31:49.9251915Z contiguous: bool, 2025-05-07T20:31:49.9251999Z compiled: bool, 2025-05-07T20:31:49.9252075Z ) -> None: 2025-05-07T20:31:49.9252172Z torch.manual_seed(2025) 2025-05-07T20:31:49.9252242Z 2025-05-07T20:31:49.9252410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9252486Z 2025-05-07T20:31:49.9252579Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9252706Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9252796Z x = x_sign * x_clamp 2025-05-07T20:31:49.9252878Z x0 = x[:, :D] 2025-05-07T20:31:49.9252959Z x1 = x[:, D:] 2025-05-07T20:31:49.9253040Z 2025-05-07T20:31:49.9253123Z if contiguous: 2025-05-07T20:31:49.9253219Z x0 = x0.contiguous() 2025-05-07T20:31:49.9253307Z x1 = x1.contiguous() 2025-05-07T20:31:49.9253380Z 2025-05-07T20:31:49.9253475Z if scale_ub is not None: 2025-05-07T20:31:49.9253580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9253716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9253798Z ) 2025-05-07T20:31:49.9253875Z else: 2025-05-07T20:31:49.9253970Z scale_ub_tensor = None 2025-05-07T20:31:49.9254042Z 2025-05-07T20:31:49.9254173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9254264Z op = silu_mul_quant 2025-05-07T20:31:49.9254347Z if compiled: 2025-05-07T20:31:49.9254446Z op = torch.compile(op) 2025-05-07T20:31:49.9254551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9254708Z 2025-05-07T20:31:49.9254804Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9254808Z 2025-05-07T20:31:49.9254909Z moe/activation_test.py:117: 2025-05-07T20:31:49.9255035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9255139Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9255241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9255610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9255711Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9256202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9256297Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9256654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9256879Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9257228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9257320Z kernel = self.compile( 2025-05-07T20:31:49.9257697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9257875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9257999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9258004Z 2025-05-07T20:31:49.9258208Z self = 2025-05-07T20:31:49.9258978Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9259553Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd8f718b0>} 2025-05-07T20:31:49.9260290Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9260483Z context = 2025-05-07T20:31:49.9260487Z 2025-05-07T20:31:49.9260651Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9260922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9261028Z module_map=module_map) 2025-05-07T20:31:49.9261263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9261364Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9261451Z E ^ 2025-05-07T20:31:49.9261806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9261810Z 2025-05-07T20:31:49.9262223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9262227Z 2025-05-07T20:31:49.9262327Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9262553Z self=, 2025-05-07T20:31:49.9262628Z T=1, 2025-05-07T20:31:49.9262703Z D=7168, 2025-05-07T20:31:49.9262786Z scale_ub=None, 2025-05-07T20:31:49.9262870Z contiguous=False, 2025-05-07T20:31:49.9262955Z compiled=True, 2025-05-07T20:31:49.9263028Z ) 2025-05-07T20:31:49.9263245Z self = 2025-05-07T20:31:49.9263409Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.9263500Z 2025-05-07T20:31:49.9263576Z @given( 2025-05-07T20:31:49.9263694Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9263797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9263910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9264025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9264141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9264216Z ) 2025-05-07T20:31:49.9264466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9264558Z def test_silu_mul_quant( 2025-05-07T20:31:49.9264634Z self, 2025-05-07T20:31:49.9264712Z T: int, 2025-05-07T20:31:49.9264789Z D: int, 2025-05-07T20:31:49.9264888Z scale_ub: Optional[float], 2025-05-07T20:31:49.9264977Z contiguous: bool, 2025-05-07T20:31:49.9265063Z compiled: bool, 2025-05-07T20:31:49.9265139Z ) -> None: 2025-05-07T20:31:49.9265248Z torch.manual_seed(2025) 2025-05-07T20:31:49.9265320Z 2025-05-07T20:31:49.9265490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9265564Z 2025-05-07T20:31:49.9265654Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9265782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9265871Z x = x_sign * x_clamp 2025-05-07T20:31:49.9265951Z x0 = x[:, :D] 2025-05-07T20:31:49.9266036Z x1 = x[:, D:] 2025-05-07T20:31:49.9266107Z 2025-05-07T20:31:49.9266191Z if contiguous: 2025-05-07T20:31:49.9266287Z x0 = x0.contiguous() 2025-05-07T20:31:49.9266374Z x1 = x1.contiguous() 2025-05-07T20:31:49.9266448Z 2025-05-07T20:31:49.9266541Z if scale_ub is not None: 2025-05-07T20:31:49.9266645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9266779Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9266941Z ) 2025-05-07T20:31:49.9267019Z else: 2025-05-07T20:31:49.9267111Z scale_ub_tensor = None 2025-05-07T20:31:49.9267187Z 2025-05-07T20:31:49.9267318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9267411Z op = silu_mul_quant 2025-05-07T20:31:49.9267496Z if compiled: 2025-05-07T20:31:49.9267594Z op = torch.compile(op) 2025-05-07T20:31:49.9267704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9267778Z 2025-05-07T20:31:49.9267867Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.9267989Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.9268063Z 2025-05-07T20:31:49.9268196Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9268303Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.9268403Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.9268534Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.9268689Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9268766Z 2025-05-07T20:31:49.9268868Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.9268872Z 2025-05-07T20:31:49.9268969Z moe/activation_test.py:126: 2025-05-07T20:31:49.9269097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9269203Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.9269335Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.9269889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.9269992Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.9270349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9270579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9271028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.9271283Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9271682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.9271934Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.9272315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.9272482Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.9272828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.9272907Z fn() 2025-05-07T20:31:49.9273313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.9273401Z self.fn.run( 2025-05-07T20:31:49.9273743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9273837Z kernel = self.compile( 2025-05-07T20:31:49.9274218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9274394Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9274522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9274527Z 2025-05-07T20:31:49.9274742Z self = 2025-05-07T20:31:49.9275587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9276105Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7721430>} 2025-05-07T20:31:49.9276842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9277035Z context = 2025-05-07T20:31:49.9277040Z 2025-05-07T20:31:49.9277205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9277464Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9277573Z module_map=module_map) 2025-05-07T20:31:49.9277739Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9277848Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.9277927Z E ^ 2025-05-07T20:31:49.9278276Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9278280Z 2025-05-07T20:31:49.9278693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9278697Z 2025-05-07T20:31:49.9278798Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9279019Z self=, 2025-05-07T20:31:49.9279098Z T=1, 2025-05-07T20:31:49.9279173Z D=5120, 2025-05-07T20:31:49.9279254Z scale_ub=1200.0, 2025-05-07T20:31:49.9279344Z contiguous=False, 2025-05-07T20:31:49.9279425Z compiled=True, 2025-05-07T20:31:49.9279498Z ) 2025-05-07T20:31:49.9279718Z self = 2025-05-07T20:31:49.9279972Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9279976Z 2025-05-07T20:31:49.9280054Z @given( 2025-05-07T20:31:49.9280171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9280268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9280384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9280499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9280611Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9280693Z ) 2025-05-07T20:31:49.9280938Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9281034Z def test_silu_mul_quant( 2025-05-07T20:31:49.9281109Z self, 2025-05-07T20:31:49.9281184Z T: int, 2025-05-07T20:31:49.9281263Z D: int, 2025-05-07T20:31:49.9281360Z scale_ub: Optional[float], 2025-05-07T20:31:49.9281451Z contiguous: bool, 2025-05-07T20:31:49.9281548Z compiled: bool, 2025-05-07T20:31:49.9281625Z ) -> None: 2025-05-07T20:31:49.9281718Z torch.manual_seed(2025) 2025-05-07T20:31:49.9281793Z 2025-05-07T20:31:49.9281960Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9282032Z 2025-05-07T20:31:49.9282125Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9282249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9282340Z x = x_sign * x_clamp 2025-05-07T20:31:49.9282420Z x0 = x[:, :D] 2025-05-07T20:31:49.9282501Z x1 = x[:, D:] 2025-05-07T20:31:49.9282575Z 2025-05-07T20:31:49.9282657Z if contiguous: 2025-05-07T20:31:49.9282749Z x0 = x0.contiguous() 2025-05-07T20:31:49.9282843Z x1 = x1.contiguous() 2025-05-07T20:31:49.9282916Z 2025-05-07T20:31:49.9283007Z if scale_ub is not None: 2025-05-07T20:31:49.9283116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9283364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9283443Z ) 2025-05-07T20:31:49.9283523Z else: 2025-05-07T20:31:49.9283614Z scale_ub_tensor = None 2025-05-07T20:31:49.9283685Z 2025-05-07T20:31:49.9283819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9283908Z op = silu_mul_quant 2025-05-07T20:31:49.9283998Z if compiled: 2025-05-07T20:31:49.9284097Z op = torch.compile(op) 2025-05-07T20:31:49.9284201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9284279Z 2025-05-07T20:31:49.9284376Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9284380Z 2025-05-07T20:31:49.9284477Z moe/activation_test.py:117: 2025-05-07T20:31:49.9284606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9284709Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9284807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9285197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9285288Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9285778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9285873Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9286228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9286452Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9286788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9286886Z kernel = self.compile( 2025-05-07T20:31:49.9287268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9287533Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9287660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9287665Z 2025-05-07T20:31:49.9287874Z self = 2025-05-07T20:31:49.9288638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9289136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7721e50>} 2025-05-07T20:31:49.9289878Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9290078Z context = 2025-05-07T20:31:49.9290083Z 2025-05-07T20:31:49.9290246Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9290515Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9290621Z module_map=module_map) 2025-05-07T20:31:49.9290782Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9290884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9290967Z E ^ 2025-05-07T20:31:49.9291321Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9291328Z 2025-05-07T20:31:49.9291737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9291749Z 2025-05-07T20:31:49.9291928Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9292163Z self=, 2025-05-07T20:31:49.9292242Z T=1, 2025-05-07T20:31:49.9292317Z D=5120, 2025-05-07T20:31:49.9292400Z scale_ub=1200.0, 2025-05-07T20:31:49.9292487Z contiguous=False, 2025-05-07T20:31:49.9292570Z compiled=False, 2025-05-07T20:31:49.9292644Z ) 2025-05-07T20:31:49.9292860Z self = 2025-05-07T20:31:49.9293029Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.9293034Z 2025-05-07T20:31:49.9293109Z @given( 2025-05-07T20:31:49.9293225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9293328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9293441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9293559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9293684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9293756Z ) 2025-05-07T20:31:49.9294001Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9294096Z def test_silu_mul_quant( 2025-05-07T20:31:49.9294171Z self, 2025-05-07T20:31:49.9294253Z T: int, 2025-05-07T20:31:49.9294332Z D: int, 2025-05-07T20:31:49.9294429Z scale_ub: Optional[float], 2025-05-07T20:31:49.9294519Z contiguous: bool, 2025-05-07T20:31:49.9294606Z compiled: bool, 2025-05-07T20:31:49.9294686Z ) -> None: 2025-05-07T20:31:49.9294782Z torch.manual_seed(2025) 2025-05-07T20:31:49.9294857Z 2025-05-07T20:31:49.9295027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9295105Z 2025-05-07T20:31:49.9295197Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9295321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9295543Z x = x_sign * x_clamp 2025-05-07T20:31:49.9295624Z x0 = x[:, :D] 2025-05-07T20:31:49.9295712Z x1 = x[:, D:] 2025-05-07T20:31:49.9295787Z 2025-05-07T20:31:49.9295870Z if contiguous: 2025-05-07T20:31:49.9295972Z x0 = x0.contiguous() 2025-05-07T20:31:49.9296062Z x1 = x1.contiguous() 2025-05-07T20:31:49.9296137Z 2025-05-07T20:31:49.9296236Z if scale_ub is not None: 2025-05-07T20:31:49.9296340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9296476Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9296556Z ) 2025-05-07T20:31:49.9296635Z else: 2025-05-07T20:31:49.9296729Z scale_ub_tensor = None 2025-05-07T20:31:49.9296804Z 2025-05-07T20:31:49.9296934Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9297023Z op = silu_mul_quant 2025-05-07T20:31:49.9297114Z if compiled: 2025-05-07T20:31:49.9297227Z op = torch.compile(op) 2025-05-07T20:31:49.9297334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9297406Z 2025-05-07T20:31:49.9297496Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9297500Z 2025-05-07T20:31:49.9297599Z moe/activation_test.py:117: 2025-05-07T20:31:49.9297726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9297825Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9297930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9298424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9298528Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9298883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9299104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9299526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9299619Z kernel = self.compile( 2025-05-07T20:31:49.9300000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9300180Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9300304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9300309Z 2025-05-07T20:31:49.9300518Z self = 2025-05-07T20:31:49.9301340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9301843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd76e7820>} 2025-05-07T20:31:49.9302596Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9302788Z context = 2025-05-07T20:31:49.9302793Z 2025-05-07T20:31:49.9302963Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9303224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9303330Z module_map=module_map) 2025-05-07T20:31:49.9303494Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9303591Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9303672Z E ^ 2025-05-07T20:31:49.9304106Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9304110Z 2025-05-07T20:31:49.9304525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9304530Z 2025-05-07T20:31:49.9304639Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9304864Z self=, 2025-05-07T20:31:49.9304943Z T=16384, 2025-05-07T20:31:49.9305021Z D=5120, 2025-05-07T20:31:49.9305103Z scale_ub=1200.0, 2025-05-07T20:31:49.9305190Z contiguous=False, 2025-05-07T20:31:49.9305273Z compiled=True, 2025-05-07T20:31:49.9305349Z ) 2025-05-07T20:31:49.9305568Z self = 2025-05-07T20:31:49.9305744Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9305754Z 2025-05-07T20:31:49.9305837Z @given( 2025-05-07T20:31:49.9305961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9306058Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9306172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9306288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9306399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9306479Z ) 2025-05-07T20:31:49.9306723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9306816Z def test_silu_mul_quant( 2025-05-07T20:31:49.9306896Z self, 2025-05-07T20:31:49.9306971Z T: int, 2025-05-07T20:31:49.9307047Z D: int, 2025-05-07T20:31:49.9307147Z scale_ub: Optional[float], 2025-05-07T20:31:49.9307234Z contiguous: bool, 2025-05-07T20:31:49.9307319Z compiled: bool, 2025-05-07T20:31:49.9307399Z ) -> None: 2025-05-07T20:31:49.9307568Z torch.manual_seed(2025) 2025-05-07T20:31:49.9307652Z 2025-05-07T20:31:49.9307820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9307893Z 2025-05-07T20:31:49.9307987Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9308115Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9308204Z x = x_sign * x_clamp 2025-05-07T20:31:49.9308295Z x0 = x[:, :D] 2025-05-07T20:31:49.9308377Z x1 = x[:, D:] 2025-05-07T20:31:49.9308449Z 2025-05-07T20:31:49.9308533Z if contiguous: 2025-05-07T20:31:49.9308624Z x0 = x0.contiguous() 2025-05-07T20:31:49.9308713Z x1 = x1.contiguous() 2025-05-07T20:31:49.9308786Z 2025-05-07T20:31:49.9308875Z if scale_ub is not None: 2025-05-07T20:31:49.9308980Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9309121Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9309196Z ) 2025-05-07T20:31:49.9309285Z else: 2025-05-07T20:31:49.9309378Z scale_ub_tensor = None 2025-05-07T20:31:49.9309451Z 2025-05-07T20:31:49.9309584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9309672Z op = silu_mul_quant 2025-05-07T20:31:49.9309759Z if compiled: 2025-05-07T20:31:49.9309861Z op = torch.compile(op) 2025-05-07T20:31:49.9309965Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9310040Z 2025-05-07T20:31:49.9310138Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9310143Z 2025-05-07T20:31:49.9310238Z moe/activation_test.py:117: 2025-05-07T20:31:49.9310375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9310474Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9310572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9310941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9311149Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9311643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9311743Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9312099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9312323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9312658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9312751Z kernel = self.compile( 2025-05-07T20:31:49.9313130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9313315Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9313455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9313463Z 2025-05-07T20:31:49.9313671Z self = 2025-05-07T20:31:49.9314432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9314936Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd70b6790>} 2025-05-07T20:31:49.9315672Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9315868Z context = 2025-05-07T20:31:49.9315953Z 2025-05-07T20:31:49.9316123Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9316389Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9316498Z module_map=module_map) 2025-05-07T20:31:49.9316660Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9316762Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9316837Z E ^ 2025-05-07T20:31:49.9317187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9317191Z 2025-05-07T20:31:49.9317605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9317610Z 2025-05-07T20:31:49.9317714Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9317943Z self=, 2025-05-07T20:31:49.9318034Z T=2048, 2025-05-07T20:31:49.9318109Z D=7168, 2025-05-07T20:31:49.9318193Z scale_ub=1200.0, 2025-05-07T20:31:49.9318277Z contiguous=False, 2025-05-07T20:31:49.9318358Z compiled=True, 2025-05-07T20:31:49.9318433Z ) 2025-05-07T20:31:49.9318648Z self = 2025-05-07T20:31:49.9318820Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9318824Z 2025-05-07T20:31:49.9318904Z @given( 2025-05-07T20:31:49.9319026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9319125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9319242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9319364Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9319475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9319552Z ) 2025-05-07T20:31:49.9319881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9319975Z def test_silu_mul_quant( 2025-05-07T20:31:49.9320051Z self, 2025-05-07T20:31:49.9320126Z T: int, 2025-05-07T20:31:49.9320209Z D: int, 2025-05-07T20:31:49.9320306Z scale_ub: Optional[float], 2025-05-07T20:31:49.9320396Z contiguous: bool, 2025-05-07T20:31:49.9320483Z compiled: bool, 2025-05-07T20:31:49.9320561Z ) -> None: 2025-05-07T20:31:49.9320657Z torch.manual_seed(2025) 2025-05-07T20:31:49.9320734Z 2025-05-07T20:31:49.9320902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9320975Z 2025-05-07T20:31:49.9321069Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9321194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9321285Z x = x_sign * x_clamp 2025-05-07T20:31:49.9321364Z x0 = x[:, :D] 2025-05-07T20:31:49.9321449Z x1 = x[:, D:] 2025-05-07T20:31:49.9321533Z 2025-05-07T20:31:49.9321616Z if contiguous: 2025-05-07T20:31:49.9321706Z x0 = x0.contiguous() 2025-05-07T20:31:49.9321804Z x1 = x1.contiguous() 2025-05-07T20:31:49.9321875Z 2025-05-07T20:31:49.9321966Z if scale_ub is not None: 2025-05-07T20:31:49.9322074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9322213Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9322287Z ) 2025-05-07T20:31:49.9322371Z else: 2025-05-07T20:31:49.9322463Z scale_ub_tensor = None 2025-05-07T20:31:49.9322539Z 2025-05-07T20:31:49.9322667Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9322758Z op = silu_mul_quant 2025-05-07T20:31:49.9322845Z if compiled: 2025-05-07T20:31:49.9322943Z op = torch.compile(op) 2025-05-07T20:31:49.9323049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9323210Z 2025-05-07T20:31:49.9323303Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9323309Z 2025-05-07T20:31:49.9323407Z moe/activation_test.py:117: 2025-05-07T20:31:49.9323535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9323634Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9323735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9324097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9324190Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9324683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9324781Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9325134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9325365Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9325709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9325804Z kernel = self.compile( 2025-05-07T20:31:49.9326179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9326353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9326481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9326487Z 2025-05-07T20:31:49.9326693Z self = 2025-05-07T20:31:49.9327459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9328051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd76bc4c0>} 2025-05-07T20:31:49.9328797Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9328992Z context = 2025-05-07T20:31:49.9328997Z 2025-05-07T20:31:49.9329160Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9329425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9329533Z module_map=module_map) 2025-05-07T20:31:49.9329696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9329801Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9329890Z E ^ 2025-05-07T20:31:49.9330239Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9330667Z 2025-05-07T20:31:49.9331083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9331087Z 2025-05-07T20:31:49.9331191Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9331420Z self=, 2025-05-07T20:31:49.9331495Z T=1, 2025-05-07T20:31:49.9331570Z D=5120, 2025-05-07T20:31:49.9331654Z scale_ub=None, 2025-05-07T20:31:49.9331738Z contiguous=False, 2025-05-07T20:31:49.9331824Z compiled=False, 2025-05-07T20:31:49.9331896Z ) 2025-05-07T20:31:49.9332110Z self = 2025-05-07T20:31:49.9332357Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.9332368Z 2025-05-07T20:31:49.9332444Z @given( 2025-05-07T20:31:49.9332562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9332664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9332778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9332894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9333014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9333090Z ) 2025-05-07T20:31:49.9333336Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9333428Z def test_silu_mul_quant( 2025-05-07T20:31:49.9333505Z self, 2025-05-07T20:31:49.9333582Z T: int, 2025-05-07T20:31:49.9333657Z D: int, 2025-05-07T20:31:49.9333754Z scale_ub: Optional[float], 2025-05-07T20:31:49.9333845Z contiguous: bool, 2025-05-07T20:31:49.9333930Z compiled: bool, 2025-05-07T20:31:49.9334010Z ) -> None: 2025-05-07T20:31:49.9334113Z torch.manual_seed(2025) 2025-05-07T20:31:49.9334187Z 2025-05-07T20:31:49.9334355Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9334432Z 2025-05-07T20:31:49.9334521Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9334650Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9334737Z x = x_sign * x_clamp 2025-05-07T20:31:49.9334816Z x0 = x[:, :D] 2025-05-07T20:31:49.9334899Z x1 = x[:, D:] 2025-05-07T20:31:49.9334971Z 2025-05-07T20:31:49.9335053Z if contiguous: 2025-05-07T20:31:49.9335145Z x0 = x0.contiguous() 2025-05-07T20:31:49.9335233Z x1 = x1.contiguous() 2025-05-07T20:31:49.9335305Z 2025-05-07T20:31:49.9335398Z if scale_ub is not None: 2025-05-07T20:31:49.9335502Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9335638Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9335809Z ) 2025-05-07T20:31:49.9335886Z else: 2025-05-07T20:31:49.9335977Z scale_ub_tensor = None 2025-05-07T20:31:49.9336053Z 2025-05-07T20:31:49.9336184Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9336277Z op = silu_mul_quant 2025-05-07T20:31:49.9336361Z if compiled: 2025-05-07T20:31:49.9336460Z op = torch.compile(op) 2025-05-07T20:31:49.9336566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9336637Z 2025-05-07T20:31:49.9336727Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9336731Z 2025-05-07T20:31:49.9336829Z moe/activation_test.py:117: 2025-05-07T20:31:49.9336956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9337056Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9337158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9337653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9337759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9338114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9338339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9338678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9338772Z kernel = self.compile( 2025-05-07T20:31:49.9339154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9339329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9339455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9339459Z 2025-05-07T20:31:49.9339745Z self = 2025-05-07T20:31:49.9340767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9341319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd76bc820>} 2025-05-07T20:31:49.9342065Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9342257Z context = 2025-05-07T20:31:49.9342261Z 2025-05-07T20:31:49.9342434Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9342707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9342817Z module_map=module_map) 2025-05-07T20:31:49.9342981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9343078Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9343155Z E ^ 2025-05-07T20:31:49.9343522Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9343527Z 2025-05-07T20:31:49.9343936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9343941Z 2025-05-07T20:31:49.9344048Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9348984Z self=, 2025-05-07T20:31:49.9349077Z T=4096, 2025-05-07T20:31:49.9349154Z D=7168, 2025-05-07T20:31:49.9349434Z scale_ub=1200.0, 2025-05-07T20:31:49.9349524Z contiguous=False, 2025-05-07T20:31:49.9349608Z compiled=False, 2025-05-07T20:31:49.9349683Z ) 2025-05-07T20:31:49.9349906Z self = 2025-05-07T20:31:49.9350084Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.9350092Z 2025-05-07T20:31:49.9350169Z @given( 2025-05-07T20:31:49.9350290Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9350395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9350511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9350626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9350741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9350816Z ) 2025-05-07T20:31:49.9351065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9351165Z def test_silu_mul_quant( 2025-05-07T20:31:49.9351251Z self, 2025-05-07T20:31:49.9351329Z T: int, 2025-05-07T20:31:49.9351409Z D: int, 2025-05-07T20:31:49.9351508Z scale_ub: Optional[float], 2025-05-07T20:31:49.9351600Z contiguous: bool, 2025-05-07T20:31:49.9351685Z compiled: bool, 2025-05-07T20:31:49.9351766Z ) -> None: 2025-05-07T20:31:49.9351865Z torch.manual_seed(2025) 2025-05-07T20:31:49.9351938Z 2025-05-07T20:31:49.9352110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9352185Z 2025-05-07T20:31:49.9352280Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9352404Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9352495Z x = x_sign * x_clamp 2025-05-07T20:31:49.9352576Z x0 = x[:, :D] 2025-05-07T20:31:49.9352654Z x1 = x[:, D:] 2025-05-07T20:31:49.9352729Z 2025-05-07T20:31:49.9352811Z if contiguous: 2025-05-07T20:31:49.9353017Z x0 = x0.contiguous() 2025-05-07T20:31:49.9353117Z x1 = x1.contiguous() 2025-05-07T20:31:49.9353189Z 2025-05-07T20:31:49.9353281Z if scale_ub is not None: 2025-05-07T20:31:49.9353387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9353523Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9353605Z ) 2025-05-07T20:31:49.9353682Z else: 2025-05-07T20:31:49.9353777Z scale_ub_tensor = None 2025-05-07T20:31:49.9353853Z 2025-05-07T20:31:49.9353984Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9354075Z op = silu_mul_quant 2025-05-07T20:31:49.9354169Z if compiled: 2025-05-07T20:31:49.9354268Z op = torch.compile(op) 2025-05-07T20:31:49.9354376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9354448Z 2025-05-07T20:31:49.9354541Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9354546Z 2025-05-07T20:31:49.9354659Z moe/activation_test.py:117: 2025-05-07T20:31:49.9354789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9354891Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9354993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9355501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9355598Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9355960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9356183Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9356533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9356626Z kernel = self.compile( 2025-05-07T20:31:49.9357011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9357280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9357405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9357410Z 2025-05-07T20:31:49.9357626Z self = 2025-05-07T20:31:49.9358394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9358890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7466af0>} 2025-05-07T20:31:49.9359650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9359848Z context = 2025-05-07T20:31:49.9359852Z 2025-05-07T20:31:49.9360020Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9360281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9360388Z module_map=module_map) 2025-05-07T20:31:49.9360554Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9360651Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9360730Z E ^ 2025-05-07T20:31:49.9361081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9361085Z 2025-05-07T20:31:49.9361572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9361583Z 2025-05-07T20:31:49.9361690Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9361911Z self=, 2025-05-07T20:31:49.9361991Z T=16384, 2025-05-07T20:31:49.9362066Z D=7168, 2025-05-07T20:31:49.9362146Z scale_ub=None, 2025-05-07T20:31:49.9362234Z contiguous=True, 2025-05-07T20:31:49.9362317Z compiled=True, 2025-05-07T20:31:49.9362390Z ) 2025-05-07T20:31:49.9362607Z self = 2025-05-07T20:31:49.9362780Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.9362785Z 2025-05-07T20:31:49.9362861Z @given( 2025-05-07T20:31:49.9362981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9363081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9363195Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9363324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9363443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9363518Z ) 2025-05-07T20:31:49.9363765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9363860Z def test_silu_mul_quant( 2025-05-07T20:31:49.9363940Z self, 2025-05-07T20:31:49.9364015Z T: int, 2025-05-07T20:31:49.9364091Z D: int, 2025-05-07T20:31:49.9364191Z scale_ub: Optional[float], 2025-05-07T20:31:49.9364281Z contiguous: bool, 2025-05-07T20:31:49.9364367Z compiled: bool, 2025-05-07T20:31:49.9364446Z ) -> None: 2025-05-07T20:31:49.9364539Z torch.manual_seed(2025) 2025-05-07T20:31:49.9364610Z 2025-05-07T20:31:49.9364780Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9364853Z 2025-05-07T20:31:49.9364946Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9365156Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9365245Z x = x_sign * x_clamp 2025-05-07T20:31:49.9365327Z x0 = x[:, :D] 2025-05-07T20:31:49.9365406Z x1 = x[:, D:] 2025-05-07T20:31:49.9365479Z 2025-05-07T20:31:49.9365563Z if contiguous: 2025-05-07T20:31:49.9365657Z x0 = x0.contiguous() 2025-05-07T20:31:49.9365745Z x1 = x1.contiguous() 2025-05-07T20:31:49.9365822Z 2025-05-07T20:31:49.9365912Z if scale_ub is not None: 2025-05-07T20:31:49.9366019Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9366164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9366242Z ) 2025-05-07T20:31:49.9366321Z else: 2025-05-07T20:31:49.9366416Z scale_ub_tensor = None 2025-05-07T20:31:49.9366489Z 2025-05-07T20:31:49.9366624Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9366714Z op = silu_mul_quant 2025-05-07T20:31:49.9366808Z if compiled: 2025-05-07T20:31:49.9366911Z op = torch.compile(op) 2025-05-07T20:31:49.9367016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9367088Z 2025-05-07T20:31:49.9367181Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9367185Z 2025-05-07T20:31:49.9367280Z moe/activation_test.py:117: 2025-05-07T20:31:49.9367407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9367509Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9367609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9367974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9368067Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9368557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9368655Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9369090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9369323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9369658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9369753Z kernel = self.compile( 2025-05-07T20:31:49.9370134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9370312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9370438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9370442Z 2025-05-07T20:31:49.9370653Z self = 2025-05-07T20:31:49.9371420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9371928Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd764b790>} 2025-05-07T20:31:49.9372673Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9372871Z context = 2025-05-07T20:31:49.9372876Z 2025-05-07T20:31:49.9373040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9373309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9373496Z module_map=module_map) 2025-05-07T20:31:49.9373661Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9373758Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9373840Z E ^ 2025-05-07T20:31:49.9374198Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9374202Z 2025-05-07T20:31:49.9374623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9374627Z 2025-05-07T20:31:49.9374728Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9374949Z self=, 2025-05-07T20:31:49.9375032Z T=4096, 2025-05-07T20:31:49.9375107Z D=5120, 2025-05-07T20:31:49.9375189Z scale_ub=None, 2025-05-07T20:31:49.9375282Z contiguous=False, 2025-05-07T20:31:49.9375364Z compiled=True, 2025-05-07T20:31:49.9375445Z ) 2025-05-07T20:31:49.9375665Z self = 2025-05-07T20:31:49.9375839Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.9375844Z 2025-05-07T20:31:49.9375922Z @given( 2025-05-07T20:31:49.9376039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9376137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9376258Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9376372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9376484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9376561Z ) 2025-05-07T20:31:49.9376806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9376901Z def test_silu_mul_quant( 2025-05-07T20:31:49.9376977Z self, 2025-05-07T20:31:49.9377054Z T: int, 2025-05-07T20:31:49.9377135Z D: int, 2025-05-07T20:31:49.9377334Z scale_ub: Optional[float], 2025-05-07T20:31:49.9377426Z contiguous: bool, 2025-05-07T20:31:49.9377513Z compiled: bool, 2025-05-07T20:31:49.9377590Z ) -> None: 2025-05-07T20:31:49.9377685Z torch.manual_seed(2025) 2025-05-07T20:31:49.9377760Z 2025-05-07T20:31:49.9377926Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9378004Z 2025-05-07T20:31:49.9378099Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9378223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9378316Z x = x_sign * x_clamp 2025-05-07T20:31:49.9378399Z x0 = x[:, :D] 2025-05-07T20:31:49.9378477Z x1 = x[:, D:] 2025-05-07T20:31:49.9378554Z 2025-05-07T20:31:49.9378636Z if contiguous: 2025-05-07T20:31:49.9378730Z x0 = x0.contiguous() 2025-05-07T20:31:49.9378820Z x1 = x1.contiguous() 2025-05-07T20:31:49.9378892Z 2025-05-07T20:31:49.9378994Z if scale_ub is not None: 2025-05-07T20:31:49.9379100Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9379239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9379315Z ) 2025-05-07T20:31:49.9379398Z else: 2025-05-07T20:31:49.9379492Z scale_ub_tensor = None 2025-05-07T20:31:49.9379569Z 2025-05-07T20:31:49.9379701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9379789Z op = silu_mul_quant 2025-05-07T20:31:49.9379882Z if compiled: 2025-05-07T20:31:49.9379982Z op = torch.compile(op) 2025-05-07T20:31:49.9380086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9380160Z 2025-05-07T20:31:49.9380250Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9380255Z 2025-05-07T20:31:49.9380350Z moe/activation_test.py:117: 2025-05-07T20:31:49.9380478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9380665Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9380769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9381222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9381316Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9381806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9381908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9382268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9382492Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9382836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9382928Z kernel = self.compile( 2025-05-07T20:31:49.9383318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9383495Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9383621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9383626Z 2025-05-07T20:31:49.9383837Z self = 2025-05-07T20:31:49.9384664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9385166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd72e0550>} 2025-05-07T20:31:49.9385978Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9386181Z context = 2025-05-07T20:31:49.9386185Z 2025-05-07T20:31:49.9386353Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9386618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9386730Z module_map=module_map) 2025-05-07T20:31:49.9386892Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9386988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9387067Z E ^ 2025-05-07T20:31:49.9387417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9387422Z 2025-05-07T20:31:49.9387846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9387855Z 2025-05-07T20:31:49.9387955Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9388176Z self=, 2025-05-07T20:31:49.9388256Z T=4096, 2025-05-07T20:31:49.9388331Z D=5120, 2025-05-07T20:31:49.9388410Z scale_ub=1200.0, 2025-05-07T20:31:49.9388497Z contiguous=False, 2025-05-07T20:31:49.9388582Z compiled=False, 2025-05-07T20:31:49.9388656Z ) 2025-05-07T20:31:49.9388872Z self = 2025-05-07T20:31:49.9389047Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.9389051Z 2025-05-07T20:31:49.9389129Z @given( 2025-05-07T20:31:49.9389248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9389346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9389467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9389664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9389779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9389855Z ) 2025-05-07T20:31:49.9390099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9390192Z def test_silu_mul_quant( 2025-05-07T20:31:49.9390269Z self, 2025-05-07T20:31:49.9390346Z T: int, 2025-05-07T20:31:49.9390429Z D: int, 2025-05-07T20:31:49.9390526Z scale_ub: Optional[float], 2025-05-07T20:31:49.9390615Z contiguous: bool, 2025-05-07T20:31:49.9390708Z compiled: bool, 2025-05-07T20:31:49.9390784Z ) -> None: 2025-05-07T20:31:49.9390878Z torch.manual_seed(2025) 2025-05-07T20:31:49.9390954Z 2025-05-07T20:31:49.9391121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9391194Z 2025-05-07T20:31:49.9391292Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9391424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9391514Z x = x_sign * x_clamp 2025-05-07T20:31:49.9391594Z x0 = x[:, :D] 2025-05-07T20:31:49.9391672Z x1 = x[:, D:] 2025-05-07T20:31:49.9391748Z 2025-05-07T20:31:49.9391830Z if contiguous: 2025-05-07T20:31:49.9391920Z x0 = x0.contiguous() 2025-05-07T20:31:49.9392012Z x1 = x1.contiguous() 2025-05-07T20:31:49.9392085Z 2025-05-07T20:31:49.9392174Z if scale_ub is not None: 2025-05-07T20:31:49.9392281Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9392416Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9392491Z ) 2025-05-07T20:31:49.9392570Z else: 2025-05-07T20:31:49.9392663Z scale_ub_tensor = None 2025-05-07T20:31:49.9392735Z 2025-05-07T20:31:49.9392870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9393041Z op = silu_mul_quant 2025-05-07T20:31:49.9393131Z if compiled: 2025-05-07T20:31:49.9393229Z op = torch.compile(op) 2025-05-07T20:31:49.9393333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9393406Z 2025-05-07T20:31:49.9393495Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9393499Z 2025-05-07T20:31:49.9393596Z moe/activation_test.py:117: 2025-05-07T20:31:49.9393725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9393824Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9393924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9394476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9394571Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9394936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9395169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9395512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9395611Z kernel = self.compile( 2025-05-07T20:31:49.9395994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9396171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9396294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9396298Z 2025-05-07T20:31:49.9396505Z self = 2025-05-07T20:31:49.9397277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9397861Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd72a40d0>} 2025-05-07T20:31:49.9398603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9398795Z context = 2025-05-07T20:31:49.9398800Z 2025-05-07T20:31:49.9398969Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9399234Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9399340Z module_map=module_map) 2025-05-07T20:31:49.9399507Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9399615Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9399692Z E ^ 2025-05-07T20:31:49.9400051Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9400056Z 2025-05-07T20:31:49.9400466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9400470Z 2025-05-07T20:31:49.9400574Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9400795Z self=, 2025-05-07T20:31:49.9400871Z T=4096, 2025-05-07T20:31:49.9400950Z D=5120, 2025-05-07T20:31:49.9401032Z scale_ub=1200.0, 2025-05-07T20:31:49.9401118Z contiguous=False, 2025-05-07T20:31:49.9401203Z compiled=True, 2025-05-07T20:31:49.9401274Z ) 2025-05-07T20:31:49.9401491Z self = 2025-05-07T20:31:49.9401744Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9401749Z 2025-05-07T20:31:49.9401826Z @given( 2025-05-07T20:31:49.9401947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9402045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9402161Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9402279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9402391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9402464Z ) 2025-05-07T20:31:49.9402711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9402803Z def test_silu_mul_quant( 2025-05-07T20:31:49.9402878Z self, 2025-05-07T20:31:49.9402956Z T: int, 2025-05-07T20:31:49.9403031Z D: int, 2025-05-07T20:31:49.9403130Z scale_ub: Optional[float], 2025-05-07T20:31:49.9403222Z contiguous: bool, 2025-05-07T20:31:49.9403318Z compiled: bool, 2025-05-07T20:31:49.9403398Z ) -> None: 2025-05-07T20:31:49.9403492Z torch.manual_seed(2025) 2025-05-07T20:31:49.9403563Z 2025-05-07T20:31:49.9403735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9403810Z 2025-05-07T20:31:49.9403900Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9404026Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9404114Z x = x_sign * x_clamp 2025-05-07T20:31:49.9404194Z x0 = x[:, :D] 2025-05-07T20:31:49.9404277Z x1 = x[:, D:] 2025-05-07T20:31:49.9404348Z 2025-05-07T20:31:49.9404430Z if contiguous: 2025-05-07T20:31:49.9404523Z x0 = x0.contiguous() 2025-05-07T20:31:49.9404611Z x1 = x1.contiguous() 2025-05-07T20:31:49.9404687Z 2025-05-07T20:31:49.9404777Z if scale_ub is not None: 2025-05-07T20:31:49.9404881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9405022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9405177Z ) 2025-05-07T20:31:49.9405254Z else: 2025-05-07T20:31:49.9405350Z scale_ub_tensor = None 2025-05-07T20:31:49.9405423Z 2025-05-07T20:31:49.9405550Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9405641Z op = silu_mul_quant 2025-05-07T20:31:49.9405726Z if compiled: 2025-05-07T20:31:49.9405825Z op = torch.compile(op) 2025-05-07T20:31:49.9405932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9406004Z 2025-05-07T20:31:49.9406096Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9406100Z 2025-05-07T20:31:49.9406196Z moe/activation_test.py:117: 2025-05-07T20:31:49.9406322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9406423Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9406522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9406901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9407002Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9407491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9407589Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9407946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9408169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9408514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9408606Z kernel = self.compile( 2025-05-07T20:31:49.9408991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9409277Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9409407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9409411Z 2025-05-07T20:31:49.9409618Z self = 2025-05-07T20:31:49.9410384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9410883Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd72a4dc0>} 2025-05-07T20:31:49.9411622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9411822Z context = 2025-05-07T20:31:49.9411827Z 2025-05-07T20:31:49.9411993Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9412261Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9412371Z module_map=module_map) 2025-05-07T20:31:49.9412533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9412630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9412710Z E ^ 2025-05-07T20:31:49.9413065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9413070Z 2025-05-07T20:31:49.9413478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9413482Z 2025-05-07T20:31:49.9413665Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9413886Z self=, 2025-05-07T20:31:49.9413965Z T=2048, 2025-05-07T20:31:49.9414043Z D=7168, 2025-05-07T20:31:49.9414137Z scale_ub=1200.0, 2025-05-07T20:31:49.9414240Z contiguous=False, 2025-05-07T20:31:49.9414339Z compiled=False, 2025-05-07T20:31:49.9414422Z ) 2025-05-07T20:31:49.9414644Z self = 2025-05-07T20:31:49.9414818Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.9414822Z 2025-05-07T20:31:49.9414898Z @given( 2025-05-07T20:31:49.9415017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9415115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9415230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9415349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9415474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9415550Z ) 2025-05-07T20:31:49.9415794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9415886Z def test_silu_mul_quant( 2025-05-07T20:31:49.9415964Z self, 2025-05-07T20:31:49.9416041Z T: int, 2025-05-07T20:31:49.9416116Z D: int, 2025-05-07T20:31:49.9416215Z scale_ub: Optional[float], 2025-05-07T20:31:49.9416303Z contiguous: bool, 2025-05-07T20:31:49.9416387Z compiled: bool, 2025-05-07T20:31:49.9416466Z ) -> None: 2025-05-07T20:31:49.9416564Z torch.manual_seed(2025) 2025-05-07T20:31:49.9416640Z 2025-05-07T20:31:49.9416806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9416879Z 2025-05-07T20:31:49.9416972Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9417095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9417264Z x = x_sign * x_clamp 2025-05-07T20:31:49.9417351Z x0 = x[:, :D] 2025-05-07T20:31:49.9417433Z x1 = x[:, D:] 2025-05-07T20:31:49.9417504Z 2025-05-07T20:31:49.9417591Z if contiguous: 2025-05-07T20:31:49.9417682Z x0 = x0.contiguous() 2025-05-07T20:31:49.9417770Z x1 = x1.contiguous() 2025-05-07T20:31:49.9417847Z 2025-05-07T20:31:49.9417938Z if scale_ub is not None: 2025-05-07T20:31:49.9418048Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9418185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9418260Z ) 2025-05-07T20:31:49.9418344Z else: 2025-05-07T20:31:49.9418439Z scale_ub_tensor = None 2025-05-07T20:31:49.9418510Z 2025-05-07T20:31:49.9418643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9418732Z op = silu_mul_quant 2025-05-07T20:31:49.9418817Z if compiled: 2025-05-07T20:31:49.9418926Z op = torch.compile(op) 2025-05-07T20:31:49.9419036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9419107Z 2025-05-07T20:31:49.9419201Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9419205Z 2025-05-07T20:31:49.9419300Z moe/activation_test.py:117: 2025-05-07T20:31:49.9419432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9419531Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9419629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9420122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9420217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9420578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9420804Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9421295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9421393Z kernel = self.compile( 2025-05-07T20:31:49.9421777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9421953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9422087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9422091Z 2025-05-07T20:31:49.9422297Z self = 2025-05-07T20:31:49.9423065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9423568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd71b0670>} 2025-05-07T20:31:49.9424361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9424554Z context = 2025-05-07T20:31:49.9424559Z 2025-05-07T20:31:49.9424726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9424997Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9425103Z module_map=module_map) 2025-05-07T20:31:49.9425263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9425363Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9425438Z E ^ 2025-05-07T20:31:49.9425873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9425887Z 2025-05-07T20:31:49.9426305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9426309Z 2025-05-07T20:31:49.9426410Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9426635Z self=, 2025-05-07T20:31:49.9426710Z T=1, 2025-05-07T20:31:49.9426789Z D=7168, 2025-05-07T20:31:49.9426873Z scale_ub=None, 2025-05-07T20:31:49.9426957Z contiguous=True, 2025-05-07T20:31:49.9427039Z compiled=False, 2025-05-07T20:31:49.9427115Z ) 2025-05-07T20:31:49.9427330Z self = 2025-05-07T20:31:49.9427497Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9427502Z 2025-05-07T20:31:49.9427578Z @given( 2025-05-07T20:31:49.9427705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9427810Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9427924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9428040Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9428158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9428231Z ) 2025-05-07T20:31:49.9428480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9428573Z def test_silu_mul_quant( 2025-05-07T20:31:49.9428648Z self, 2025-05-07T20:31:49.9428725Z T: int, 2025-05-07T20:31:49.9428800Z D: int, 2025-05-07T20:31:49.9428897Z scale_ub: Optional[float], 2025-05-07T20:31:49.9428989Z contiguous: bool, 2025-05-07T20:31:49.9429074Z compiled: bool, 2025-05-07T20:31:49.9429153Z ) -> None: 2025-05-07T20:31:49.9429252Z torch.manual_seed(2025) 2025-05-07T20:31:49.9429409Z 2025-05-07T20:31:49.9429577Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9429654Z 2025-05-07T20:31:49.9429745Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9429873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9429961Z x = x_sign * x_clamp 2025-05-07T20:31:49.9430041Z x0 = x[:, :D] 2025-05-07T20:31:49.9430122Z x1 = x[:, D:] 2025-05-07T20:31:49.9430193Z 2025-05-07T20:31:49.9430275Z if contiguous: 2025-05-07T20:31:49.9430370Z x0 = x0.contiguous() 2025-05-07T20:31:49.9430459Z x1 = x1.contiguous() 2025-05-07T20:31:49.9430532Z 2025-05-07T20:31:49.9430626Z if scale_ub is not None: 2025-05-07T20:31:49.9430729Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9430865Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9430942Z ) 2025-05-07T20:31:49.9431018Z else: 2025-05-07T20:31:49.9431120Z scale_ub_tensor = None 2025-05-07T20:31:49.9431194Z 2025-05-07T20:31:49.9431326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9431417Z op = silu_mul_quant 2025-05-07T20:31:49.9431501Z if compiled: 2025-05-07T20:31:49.9431600Z op = torch.compile(op) 2025-05-07T20:31:49.9431709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9431780Z 2025-05-07T20:31:49.9431870Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9431874Z 2025-05-07T20:31:49.9431973Z moe/activation_test.py:117: 2025-05-07T20:31:49.9432101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9432201Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9432302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9432796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9432974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9433339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9433565Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9433912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9434005Z kernel = self.compile( 2025-05-07T20:31:49.9434388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9434569Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9434695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9434699Z 2025-05-07T20:31:49.9434907Z self = 2025-05-07T20:31:49.9435681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9436182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6ee3160>} 2025-05-07T20:31:49.9436930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9437121Z context = 2025-05-07T20:31:49.9437126Z 2025-05-07T20:31:49.9437293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9437565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9437770Z module_map=module_map) 2025-05-07T20:31:49.9437931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9438028Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9438109Z E ^ 2025-05-07T20:31:49.9438458Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9438463Z 2025-05-07T20:31:49.9438879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9438886Z 2025-05-07T20:31:49.9438987Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9439208Z self=, 2025-05-07T20:31:49.9439286Z T=16384, 2025-05-07T20:31:49.9439361Z D=7168, 2025-05-07T20:31:49.9439443Z scale_ub=1200.0, 2025-05-07T20:31:49.9439531Z contiguous=False, 2025-05-07T20:31:49.9439622Z compiled=True, 2025-05-07T20:31:49.9439695Z ) 2025-05-07T20:31:49.9439915Z self = 2025-05-07T20:31:49.9440451Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9440460Z 2025-05-07T20:31:49.9440578Z @given( 2025-05-07T20:31:49.9440738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9440867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9441031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9441164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9441278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9441355Z ) 2025-05-07T20:31:49.9441600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9441693Z def test_silu_mul_quant( 2025-05-07T20:31:49.9441773Z self, 2025-05-07T20:31:49.9441850Z T: int, 2025-05-07T20:31:49.9442077Z D: int, 2025-05-07T20:31:49.9442183Z scale_ub: Optional[float], 2025-05-07T20:31:49.9442271Z contiguous: bool, 2025-05-07T20:31:49.9442360Z compiled: bool, 2025-05-07T20:31:49.9442438Z ) -> None: 2025-05-07T20:31:49.9442531Z torch.manual_seed(2025) 2025-05-07T20:31:49.9442606Z 2025-05-07T20:31:49.9442773Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9442847Z 2025-05-07T20:31:49.9442942Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9443068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9443155Z x = x_sign * x_clamp 2025-05-07T20:31:49.9443239Z x0 = x[:, :D] 2025-05-07T20:31:49.9443318Z x1 = x[:, D:] 2025-05-07T20:31:49.9443390Z 2025-05-07T20:31:49.9443476Z if contiguous: 2025-05-07T20:31:49.9443566Z x0 = x0.contiguous() 2025-05-07T20:31:49.9443653Z x1 = x1.contiguous() 2025-05-07T20:31:49.9443742Z 2025-05-07T20:31:49.9443831Z if scale_ub is not None: 2025-05-07T20:31:49.9443938Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9444071Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9444148Z ) 2025-05-07T20:31:49.9444229Z else: 2025-05-07T20:31:49.9444322Z scale_ub_tensor = None 2025-05-07T20:31:49.9444397Z 2025-05-07T20:31:49.9444529Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9444618Z op = silu_mul_quant 2025-05-07T20:31:49.9444702Z if compiled: 2025-05-07T20:31:49.9444805Z op = torch.compile(op) 2025-05-07T20:31:49.9444909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9444980Z 2025-05-07T20:31:49.9445072Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9445077Z 2025-05-07T20:31:49.9445172Z moe/activation_test.py:117: 2025-05-07T20:31:49.9445306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9445533Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9445634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9446000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9446092Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9446590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9446692Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9447048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9447272Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9447606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9447708Z kernel = self.compile( 2025-05-07T20:31:49.9448095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9448272Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9448403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9448407Z 2025-05-07T20:31:49.9448616Z self = 2025-05-07T20:31:49.9449379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9449889Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6ee34c0>} 2025-05-07T20:31:49.9450719Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9450915Z context = 2025-05-07T20:31:49.9450919Z 2025-05-07T20:31:49.9451083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9451350Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9451460Z module_map=module_map) 2025-05-07T20:31:49.9451621Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9451722Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9451798Z E ^ 2025-05-07T20:31:49.9452149Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9452153Z 2025-05-07T20:31:49.9452583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9452588Z 2025-05-07T20:31:49.9452688Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9452914Z self=, 2025-05-07T20:31:49.9452991Z T=1, 2025-05-07T20:31:49.9453065Z D=7168, 2025-05-07T20:31:49.9453150Z scale_ub=None, 2025-05-07T20:31:49.9453234Z contiguous=False, 2025-05-07T20:31:49.9453317Z compiled=False, 2025-05-07T20:31:49.9453391Z ) 2025-05-07T20:31:49.9453608Z self = 2025-05-07T20:31:49.9453773Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.9453777Z 2025-05-07T20:31:49.9453857Z @given( 2025-05-07T20:31:49.9453973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9454071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9454274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9454391Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9454507Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9454579Z ) 2025-05-07T20:31:49.9454824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9454921Z def test_silu_mul_quant( 2025-05-07T20:31:49.9454999Z self, 2025-05-07T20:31:49.9455075Z T: int, 2025-05-07T20:31:49.9455154Z D: int, 2025-05-07T20:31:49.9455252Z scale_ub: Optional[float], 2025-05-07T20:31:49.9455340Z contiguous: bool, 2025-05-07T20:31:49.9455428Z compiled: bool, 2025-05-07T20:31:49.9455505Z ) -> None: 2025-05-07T20:31:49.9455599Z torch.manual_seed(2025) 2025-05-07T20:31:49.9455675Z 2025-05-07T20:31:49.9455845Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9455927Z 2025-05-07T20:31:49.9456022Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9456147Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9456238Z x = x_sign * x_clamp 2025-05-07T20:31:49.9456317Z x0 = x[:, :D] 2025-05-07T20:31:49.9456395Z x1 = x[:, D:] 2025-05-07T20:31:49.9456469Z 2025-05-07T20:31:49.9456551Z if contiguous: 2025-05-07T20:31:49.9456641Z x0 = x0.contiguous() 2025-05-07T20:31:49.9456733Z x1 = x1.contiguous() 2025-05-07T20:31:49.9456804Z 2025-05-07T20:31:49.9456897Z if scale_ub is not None: 2025-05-07T20:31:49.9457005Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9457139Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9457216Z ) 2025-05-07T20:31:49.9457290Z else: 2025-05-07T20:31:49.9457383Z scale_ub_tensor = None 2025-05-07T20:31:49.9457459Z 2025-05-07T20:31:49.9457666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9457763Z op = silu_mul_quant 2025-05-07T20:31:49.9457854Z if compiled: 2025-05-07T20:31:49.9457954Z op = torch.compile(op) 2025-05-07T20:31:49.9458057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9458131Z 2025-05-07T20:31:49.9458221Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9458225Z 2025-05-07T20:31:49.9458322Z moe/activation_test.py:117: 2025-05-07T20:31:49.9458456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9458555Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9458659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9459157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9459252Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9459618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9459848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9460189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9460281Z kernel = self.compile( 2025-05-07T20:31:49.9460659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9460833Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9460958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9460963Z 2025-05-07T20:31:49.9461227Z self = 2025-05-07T20:31:49.9462015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9462605Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd7023820>} 2025-05-07T20:31:49.9463357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9463549Z context = 2025-05-07T20:31:49.9463553Z 2025-05-07T20:31:49.9463726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9463993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9464098Z module_map=module_map) 2025-05-07T20:31:49.9464267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9464372Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9464449Z E ^ 2025-05-07T20:31:49.9464804Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9464808Z 2025-05-07T20:31:49.9465217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9465222Z 2025-05-07T20:31:49.9465326Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9465549Z self=, 2025-05-07T20:31:49.9465624Z T=2048, 2025-05-07T20:31:49.9465702Z D=7168, 2025-05-07T20:31:49.9465782Z scale_ub=None, 2025-05-07T20:31:49.9465866Z contiguous=False, 2025-05-07T20:31:49.9465955Z compiled=True, 2025-05-07T20:31:49.9466029Z ) 2025-05-07T20:31:49.9466324Z self = 2025-05-07T20:31:49.9466510Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.9466515Z 2025-05-07T20:31:49.9466590Z @given( 2025-05-07T20:31:49.9466708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9466816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9466930Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9467048Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9467164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9467240Z ) 2025-05-07T20:31:49.9467495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9472239Z def test_silu_mul_quant( 2025-05-07T20:31:49.9472334Z self, 2025-05-07T20:31:49.9472412Z T: int, 2025-05-07T20:31:49.9472491Z D: int, 2025-05-07T20:31:49.9472592Z scale_ub: Optional[float], 2025-05-07T20:31:49.9472694Z contiguous: bool, 2025-05-07T20:31:49.9472782Z compiled: bool, 2025-05-07T20:31:49.9472861Z ) -> None: 2025-05-07T20:31:49.9472956Z torch.manual_seed(2025) 2025-05-07T20:31:49.9473033Z 2025-05-07T20:31:49.9473208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9473284Z 2025-05-07T20:31:49.9473384Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9473511Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9473599Z x = x_sign * x_clamp 2025-05-07T20:31:49.9473683Z x0 = x[:, :D] 2025-05-07T20:31:49.9473763Z x1 = x[:, D:] 2025-05-07T20:31:49.9473841Z 2025-05-07T20:31:49.9473924Z if contiguous: 2025-05-07T20:31:49.9474015Z x0 = x0.contiguous() 2025-05-07T20:31:49.9474109Z x1 = x1.contiguous() 2025-05-07T20:31:49.9474185Z 2025-05-07T20:31:49.9474277Z if scale_ub is not None: 2025-05-07T20:31:49.9474388Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9474658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9474734Z ) 2025-05-07T20:31:49.9474817Z else: 2025-05-07T20:31:49.9474910Z scale_ub_tensor = None 2025-05-07T20:31:49.9474983Z 2025-05-07T20:31:49.9475121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9475211Z op = silu_mul_quant 2025-05-07T20:31:49.9475299Z if compiled: 2025-05-07T20:31:49.9475399Z op = torch.compile(op) 2025-05-07T20:31:49.9475504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9475579Z 2025-05-07T20:31:49.9475671Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9475676Z 2025-05-07T20:31:49.9475773Z moe/activation_test.py:117: 2025-05-07T20:31:49.9475903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9476004Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9476108Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9476494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9476586Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9477090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9477185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9477540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9477766Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9478106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9478206Z kernel = self.compile( 2025-05-07T20:31:49.9478659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9478842Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9478970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9478974Z 2025-05-07T20:31:49.9479185Z self = 2025-05-07T20:31:49.9479973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9480477Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6fe1790>} 2025-05-07T20:31:49.9481234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9481436Z context = 2025-05-07T20:31:49.9481441Z 2025-05-07T20:31:49.9481605Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9481878Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9481986Z module_map=module_map) 2025-05-07T20:31:49.9482149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9482254Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9482332Z E ^ 2025-05-07T20:31:49.9482689Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9482693Z 2025-05-07T20:31:49.9483106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9483189Z 2025-05-07T20:31:49.9483294Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9483519Z self=, 2025-05-07T20:31:49.9483596Z T=4096, 2025-05-07T20:31:49.9483674Z D=7168, 2025-05-07T20:31:49.9483758Z scale_ub=None, 2025-05-07T20:31:49.9483843Z contiguous=False, 2025-05-07T20:31:49.9483925Z compiled=True, 2025-05-07T20:31:49.9483999Z ) 2025-05-07T20:31:49.9484248Z self = 2025-05-07T20:31:49.9484437Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.9484444Z 2025-05-07T20:31:49.9484521Z @given( 2025-05-07T20:31:49.9484639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9484741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9484854Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9484974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9485094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9485167Z ) 2025-05-07T20:31:49.9485412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9485507Z def test_silu_mul_quant( 2025-05-07T20:31:49.9485582Z self, 2025-05-07T20:31:49.9485659Z T: int, 2025-05-07T20:31:49.9485740Z D: int, 2025-05-07T20:31:49.9485837Z scale_ub: Optional[float], 2025-05-07T20:31:49.9485926Z contiguous: bool, 2025-05-07T20:31:49.9486010Z compiled: bool, 2025-05-07T20:31:49.9486087Z ) -> None: 2025-05-07T20:31:49.9486182Z torch.manual_seed(2025) 2025-05-07T20:31:49.9486256Z 2025-05-07T20:31:49.9486424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9486500Z 2025-05-07T20:31:49.9486591Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9486715Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9486890Z x = x_sign * x_clamp 2025-05-07T20:31:49.9486977Z x0 = x[:, :D] 2025-05-07T20:31:49.9487058Z x1 = x[:, D:] 2025-05-07T20:31:49.9487133Z 2025-05-07T20:31:49.9487216Z if contiguous: 2025-05-07T20:31:49.9487310Z x0 = x0.contiguous() 2025-05-07T20:31:49.9487399Z x1 = x1.contiguous() 2025-05-07T20:31:49.9487472Z 2025-05-07T20:31:49.9487563Z if scale_ub is not None: 2025-05-07T20:31:49.9487669Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9487803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9487881Z ) 2025-05-07T20:31:49.9487957Z else: 2025-05-07T20:31:49.9488051Z scale_ub_tensor = None 2025-05-07T20:31:49.9488130Z 2025-05-07T20:31:49.9488259Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9488348Z op = silu_mul_quant 2025-05-07T20:31:49.9488436Z if compiled: 2025-05-07T20:31:49.9488545Z op = torch.compile(op) 2025-05-07T20:31:49.9488653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9488726Z 2025-05-07T20:31:49.9488816Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9488821Z 2025-05-07T20:31:49.9488921Z moe/activation_test.py:117: 2025-05-07T20:31:49.9489048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9489148Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9489253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9489617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9489709Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9490212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9490307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9490756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9490981Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9491316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9491412Z kernel = self.compile( 2025-05-07T20:31:49.9491795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9491977Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9492101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9492105Z 2025-05-07T20:31:49.9492312Z self = 2025-05-07T20:31:49.9493087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9493589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6f6a4c0>} 2025-05-07T20:31:49.9494378Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9494571Z context = 2025-05-07T20:31:49.9494575Z 2025-05-07T20:31:49.9494743Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9495013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9495119Z module_map=module_map) 2025-05-07T20:31:49.9495363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9495462Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9495539Z E ^ 2025-05-07T20:31:49.9495899Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9495904Z 2025-05-07T20:31:49.9496319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9496323Z 2025-05-07T20:31:49.9496429Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9496650Z self=, 2025-05-07T20:31:49.9496727Z T=16384, 2025-05-07T20:31:49.9496805Z D=5120, 2025-05-07T20:31:49.9496888Z scale_ub=1200.0, 2025-05-07T20:31:49.9496974Z contiguous=False, 2025-05-07T20:31:49.9497067Z compiled=False, 2025-05-07T20:31:49.9497139Z ) 2025-05-07T20:31:49.9497365Z self = 2025-05-07T20:31:49.9497547Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.9497552Z 2025-05-07T20:31:49.9497628Z @given( 2025-05-07T20:31:49.9497752Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9497852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9497965Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9498083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9498195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9498268Z ) 2025-05-07T20:31:49.9498517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9498608Z def test_silu_mul_quant( 2025-05-07T20:31:49.9498685Z self, 2025-05-07T20:31:49.9498765Z T: int, 2025-05-07T20:31:49.9498841Z D: int, 2025-05-07T20:31:49.9498941Z scale_ub: Optional[float], 2025-05-07T20:31:49.9499112Z contiguous: bool, 2025-05-07T20:31:49.9499197Z compiled: bool, 2025-05-07T20:31:49.9499278Z ) -> None: 2025-05-07T20:31:49.9499372Z torch.manual_seed(2025) 2025-05-07T20:31:49.9499443Z 2025-05-07T20:31:49.9499616Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9499689Z 2025-05-07T20:31:49.9499779Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9499908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9499996Z x = x_sign * x_clamp 2025-05-07T20:31:49.9500075Z x0 = x[:, :D] 2025-05-07T20:31:49.9500159Z x1 = x[:, D:] 2025-05-07T20:31:49.9500233Z 2025-05-07T20:31:49.9500315Z if contiguous: 2025-05-07T20:31:49.9500411Z x0 = x0.contiguous() 2025-05-07T20:31:49.9500499Z x1 = x1.contiguous() 2025-05-07T20:31:49.9500573Z 2025-05-07T20:31:49.9500664Z if scale_ub is not None: 2025-05-07T20:31:49.9500781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9500919Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9500994Z ) 2025-05-07T20:31:49.9501077Z else: 2025-05-07T20:31:49.9501256Z scale_ub_tensor = None 2025-05-07T20:31:49.9501332Z 2025-05-07T20:31:49.9501462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9501556Z op = silu_mul_quant 2025-05-07T20:31:49.9501641Z if compiled: 2025-05-07T20:31:49.9501741Z op = torch.compile(op) 2025-05-07T20:31:49.9501849Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9501924Z 2025-05-07T20:31:49.9502016Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9502021Z 2025-05-07T20:31:49.9502117Z moe/activation_test.py:117: 2025-05-07T20:31:49.9502244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9502348Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9502558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9503063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9503163Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9503526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9503754Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9504094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9504186Z kernel = self.compile( 2025-05-07T20:31:49.9504574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9504754Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9504890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9504897Z 2025-05-07T20:31:49.9505104Z self = 2025-05-07T20:31:49.9505883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9506384Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6f6a820>} 2025-05-07T20:31:49.9507119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9507312Z context = 2025-05-07T20:31:49.9507398Z 2025-05-07T20:31:49.9507568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9507830Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9507938Z module_map=module_map) 2025-05-07T20:31:49.9508104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9508201Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9508284Z E ^ 2025-05-07T20:31:49.9508634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9508639Z 2025-05-07T20:31:49.9509049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9509054Z 2025-05-07T20:31:49.9509158Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9509384Z self=, 2025-05-07T20:31:49.9509469Z T=16384, 2025-05-07T20:31:49.9509547Z D=5120, 2025-05-07T20:31:49.9509629Z scale_ub=1200.0, 2025-05-07T20:31:49.9509714Z contiguous=True, 2025-05-07T20:31:49.9509796Z compiled=True, 2025-05-07T20:31:49.9509868Z ) 2025-05-07T20:31:49.9510086Z self = 2025-05-07T20:31:49.9510259Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.9510264Z 2025-05-07T20:31:49.9510339Z @given( 2025-05-07T20:31:49.9510464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9510563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9510676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9510796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9510910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9510988Z ) 2025-05-07T20:31:49.9511316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9511411Z def test_silu_mul_quant( 2025-05-07T20:31:49.9511490Z self, 2025-05-07T20:31:49.9511566Z T: int, 2025-05-07T20:31:49.9511642Z D: int, 2025-05-07T20:31:49.9511743Z scale_ub: Optional[float], 2025-05-07T20:31:49.9511832Z contiguous: bool, 2025-05-07T20:31:49.9511917Z compiled: bool, 2025-05-07T20:31:49.9511998Z ) -> None: 2025-05-07T20:31:49.9512093Z torch.manual_seed(2025) 2025-05-07T20:31:49.9512165Z 2025-05-07T20:31:49.9512337Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9512410Z 2025-05-07T20:31:49.9512504Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9512629Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9512716Z x = x_sign * x_clamp 2025-05-07T20:31:49.9512798Z x0 = x[:, :D] 2025-05-07T20:31:49.9512885Z x1 = x[:, D:] 2025-05-07T20:31:49.9512957Z 2025-05-07T20:31:49.9513043Z if contiguous: 2025-05-07T20:31:49.9513133Z x0 = x0.contiguous() 2025-05-07T20:31:49.9513221Z x1 = x1.contiguous() 2025-05-07T20:31:49.9513296Z 2025-05-07T20:31:49.9513388Z if scale_ub is not None: 2025-05-07T20:31:49.9513495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9513639Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9513716Z ) 2025-05-07T20:31:49.9513795Z else: 2025-05-07T20:31:49.9513887Z scale_ub_tensor = None 2025-05-07T20:31:49.9513958Z 2025-05-07T20:31:49.9514091Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9514180Z op = silu_mul_quant 2025-05-07T20:31:49.9514264Z if compiled: 2025-05-07T20:31:49.9514368Z op = torch.compile(op) 2025-05-07T20:31:49.9514474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9514632Z 2025-05-07T20:31:49.9514726Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9514731Z 2025-05-07T20:31:49.9514827Z moe/activation_test.py:117: 2025-05-07T20:31:49.9514954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9515060Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9515158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9515525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9515618Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9516114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9516213Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9516569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9516807Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9517149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9517244Z kernel = self.compile( 2025-05-07T20:31:49.9517630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9517803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9517928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9517932Z 2025-05-07T20:31:49.9518139Z self = 2025-05-07T20:31:49.9518902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9519490Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6df6e50>} 2025-05-07T20:31:49.9520238Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9520435Z context = 2025-05-07T20:31:49.9520440Z 2025-05-07T20:31:49.9520604Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9520865Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9520974Z module_map=module_map) 2025-05-07T20:31:49.9521135Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9521237Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9521324Z E ^ 2025-05-07T20:31:49.9521675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9521679Z 2025-05-07T20:31:49.9522089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9522093Z 2025-05-07T20:31:49.9522194Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9522415Z self=, 2025-05-07T20:31:49.9522497Z T=16384, 2025-05-07T20:31:49.9522572Z D=5120, 2025-05-07T20:31:49.9522653Z scale_ub=None, 2025-05-07T20:31:49.9522743Z contiguous=False, 2025-05-07T20:31:49.9522825Z compiled=True, 2025-05-07T20:31:49.9522902Z ) 2025-05-07T20:31:49.9523117Z self = 2025-05-07T20:31:49.9523302Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.9523381Z 2025-05-07T20:31:49.9523461Z @given( 2025-05-07T20:31:49.9523578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9523675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9523793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9523909Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9524038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9524121Z ) 2025-05-07T20:31:49.9524391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9524487Z def test_silu_mul_quant( 2025-05-07T20:31:49.9524561Z self, 2025-05-07T20:31:49.9524637Z T: int, 2025-05-07T20:31:49.9524714Z D: int, 2025-05-07T20:31:49.9524810Z scale_ub: Optional[float], 2025-05-07T20:31:49.9524898Z contiguous: bool, 2025-05-07T20:31:49.9524987Z compiled: bool, 2025-05-07T20:31:49.9525074Z ) -> None: 2025-05-07T20:31:49.9525169Z torch.manual_seed(2025) 2025-05-07T20:31:49.9525243Z 2025-05-07T20:31:49.9525410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9525483Z 2025-05-07T20:31:49.9525576Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9525699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9525789Z x = x_sign * x_clamp 2025-05-07T20:31:49.9525868Z x0 = x[:, :D] 2025-05-07T20:31:49.9525946Z x1 = x[:, D:] 2025-05-07T20:31:49.9526020Z 2025-05-07T20:31:49.9526102Z if contiguous: 2025-05-07T20:31:49.9526194Z x0 = x0.contiguous() 2025-05-07T20:31:49.9526286Z x1 = x1.contiguous() 2025-05-07T20:31:49.9526357Z 2025-05-07T20:31:49.9526446Z if scale_ub is not None: 2025-05-07T20:31:49.9526552Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9526686Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9526934Z ) 2025-05-07T20:31:49.9527018Z else: 2025-05-07T20:31:49.9527115Z scale_ub_tensor = None 2025-05-07T20:31:49.9527191Z 2025-05-07T20:31:49.9527322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9527412Z op = silu_mul_quant 2025-05-07T20:31:49.9527498Z if compiled: 2025-05-07T20:31:49.9527597Z op = torch.compile(op) 2025-05-07T20:31:49.9527701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9527776Z 2025-05-07T20:31:49.9527868Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9527872Z 2025-05-07T20:31:49.9527968Z moe/activation_test.py:117: 2025-05-07T20:31:49.9528099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9528197Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9528297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9528675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9528771Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9529263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9529358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9529718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9529944Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9530279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9530374Z kernel = self.compile( 2025-05-07T20:31:49.9530750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9530934Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9531140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9531144Z 2025-05-07T20:31:49.9531349Z self = 2025-05-07T20:31:49.9532116Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9532615Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6d799d0>} 2025-05-07T20:31:49.9533350Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9533557Z context = 2025-05-07T20:31:49.9533561Z 2025-05-07T20:31:49.9533731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9533998Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9534128Z module_map=module_map) 2025-05-07T20:31:49.9534316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9534416Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9534492Z E ^ 2025-05-07T20:31:49.9534846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9534853Z 2025-05-07T20:31:49.9535262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9535266Z 2025-05-07T20:31:49.9535367Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9535694Z self=, 2025-05-07T20:31:49.9535773Z T=2048, 2025-05-07T20:31:49.9535848Z D=5120, 2025-05-07T20:31:49.9535932Z scale_ub=None, 2025-05-07T20:31:49.9536016Z contiguous=False, 2025-05-07T20:31:49.9536097Z compiled=True, 2025-05-07T20:31:49.9536171Z ) 2025-05-07T20:31:49.9536389Z self = 2025-05-07T20:31:49.9536563Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.9536567Z 2025-05-07T20:31:49.9536642Z @given( 2025-05-07T20:31:49.9536759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9536860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9536973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9537089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9537203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9537287Z ) 2025-05-07T20:31:49.9537535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9537630Z def test_silu_mul_quant( 2025-05-07T20:31:49.9537706Z self, 2025-05-07T20:31:49.9537785Z T: int, 2025-05-07T20:31:49.9537863Z D: int, 2025-05-07T20:31:49.9537960Z scale_ub: Optional[float], 2025-05-07T20:31:49.9538053Z contiguous: bool, 2025-05-07T20:31:49.9538136Z compiled: bool, 2025-05-07T20:31:49.9538213Z ) -> None: 2025-05-07T20:31:49.9538309Z torch.manual_seed(2025) 2025-05-07T20:31:49.9538379Z 2025-05-07T20:31:49.9538549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9538625Z 2025-05-07T20:31:49.9538715Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9538839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9538929Z x = x_sign * x_clamp 2025-05-07T20:31:49.9539092Z x0 = x[:, :D] 2025-05-07T20:31:49.9539174Z x1 = x[:, D:] 2025-05-07T20:31:49.9539247Z 2025-05-07T20:31:49.9539328Z if contiguous: 2025-05-07T20:31:49.9539421Z x0 = x0.contiguous() 2025-05-07T20:31:49.9539509Z x1 = x1.contiguous() 2025-05-07T20:31:49.9539579Z 2025-05-07T20:31:49.9539670Z if scale_ub is not None: 2025-05-07T20:31:49.9539780Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9539914Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9539993Z ) 2025-05-07T20:31:49.9540320Z else: 2025-05-07T20:31:49.9540418Z scale_ub_tensor = None 2025-05-07T20:31:49.9540493Z 2025-05-07T20:31:49.9540624Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9540718Z op = silu_mul_quant 2025-05-07T20:31:49.9540804Z if compiled: 2025-05-07T20:31:49.9540903Z op = torch.compile(op) 2025-05-07T20:31:49.9541025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9541096Z 2025-05-07T20:31:49.9541245Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9541250Z 2025-05-07T20:31:49.9541350Z moe/activation_test.py:117: 2025-05-07T20:31:49.9541478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9541577Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9541681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9542043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9542142Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9542639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9542734Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9543229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9543466Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9543800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9543898Z kernel = self.compile( 2025-05-07T20:31:49.9544277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9544454Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9544581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9544586Z 2025-05-07T20:31:49.9544789Z self = 2025-05-07T20:31:49.9545562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9546073Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6caa550>} 2025-05-07T20:31:49.9546811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9547001Z context = 2025-05-07T20:31:49.9547006Z 2025-05-07T20:31:49.9547173Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9547440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9547546Z module_map=module_map) 2025-05-07T20:31:49.9547713Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9547934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9548011Z E ^ 2025-05-07T20:31:49.9548374Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9548378Z 2025-05-07T20:31:49.9548785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9548789Z 2025-05-07T20:31:49.9548893Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9549114Z self=, 2025-05-07T20:31:49.9549189Z T=2048, 2025-05-07T20:31:49.9549270Z D=5120, 2025-05-07T20:31:49.9549351Z scale_ub=1200.0, 2025-05-07T20:31:49.9549435Z contiguous=False, 2025-05-07T20:31:49.9549521Z compiled=True, 2025-05-07T20:31:49.9549592Z ) 2025-05-07T20:31:49.9549807Z self = 2025-05-07T20:31:49.9549995Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9550000Z 2025-05-07T20:31:49.9550075Z @given( 2025-05-07T20:31:49.9550196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9550294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9550407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9550524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9550636Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9550710Z ) 2025-05-07T20:31:49.9550957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9551050Z def test_silu_mul_quant( 2025-05-07T20:31:49.9551129Z self, 2025-05-07T20:31:49.9551205Z T: int, 2025-05-07T20:31:49.9551280Z D: int, 2025-05-07T20:31:49.9551380Z scale_ub: Optional[float], 2025-05-07T20:31:49.9551467Z contiguous: bool, 2025-05-07T20:31:49.9551636Z compiled: bool, 2025-05-07T20:31:49.9551719Z ) -> None: 2025-05-07T20:31:49.9551814Z torch.manual_seed(2025) 2025-05-07T20:31:49.9551885Z 2025-05-07T20:31:49.9552059Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9552133Z 2025-05-07T20:31:49.9552223Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9552352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9552440Z x = x_sign * x_clamp 2025-05-07T20:31:49.9552519Z x0 = x[:, :D] 2025-05-07T20:31:49.9552599Z x1 = x[:, D:] 2025-05-07T20:31:49.9552671Z 2025-05-07T20:31:49.9552758Z if contiguous: 2025-05-07T20:31:49.9552848Z x0 = x0.contiguous() 2025-05-07T20:31:49.9552936Z x1 = x1.contiguous() 2025-05-07T20:31:49.9553010Z 2025-05-07T20:31:49.9553099Z if scale_ub is not None: 2025-05-07T20:31:49.9553203Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9553353Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9553429Z ) 2025-05-07T20:31:49.9553506Z else: 2025-05-07T20:31:49.9553602Z scale_ub_tensor = None 2025-05-07T20:31:49.9553675Z 2025-05-07T20:31:49.9553804Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9553895Z op = silu_mul_quant 2025-05-07T20:31:49.9553979Z if compiled: 2025-05-07T20:31:49.9554082Z op = torch.compile(op) 2025-05-07T20:31:49.9554188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9554261Z 2025-05-07T20:31:49.9554354Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9554358Z 2025-05-07T20:31:49.9554453Z moe/activation_test.py:117: 2025-05-07T20:31:49.9554580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9554683Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9554780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9555236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9555329Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9555827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9555926Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9556279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9556505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9556843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9556934Z kernel = self.compile( 2025-05-07T20:31:49.9557319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9557506Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9557632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9557636Z 2025-05-07T20:31:49.9557844Z self = 2025-05-07T20:31:49.9558608Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9559113Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6c3b310>} 2025-05-07T20:31:49.9559923Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9560121Z context = 2025-05-07T20:31:49.9560126Z 2025-05-07T20:31:49.9560295Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9560562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9560669Z module_map=module_map) 2025-05-07T20:31:49.9560834Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9560933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9561016Z E ^ 2025-05-07T20:31:49.9561365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9561369Z 2025-05-07T20:31:49.9561778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9561785Z 2025-05-07T20:31:49.9561898Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9562118Z self=, 2025-05-07T20:31:49.9562198Z T=4096, 2025-05-07T20:31:49.9562273Z D=5120, 2025-05-07T20:31:49.9562357Z scale_ub=1200.0, 2025-05-07T20:31:49.9562442Z contiguous=True, 2025-05-07T20:31:49.9562524Z compiled=True, 2025-05-07T20:31:49.9562595Z ) 2025-05-07T20:31:49.9562816Z self = 2025-05-07T20:31:49.9562987Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.9562992Z 2025-05-07T20:31:49.9563072Z @given( 2025-05-07T20:31:49.9563187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9563284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9563399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9563513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9563739Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9563816Z ) 2025-05-07T20:31:49.9564060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9564151Z def test_silu_mul_quant( 2025-05-07T20:31:49.9564230Z self, 2025-05-07T20:31:49.9564306Z T: int, 2025-05-07T20:31:49.9564381Z D: int, 2025-05-07T20:31:49.9564480Z scale_ub: Optional[float], 2025-05-07T20:31:49.9564567Z contiguous: bool, 2025-05-07T20:31:49.9564653Z compiled: bool, 2025-05-07T20:31:49.9564729Z ) -> None: 2025-05-07T20:31:49.9564821Z torch.manual_seed(2025) 2025-05-07T20:31:49.9564897Z 2025-05-07T20:31:49.9565065Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9565137Z 2025-05-07T20:31:49.9565231Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9565359Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9565460Z x = x_sign * x_clamp 2025-05-07T20:31:49.9565545Z x0 = x[:, :D] 2025-05-07T20:31:49.9565624Z x1 = x[:, D:] 2025-05-07T20:31:49.9565694Z 2025-05-07T20:31:49.9565782Z if contiguous: 2025-05-07T20:31:49.9565873Z x0 = x0.contiguous() 2025-05-07T20:31:49.9565960Z x1 = x1.contiguous() 2025-05-07T20:31:49.9566034Z 2025-05-07T20:31:49.9566124Z if scale_ub is not None: 2025-05-07T20:31:49.9566232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9566368Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9566442Z ) 2025-05-07T20:31:49.9566521Z else: 2025-05-07T20:31:49.9566613Z scale_ub_tensor = None 2025-05-07T20:31:49.9566690Z 2025-05-07T20:31:49.9566821Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9566909Z op = silu_mul_quant 2025-05-07T20:31:49.9566993Z if compiled: 2025-05-07T20:31:49.9567178Z op = torch.compile(op) 2025-05-07T20:31:49.9567285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9567358Z 2025-05-07T20:31:49.9567450Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9567454Z 2025-05-07T20:31:49.9567550Z moe/activation_test.py:117: 2025-05-07T20:31:49.9567679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9567778Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9567879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9568245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9568336Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9568823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9568924Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9569283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9569519Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9569854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9569946Z kernel = self.compile( 2025-05-07T20:31:49.9570325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9570499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9570626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9570631Z 2025-05-07T20:31:49.9570835Z self = 2025-05-07T20:31:49.9571603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9572191Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6fb3040>} 2025-05-07T20:31:49.9572929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9573125Z context = 2025-05-07T20:31:49.9573130Z 2025-05-07T20:31:49.9573296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9573565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9573673Z module_map=module_map) 2025-05-07T20:31:49.9573843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9573950Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9574026Z E ^ 2025-05-07T20:31:49.9574383Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9574388Z 2025-05-07T20:31:49.9574797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9574801Z 2025-05-07T20:31:49.9574902Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9575125Z self=, 2025-05-07T20:31:49.9575202Z T=128, 2025-05-07T20:31:49.9575276Z D=5120, 2025-05-07T20:31:49.9575361Z scale_ub=1200.0, 2025-05-07T20:31:49.9575447Z contiguous=False, 2025-05-07T20:31:49.9575528Z compiled=True, 2025-05-07T20:31:49.9575603Z ) 2025-05-07T20:31:49.9575891Z self = 2025-05-07T20:31:49.9576067Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9576071Z 2025-05-07T20:31:49.9576152Z @given( 2025-05-07T20:31:49.9576269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9576366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9576485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9576601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9576717Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9576789Z ) 2025-05-07T20:31:49.9577033Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9577129Z def test_silu_mul_quant( 2025-05-07T20:31:49.9577205Z self, 2025-05-07T20:31:49.9577281Z T: int, 2025-05-07T20:31:49.9577360Z D: int, 2025-05-07T20:31:49.9577457Z scale_ub: Optional[float], 2025-05-07T20:31:49.9577557Z contiguous: bool, 2025-05-07T20:31:49.9577646Z compiled: bool, 2025-05-07T20:31:49.9577722Z ) -> None: 2025-05-07T20:31:49.9577818Z torch.manual_seed(2025) 2025-05-07T20:31:49.9577892Z 2025-05-07T20:31:49.9578058Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9578135Z 2025-05-07T20:31:49.9578225Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9578347Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9578439Z x = x_sign * x_clamp 2025-05-07T20:31:49.9578519Z x0 = x[:, :D] 2025-05-07T20:31:49.9578598Z x1 = x[:, D:] 2025-05-07T20:31:49.9578674Z 2025-05-07T20:31:49.9578757Z if contiguous: 2025-05-07T20:31:49.9578847Z x0 = x0.contiguous() 2025-05-07T20:31:49.9578940Z x1 = x1.contiguous() 2025-05-07T20:31:49.9579012Z 2025-05-07T20:31:49.9579101Z if scale_ub is not None: 2025-05-07T20:31:49.9579298Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9579434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9579510Z ) 2025-05-07T20:31:49.9579586Z else: 2025-05-07T20:31:49.9579677Z scale_ub_tensor = None 2025-05-07T20:31:49.9579751Z 2025-05-07T20:31:49.9579879Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9579968Z op = silu_mul_quant 2025-05-07T20:31:49.9580057Z if compiled: 2025-05-07T20:31:49.9580154Z op = torch.compile(op) 2025-05-07T20:31:49.9580258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9580332Z 2025-05-07T20:31:49.9580421Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9580426Z 2025-05-07T20:31:49.9580522Z moe/activation_test.py:117: 2025-05-07T20:31:49.9580651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9580749Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9580863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9581307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9581400Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9581903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9581998Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9582358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9582581Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9582923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9583017Z kernel = self.compile( 2025-05-07T20:31:49.9583471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9583655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9583789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9583794Z 2025-05-07T20:31:49.9584026Z self = 2025-05-07T20:31:49.9584818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9585323Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6fb3ca0>} 2025-05-07T20:31:49.9586075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9586271Z context = 2025-05-07T20:31:49.9586276Z 2025-05-07T20:31:49.9586442Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9586710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9586816Z module_map=module_map) 2025-05-07T20:31:49.9586977Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9587075Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9587150Z E ^ 2025-05-07T20:31:49.9587510Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9587514Z 2025-05-07T20:31:49.9587925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9588004Z 2025-05-07T20:31:49.9588109Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9588336Z self=, 2025-05-07T20:31:49.9588412Z T=16384, 2025-05-07T20:31:49.9588490Z D=7168, 2025-05-07T20:31:49.9588572Z scale_ub=1200.0, 2025-05-07T20:31:49.9588655Z contiguous=True, 2025-05-07T20:31:49.9588743Z compiled=True, 2025-05-07T20:31:49.9588815Z ) 2025-05-07T20:31:49.9589031Z self = 2025-05-07T20:31:49.9589206Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.9589210Z 2025-05-07T20:31:49.9589286Z @given( 2025-05-07T20:31:49.9589402Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9589501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9589615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9589745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9589863Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9589937Z ) 2025-05-07T20:31:49.9590183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9590277Z def test_silu_mul_quant( 2025-05-07T20:31:49.9590351Z self, 2025-05-07T20:31:49.9590428Z T: int, 2025-05-07T20:31:49.9590511Z D: int, 2025-05-07T20:31:49.9590609Z scale_ub: Optional[float], 2025-05-07T20:31:49.9590702Z contiguous: bool, 2025-05-07T20:31:49.9590787Z compiled: bool, 2025-05-07T20:31:49.9590863Z ) -> None: 2025-05-07T20:31:49.9590961Z torch.manual_seed(2025) 2025-05-07T20:31:49.9591032Z 2025-05-07T20:31:49.9596062Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9596142Z 2025-05-07T20:31:49.9596238Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9596474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9596571Z x = x_sign * x_clamp 2025-05-07T20:31:49.9596653Z x0 = x[:, :D] 2025-05-07T20:31:49.9596732Z x1 = x[:, D:] 2025-05-07T20:31:49.9596802Z 2025-05-07T20:31:49.9596887Z if contiguous: 2025-05-07T20:31:49.9596979Z x0 = x0.contiguous() 2025-05-07T20:31:49.9597067Z x1 = x1.contiguous() 2025-05-07T20:31:49.9597145Z 2025-05-07T20:31:49.9597235Z if scale_ub is not None: 2025-05-07T20:31:49.9597340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9597477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9597554Z ) 2025-05-07T20:31:49.9597635Z else: 2025-05-07T20:31:49.9597729Z scale_ub_tensor = None 2025-05-07T20:31:49.9597801Z 2025-05-07T20:31:49.9597938Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9598030Z op = silu_mul_quant 2025-05-07T20:31:49.9598126Z if compiled: 2025-05-07T20:31:49.9598229Z op = torch.compile(op) 2025-05-07T20:31:49.9598335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9598407Z 2025-05-07T20:31:49.9598502Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9598508Z 2025-05-07T20:31:49.9598606Z moe/activation_test.py:117: 2025-05-07T20:31:49.9598734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9598839Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9598937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9599308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9599400Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9599891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9599989Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9601040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9601269Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9601604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9601696Z kernel = self.compile( 2025-05-07T20:31:49.9602085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9602259Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9602385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9602389Z 2025-05-07T20:31:49.9602599Z self = 2025-05-07T20:31:49.9603369Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9603890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6ceaa60>} 2025-05-07T20:31:49.9604677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9604873Z context = 2025-05-07T20:31:49.9604877Z 2025-05-07T20:31:49.9605045Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9605311Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9605494Z module_map=module_map) 2025-05-07T20:31:49.9605661Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9605759Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9605841Z E ^ 2025-05-07T20:31:49.9606194Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9606199Z 2025-05-07T20:31:49.9606616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9606621Z 2025-05-07T20:31:49.9606721Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9606941Z self=, 2025-05-07T20:31:49.9607021Z T=16384, 2025-05-07T20:31:49.9607096Z D=5120, 2025-05-07T20:31:49.9607178Z scale_ub=1200.0, 2025-05-07T20:31:49.9607267Z contiguous=True, 2025-05-07T20:31:49.9607353Z compiled=False, 2025-05-07T20:31:49.9607432Z ) 2025-05-07T20:31:49.9607654Z self = 2025-05-07T20:31:49.9607834Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.9607838Z 2025-05-07T20:31:49.9607918Z @given( 2025-05-07T20:31:49.9608035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9608135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9608253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9608370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9608485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9608562Z ) 2025-05-07T20:31:49.9608806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9608901Z def test_silu_mul_quant( 2025-05-07T20:31:49.9608978Z self, 2025-05-07T20:31:49.9609053Z T: int, 2025-05-07T20:31:49.9609131Z D: int, 2025-05-07T20:31:49.9609315Z scale_ub: Optional[float], 2025-05-07T20:31:49.9609404Z contiguous: bool, 2025-05-07T20:31:49.9609492Z compiled: bool, 2025-05-07T20:31:49.9609569Z ) -> None: 2025-05-07T20:31:49.9609664Z torch.manual_seed(2025) 2025-05-07T20:31:49.9609738Z 2025-05-07T20:31:49.9609905Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9609978Z 2025-05-07T20:31:49.9610072Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9610196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9610286Z x = x_sign * x_clamp 2025-05-07T20:31:49.9610366Z x0 = x[:, :D] 2025-05-07T20:31:49.9610444Z x1 = x[:, D:] 2025-05-07T20:31:49.9610518Z 2025-05-07T20:31:49.9610603Z if contiguous: 2025-05-07T20:31:49.9610693Z x0 = x0.contiguous() 2025-05-07T20:31:49.9610783Z x1 = x1.contiguous() 2025-05-07T20:31:49.9610856Z 2025-05-07T20:31:49.9610955Z if scale_ub is not None: 2025-05-07T20:31:49.9611065Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9611202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9611279Z ) 2025-05-07T20:31:49.9611360Z else: 2025-05-07T20:31:49.9611452Z scale_ub_tensor = None 2025-05-07T20:31:49.9611524Z 2025-05-07T20:31:49.9611658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9611749Z op = silu_mul_quant 2025-05-07T20:31:49.9611835Z if compiled: 2025-05-07T20:31:49.9611933Z op = torch.compile(op) 2025-05-07T20:31:49.9612037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9612113Z 2025-05-07T20:31:49.9612202Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9612206Z 2025-05-07T20:31:49.9612302Z moe/activation_test.py:117: 2025-05-07T20:31:49.9612433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9612614Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9612715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9613217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9613312Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9613677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9613900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9614235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9614330Z kernel = self.compile( 2025-05-07T20:31:49.9614708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9614894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9615031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9615036Z 2025-05-07T20:31:49.9615244Z self = 2025-05-07T20:31:49.9616013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9616518Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6c08550>} 2025-05-07T20:31:49.9617258Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9617452Z context = 2025-05-07T20:31:49.9617531Z 2025-05-07T20:31:49.9617700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9617972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9618078Z module_map=module_map) 2025-05-07T20:31:49.9618242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9618340Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9618416Z E ^ 2025-05-07T20:31:49.9618773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9618777Z 2025-05-07T20:31:49.9619189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9619194Z 2025-05-07T20:31:49.9619300Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9619530Z self=, 2025-05-07T20:31:49.9619607Z T=1, 2025-05-07T20:31:49.9619684Z D=7168, 2025-05-07T20:31:49.9619766Z scale_ub=1200.0, 2025-05-07T20:31:49.9619851Z contiguous=False, 2025-05-07T20:31:49.9619937Z compiled=False, 2025-05-07T20:31:49.9620009Z ) 2025-05-07T20:31:49.9620225Z self = 2025-05-07T20:31:49.9620395Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.9620400Z 2025-05-07T20:31:49.9620476Z @given( 2025-05-07T20:31:49.9620598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9620696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9620810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9620928Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9621040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9621309Z ) 2025-05-07T20:31:49.9621564Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9621656Z def test_silu_mul_quant( 2025-05-07T20:31:49.9621733Z self, 2025-05-07T20:31:49.9621811Z T: int, 2025-05-07T20:31:49.9621887Z D: int, 2025-05-07T20:31:49.9621988Z scale_ub: Optional[float], 2025-05-07T20:31:49.9622078Z contiguous: bool, 2025-05-07T20:31:49.9622162Z compiled: bool, 2025-05-07T20:31:49.9622242Z ) -> None: 2025-05-07T20:31:49.9622334Z torch.manual_seed(2025) 2025-05-07T20:31:49.9622405Z 2025-05-07T20:31:49.9622581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9622654Z 2025-05-07T20:31:49.9622744Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9622875Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9622965Z x = x_sign * x_clamp 2025-05-07T20:31:49.9623045Z x0 = x[:, :D] 2025-05-07T20:31:49.9623142Z x1 = x[:, D:] 2025-05-07T20:31:49.9623214Z 2025-05-07T20:31:49.9623296Z if contiguous: 2025-05-07T20:31:49.9623389Z x0 = x0.contiguous() 2025-05-07T20:31:49.9623478Z x1 = x1.contiguous() 2025-05-07T20:31:49.9623556Z 2025-05-07T20:31:49.9623646Z if scale_ub is not None: 2025-05-07T20:31:49.9623750Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9623887Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9623963Z ) 2025-05-07T20:31:49.9624053Z else: 2025-05-07T20:31:49.9624159Z scale_ub_tensor = None 2025-05-07T20:31:49.9624248Z 2025-05-07T20:31:49.9624388Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9624482Z op = silu_mul_quant 2025-05-07T20:31:49.9624568Z if compiled: 2025-05-07T20:31:49.9624666Z op = torch.compile(op) 2025-05-07T20:31:49.9624780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9624957Z 2025-05-07T20:31:49.9625052Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9625056Z 2025-05-07T20:31:49.9625151Z moe/activation_test.py:117: 2025-05-07T20:31:49.9625281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9625386Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9625485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9625980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9626081Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9626444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9626670Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9627016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9627117Z kernel = self.compile( 2025-05-07T20:31:49.9627504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9627681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9627807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9627814Z 2025-05-07T20:31:49.9628018Z self = 2025-05-07T20:31:49.9628785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9629374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6ceae50>} 2025-05-07T20:31:49.9630132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9630326Z context = 2025-05-07T20:31:49.9630330Z 2025-05-07T20:31:49.9630498Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9630766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9630876Z module_map=module_map) 2025-05-07T20:31:49.9631041Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9631139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9631218Z E ^ 2025-05-07T20:31:49.9631577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9631586Z 2025-05-07T20:31:49.9632004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9632008Z 2025-05-07T20:31:49.9632108Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9632328Z self=, 2025-05-07T20:31:49.9632406Z T=4096, 2025-05-07T20:31:49.9632482Z D=7168, 2025-05-07T20:31:49.9632567Z scale_ub=1200.0, 2025-05-07T20:31:49.9632651Z contiguous=False, 2025-05-07T20:31:49.9632736Z compiled=True, 2025-05-07T20:31:49.9632809Z ) 2025-05-07T20:31:49.9633028Z self = 2025-05-07T20:31:49.9633204Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9633208Z 2025-05-07T20:31:49.9633285Z @given( 2025-05-07T20:31:49.9633410Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9633585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9633702Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9633820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9633931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9634006Z ) 2025-05-07T20:31:49.9634287Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9634391Z def test_silu_mul_quant( 2025-05-07T20:31:49.9634471Z self, 2025-05-07T20:31:49.9634546Z T: int, 2025-05-07T20:31:49.9634620Z D: int, 2025-05-07T20:31:49.9634727Z scale_ub: Optional[float], 2025-05-07T20:31:49.9634816Z contiguous: bool, 2025-05-07T20:31:49.9634899Z compiled: bool, 2025-05-07T20:31:49.9634979Z ) -> None: 2025-05-07T20:31:49.9635071Z torch.manual_seed(2025) 2025-05-07T20:31:49.9635142Z 2025-05-07T20:31:49.9635319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9635398Z 2025-05-07T20:31:49.9635494Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9635620Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9635708Z x = x_sign * x_clamp 2025-05-07T20:31:49.9635790Z x0 = x[:, :D] 2025-05-07T20:31:49.9635871Z x1 = x[:, D:] 2025-05-07T20:31:49.9635943Z 2025-05-07T20:31:49.9636029Z if contiguous: 2025-05-07T20:31:49.9636120Z x0 = x0.contiguous() 2025-05-07T20:31:49.9636208Z x1 = x1.contiguous() 2025-05-07T20:31:49.9636281Z 2025-05-07T20:31:49.9636370Z if scale_ub is not None: 2025-05-07T20:31:49.9636473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9636611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9636686Z ) 2025-05-07T20:31:49.9636765Z else: 2025-05-07T20:31:49.9636857Z scale_ub_tensor = None 2025-05-07T20:31:49.9637010Z 2025-05-07T20:31:49.9637148Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9637238Z op = silu_mul_quant 2025-05-07T20:31:49.9637323Z if compiled: 2025-05-07T20:31:49.9637426Z op = torch.compile(op) 2025-05-07T20:31:49.9637531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9637604Z 2025-05-07T20:31:49.9637699Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9637703Z 2025-05-07T20:31:49.9637801Z moe/activation_test.py:117: 2025-05-07T20:31:49.9637930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9638029Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9638128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9638493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9638584Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9639085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9639186Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9639540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9639769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9640275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9640370Z kernel = self.compile( 2025-05-07T20:31:49.9640766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9640941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9641066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9641205Z 2025-05-07T20:31:49.9641424Z self = 2025-05-07T20:31:49.9642205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9642719Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6b5eee0>} 2025-05-07T20:31:49.9643458Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9643652Z context = 2025-05-07T20:31:49.9643657Z 2025-05-07T20:31:49.9643832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9644100Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9644211Z module_map=module_map) 2025-05-07T20:31:49.9644372Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9644468Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9644546Z E ^ 2025-05-07T20:31:49.9644902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9644907Z 2025-05-07T20:31:49.9645326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9645330Z 2025-05-07T20:31:49.9645430Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9645650Z self=, 2025-05-07T20:31:49.9645730Z T=128, 2025-05-07T20:31:49.9645917Z D=7168, 2025-05-07T20:31:49.9646002Z scale_ub=1200.0, 2025-05-07T20:31:49.9646090Z contiguous=False, 2025-05-07T20:31:49.9646171Z compiled=True, 2025-05-07T20:31:49.9646248Z ) 2025-05-07T20:31:49.9646466Z self = 2025-05-07T20:31:49.9646636Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.9646640Z 2025-05-07T20:31:49.9646719Z @given( 2025-05-07T20:31:49.9646835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9646931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9647050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9647166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9647286Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9647358Z ) 2025-05-07T20:31:49.9647602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9647713Z def test_silu_mul_quant( 2025-05-07T20:31:49.9647789Z self, 2025-05-07T20:31:49.9647864Z T: int, 2025-05-07T20:31:49.9647943Z D: int, 2025-05-07T20:31:49.9648039Z scale_ub: Optional[float], 2025-05-07T20:31:49.9648127Z contiguous: bool, 2025-05-07T20:31:49.9648215Z compiled: bool, 2025-05-07T20:31:49.9648293Z ) -> None: 2025-05-07T20:31:49.9648387Z torch.manual_seed(2025) 2025-05-07T20:31:49.9648461Z 2025-05-07T20:31:49.9648630Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9648704Z 2025-05-07T20:31:49.9648797Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9648920Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9649010Z x = x_sign * x_clamp 2025-05-07T20:31:49.9649090Z x0 = x[:, :D] 2025-05-07T20:31:49.9649169Z x1 = x[:, D:] 2025-05-07T20:31:49.9649243Z 2025-05-07T20:31:49.9649325Z if contiguous: 2025-05-07T20:31:49.9649506Z x0 = x0.contiguous() 2025-05-07T20:31:49.9649598Z x1 = x1.contiguous() 2025-05-07T20:31:49.9649670Z 2025-05-07T20:31:49.9649761Z if scale_ub is not None: 2025-05-07T20:31:49.9649868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9650002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9650077Z ) 2025-05-07T20:31:49.9650154Z else: 2025-05-07T20:31:49.9650246Z scale_ub_tensor = None 2025-05-07T20:31:49.9650320Z 2025-05-07T20:31:49.9650449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9650539Z op = silu_mul_quant 2025-05-07T20:31:49.9650626Z if compiled: 2025-05-07T20:31:49.9650724Z op = torch.compile(op) 2025-05-07T20:31:49.9650828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9650902Z 2025-05-07T20:31:49.9650991Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9651004Z 2025-05-07T20:31:49.9651105Z moe/activation_test.py:117: 2025-05-07T20:31:49.9651236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9651335Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9651437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9651803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9651894Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9652384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9652479Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9652840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9653064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9653503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9653601Z kernel = self.compile( 2025-05-07T20:31:49.9653979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9654156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9654283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9654287Z 2025-05-07T20:31:49.9654493Z self = 2025-05-07T20:31:49.9655262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9655765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6b1eaf0>} 2025-05-07T20:31:49.9656504Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9656698Z context = 2025-05-07T20:31:49.9656703Z 2025-05-07T20:31:49.9656871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9657141Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9657248Z module_map=module_map) 2025-05-07T20:31:49.9657408Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9657508Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9657584Z E ^ 2025-05-07T20:31:49.9657943Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9658029Z 2025-05-07T20:31:49.9658447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9658451Z 2025-05-07T20:31:49.9658552Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9658779Z self=, 2025-05-07T20:31:49.9658859Z T=2048, 2025-05-07T20:31:49.9658938Z D=7168, 2025-05-07T20:31:49.9659021Z scale_ub=None, 2025-05-07T20:31:49.9659105Z contiguous=True, 2025-05-07T20:31:49.9659192Z compiled=True, 2025-05-07T20:31:49.9659268Z ) 2025-05-07T20:31:49.9659484Z self = 2025-05-07T20:31:49.9659654Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.9659659Z 2025-05-07T20:31:49.9659735Z @given( 2025-05-07T20:31:49.9659862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9659964Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9660079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9660194Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9660311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9660384Z ) 2025-05-07T20:31:49.9660631Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9660723Z def test_silu_mul_quant( 2025-05-07T20:31:49.9660800Z self, 2025-05-07T20:31:49.9660878Z T: int, 2025-05-07T20:31:49.9660953Z D: int, 2025-05-07T20:31:49.9661049Z scale_ub: Optional[float], 2025-05-07T20:31:49.9661186Z contiguous: bool, 2025-05-07T20:31:49.9661271Z compiled: bool, 2025-05-07T20:31:49.9661349Z ) -> None: 2025-05-07T20:31:49.9661444Z torch.manual_seed(2025) 2025-05-07T20:31:49.9661522Z 2025-05-07T20:31:49.9661835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9661916Z 2025-05-07T20:31:49.9662007Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9662132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9662227Z x = x_sign * x_clamp 2025-05-07T20:31:49.9662307Z x0 = x[:, :D] 2025-05-07T20:31:49.9662391Z x1 = x[:, D:] 2025-05-07T20:31:49.9662462Z 2025-05-07T20:31:49.9662544Z if contiguous: 2025-05-07T20:31:49.9662637Z x0 = x0.contiguous() 2025-05-07T20:31:49.9662724Z x1 = x1.contiguous() 2025-05-07T20:31:49.9662797Z 2025-05-07T20:31:49.9662890Z if scale_ub is not None: 2025-05-07T20:31:49.9662995Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9663128Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9663205Z ) 2025-05-07T20:31:49.9663280Z else: 2025-05-07T20:31:49.9663386Z scale_ub_tensor = None 2025-05-07T20:31:49.9663462Z 2025-05-07T20:31:49.9663590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9663681Z op = silu_mul_quant 2025-05-07T20:31:49.9663768Z if compiled: 2025-05-07T20:31:49.9663866Z op = torch.compile(op) 2025-05-07T20:31:49.9663972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9664046Z 2025-05-07T20:31:49.9664135Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9664140Z 2025-05-07T20:31:49.9664240Z moe/activation_test.py:117: 2025-05-07T20:31:49.9664365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9664464Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9664566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9664931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9665111Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9665611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9665708Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9666066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9666292Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9666628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9666725Z kernel = self.compile( 2025-05-07T20:31:49.9667104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9667283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9667415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9667424Z 2025-05-07T20:31:49.9667628Z self = 2025-05-07T20:31:49.9668397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9668896Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd68b48b0>} 2025-05-07T20:31:49.9669653Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9669843Z context = 2025-05-07T20:31:49.9669848Z 2025-05-07T20:31:49.9670096Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9670367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9670474Z module_map=module_map) 2025-05-07T20:31:49.9670640Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9670737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9670814Z E ^ 2025-05-07T20:31:49.9671169Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9671173Z 2025-05-07T20:31:49.9671580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9671585Z 2025-05-07T20:31:49.9671688Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9671908Z self=, 2025-05-07T20:31:49.9671994Z T=16384, 2025-05-07T20:31:49.9672073Z D=5120, 2025-05-07T20:31:49.9672154Z scale_ub=None, 2025-05-07T20:31:49.9672240Z contiguous=False, 2025-05-07T20:31:49.9672325Z compiled=False, 2025-05-07T20:31:49.9672398Z ) 2025-05-07T20:31:49.9672613Z self = 2025-05-07T20:31:49.9672795Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.9672799Z 2025-05-07T20:31:49.9672876Z @given( 2025-05-07T20:31:49.9672995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9673092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9673205Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9673325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9673436Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9673508Z ) 2025-05-07T20:31:49.9673761Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9673931Z def test_silu_mul_quant( 2025-05-07T20:31:49.9674009Z self, 2025-05-07T20:31:49.9674085Z T: int, 2025-05-07T20:31:49.9674159Z D: int, 2025-05-07T20:31:49.9674259Z scale_ub: Optional[float], 2025-05-07T20:31:49.9674347Z contiguous: bool, 2025-05-07T20:31:49.9674430Z compiled: bool, 2025-05-07T20:31:49.9674509Z ) -> None: 2025-05-07T20:31:49.9674602Z torch.manual_seed(2025) 2025-05-07T20:31:49.9674675Z 2025-05-07T20:31:49.9674847Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9674920Z 2025-05-07T20:31:49.9675011Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9675138Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9676954Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9676967Z 2025-05-07T20:31:49.9677089Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.9677093Z 2025-05-07T20:31:49.9677193Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9677417Z self=, 2025-05-07T20:31:49.9677493Z T=4096, 2025-05-07T20:31:49.9677569Z D=7168, 2025-05-07T20:31:49.9677653Z scale_ub=1200.0, 2025-05-07T20:31:49.9677735Z contiguous=True, 2025-05-07T20:31:49.9677818Z compiled=True, 2025-05-07T20:31:49.9677895Z ) 2025-05-07T20:31:49.9678189Z self = 2025-05-07T20:31:49.9678368Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.9678373Z 2025-05-07T20:31:49.9678451Z @given( 2025-05-07T20:31:49.9678566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9678668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9678780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9678895Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9679008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9679082Z ) 2025-05-07T20:31:49.9679325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9679421Z def test_silu_mul_quant( 2025-05-07T20:31:49.9679496Z self, 2025-05-07T20:31:49.9679572Z T: int, 2025-05-07T20:31:49.9679652Z D: int, 2025-05-07T20:31:49.9679747Z scale_ub: Optional[float], 2025-05-07T20:31:49.9679848Z contiguous: bool, 2025-05-07T20:31:49.9679935Z compiled: bool, 2025-05-07T20:31:49.9680014Z ) -> None: 2025-05-07T20:31:49.9680110Z torch.manual_seed(2025) 2025-05-07T20:31:49.9680183Z 2025-05-07T20:31:49.9680351Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9680427Z 2025-05-07T20:31:49.9680519Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9680642Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9682433Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9682519Z 2025-05-07T20:31:49.9682638Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.9682643Z 2025-05-07T20:31:49.9682747Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9682967Z self=, 2025-05-07T20:31:49.9683049Z T=16384, 2025-05-07T20:31:49.9683126Z D=7168, 2025-05-07T20:31:49.9683207Z scale_ub=None, 2025-05-07T20:31:49.9683294Z contiguous=False, 2025-05-07T20:31:49.9683378Z compiled=False, 2025-05-07T20:31:49.9683450Z ) 2025-05-07T20:31:49.9683666Z self = 2025-05-07T20:31:49.9683840Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.9683844Z 2025-05-07T20:31:49.9683920Z @given( 2025-05-07T20:31:49.9684046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9684151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9684263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9684383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9684496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9684571Z ) 2025-05-07T20:31:49.9684820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9684912Z def test_silu_mul_quant( 2025-05-07T20:31:49.9684990Z self, 2025-05-07T20:31:49.9685067Z T: int, 2025-05-07T20:31:49.9685143Z D: int, 2025-05-07T20:31:49.9685244Z scale_ub: Optional[float], 2025-05-07T20:31:49.9685332Z contiguous: bool, 2025-05-07T20:31:49.9685417Z compiled: bool, 2025-05-07T20:31:49.9685496Z ) -> None: 2025-05-07T20:31:49.9685591Z torch.manual_seed(2025) 2025-05-07T20:31:49.9685663Z 2025-05-07T20:31:49.9685938Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9687730Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9687740Z 2025-05-07T20:31:49.9687854Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9687858Z 2025-05-07T20:31:49.9687958Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9688181Z self=, 2025-05-07T20:31:49.9688261Z T=2048, 2025-05-07T20:31:49.9688341Z D=7168, 2025-05-07T20:31:49.9688426Z scale_ub=1200.0, 2025-05-07T20:31:49.9688508Z contiguous=True, 2025-05-07T20:31:49.9688589Z compiled=True, 2025-05-07T20:31:49.9688663Z ) 2025-05-07T20:31:49.9688877Z self = 2025-05-07T20:31:49.9689048Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.9689058Z 2025-05-07T20:31:49.9689133Z @given( 2025-05-07T20:31:49.9689248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9689348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9689463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9689578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9689694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9689766Z ) 2025-05-07T20:31:49.9690020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9690191Z def test_silu_mul_quant( 2025-05-07T20:31:49.9690265Z self, 2025-05-07T20:31:49.9690343Z T: int, 2025-05-07T20:31:49.9690419Z D: int, 2025-05-07T20:31:49.9690516Z scale_ub: Optional[float], 2025-05-07T20:31:49.9690606Z contiguous: bool, 2025-05-07T20:31:49.9690691Z compiled: bool, 2025-05-07T20:31:49.9690770Z ) -> None: 2025-05-07T20:31:49.9690867Z torch.manual_seed(2025) 2025-05-07T20:31:49.9690941Z 2025-05-07T20:31:49.9691108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9691184Z 2025-05-07T20:31:49.9691274Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9691399Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9693148Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9693160Z 2025-05-07T20:31:49.9693276Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.9693281Z 2025-05-07T20:31:49.9693387Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9693608Z self=, 2025-05-07T20:31:49.9693686Z T=2048, 2025-05-07T20:31:49.9693760Z D=7168, 2025-05-07T20:31:49.9693844Z scale_ub=None, 2025-05-07T20:31:49.9693934Z contiguous=True, 2025-05-07T20:31:49.9694017Z compiled=False, 2025-05-07T20:31:49.9694088Z ) 2025-05-07T20:31:49.9694382Z self = 2025-05-07T20:31:49.9694563Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9694568Z 2025-05-07T20:31:49.9694644Z @given( 2025-05-07T20:31:49.9694764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9694861Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9694977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9695092Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9695203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9695280Z ) 2025-05-07T20:31:49.9695527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9695619Z def test_silu_mul_quant( 2025-05-07T20:31:49.9695698Z self, 2025-05-07T20:31:49.9695773Z T: int, 2025-05-07T20:31:49.9695851Z D: int, 2025-05-07T20:31:49.9695950Z scale_ub: Optional[float], 2025-05-07T20:31:49.9696049Z contiguous: bool, 2025-05-07T20:31:49.9696134Z compiled: bool, 2025-05-07T20:31:49.9696214Z ) -> None: 2025-05-07T20:31:49.9696306Z torch.manual_seed(2025) 2025-05-07T20:31:49.9696381Z 2025-05-07T20:31:49.9696546Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9696619Z 2025-05-07T20:31:49.9696713Z > x_sign = torch.sign(x) 2025-05-07T20:31:49.9698494Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9698581Z 2025-05-07T20:31:49.9698709Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:49.9698714Z 2025-05-07T20:31:49.9698813Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9699033Z self=, 2025-05-07T20:31:49.9699111Z T=1, 2025-05-07T20:31:49.9699185Z D=7168, 2025-05-07T20:31:49.9699267Z scale_ub=1200.0, 2025-05-07T20:31:49.9699358Z contiguous=True, 2025-05-07T20:31:49.9699440Z compiled=False, 2025-05-07T20:31:49.9699513Z ) 2025-05-07T20:31:49.9699730Z self = 2025-05-07T20:31:49.9699895Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.9699900Z 2025-05-07T20:31:49.9699977Z @given( 2025-05-07T20:31:49.9700093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9700190Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9700315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9700434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9700546Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9700621Z ) 2025-05-07T20:31:49.9700866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9700963Z def test_silu_mul_quant( 2025-05-07T20:31:49.9701037Z self, 2025-05-07T20:31:49.9701112Z T: int, 2025-05-07T20:31:49.9701257Z D: int, 2025-05-07T20:31:49.9701354Z scale_ub: Optional[float], 2025-05-07T20:31:49.9701442Z contiguous: bool, 2025-05-07T20:31:49.9701531Z compiled: bool, 2025-05-07T20:31:49.9701607Z ) -> None: 2025-05-07T20:31:49.9701699Z torch.manual_seed(2025) 2025-05-07T20:31:49.9701773Z 2025-05-07T20:31:49.9701942Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9702014Z 2025-05-07T20:31:49.9702196Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9702323Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9702415Z x = x_sign * x_clamp 2025-05-07T20:31:49.9702499Z x0 = x[:, :D] 2025-05-07T20:31:49.9702578Z x1 = x[:, D:] 2025-05-07T20:31:49.9702652Z 2025-05-07T20:31:49.9702734Z if contiguous: 2025-05-07T20:31:49.9702824Z x0 = x0.contiguous() 2025-05-07T20:31:49.9702917Z x1 = x1.contiguous() 2025-05-07T20:31:49.9702988Z 2025-05-07T20:31:49.9703077Z if scale_ub is not None: 2025-05-07T20:31:49.9703185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9703323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9703397Z ) 2025-05-07T20:31:49.9703477Z else: 2025-05-07T20:31:49.9703570Z scale_ub_tensor = None 2025-05-07T20:31:49.9703642Z 2025-05-07T20:31:49.9703778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9703877Z op = silu_mul_quant 2025-05-07T20:31:49.9703967Z if compiled: 2025-05-07T20:31:49.9704066Z op = torch.compile(op) 2025-05-07T20:31:49.9704170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9704243Z 2025-05-07T20:31:49.9704333Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9704338Z 2025-05-07T20:31:49.9704434Z moe/activation_test.py:117: 2025-05-07T20:31:49.9704563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9704661Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9704760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9705270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9705367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9705746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9706052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9706390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9706486Z kernel = self.compile( 2025-05-07T20:31:49.9706870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9707051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9707176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9707180Z 2025-05-07T20:31:49.9707382Z self = 2025-05-07T20:31:49.9708157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9708666Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd6798550>} 2025-05-07T20:31:49.9709418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9709609Z context = 2025-05-07T20:31:49.9709614Z 2025-05-07T20:31:49.9709779Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9710047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9710153Z module_map=module_map) 2025-05-07T20:31:49.9710317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9710496Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9710576Z E ^ 2025-05-07T20:31:49.9710932Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9710936Z 2025-05-07T20:31:49.9711347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9711351Z 2025-05-07T20:31:49.9711457Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9711677Z self=, 2025-05-07T20:31:49.9711754Z T=128, 2025-05-07T20:31:49.9711832Z D=5120, 2025-05-07T20:31:49.9711913Z scale_ub=None, 2025-05-07T20:31:49.9711997Z contiguous=True, 2025-05-07T20:31:49.9712081Z compiled=False, 2025-05-07T20:31:49.9712152Z ) 2025-05-07T20:31:49.9712366Z self = 2025-05-07T20:31:49.9712555Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9712559Z 2025-05-07T20:31:49.9712635Z @given( 2025-05-07T20:31:49.9712757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9712855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9712968Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9713088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9713202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9713274Z ) 2025-05-07T20:31:49.9713529Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9713621Z def test_silu_mul_quant( 2025-05-07T20:31:49.9713699Z self, 2025-05-07T20:31:49.9713778Z T: int, 2025-05-07T20:31:49.9713852Z D: int, 2025-05-07T20:31:49.9713949Z scale_ub: Optional[float], 2025-05-07T20:31:49.9714039Z contiguous: bool, 2025-05-07T20:31:49.9714231Z compiled: bool, 2025-05-07T20:31:49.9714311Z ) -> None: 2025-05-07T20:31:49.9714405Z torch.manual_seed(2025) 2025-05-07T20:31:49.9714481Z 2025-05-07T20:31:49.9714653Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9714727Z 2025-05-07T20:31:49.9714817Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9714942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9715039Z x = x_sign * x_clamp 2025-05-07T20:31:49.9715120Z x0 = x[:, :D] 2025-05-07T20:31:49.9715203Z x1 = x[:, D:] 2025-05-07T20:31:49.9715276Z 2025-05-07T20:31:49.9715357Z if contiguous: 2025-05-07T20:31:49.9715454Z x0 = x0.contiguous() 2025-05-07T20:31:49.9715542Z x1 = x1.contiguous() 2025-05-07T20:31:49.9720366Z 2025-05-07T20:31:49.9720466Z if scale_ub is not None: 2025-05-07T20:31:49.9720575Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9720727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9720803Z ) 2025-05-07T20:31:49.9720881Z else: 2025-05-07T20:31:49.9720973Z scale_ub_tensor = None 2025-05-07T20:31:49.9721045Z 2025-05-07T20:31:49.9721180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9721271Z op = silu_mul_quant 2025-05-07T20:31:49.9721359Z if compiled: 2025-05-07T20:31:49.9721466Z op = torch.compile(op) 2025-05-07T20:31:49.9721572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9721643Z 2025-05-07T20:31:49.9721736Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9721741Z 2025-05-07T20:31:49.9721838Z moe/activation_test.py:117: 2025-05-07T20:31:49.9721971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9722071Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9722171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9722788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9722889Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9723247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9723477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9723825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9723924Z kernel = self.compile( 2025-05-07T20:31:49.9724303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9724483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9724611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9724628Z 2025-05-07T20:31:49.9724839Z self = 2025-05-07T20:31:49.9725621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9726122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd696b040>} 2025-05-07T20:31:49.9726866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9727059Z context = 2025-05-07T20:31:49.9727063Z 2025-05-07T20:31:49.9727234Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9727584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9727690Z module_map=module_map) 2025-05-07T20:31:49.9727854Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9727955Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9728031Z E ^ 2025-05-07T20:31:49.9728386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9728390Z 2025-05-07T20:31:49.9728798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9728803Z 2025-05-07T20:31:49.9728905Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9729130Z self=, 2025-05-07T20:31:49.9729207Z T=128, 2025-05-07T20:31:49.9729295Z D=7168, 2025-05-07T20:31:49.9729382Z scale_ub=None, 2025-05-07T20:31:49.9729467Z contiguous=True, 2025-05-07T20:31:49.9729555Z compiled=False, 2025-05-07T20:31:49.9729627Z ) 2025-05-07T20:31:49.9729842Z self = 2025-05-07T20:31:49.9730016Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9730020Z 2025-05-07T20:31:49.9730098Z @given( 2025-05-07T20:31:49.9730218Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9730321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9730434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9730549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9730670Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9730743Z ) 2025-05-07T20:31:49.9730996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9731178Z def test_silu_mul_quant( 2025-05-07T20:31:49.9731260Z self, 2025-05-07T20:31:49.9731342Z T: int, 2025-05-07T20:31:49.9731420Z D: int, 2025-05-07T20:31:49.9731519Z scale_ub: Optional[float], 2025-05-07T20:31:49.9731614Z contiguous: bool, 2025-05-07T20:31:49.9731699Z compiled: bool, 2025-05-07T20:31:49.9731778Z ) -> None: 2025-05-07T20:31:49.9731875Z torch.manual_seed(2025) 2025-05-07T20:31:49.9731947Z 2025-05-07T20:31:49.9732117Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9732192Z 2025-05-07T20:31:49.9732281Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9732407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9732497Z x = x_sign * x_clamp 2025-05-07T20:31:49.9732576Z x0 = x[:, :D] 2025-05-07T20:31:49.9732657Z x1 = x[:, D:] 2025-05-07T20:31:49.9732728Z 2025-05-07T20:31:49.9732821Z if contiguous: 2025-05-07T20:31:49.9732914Z x0 = x0.contiguous() 2025-05-07T20:31:49.9733001Z x1 = x1.contiguous() 2025-05-07T20:31:49.9733074Z 2025-05-07T20:31:49.9733168Z if scale_ub is not None: 2025-05-07T20:31:49.9733271Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9733406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9733488Z ) 2025-05-07T20:31:49.9733565Z else: 2025-05-07T20:31:49.9733661Z scale_ub_tensor = None 2025-05-07T20:31:49.9733732Z 2025-05-07T20:31:49.9733861Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9733951Z op = silu_mul_quant 2025-05-07T20:31:49.9734038Z if compiled: 2025-05-07T20:31:49.9734143Z op = torch.compile(op) 2025-05-07T20:31:49.9734271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9734354Z 2025-05-07T20:31:49.9734458Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9734547Z 2025-05-07T20:31:49.9734650Z moe/activation_test.py:117: 2025-05-07T20:31:49.9734779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9734882Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9734984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9735488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9735587Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9735944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9736168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9736508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9736600Z kernel = self.compile( 2025-05-07T20:31:49.9736996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9737175Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9737300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9737304Z 2025-05-07T20:31:49.9737510Z self = 2025-05-07T20:31:49.9738292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9738799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd696bc10>} 2025-05-07T20:31:49.9739612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9739810Z context = 2025-05-07T20:31:49.9739814Z 2025-05-07T20:31:49.9739986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9740504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9740619Z module_map=module_map) 2025-05-07T20:31:49.9740779Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9740880Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9740960Z E ^ 2025-05-07T20:31:49.9741364Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9741369Z 2025-05-07T20:31:49.9741801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9741814Z 2025-05-07T20:31:49.9741919Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9742139Z self=, 2025-05-07T20:31:49.9742221Z T=2048, 2025-05-07T20:31:49.9742296Z D=7168, 2025-05-07T20:31:49.9742378Z scale_ub=1200.0, 2025-05-07T20:31:49.9742465Z contiguous=True, 2025-05-07T20:31:49.9742547Z compiled=False, 2025-05-07T20:31:49.9742620Z ) 2025-05-07T20:31:49.9742840Z self = 2025-05-07T20:31:49.9743012Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.9743016Z 2025-05-07T20:31:49.9743094Z @given( 2025-05-07T20:31:49.9743211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9743308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9743571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9743687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9743801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9743878Z ) 2025-05-07T20:31:49.9744143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9744243Z def test_silu_mul_quant( 2025-05-07T20:31:49.9744342Z self, 2025-05-07T20:31:49.9744420Z T: int, 2025-05-07T20:31:49.9744501Z D: int, 2025-05-07T20:31:49.9744600Z scale_ub: Optional[float], 2025-05-07T20:31:49.9744689Z contiguous: bool, 2025-05-07T20:31:49.9744779Z compiled: bool, 2025-05-07T20:31:49.9744856Z ) -> None: 2025-05-07T20:31:49.9744949Z torch.manual_seed(2025) 2025-05-07T20:31:49.9745023Z 2025-05-07T20:31:49.9745191Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9746952Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9746969Z 2025-05-07T20:31:49.9747089Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9747093Z 2025-05-07T20:31:49.9747194Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9747423Z self=, 2025-05-07T20:31:49.9747499Z T=1, 2025-05-07T20:31:49.9747578Z D=5120, 2025-05-07T20:31:49.9747662Z scale_ub=1200.0, 2025-05-07T20:31:49.9747747Z contiguous=True, 2025-05-07T20:31:49.9747951Z compiled=False, 2025-05-07T20:31:49.9748026Z ) 2025-05-07T20:31:49.9748241Z self = 2025-05-07T20:31:49.9748411Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.9748415Z 2025-05-07T20:31:49.9748491Z @given( 2025-05-07T20:31:49.9748607Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9748710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9748824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9748940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9749059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9749133Z ) 2025-05-07T20:31:49.9749384Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9749476Z def test_silu_mul_quant( 2025-05-07T20:31:49.9749553Z self, 2025-05-07T20:31:49.9749642Z T: int, 2025-05-07T20:31:49.9749720Z D: int, 2025-05-07T20:31:49.9749818Z scale_ub: Optional[float], 2025-05-07T20:31:49.9749908Z contiguous: bool, 2025-05-07T20:31:49.9749992Z compiled: bool, 2025-05-07T20:31:49.9750069Z ) -> None: 2025-05-07T20:31:49.9750164Z torch.manual_seed(2025) 2025-05-07T20:31:49.9750236Z 2025-05-07T20:31:49.9750404Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9750481Z 2025-05-07T20:31:49.9750572Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9750700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9750789Z x = x_sign * x_clamp 2025-05-07T20:31:49.9750868Z x0 = x[:, :D] 2025-05-07T20:31:49.9750958Z x1 = x[:, D:] 2025-05-07T20:31:49.9751029Z 2025-05-07T20:31:49.9751111Z if contiguous: 2025-05-07T20:31:49.9751207Z x0 = x0.contiguous() 2025-05-07T20:31:49.9751297Z x1 = x1.contiguous() 2025-05-07T20:31:49.9751481Z 2025-05-07T20:31:49.9751576Z if scale_ub is not None: 2025-05-07T20:31:49.9751683Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9751820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9751898Z ) 2025-05-07T20:31:49.9751976Z else: 2025-05-07T20:31:49.9752074Z scale_ub_tensor = None 2025-05-07T20:31:49.9752146Z 2025-05-07T20:31:49.9752275Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9752369Z op = silu_mul_quant 2025-05-07T20:31:49.9752452Z if compiled: 2025-05-07T20:31:49.9752552Z op = torch.compile(op) 2025-05-07T20:31:49.9752664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9752737Z 2025-05-07T20:31:49.9752826Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9752830Z 2025-05-07T20:31:49.9752933Z moe/activation_test.py:117: 2025-05-07T20:31:49.9753062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9753176Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9753281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9753788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9753886Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9754249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9754471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9754811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9754902Z kernel = self.compile( 2025-05-07T20:31:49.9755289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9755549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9755675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9755679Z 2025-05-07T20:31:49.9755890Z self = 2025-05-07T20:31:49.9756656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9757154Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd66ef9d0>} 2025-05-07T20:31:49.9757901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9758099Z context = 2025-05-07T20:31:49.9758104Z 2025-05-07T20:31:49.9758272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9758535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9758641Z module_map=module_map) 2025-05-07T20:31:49.9758807Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9758907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9758982Z E ^ 2025-05-07T20:31:49.9759343Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9759348Z 2025-05-07T20:31:49.9759763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9759768Z 2025-05-07T20:31:49.9759953Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9760175Z self=, 2025-05-07T20:31:49.9760250Z T=2048, 2025-05-07T20:31:49.9760329Z D=5120, 2025-05-07T20:31:49.9760413Z scale_ub=None, 2025-05-07T20:31:49.9760497Z contiguous=True, 2025-05-07T20:31:49.9760583Z compiled=False, 2025-05-07T20:31:49.9760655Z ) 2025-05-07T20:31:49.9760869Z self = 2025-05-07T20:31:49.9761046Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9761051Z 2025-05-07T20:31:49.9761126Z @given( 2025-05-07T20:31:49.9761247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9761344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9761459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9761579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9761704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9761779Z ) 2025-05-07T20:31:49.9762031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9762125Z def test_silu_mul_quant( 2025-05-07T20:31:49.9762205Z self, 2025-05-07T20:31:49.9762281Z T: int, 2025-05-07T20:31:49.9762355Z D: int, 2025-05-07T20:31:49.9762455Z scale_ub: Optional[float], 2025-05-07T20:31:49.9762543Z contiguous: bool, 2025-05-07T20:31:49.9762626Z compiled: bool, 2025-05-07T20:31:49.9762707Z ) -> None: 2025-05-07T20:31:49.9762800Z torch.manual_seed(2025) 2025-05-07T20:31:49.9762871Z 2025-05-07T20:31:49.9763044Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9763118Z 2025-05-07T20:31:49.9763209Z > x_sign = torch.sign(x) 2025-05-07T20:31:49.9765100Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9765112Z 2025-05-07T20:31:49.9765231Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:49.9765236Z 2025-05-07T20:31:49.9765342Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9765566Z self=, 2025-05-07T20:31:49.9765648Z T=16384, 2025-05-07T20:31:49.9765724Z D=5120, 2025-05-07T20:31:49.9765806Z scale_ub=None, 2025-05-07T20:31:49.9765892Z contiguous=True, 2025-05-07T20:31:49.9765976Z compiled=False, 2025-05-07T20:31:49.9766059Z ) 2025-05-07T20:31:49.9766277Z self = 2025-05-07T20:31:49.9766451Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9766456Z 2025-05-07T20:31:49.9766532Z @given( 2025-05-07T20:31:49.9766653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9766751Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9766865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9766980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9767092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9767170Z ) 2025-05-07T20:31:49.9767413Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9767506Z def test_silu_mul_quant( 2025-05-07T20:31:49.9767583Z self, 2025-05-07T20:31:49.9767660Z T: int, 2025-05-07T20:31:49.9767818Z D: int, 2025-05-07T20:31:49.9767919Z scale_ub: Optional[float], 2025-05-07T20:31:49.9768011Z contiguous: bool, 2025-05-07T20:31:49.9768097Z compiled: bool, 2025-05-07T20:31:49.9768180Z ) -> None: 2025-05-07T20:31:49.9768274Z torch.manual_seed(2025) 2025-05-07T20:31:49.9768349Z 2025-05-07T20:31:49.9768518Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9770297Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9770310Z 2025-05-07T20:31:49.9770427Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9770431Z 2025-05-07T20:31:49.9770531Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9770756Z self=, 2025-05-07T20:31:49.9770832Z T=4096, 2025-05-07T20:31:49.9770907Z D=5120, 2025-05-07T20:31:49.9770990Z scale_ub=None, 2025-05-07T20:31:49.9771072Z contiguous=True, 2025-05-07T20:31:49.9771156Z compiled=False, 2025-05-07T20:31:49.9771233Z ) 2025-05-07T20:31:49.9771451Z self = 2025-05-07T20:31:49.9771624Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9771629Z 2025-05-07T20:31:49.9771704Z @given( 2025-05-07T20:31:49.9771821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9771922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9772117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9772236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9772358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9772430Z ) 2025-05-07T20:31:49.9772677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9772772Z def test_silu_mul_quant( 2025-05-07T20:31:49.9772847Z self, 2025-05-07T20:31:49.9772926Z T: int, 2025-05-07T20:31:49.9773000Z D: int, 2025-05-07T20:31:49.9773096Z scale_ub: Optional[float], 2025-05-07T20:31:49.9773185Z contiguous: bool, 2025-05-07T20:31:49.9773269Z compiled: bool, 2025-05-07T20:31:49.9773347Z ) -> None: 2025-05-07T20:31:49.9773442Z torch.manual_seed(2025) 2025-05-07T20:31:49.9773514Z 2025-05-07T20:31:49.9773680Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9775466Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9775472Z 2025-05-07T20:31:49.9775587Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9775591Z 2025-05-07T20:31:49.9775694Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9775914Z self=, 2025-05-07T20:31:49.9775991Z T=2048, 2025-05-07T20:31:49.9776065Z D=5120, 2025-05-07T20:31:49.9776146Z scale_ub=None, 2025-05-07T20:31:49.9776316Z contiguous=False, 2025-05-07T20:31:49.9776400Z compiled=False, 2025-05-07T20:31:49.9776472Z ) 2025-05-07T20:31:49.9776690Z self = 2025-05-07T20:31:49.9776862Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.9776866Z 2025-05-07T20:31:49.9776940Z @given( 2025-05-07T20:31:49.9777058Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9777155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9777271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9777386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9777496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9777575Z ) 2025-05-07T20:31:49.9777823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9777915Z def test_silu_mul_quant( 2025-05-07T20:31:49.9777999Z self, 2025-05-07T20:31:49.9778078Z T: int, 2025-05-07T20:31:49.9778154Z D: int, 2025-05-07T20:31:49.9778254Z scale_ub: Optional[float], 2025-05-07T20:31:49.9778343Z contiguous: bool, 2025-05-07T20:31:49.9778428Z compiled: bool, 2025-05-07T20:31:49.9778510Z ) -> None: 2025-05-07T20:31:49.9778601Z torch.manual_seed(2025) 2025-05-07T20:31:49.9778676Z 2025-05-07T20:31:49.9778844Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9780685Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9780699Z 2025-05-07T20:31:49.9780817Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9780821Z 2025-05-07T20:31:49.9780921Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9781211Z self=, 2025-05-07T20:31:49.9781289Z T=4096, 2025-05-07T20:31:49.9781363Z D=7168, 2025-05-07T20:31:49.9781449Z scale_ub=None, 2025-05-07T20:31:49.9781532Z contiguous=True, 2025-05-07T20:31:49.9781613Z compiled=True, 2025-05-07T20:31:49.9781689Z ) 2025-05-07T20:31:49.9781904Z self = 2025-05-07T20:31:49.9782074Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.9782079Z 2025-05-07T20:31:49.9782154Z @given( 2025-05-07T20:31:49.9782271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9782384Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9782499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9782613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9782728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9782800Z ) 2025-05-07T20:31:49.9783050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9783142Z def test_silu_mul_quant( 2025-05-07T20:31:49.9783217Z self, 2025-05-07T20:31:49.9783294Z T: int, 2025-05-07T20:31:49.9783369Z D: int, 2025-05-07T20:31:49.9783467Z scale_ub: Optional[float], 2025-05-07T20:31:49.9783559Z contiguous: bool, 2025-05-07T20:31:49.9783644Z compiled: bool, 2025-05-07T20:31:49.9783722Z ) -> None: 2025-05-07T20:31:49.9783818Z torch.manual_seed(2025) 2025-05-07T20:31:49.9783894Z 2025-05-07T20:31:49.9784085Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9785979Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9785985Z 2025-05-07T20:31:49.9786100Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9786104Z 2025-05-07T20:31:49.9786208Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9786426Z self=, 2025-05-07T20:31:49.9786503Z T=2048, 2025-05-07T20:31:49.9786583Z D=5120, 2025-05-07T20:31:49.9786668Z scale_ub=1200.0, 2025-05-07T20:31:49.9786755Z contiguous=False, 2025-05-07T20:31:49.9786837Z compiled=False, 2025-05-07T20:31:49.9786908Z ) 2025-05-07T20:31:49.9787126Z self = 2025-05-07T20:31:49.9787298Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.9787303Z 2025-05-07T20:31:49.9787379Z @given( 2025-05-07T20:31:49.9787497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9787595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9787709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9787824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9787937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9788014Z ) 2025-05-07T20:31:49.9788262Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9788439Z def test_silu_mul_quant( 2025-05-07T20:31:49.9788519Z self, 2025-05-07T20:31:49.9788597Z T: int, 2025-05-07T20:31:49.9788675Z D: int, 2025-05-07T20:31:49.9788784Z scale_ub: Optional[float], 2025-05-07T20:31:49.9788871Z contiguous: bool, 2025-05-07T20:31:49.9788955Z compiled: bool, 2025-05-07T20:31:49.9789034Z ) -> None: 2025-05-07T20:31:49.9789128Z torch.manual_seed(2025) 2025-05-07T20:31:49.9789202Z 2025-05-07T20:31:49.9789368Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9791144Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9791158Z 2025-05-07T20:31:49.9791275Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9791280Z 2025-05-07T20:31:49.9791380Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9791601Z self=, 2025-05-07T20:31:49.9791678Z T=4096, 2025-05-07T20:31:49.9791754Z D=7168, 2025-05-07T20:31:49.9791837Z scale_ub=1200.0, 2025-05-07T20:31:49.9791920Z contiguous=True, 2025-05-07T20:31:49.9792002Z compiled=False, 2025-05-07T20:31:49.9792076Z ) 2025-05-07T20:31:49.9792290Z self = 2025-05-07T20:31:49.9792466Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.9792471Z 2025-05-07T20:31:49.9792702Z @given( 2025-05-07T20:31:49.9792829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9792929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9793042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9793156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9793270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9793343Z ) 2025-05-07T20:31:49.9793597Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9793690Z def test_silu_mul_quant( 2025-05-07T20:31:49.9793765Z self, 2025-05-07T20:31:49.9793848Z T: int, 2025-05-07T20:31:49.9793923Z D: int, 2025-05-07T20:31:49.9794019Z scale_ub: Optional[float], 2025-05-07T20:31:49.9794113Z contiguous: bool, 2025-05-07T20:31:49.9794197Z compiled: bool, 2025-05-07T20:31:49.9794275Z ) -> None: 2025-05-07T20:31:49.9794373Z torch.manual_seed(2025) 2025-05-07T20:31:49.9794457Z 2025-05-07T20:31:49.9794623Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9796373Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9796379Z 2025-05-07T20:31:49.9796494Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9796499Z 2025-05-07T20:31:49.9796601Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9796825Z self=, 2025-05-07T20:31:49.9796987Z T=16384, 2025-05-07T20:31:49.9797065Z D=7168, 2025-05-07T20:31:49.9797145Z scale_ub=None, 2025-05-07T20:31:49.9797234Z contiguous=False, 2025-05-07T20:31:49.9797317Z compiled=True, 2025-05-07T20:31:49.9797390Z ) 2025-05-07T20:31:49.9797606Z self = 2025-05-07T20:31:49.9797780Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.9797785Z 2025-05-07T20:31:49.9797860Z @given( 2025-05-07T20:31:49.9797977Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9798073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9798189Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9798305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9798417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9798492Z ) 2025-05-07T20:31:49.9798744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9798843Z def test_silu_mul_quant( 2025-05-07T20:31:49.9798922Z self, 2025-05-07T20:31:49.9798998Z T: int, 2025-05-07T20:31:49.9799074Z D: int, 2025-05-07T20:31:49.9799176Z scale_ub: Optional[float], 2025-05-07T20:31:49.9799263Z contiguous: bool, 2025-05-07T20:31:49.9799347Z compiled: bool, 2025-05-07T20:31:49.9799428Z ) -> None: 2025-05-07T20:31:49.9799521Z torch.manual_seed(2025) 2025-05-07T20:31:49.9799594Z 2025-05-07T20:31:49.9799758Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9801540Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9801627Z 2025-05-07T20:31:49.9801748Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9801752Z 2025-05-07T20:31:49.9801852Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9802077Z self=, 2025-05-07T20:31:49.9802154Z T=4096, 2025-05-07T20:31:49.9802229Z D=7168, 2025-05-07T20:31:49.9802312Z scale_ub=None, 2025-05-07T20:31:49.9802396Z contiguous=True, 2025-05-07T20:31:49.9802479Z compiled=False, 2025-05-07T20:31:49.9802555Z ) 2025-05-07T20:31:49.9802770Z self = 2025-05-07T20:31:49.9802947Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9802965Z 2025-05-07T20:31:49.9803041Z @given( 2025-05-07T20:31:49.9803159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9803258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9803370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9803484Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9803597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9803670Z ) 2025-05-07T20:31:49.9803922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9804014Z def test_silu_mul_quant( 2025-05-07T20:31:49.9804091Z self, 2025-05-07T20:31:49.9804187Z T: int, 2025-05-07T20:31:49.9804269Z D: int, 2025-05-07T20:31:49.9804385Z scale_ub: Optional[float], 2025-05-07T20:31:49.9804483Z contiguous: bool, 2025-05-07T20:31:49.9804571Z compiled: bool, 2025-05-07T20:31:49.9804653Z ) -> None: 2025-05-07T20:31:49.9804823Z torch.manual_seed(2025) 2025-05-07T20:31:49.9804897Z 2025-05-07T20:31:49.9805063Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9806814Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9806819Z 2025-05-07T20:31:49.9806937Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9806945Z 2025-05-07T20:31:49.9807044Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9807275Z self=, 2025-05-07T20:31:49.9807354Z T=16384, 2025-05-07T20:31:49.9807429Z D=7168, 2025-05-07T20:31:49.9807509Z scale_ub=None, 2025-05-07T20:31:49.9807595Z contiguous=True, 2025-05-07T20:31:49.9807678Z compiled=False, 2025-05-07T20:31:49.9807749Z ) 2025-05-07T20:31:49.9807970Z self = 2025-05-07T20:31:49.9808147Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.9808152Z 2025-05-07T20:31:49.9808227Z @given( 2025-05-07T20:31:49.9808348Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9808445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9808559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9808673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9808785Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9808942Z ) 2025-05-07T20:31:49.9809191Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9809283Z def test_silu_mul_quant( 2025-05-07T20:31:49.9809362Z self, 2025-05-07T20:31:49.9809437Z T: int, 2025-05-07T20:31:49.9809512Z D: int, 2025-05-07T20:31:49.9809610Z scale_ub: Optional[float], 2025-05-07T20:31:49.9809696Z contiguous: bool, 2025-05-07T20:31:49.9809780Z compiled: bool, 2025-05-07T20:31:49.9809860Z ) -> None: 2025-05-07T20:31:49.9809953Z torch.manual_seed(2025) 2025-05-07T20:31:49.9810029Z 2025-05-07T20:31:49.9810197Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9811982Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9811997Z 2025-05-07T20:31:49.9812112Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9812117Z 2025-05-07T20:31:49.9812217Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9812441Z self=, 2025-05-07T20:31:49.9812517Z T=16384, 2025-05-07T20:31:49.9812591Z D=7168, 2025-05-07T20:31:49.9812678Z scale_ub=1200.0, 2025-05-07T20:31:49.9812764Z contiguous=True, 2025-05-07T20:31:49.9812846Z compiled=False, 2025-05-07T20:31:49.9812922Z ) 2025-05-07T20:31:49.9813135Z self = 2025-05-07T20:31:49.9813425Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.9813430Z 2025-05-07T20:31:49.9813505Z @given( 2025-05-07T20:31:49.9813622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9813724Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9813836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9813950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9814065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9814148Z ) 2025-05-07T20:31:49.9814442Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9814536Z def test_silu_mul_quant( 2025-05-07T20:31:49.9814612Z self, 2025-05-07T20:31:49.9814689Z T: int, 2025-05-07T20:31:49.9814763Z D: int, 2025-05-07T20:31:49.9814859Z scale_ub: Optional[float], 2025-05-07T20:31:49.9814948Z contiguous: bool, 2025-05-07T20:31:49.9815046Z compiled: bool, 2025-05-07T20:31:49.9815122Z ) -> None: 2025-05-07T20:31:49.9815218Z torch.manual_seed(2025) 2025-05-07T20:31:49.9815290Z 2025-05-07T20:31:49.9815458Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9817240Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9817245Z 2025-05-07T20:31:49.9817362Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9817449Z 2025-05-07T20:31:49.9817556Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9817775Z self=, 2025-05-07T20:31:49.9817853Z T=128, 2025-05-07T20:31:49.9817929Z D=5120, 2025-05-07T20:31:49.9818012Z scale_ub=1200.0, 2025-05-07T20:31:49.9818099Z contiguous=False, 2025-05-07T20:31:49.9818181Z compiled=False, 2025-05-07T20:31:49.9818252Z ) 2025-05-07T20:31:49.9818475Z self = 2025-05-07T20:31:49.9818647Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.9818652Z 2025-05-07T20:31:49.9818727Z @given( 2025-05-07T20:31:49.9818848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9818946Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9819066Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9819181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9819302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9819378Z ) 2025-05-07T20:31:49.9819621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9819715Z def test_silu_mul_quant( 2025-05-07T20:31:49.9819795Z self, 2025-05-07T20:31:49.9819869Z T: int, 2025-05-07T20:31:49.9819943Z D: int, 2025-05-07T20:31:49.9820046Z scale_ub: Optional[float], 2025-05-07T20:31:49.9820134Z contiguous: bool, 2025-05-07T20:31:49.9820219Z compiled: bool, 2025-05-07T20:31:49.9820297Z ) -> None: 2025-05-07T20:31:49.9820389Z torch.manual_seed(2025) 2025-05-07T20:31:49.9820465Z 2025-05-07T20:31:49.9820630Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9820703Z 2025-05-07T20:31:49.9820796Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9820921Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9821091Z x = x_sign * x_clamp 2025-05-07T20:31:49.9821225Z x0 = x[:, :D] 2025-05-07T20:31:49.9821305Z x1 = x[:, D:] 2025-05-07T20:31:49.9821377Z 2025-05-07T20:31:49.9821462Z if contiguous: 2025-05-07T20:31:49.9821554Z x0 = x0.contiguous() 2025-05-07T20:31:49.9821643Z x1 = x1.contiguous() 2025-05-07T20:31:49.9821719Z 2025-05-07T20:31:49.9821807Z if scale_ub is not None: 2025-05-07T20:31:49.9821914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9822050Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9822125Z ) 2025-05-07T20:31:49.9822204Z else: 2025-05-07T20:31:49.9822297Z scale_ub_tensor = None 2025-05-07T20:31:49.9822368Z 2025-05-07T20:31:49.9822502Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9822592Z op = silu_mul_quant 2025-05-07T20:31:49.9822678Z if compiled: 2025-05-07T20:31:49.9822795Z op = torch.compile(op) 2025-05-07T20:31:49.9822899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9822970Z 2025-05-07T20:31:49.9823061Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9823066Z 2025-05-07T20:31:49.9823162Z moe/activation_test.py:117: 2025-05-07T20:31:49.9823291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9823391Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9823490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9823992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9824087Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9824443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9824670Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9825101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9825196Z kernel = self.compile( 2025-05-07T20:31:49.9825575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9825751Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9825879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9825883Z 2025-05-07T20:31:49.9826091Z self = 2025-05-07T20:31:49.9826871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9827377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd65c6670>} 2025-05-07T20:31:49.9828130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9828324Z context = 2025-05-07T20:31:49.9828329Z 2025-05-07T20:31:49.9828493Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9828767Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9828874Z module_map=module_map) 2025-05-07T20:31:49.9829037Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9829138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9829213Z E ^ 2025-05-07T20:31:49.9829648Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9829656Z 2025-05-07T20:31:49.9830073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9830077Z 2025-05-07T20:31:49.9830181Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9830406Z self=, 2025-05-07T20:31:49.9830486Z T=2048, 2025-05-07T20:31:49.9830562Z D=7168, 2025-05-07T20:31:49.9830647Z scale_ub=None, 2025-05-07T20:31:49.9830733Z contiguous=False, 2025-05-07T20:31:49.9830814Z compiled=False, 2025-05-07T20:31:49.9830888Z ) 2025-05-07T20:31:49.9831105Z self = 2025-05-07T20:31:49.9831285Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.9831294Z 2025-05-07T20:31:49.9831374Z @given( 2025-05-07T20:31:49.9831493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9831596Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9831709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9831822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9831939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9832011Z ) 2025-05-07T20:31:49.9832263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9832355Z def test_silu_mul_quant( 2025-05-07T20:31:49.9832431Z self, 2025-05-07T20:31:49.9832511Z T: int, 2025-05-07T20:31:49.9832586Z D: int, 2025-05-07T20:31:49.9832681Z scale_ub: Optional[float], 2025-05-07T20:31:49.9832773Z contiguous: bool, 2025-05-07T20:31:49.9832857Z compiled: bool, 2025-05-07T20:31:49.9832933Z ) -> None: 2025-05-07T20:31:49.9833035Z torch.manual_seed(2025) 2025-05-07T20:31:49.9833208Z 2025-05-07T20:31:49.9833377Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9835128Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9835134Z 2025-05-07T20:31:49.9835251Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9835258Z 2025-05-07T20:31:49.9835358Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9835587Z self=, 2025-05-07T20:31:49.9835676Z T=128, 2025-05-07T20:31:49.9835753Z D=7168, 2025-05-07T20:31:49.9835836Z scale_ub=1200.0, 2025-05-07T20:31:49.9835927Z contiguous=True, 2025-05-07T20:31:49.9836008Z compiled=True, 2025-05-07T20:31:49.9836080Z ) 2025-05-07T20:31:49.9836299Z self = 2025-05-07T20:31:49.9836466Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.9836471Z 2025-05-07T20:31:49.9836545Z @given( 2025-05-07T20:31:49.9836663Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9836761Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9836875Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9836991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9837102Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9837180Z ) 2025-05-07T20:31:49.9837511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9837606Z def test_silu_mul_quant( 2025-05-07T20:31:49.9837687Z self, 2025-05-07T20:31:49.9837764Z T: int, 2025-05-07T20:31:49.9837838Z D: int, 2025-05-07T20:31:49.9837939Z scale_ub: Optional[float], 2025-05-07T20:31:49.9838026Z contiguous: bool, 2025-05-07T20:31:49.9838112Z compiled: bool, 2025-05-07T20:31:49.9838190Z ) -> None: 2025-05-07T20:31:49.9838282Z torch.manual_seed(2025) 2025-05-07T20:31:49.9838356Z 2025-05-07T20:31:49.9838524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9838599Z 2025-05-07T20:31:49.9838692Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9838817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9838905Z x = x_sign * x_clamp 2025-05-07T20:31:49.9838987Z x0 = x[:, :D] 2025-05-07T20:31:49.9839076Z x1 = x[:, D:] 2025-05-07T20:31:49.9839147Z 2025-05-07T20:31:49.9839234Z if contiguous: 2025-05-07T20:31:49.9839325Z x0 = x0.contiguous() 2025-05-07T20:31:49.9839414Z x1 = x1.contiguous() 2025-05-07T20:31:49.9839489Z 2025-05-07T20:31:49.9839578Z if scale_ub is not None: 2025-05-07T20:31:49.9839684Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.9839818Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.9839892Z ) 2025-05-07T20:31:49.9839970Z else: 2025-05-07T20:31:49.9840196Z scale_ub_tensor = None 2025-05-07T20:31:49.9840270Z 2025-05-07T20:31:49.9840405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.9840493Z op = silu_mul_quant 2025-05-07T20:31:49.9840578Z if compiled: 2025-05-07T20:31:49.9840680Z op = torch.compile(op) 2025-05-07T20:31:49.9840790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9841018Z 2025-05-07T20:31:49.9841113Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.9841118Z 2025-05-07T20:31:49.9841214Z moe/activation_test.py:117: 2025-05-07T20:31:49.9841346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9841449Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.9841553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.9846476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.9846588Z return fn(*args, **kwargs) 2025-05-07T20:31:49.9847094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.9847192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.9847557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.9847793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.9848136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.9848231Z kernel = self.compile( 2025-05-07T20:31:49.9848609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.9848784Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9848918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.9848924Z 2025-05-07T20:31:49.9849135Z self = 2025-05-07T20:31:49.9849905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.9850552Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2fd659b5e0>} 2025-05-07T20:31:49.9851305Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.9851495Z context = 2025-05-07T20:31:49.9851500Z 2025-05-07T20:31:49.9851666Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.9851930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9852036Z module_map=module_map) 2025-05-07T20:31:49.9852197Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9852302Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9852389Z E ^ 2025-05-07T20:31:49.9852746Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9852751Z 2025-05-07T20:31:49.9853158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9853162Z 2025-05-07T20:31:49.9853263Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9853489Z self=, 2025-05-07T20:31:49.9853564Z T=128, 2025-05-07T20:31:49.9853641Z D=7168, 2025-05-07T20:31:49.9853726Z scale_ub=1200.0, 2025-05-07T20:31:49.9853808Z contiguous=True, 2025-05-07T20:31:49.9853892Z compiled=False, 2025-05-07T20:31:49.9853963Z ) 2025-05-07T20:31:49.9854178Z self = 2025-05-07T20:31:49.9854359Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.9854443Z 2025-05-07T20:31:49.9854520Z @given( 2025-05-07T20:31:49.9854637Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9854738Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9854852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9854968Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9855080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9855153Z ) 2025-05-07T20:31:49.9855404Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9855497Z def test_silu_mul_quant( 2025-05-07T20:31:49.9855572Z self, 2025-05-07T20:31:49.9855655Z T: int, 2025-05-07T20:31:49.9855731Z D: int, 2025-05-07T20:31:49.9855830Z scale_ub: Optional[float], 2025-05-07T20:31:49.9855925Z contiguous: bool, 2025-05-07T20:31:49.9856009Z compiled: bool, 2025-05-07T20:31:49.9856097Z ) -> None: 2025-05-07T20:31:49.9856192Z torch.manual_seed(2025) 2025-05-07T20:31:49.9856268Z 2025-05-07T20:31:49.9856442Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9856516Z 2025-05-07T20:31:49.9856607Z x_sign = torch.sign(x) 2025-05-07T20:31:49.9856735Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.9858482Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9858493Z 2025-05-07T20:31:49.9858688Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.9858693Z 2025-05-07T20:31:49.9858794Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9859016Z self=, 2025-05-07T20:31:49.9859098Z T=128, 2025-05-07T20:31:49.9859173Z D=5120, 2025-05-07T20:31:49.9859253Z scale_ub=1200.0, 2025-05-07T20:31:49.9859344Z contiguous=True, 2025-05-07T20:31:49.9859425Z compiled=True, 2025-05-07T20:31:49.9859500Z ) 2025-05-07T20:31:49.9859716Z self = 2025-05-07T20:31:49.9859883Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.9859887Z 2025-05-07T20:31:49.9859965Z @given( 2025-05-07T20:31:49.9860082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9860179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9860300Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9860420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9860534Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9860610Z ) 2025-05-07T20:31:49.9860853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9860947Z def test_silu_mul_quant( 2025-05-07T20:31:49.9861022Z self, 2025-05-07T20:31:49.9861097Z T: int, 2025-05-07T20:31:49.9861239Z D: int, 2025-05-07T20:31:49.9861338Z scale_ub: Optional[float], 2025-05-07T20:31:49.9861425Z contiguous: bool, 2025-05-07T20:31:49.9861512Z compiled: bool, 2025-05-07T20:31:49.9861588Z ) -> None: 2025-05-07T20:31:49.9861681Z torch.manual_seed(2025) 2025-05-07T20:31:49.9861757Z 2025-05-07T20:31:49.9861923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9861997Z 2025-05-07T20:31:49.9862095Z > x_sign = torch.sign(x) 2025-05-07T20:31:49.9863946Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9863952Z 2025-05-07T20:31:49.9864074Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:49.9864079Z 2025-05-07T20:31:49.9864179Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.9864403Z self=, 2025-05-07T20:31:49.9864480Z T=128, 2025-05-07T20:31:49.9864554Z D=7168, 2025-05-07T20:31:49.9864645Z scale_ub=None, 2025-05-07T20:31:49.9864729Z contiguous=True, 2025-05-07T20:31:49.9864811Z compiled=True, 2025-05-07T20:31:49.9864885Z ) 2025-05-07T20:31:49.9865099Z self = 2025-05-07T20:31:49.9865264Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.9865272Z 2025-05-07T20:31:49.9865349Z @given( 2025-05-07T20:31:49.9865467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.9865567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.9865679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.9865793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.9865906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.9865980Z ) 2025-05-07T20:31:49.9866223Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.9866319Z def test_silu_mul_quant( 2025-05-07T20:31:49.9866476Z self, 2025-05-07T20:31:49.9866557Z T: int, 2025-05-07T20:31:49.9866635Z D: int, 2025-05-07T20:31:49.9866730Z scale_ub: Optional[float], 2025-05-07T20:31:49.9866820Z contiguous: bool, 2025-05-07T20:31:49.9866906Z compiled: bool, 2025-05-07T20:31:49.9866983Z ) -> None: 2025-05-07T20:31:49.9867080Z torch.manual_seed(2025) 2025-05-07T20:31:49.9867153Z 2025-05-07T20:31:49.9867319Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.9869061Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.9869071Z 2025-05-07T20:31:49.9869189Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.9869325Z =============================== warnings summary =============================== 2025-05-07T20:31:49.9869636Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:49.9869932Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:49.9870228Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:49.9871108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:31:49.9871417Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:31:49.9871422Z 2025-05-07T20:31:49.9871602Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:31:49.9872888Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:31:49.9873078Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:31:49.9873082Z 2025-05-07T20:31:49.9873292Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:31:49.9873459Z ================== 1 failed, 1 passed, 13 warnings in 33.14s =================== 2025-05-07T20:31:51.7274847Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:31:51.7914319Z 2025-05-07T20:31:51.7915251Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:31:51.7915640Z 2025-05-07T20:31:51.7915645Z 2025-05-07T20:31:51.7934714Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:31:53.9547036Z ============================= test session starts ============================== 2025-05-07T20:31:53.9547701Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:53.9548236Z cachedir: .pytest_cache 2025-05-07T20:31:53.9549177Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:53.9549946Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:53.9550369Z plugins: hypothesis-6.131.14 2025-05-07T20:31:55.5644852Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:55.7770395Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:31:55.7770807Z run-last-failure: rerun previous 1 failure 2025-05-07T20:31:55.7771043Z 2025-05-07T20:31:57.9873717Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:57.9874804Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:57.9876198Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:57.9877656Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:57.9879049Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:57.9880437Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.9881756Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:57.9883427Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.9884857Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:57.9886120Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:57.9887360Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:57.9888599Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:57.9889643Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:57.9890684Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:57.9891921Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:57.9893224Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:57.9894522Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:57.9895583Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:57.9896788Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:57.9898157Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:57.9899236Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.9900151Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.9900908Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:57.9902043Z W0507 20:31:57.985997 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.0046018Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.0047156Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:58.0048528Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.0050456Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.0052035Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.0053435Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.0054764Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.0056163Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.0057639Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.0058891Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:58.0060120Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.0061667Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:58.0062718Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:58.0063742Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:58.0064979Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.0066258Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.0067389Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:58.0068448Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:58.0069615Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.0070982Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.0072055Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.0072983Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.0073819Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:58.0074865Z W0507 20:31:58.004048 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.6534342Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.6535302Z self=, 2025-05-07T20:31:58.6535907Z T=1, 2025-05-07T20:31:58.6536104Z D=5120, 2025-05-07T20:31:58.6536318Z scale_ub=None, 2025-05-07T20:31:58.6536546Z contiguous=True, 2025-05-07T20:31:58.6536777Z compiled=True, 2025-05-07T20:31:58.6537000Z ) 2025-05-07T20:31:58.6537338Z self = 2025-05-07T20:31:58.6537871Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:58.6538141Z 2025-05-07T20:31:58.6538224Z @given( 2025-05-07T20:31:58.6538470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.6538794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.6539106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.6539446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.6539788Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.6540468Z ) 2025-05-07T20:31:58.6540923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.6541486Z def test_silu_mul_quant( 2025-05-07T20:31:58.6541742Z self, 2025-05-07T20:31:58.6541940Z T: int, 2025-05-07T20:31:58.6542147Z D: int, 2025-05-07T20:31:58.6542379Z scale_ub: Optional[float], 2025-05-07T20:31:58.6542656Z contiguous: bool, 2025-05-07T20:31:58.6543330Z compiled: bool, 2025-05-07T20:31:58.6543579Z ) -> None: 2025-05-07T20:31:58.6543809Z torch.manual_seed(2025) 2025-05-07T20:31:58.6544060Z 2025-05-07T20:31:58.6544347Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.6544702Z 2025-05-07T20:31:58.6544900Z x_sign = torch.sign(x) 2025-05-07T20:31:58.6545203Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.6545527Z x = x_sign * x_clamp 2025-05-07T20:31:58.6545775Z x0 = x[:, :D] 2025-05-07T20:31:58.6546004Z x1 = x[:, D:] 2025-05-07T20:31:58.6546223Z 2025-05-07T20:31:58.6546416Z if contiguous: 2025-05-07T20:31:58.6546671Z x0 = x0.contiguous() 2025-05-07T20:31:58.6546982Z x1 = x1.contiguous() 2025-05-07T20:31:58.6547229Z 2025-05-07T20:31:58.6547438Z if scale_ub is not None: 2025-05-07T20:31:58.6547728Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.6548089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.6548407Z ) 2025-05-07T20:31:58.6548611Z else: 2025-05-07T20:31:58.6548839Z scale_ub_tensor = None 2025-05-07T20:31:58.6549097Z 2025-05-07T20:31:58.6549345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.6549674Z op = silu_mul_quant 2025-05-07T20:31:58.6549940Z if compiled: 2025-05-07T20:31:58.6550209Z op = torch.compile(op) 2025-05-07T20:31:58.6550523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.6550808Z 2025-05-07T20:31:58.6551013Z y_fp8, y_scale = fn() 2025-05-07T20:31:58.6551314Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:58.6551610Z 2025-05-07T20:31:58.6551860Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.6552210Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:58.6552509Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:58.6553016Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:58.6553392Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.6553716Z 2025-05-07T20:31:58.6553923Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:58.6554128Z 2025-05-07T20:31:58.6554234Z moe/activation_test.py:126: 2025-05-07T20:31:58.6554547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.6554890Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:58.6555229Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.6556030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:58.6556805Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:58.6557358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.6558061Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.6558762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:58.6559494Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.6560258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:58.6561014Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.6561746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:58.6562389Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:58.6563001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:58.6563625Z fn() 2025-05-07T20:31:58.6564141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:58.6564724Z self.fn.run( 2025-05-07T20:31:58.6565199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.6565745Z kernel = self.compile( 2025-05-07T20:31:58.6566290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.6566990Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.6567411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.6567654Z 2025-05-07T20:31:58.6567874Z self = 2025-05-07T20:31:58.6568964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.6570370Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7edee7040>} 2025-05-07T20:31:58.6571726Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.6572754Z context = 2025-05-07T20:31:58.6573048Z 2025-05-07T20:31:58.6573238Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.6573763Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.6574254Z module_map=module_map) 2025-05-07T20:31:58.6574728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.6575097Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:58.6575383Z E ^ 2025-05-07T20:31:58.6575857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.6576309Z 2025-05-07T20:31:58.6576749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.6577304Z 2025-05-07T20:31:58.6577413Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.6577847Z self=, 2025-05-07T20:31:58.6578261Z T=2048, 2025-05-07T20:31:58.6578454Z D=5120, 2025-05-07T20:31:58.6578658Z scale_ub=1200.0, 2025-05-07T20:31:58.6578894Z contiguous=True, 2025-05-07T20:31:58.6579125Z compiled=False, 2025-05-07T20:31:58.6579343Z ) 2025-05-07T20:31:59.7052647Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.7053757Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:59.7055100Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.7056562Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.7058267Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.7059671Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.7060988Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.7062471Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.7063906Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.7065175Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:59.7066412Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.7067627Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:59.7068682Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:59.7069713Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:59.7071109Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.7072407Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.7073538Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:59.7074598Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:59.7075791Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.7077176Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.7078249Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.7079182Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.7079946Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:59.7080991Z W0507 20:31:59.700780 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.9382561Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.9383674Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:59.9385035Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.9386494Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.9387903Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.9389317Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.9390654Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.9392059Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.9393501Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.9394933Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:59.9396177Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.9397398Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:59.9398464Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:59.9399510Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:59.9400764Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.9402083Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.9403229Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:59.9404304Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:59.9405507Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.9406981Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.9408075Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.9409001Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.9409768Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:59.9410806Z W0507 20:31:59.934271 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7832019Z self = 2025-05-07T20:32:00.7832914Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.7833332Z 2025-05-07T20:32:00.7833443Z @given( 2025-05-07T20:32:00.7833768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.7834199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.7834539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.7834897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.7835241Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.7835549Z ) 2025-05-07T20:32:00.7835923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.7836374Z def test_silu_mul_quant( 2025-05-07T20:32:00.7836633Z self, 2025-05-07T20:32:00.7836844Z T: int, 2025-05-07T20:32:00.7837050Z D: int, 2025-05-07T20:32:00.7837288Z scale_ub: Optional[float], 2025-05-07T20:32:00.7837606Z contiguous: bool, 2025-05-07T20:32:00.7837883Z compiled: bool, 2025-05-07T20:32:00.7838557Z ) -> None: 2025-05-07T20:32:00.7838789Z torch.manual_seed(2025) 2025-05-07T20:32:00.7839051Z 2025-05-07T20:32:00.7839332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.7839690Z 2025-05-07T20:32:00.7839900Z x_sign = torch.sign(x) 2025-05-07T20:32:00.7840396Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.7840727Z x = x_sign * x_clamp 2025-05-07T20:32:00.7840983Z x0 = x[:, :D] 2025-05-07T20:32:00.7841205Z x1 = x[:, D:] 2025-05-07T20:32:00.7841427Z 2025-05-07T20:32:00.7841629Z if contiguous: 2025-05-07T20:32:00.7841867Z x0 = x0.contiguous() 2025-05-07T20:32:00.7842141Z x1 = x1.contiguous() 2025-05-07T20:32:00.7842394Z 2025-05-07T20:32:00.7842591Z if scale_ub is not None: 2025-05-07T20:32:00.7842878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.7843234Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.7843568Z ) 2025-05-07T20:32:00.7843768Z else: 2025-05-07T20:32:00.7843993Z scale_ub_tensor = None 2025-05-07T20:32:00.7844260Z 2025-05-07T20:32:00.7844499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7844845Z op = silu_mul_quant 2025-05-07T20:32:00.7845119Z if compiled: 2025-05-07T20:32:00.7845385Z op = torch.compile(op) 2025-05-07T20:32:00.7845691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7855756Z 2025-05-07T20:32:00.7855988Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.7856180Z 2025-05-07T20:32:00.7856288Z moe/activation_test.py:117: 2025-05-07T20:32:00.7856606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7856948Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.7857246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7858158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.7858868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.7859430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.7860129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.7860809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.7861433Z kernel = self.compile( 2025-05-07T20:32:00.7861995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.7862661Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.7863068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7863319Z 2025-05-07T20:32:00.7863540Z self = 2025-05-07T20:32:00.7864632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.7866113Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7cbe6e5e0>} 2025-05-07T20:32:00.7867489Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.7868532Z context = 2025-05-07T20:32:00.7868827Z 2025-05-07T20:32:00.7869006Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.7869676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.7870161Z module_map=module_map) 2025-05-07T20:32:00.7870535Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.7870910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.7871186Z E ^ 2025-05-07T20:32:00.7871660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7872120Z 2025-05-07T20:32:00.7872541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.7873063Z 2025-05-07T20:32:00.7873170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.7873599Z self=, 2025-05-07T20:32:00.7874010Z T=2048, 2025-05-07T20:32:00.7874207Z D=5120, 2025-05-07T20:32:00.7874420Z scale_ub=1200.0, 2025-05-07T20:32:00.7874656Z contiguous=True, 2025-05-07T20:32:00.7874881Z compiled=True, 2025-05-07T20:32:00.7875099Z ) 2025-05-07T20:32:00.7875429Z self = 2025-05-07T20:32:00.7875931Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.7876215Z 2025-05-07T20:32:00.7876299Z @given( 2025-05-07T20:32:00.7876546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.7876864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.7877183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.7877521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.7877865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.7878161Z ) 2025-05-07T20:32:00.7878520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.7879058Z def test_silu_mul_quant( 2025-05-07T20:32:00.7879310Z self, 2025-05-07T20:32:00.7879517Z T: int, 2025-05-07T20:32:00.7879724Z D: int, 2025-05-07T20:32:00.7879947Z scale_ub: Optional[float], 2025-05-07T20:32:00.7880235Z contiguous: bool, 2025-05-07T20:32:00.7880486Z compiled: bool, 2025-05-07T20:32:00.7880712Z ) -> None: 2025-05-07T20:32:00.7880945Z torch.manual_seed(2025) 2025-05-07T20:32:00.7881208Z 2025-05-07T20:32:00.7881486Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.7881842Z 2025-05-07T20:32:00.7882046Z x_sign = torch.sign(x) 2025-05-07T20:32:00.7882342Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.7882669Z x = x_sign * x_clamp 2025-05-07T20:32:00.7882927Z x0 = x[:, :D] 2025-05-07T20:32:00.7883152Z x1 = x[:, D:] 2025-05-07T20:32:00.7883363Z 2025-05-07T20:32:00.7883559Z if contiguous: 2025-05-07T20:32:00.7883809Z x0 = x0.contiguous() 2025-05-07T20:32:00.7884075Z x1 = x1.contiguous() 2025-05-07T20:32:00.7884328Z 2025-05-07T20:32:00.7884533Z if scale_ub is not None: 2025-05-07T20:32:00.7884812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.7885163Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.7885493Z ) 2025-05-07T20:32:00.7885686Z else: 2025-05-07T20:32:00.7885907Z scale_ub_tensor = None 2025-05-07T20:32:00.7886170Z 2025-05-07T20:32:00.7886405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7886734Z op = silu_mul_quant 2025-05-07T20:32:00.7887002Z if compiled: 2025-05-07T20:32:00.7887254Z op = torch.compile(op) 2025-05-07T20:32:00.7887563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7887853Z 2025-05-07T20:32:00.7888049Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.7888445Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.7888750Z 2025-05-07T20:32:00.7889001Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7889345Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.7889651Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.7889979Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.7890343Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.7890663Z 2025-05-07T20:32:00.7890872Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.7891068Z 2025-05-07T20:32:00.7891173Z moe/activation_test.py:126: 2025-05-07T20:32:00.7891485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7891836Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.7892183Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.7892979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.7893757Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.7894326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.7895010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.7895715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.7896446Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.7897212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:00.7897963Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.7898785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.7899445Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.7900069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.7900599Z fn() 2025-05-07T20:32:00.7901188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.7901789Z self.fn.run( 2025-05-07T20:32:00.7902265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.7902814Z kernel = self.compile( 2025-05-07T20:32:00.7903373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.7904043Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.7904470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7904705Z 2025-05-07T20:32:00.7904926Z self = 2025-05-07T20:32:00.7906009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.7907402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eca155e0>} 2025-05-07T20:32:00.7908794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.7909817Z context = 2025-05-07T20:32:00.7910202Z 2025-05-07T20:32:00.7910385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.7910914Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.7911392Z module_map=module_map) 2025-05-07T20:32:00.7911769Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.7912132Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.7912399Z E ^ 2025-05-07T20:32:00.7912864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7913314Z 2025-05-07T20:32:00.7913736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.7914248Z 2025-05-07T20:32:00.7914358Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.7914778Z self=, 2025-05-07T20:32:00.7915191Z T=16384, 2025-05-07T20:32:00.7915391Z D=7168, 2025-05-07T20:32:00.7915586Z scale_ub=1200.0, 2025-05-07T20:32:00.7915817Z contiguous=False, 2025-05-07T20:32:00.7916053Z compiled=False, 2025-05-07T20:32:00.7916260Z ) 2025-05-07T20:32:01.4108374Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.4109677Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:01.4111023Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.4112652Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.4114061Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.4115447Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.4116775Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.4118228Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.4119656Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.4120915Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:01.4122159Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.4123370Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:01.4124420Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:01.4125573Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:01.4126785Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.4128082Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.4129214Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:01.4130271Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:01.4131468Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.4132824Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.4133903Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.4134825Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.4135569Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:01.4136669Z W0507 20:32:01.406810 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.5858862Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.5860146Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:01.5861565Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.5863007Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.5864415Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.5865791Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.5867105Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.5868493Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.5870096Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.5871360Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:01.5872576Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.5873797Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:01.5874859Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:01.5875909Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:01.5877132Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.5878407Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.5879538Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:01.5880693Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:01.5881887Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.5883252Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.5884323Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.5885247Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.5886005Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:01.5887044Z W0507 20:32:01.581908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.7333719Z self = 2025-05-07T20:32:02.7334303Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.7334601Z 2025-05-07T20:32:02.7334712Z @given( 2025-05-07T20:32:02.7335054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.7335514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.7335974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.7336347Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.7336696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.7337003Z ) 2025-05-07T20:32:02.7337367Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.7338047Z def test_silu_mul_quant( 2025-05-07T20:32:02.7338343Z self, 2025-05-07T20:32:02.7338542Z T: int, 2025-05-07T20:32:02.7338750Z D: int, 2025-05-07T20:32:02.7338979Z scale_ub: Optional[float], 2025-05-07T20:32:02.7339253Z contiguous: bool, 2025-05-07T20:32:02.7339503Z compiled: bool, 2025-05-07T20:32:02.7339738Z ) -> None: 2025-05-07T20:32:02.7339963Z torch.manual_seed(2025) 2025-05-07T20:32:02.7340397Z 2025-05-07T20:32:02.7340684Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.7341036Z 2025-05-07T20:32:02.7341299Z x_sign = torch.sign(x) 2025-05-07T20:32:02.7341606Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.7341929Z x = x_sign * x_clamp 2025-05-07T20:32:02.7342175Z x0 = x[:, :D] 2025-05-07T20:32:02.7342405Z x1 = x[:, D:] 2025-05-07T20:32:02.7342623Z 2025-05-07T20:32:02.7342817Z if contiguous: 2025-05-07T20:32:02.7343070Z x0 = x0.contiguous() 2025-05-07T20:32:02.7343342Z x1 = x1.contiguous() 2025-05-07T20:32:02.7343595Z 2025-05-07T20:32:02.7343799Z if scale_ub is not None: 2025-05-07T20:32:02.7344082Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.7344426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.7344749Z ) 2025-05-07T20:32:02.7344952Z else: 2025-05-07T20:32:02.7345166Z scale_ub_tensor = None 2025-05-07T20:32:02.7345430Z 2025-05-07T20:32:02.7345669Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.7345993Z op = silu_mul_quant 2025-05-07T20:32:02.7346285Z if compiled: 2025-05-07T20:32:02.7346536Z op = torch.compile(op) 2025-05-07T20:32:02.7346845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7347131Z 2025-05-07T20:32:02.7347324Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.7347500Z 2025-05-07T20:32:02.7347759Z moe/activation_test.py:117: 2025-05-07T20:32:02.7348073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7348415Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.7348712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7349460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.7350161Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.7350705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.7351406Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.7352079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.7352619Z kernel = self.compile( 2025-05-07T20:32:02.7353170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.7353836Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.7354242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7354475Z 2025-05-07T20:32:02.7354694Z self = 2025-05-07T20:32:02.7355771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.7357181Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ec7e61f0>} 2025-05-07T20:32:02.7358598Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.7359761Z context = 2025-05-07T20:32:02.7360057Z 2025-05-07T20:32:02.7360235Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.7360759Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.7361233Z module_map=module_map) 2025-05-07T20:32:02.7361610Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.7361975Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.7362238Z E ^ 2025-05-07T20:32:02.7362717Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.7363167Z 2025-05-07T20:32:02.7363602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.7364129Z 2025-05-07T20:32:02.7364239Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.7364665Z self=, 2025-05-07T20:32:02.7365075Z T=1, 2025-05-07T20:32:02.7365269Z D=7168, 2025-05-07T20:32:02.7365467Z scale_ub=None, 2025-05-07T20:32:02.7365695Z contiguous=True, 2025-05-07T20:32:02.7365929Z compiled=True, 2025-05-07T20:32:02.7366137Z ) 2025-05-07T20:32:02.7366468Z self = 2025-05-07T20:32:02.7366967Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:02.7367230Z 2025-05-07T20:32:02.7367313Z @given( 2025-05-07T20:32:02.7367559Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.7367885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.7368196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.7368629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.7368972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.7369269Z ) 2025-05-07T20:32:02.7369620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.7370069Z def test_silu_mul_quant( 2025-05-07T20:32:02.7370324Z self, 2025-05-07T20:32:02.7370522Z T: int, 2025-05-07T20:32:02.7370730Z D: int, 2025-05-07T20:32:02.7370957Z scale_ub: Optional[float], 2025-05-07T20:32:02.7371232Z contiguous: bool, 2025-05-07T20:32:02.7371480Z compiled: bool, 2025-05-07T20:32:02.7371714Z ) -> None: 2025-05-07T20:32:02.7371937Z torch.manual_seed(2025) 2025-05-07T20:32:02.7372190Z 2025-05-07T20:32:02.7372471Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.7372828Z 2025-05-07T20:32:02.7373024Z x_sign = torch.sign(x) 2025-05-07T20:32:02.7373337Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.7373661Z x = x_sign * x_clamp 2025-05-07T20:32:02.7373914Z x0 = x[:, :D] 2025-05-07T20:32:02.7374133Z x1 = x[:, D:] 2025-05-07T20:32:02.7374353Z 2025-05-07T20:32:02.7374549Z if contiguous: 2025-05-07T20:32:02.7374783Z x0 = x0.contiguous() 2025-05-07T20:32:02.7375052Z x1 = x1.contiguous() 2025-05-07T20:32:02.7375307Z 2025-05-07T20:32:02.7375503Z if scale_ub is not None: 2025-05-07T20:32:02.7375787Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.7376136Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.7376451Z ) 2025-05-07T20:32:02.7376654Z else: 2025-05-07T20:32:02.7376877Z scale_ub_tensor = None 2025-05-07T20:32:02.7377131Z 2025-05-07T20:32:02.7377372Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.7377695Z op = silu_mul_quant 2025-05-07T20:32:02.7378045Z if compiled: 2025-05-07T20:32:02.7378307Z op = torch.compile(op) 2025-05-07T20:32:02.7378610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7378895Z 2025-05-07T20:32:02.7379091Z y_fp8, y_scale = fn() 2025-05-07T20:32:02.7379386Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:02.7379684Z 2025-05-07T20:32:02.7379922Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.7380268Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:02.7380573Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:02.7380894Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:02.7381317Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.7381633Z 2025-05-07T20:32:02.7381837Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:02.7382043Z 2025-05-07T20:32:02.7382146Z moe/activation_test.py:126: 2025-05-07T20:32:02.7382461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7382813Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:02.7383150Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.7383939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:02.7384697Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:02.7385246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.7385941Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.7386637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:02.7387363Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.7388256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:02.7389015Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.7389747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:02.7390392Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:02.7390996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:02.7391522Z fn() 2025-05-07T20:32:02.7392039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:02.7392618Z self.fn.run( 2025-05-07T20:32:02.7393097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.7393649Z kernel = self.compile( 2025-05-07T20:32:02.7394193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.7394846Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.7395259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7395493Z 2025-05-07T20:32:02.7395710Z self = 2025-05-07T20:32:02.7396794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.7398183Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ec7e6790>} 2025-05-07T20:32:02.7399619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.7400639Z context = 2025-05-07T20:32:02.7400929Z 2025-05-07T20:32:02.7401105Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.7401627Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.7402096Z module_map=module_map) 2025-05-07T20:32:02.7402470Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.7402838Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:02.7403106Z E ^ 2025-05-07T20:32:02.7403572Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.7404028Z 2025-05-07T20:32:02.7404455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.7404977Z 2025-05-07T20:32:02.7405088Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.7405503Z self=, 2025-05-07T20:32:02.7405910Z T=4096, 2025-05-07T20:32:02.7406102Z D=5120, 2025-05-07T20:32:02.7406294Z scale_ub=None, 2025-05-07T20:32:02.7406516Z contiguous=False, 2025-05-07T20:32:02.7406754Z compiled=False, 2025-05-07T20:32:02.7406979Z ) 2025-05-07T20:32:03.4117964Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.4119113Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:03.4121011Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.4122433Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.4123800Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.4125182Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.4126497Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.4127885Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.4129313Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.4130565Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:03.4131807Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.4133161Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:03.4134198Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:03.4135219Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:03.4136600Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.4137913Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.4139110Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.4140410Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:03.4141649Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.4143020Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.4144246Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.4145184Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.4145949Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:03.4146989Z W0507 20:32:03.407715 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.0842026Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.0844007Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:04.0846457Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.0849115Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.0850505Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.0851901Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.0853237Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.0854828Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.0856261Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.0857528Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:04.0858994Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.0860249Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:04.0861351Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:04.0862393Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:04.0863634Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.0864932Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.0866211Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:04.0867273Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:04.0868471Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.0869850Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.0870931Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.0871872Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.0872620Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:04.0873659Z W0507 20:32:04.080095 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.3795848Z self = 2025-05-07T20:32:05.3796390Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.3796696Z 2025-05-07T20:32:05.3796800Z @given( 2025-05-07T20:32:05.3797063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.3797411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.3797749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.3798309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.3798664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.3798981Z ) 2025-05-07T20:32:05.3799344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.3799846Z def test_silu_mul_quant( 2025-05-07T20:32:05.3800104Z self, 2025-05-07T20:32:05.3800330Z T: int, 2025-05-07T20:32:05.3800550Z D: int, 2025-05-07T20:32:05.3800785Z scale_ub: Optional[float], 2025-05-07T20:32:05.3801081Z contiguous: bool, 2025-05-07T20:32:05.3801340Z compiled: bool, 2025-05-07T20:32:05.3801577Z ) -> None: 2025-05-07T20:32:05.3801820Z torch.manual_seed(2025) 2025-05-07T20:32:05.3802086Z 2025-05-07T20:32:05.3802372Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.3802742Z 2025-05-07T20:32:05.3802958Z x_sign = torch.sign(x) 2025-05-07T20:32:05.3803276Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.3803611Z x = x_sign * x_clamp 2025-05-07T20:32:05.3803871Z x0 = x[:, :D] 2025-05-07T20:32:05.3804104Z x1 = x[:, D:] 2025-05-07T20:32:05.3804318Z 2025-05-07T20:32:05.3804517Z if contiguous: 2025-05-07T20:32:05.3804767Z x0 = x0.contiguous() 2025-05-07T20:32:05.3805028Z x1 = x1.contiguous() 2025-05-07T20:32:05.3805283Z 2025-05-07T20:32:05.3805489Z if scale_ub is not None: 2025-05-07T20:32:05.3805764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.3806115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.3806443Z ) 2025-05-07T20:32:05.3806644Z else: 2025-05-07T20:32:05.3806872Z scale_ub_tensor = None 2025-05-07T20:32:05.3807139Z 2025-05-07T20:32:05.3807378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.3807702Z op = silu_mul_quant 2025-05-07T20:32:05.3808099Z if compiled: 2025-05-07T20:32:05.3808361Z op = torch.compile(op) 2025-05-07T20:32:05.3808675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.3808975Z 2025-05-07T20:32:05.3809192Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.3809395Z 2025-05-07T20:32:05.3809521Z moe/activation_test.py:117: 2025-05-07T20:32:05.3809840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.3810185Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.3810473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.3811190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.3811895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.3812447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.3813144Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.3813818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.3814353Z kernel = self.compile( 2025-05-07T20:32:05.3814906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.3815570Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.3815972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.3816217Z 2025-05-07T20:32:05.3816435Z self = 2025-05-07T20:32:05.3817519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.3819022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ec3e9550>} 2025-05-07T20:32:05.3820398Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.3821525Z context = 2025-05-07T20:32:05.3821821Z 2025-05-07T20:32:05.3821995Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.3822525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.3822997Z module_map=module_map) 2025-05-07T20:32:05.3823365Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.3823728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.3824006Z E ^ 2025-05-07T20:32:05.3824468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.3824927Z 2025-05-07T20:32:05.3825351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.3825868Z 2025-05-07T20:32:05.3825975Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.3826397Z self=, 2025-05-07T20:32:05.3826800Z T=4096, 2025-05-07T20:32:05.3826996Z D=7168, 2025-05-07T20:32:05.3827195Z scale_ub=None, 2025-05-07T20:32:05.3827413Z contiguous=False, 2025-05-07T20:32:05.3827652Z compiled=False, 2025-05-07T20:32:05.3827865Z ) 2025-05-07T20:32:05.3828184Z self = 2025-05-07T20:32:05.3828687Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.3829076Z 2025-05-07T20:32:05.3829163Z @given( 2025-05-07T20:32:05.3829420Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.3829735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.3830054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.3830395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.3830727Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.3831019Z ) 2025-05-07T20:32:05.3831377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.3831821Z def test_silu_mul_quant( 2025-05-07T20:32:05.3832069Z self, 2025-05-07T20:32:05.3832272Z T: int, 2025-05-07T20:32:05.3832473Z D: int, 2025-05-07T20:32:05.3832703Z scale_ub: Optional[float], 2025-05-07T20:32:05.3832982Z contiguous: bool, 2025-05-07T20:32:05.3833232Z compiled: bool, 2025-05-07T20:32:05.3833457Z ) -> None: 2025-05-07T20:32:05.3833691Z torch.manual_seed(2025) 2025-05-07T20:32:05.3833943Z 2025-05-07T20:32:05.3834219Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.3834571Z 2025-05-07T20:32:05.3834769Z x_sign = torch.sign(x) 2025-05-07T20:32:05.3835064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.3835382Z x = x_sign * x_clamp 2025-05-07T20:32:05.3835634Z x0 = x[:, :D] 2025-05-07T20:32:05.3835852Z x1 = x[:, D:] 2025-05-07T20:32:05.3836067Z 2025-05-07T20:32:05.3836263Z if contiguous: 2025-05-07T20:32:05.3836495Z x0 = x0.contiguous() 2025-05-07T20:32:05.3836761Z x1 = x1.contiguous() 2025-05-07T20:32:05.3837010Z 2025-05-07T20:32:05.3837204Z if scale_ub is not None: 2025-05-07T20:32:05.3837486Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.3837830Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.3838649Z ) 2025-05-07T20:32:05.3838846Z else: 2025-05-07T20:32:05.3839066Z scale_ub_tensor = None 2025-05-07T20:32:05.3839325Z 2025-05-07T20:32:05.3839561Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.3839888Z op = silu_mul_quant 2025-05-07T20:32:05.3840331Z if compiled: 2025-05-07T20:32:05.3840582Z op = torch.compile(op) 2025-05-07T20:32:05.3840885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.3841167Z 2025-05-07T20:32:05.3841362Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.3841536Z 2025-05-07T20:32:05.3841638Z moe/activation_test.py:117: 2025-05-07T20:32:05.3841937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.3842270Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.3842560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.3843266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.3843964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.3844502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.3845189Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.3845861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.3846399Z kernel = self.compile( 2025-05-07T20:32:05.3846943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.3847614Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.3848021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.3848255Z 2025-05-07T20:32:05.3848589Z self = 2025-05-07T20:32:05.3849686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.3851054Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ec29b5e0>} 2025-05-07T20:32:05.3852393Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.3853413Z context = 2025-05-07T20:32:05.3853706Z 2025-05-07T20:32:05.3853880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.3854424Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.3854893Z module_map=module_map) 2025-05-07T20:32:05.3855261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.3855631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.3855902Z E ^ 2025-05-07T20:32:05.3856376Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.3856826Z 2025-05-07T20:32:05.3857252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.3857768Z 2025-05-07T20:32:05.3857873Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.3858293Z self=, 2025-05-07T20:32:05.3858699Z T=128, 2025-05-07T20:32:05.3858887Z D=7168, 2025-05-07T20:32:05.3859088Z scale_ub=None, 2025-05-07T20:32:05.3859447Z contiguous=False, 2025-05-07T20:32:05.3859710Z compiled=True, 2025-05-07T20:32:05.3859922Z ) 2025-05-07T20:32:05.4637614Z self = 2025-05-07T20:32:05.4638158Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.4638435Z 2025-05-07T20:32:05.4638517Z @given( 2025-05-07T20:32:05.4638761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4639286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4639890Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4640748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4641366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4641913Z ) 2025-05-07T20:32:05.4642558Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4643377Z def test_silu_mul_quant( 2025-05-07T20:32:05.4643863Z self, 2025-05-07T20:32:05.4644223Z T: int, 2025-05-07T20:32:05.4644602Z D: int, 2025-05-07T20:32:05.4645013Z scale_ub: Optional[float], 2025-05-07T20:32:05.4645514Z contiguous: bool, 2025-05-07T20:32:05.4645965Z compiled: bool, 2025-05-07T20:32:05.4646386Z ) -> None: 2025-05-07T20:32:05.4646790Z torch.manual_seed(2025) 2025-05-07T20:32:05.4647251Z 2025-05-07T20:32:05.4647766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4648394Z 2025-05-07T20:32:05.4648758Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4649307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4649791Z x = x_sign * x_clamp 2025-05-07T20:32:05.4650038Z x0 = x[:, :D] 2025-05-07T20:32:05.4650267Z x1 = x[:, D:] 2025-05-07T20:32:05.4650492Z 2025-05-07T20:32:05.4650687Z if contiguous: 2025-05-07T20:32:05.4650938Z x0 = x0.contiguous() 2025-05-07T20:32:05.4651385Z x1 = x1.contiguous() 2025-05-07T20:32:05.4651640Z 2025-05-07T20:32:05.4651847Z if scale_ub is not None: 2025-05-07T20:32:05.4652133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4652482Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4652809Z ) 2025-05-07T20:32:05.4653018Z else: 2025-05-07T20:32:05.4653233Z scale_ub_tensor = None 2025-05-07T20:32:05.4653501Z 2025-05-07T20:32:05.4653751Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4654080Z op = silu_mul_quant 2025-05-07T20:32:05.4654348Z if compiled: 2025-05-07T20:32:05.4654612Z op = torch.compile(op) 2025-05-07T20:32:05.4654916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4655200Z 2025-05-07T20:32:05.4655402Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.4655698Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.4656004Z 2025-05-07T20:32:05.4656251Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4656599Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.4656900Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.4657239Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.4657759Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4658074Z 2025-05-07T20:32:05.4658285Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.4658481Z 2025-05-07T20:32:05.4658590Z moe/activation_test.py:126: 2025-05-07T20:32:05.4658907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4659283Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.4659621Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4660421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.4661443Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.4661991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4662680Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4663378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.4664097Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4664852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.4665597Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4666330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.4666981Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.4667588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.4668108Z fn() 2025-05-07T20:32:05.4668618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.4669202Z self.fn.run( 2025-05-07T20:32:05.4669677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4670211Z kernel = self.compile( 2025-05-07T20:32:05.4670758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4671413Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4671820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4672167Z 2025-05-07T20:32:05.4672397Z self = 2025-05-07T20:32:05.4673473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4674862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ecc73af0>} 2025-05-07T20:32:05.4676201Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4677220Z context = 2025-05-07T20:32:05.4677514Z 2025-05-07T20:32:05.4677710Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4678236Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4678709Z module_map=module_map) 2025-05-07T20:32:05.4679083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4679446Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.4679724Z E ^ 2025-05-07T20:32:05.4680196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4680646Z 2025-05-07T20:32:05.4681071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4681584Z 2025-05-07T20:32:05.4681691Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4682113Z self=, 2025-05-07T20:32:05.4682613Z T=128, 2025-05-07T20:32:05.4682807Z D=7168, 2025-05-07T20:32:05.4683011Z scale_ub=None, 2025-05-07T20:32:05.4683235Z contiguous=False, 2025-05-07T20:32:05.4683465Z compiled=False, 2025-05-07T20:32:05.4683680Z ) 2025-05-07T20:32:05.8810922Z self = 2025-05-07T20:32:05.8812221Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.8812993Z 2025-05-07T20:32:05.8813202Z @given( 2025-05-07T20:32:05.8813823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8814458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8815086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8815765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8816432Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8817022Z ) 2025-05-07T20:32:05.8817741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8818652Z def test_silu_mul_quant( 2025-05-07T20:32:05.8819032Z self, 2025-05-07T20:32:05.8819271Z T: int, 2025-05-07T20:32:05.8819483Z D: int, 2025-05-07T20:32:05.8819711Z scale_ub: Optional[float], 2025-05-07T20:32:05.8819993Z contiguous: bool, 2025-05-07T20:32:05.8820246Z compiled: bool, 2025-05-07T20:32:05.8820473Z ) -> None: 2025-05-07T20:32:05.8820706Z torch.manual_seed(2025) 2025-05-07T20:32:05.8820958Z 2025-05-07T20:32:05.8821323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8821677Z 2025-05-07T20:32:05.8821882Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8822182Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8822506Z x = x_sign * x_clamp 2025-05-07T20:32:05.8822758Z x0 = x[:, :D] 2025-05-07T20:32:05.8822981Z x1 = x[:, D:] 2025-05-07T20:32:05.8823201Z 2025-05-07T20:32:05.8823565Z if contiguous: 2025-05-07T20:32:05.8823809Z x0 = x0.contiguous() 2025-05-07T20:32:05.8824081Z x1 = x1.contiguous() 2025-05-07T20:32:05.8824332Z 2025-05-07T20:32:05.8824527Z if scale_ub is not None: 2025-05-07T20:32:05.8824811Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8825160Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8825474Z ) 2025-05-07T20:32:05.8825681Z else: 2025-05-07T20:32:05.8825904Z scale_ub_tensor = None 2025-05-07T20:32:05.8826169Z 2025-05-07T20:32:05.8826405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8826733Z op = silu_mul_quant 2025-05-07T20:32:05.8826998Z if compiled: 2025-05-07T20:32:05.8827250Z op = torch.compile(op) 2025-05-07T20:32:05.8827558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8827848Z 2025-05-07T20:32:05.8828049Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8828228Z 2025-05-07T20:32:05.8828333Z moe/activation_test.py:117: 2025-05-07T20:32:05.8828640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8836257Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8836593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8837297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8838003Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8838548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8839242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8839923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8840667Z kernel = self.compile( 2025-05-07T20:32:05.8841395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8842072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8842485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8842721Z 2025-05-07T20:32:05.8842946Z self = 2025-05-07T20:32:05.8844046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8845423Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ebf6f430>} 2025-05-07T20:32:05.8846770Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8847800Z context = 2025-05-07T20:32:05.8848091Z 2025-05-07T20:32:05.8848268Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8848794Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8849316Z module_map=module_map) 2025-05-07T20:32:05.8849698Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8850058Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8850329Z E ^ 2025-05-07T20:32:05.8850799Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8851261Z 2025-05-07T20:32:05.8851810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8852332Z 2025-05-07T20:32:05.8852439Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8852863Z self=, 2025-05-07T20:32:05.8853270Z T=4096, 2025-05-07T20:32:05.8853462Z D=5120, 2025-05-07T20:32:05.8853663Z scale_ub=1200.0, 2025-05-07T20:32:05.8853899Z contiguous=True, 2025-05-07T20:32:05.8854122Z compiled=False, 2025-05-07T20:32:05.8854338Z ) 2025-05-07T20:32:05.8854670Z self = 2025-05-07T20:32:05.8855177Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.8855457Z 2025-05-07T20:32:05.8855539Z @given( 2025-05-07T20:32:05.8855777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8856105Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8856427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8856771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8857107Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8857396Z ) 2025-05-07T20:32:05.8857749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8858197Z def test_silu_mul_quant( 2025-05-07T20:32:05.8858449Z self, 2025-05-07T20:32:05.8858643Z T: int, 2025-05-07T20:32:05.8858844Z D: int, 2025-05-07T20:32:05.8859071Z scale_ub: Optional[float], 2025-05-07T20:32:05.8859344Z contiguous: bool, 2025-05-07T20:32:05.8859596Z compiled: bool, 2025-05-07T20:32:05.8859878Z ) -> None: 2025-05-07T20:32:05.8860106Z torch.manual_seed(2025) 2025-05-07T20:32:05.8860364Z 2025-05-07T20:32:05.8860646Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8860989Z 2025-05-07T20:32:05.8861350Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8861651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8861968Z x = x_sign * x_clamp 2025-05-07T20:32:05.8862221Z x0 = x[:, :D] 2025-05-07T20:32:05.8862450Z x1 = x[:, D:] 2025-05-07T20:32:05.8862657Z 2025-05-07T20:32:05.8862848Z if contiguous: 2025-05-07T20:32:05.8863099Z x0 = x0.contiguous() 2025-05-07T20:32:05.8863354Z x1 = x1.contiguous() 2025-05-07T20:32:05.8863609Z 2025-05-07T20:32:05.8863808Z if scale_ub is not None: 2025-05-07T20:32:05.8864089Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8864426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8864743Z ) 2025-05-07T20:32:05.8864941Z else: 2025-05-07T20:32:05.8865155Z scale_ub_tensor = None 2025-05-07T20:32:05.8865416Z 2025-05-07T20:32:05.8865648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8865982Z op = silu_mul_quant 2025-05-07T20:32:05.8866245Z if compiled: 2025-05-07T20:32:05.8866508Z op = torch.compile(op) 2025-05-07T20:32:05.8866812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8867099Z 2025-05-07T20:32:05.8867297Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8867469Z 2025-05-07T20:32:05.8867572Z moe/activation_test.py:117: 2025-05-07T20:32:05.8867875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8868217Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8868507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8869205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8869948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8870582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8871282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8871952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8872488Z kernel = self.compile( 2025-05-07T20:32:05.8873041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8873694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8874098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8874330Z 2025-05-07T20:32:05.8874545Z self = 2025-05-07T20:32:05.8875636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8877009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ebcb6430>} 2025-05-07T20:32:05.8878367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8879384Z context = 2025-05-07T20:32:05.8879674Z 2025-05-07T20:32:05.8879852Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8880381Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8880859Z module_map=module_map) 2025-05-07T20:32:05.8881239Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8881678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8881941Z E ^ 2025-05-07T20:32:05.8882409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8882857Z 2025-05-07T20:32:05.8883280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8883793Z 2025-05-07T20:32:05.8883904Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8884320Z self=, 2025-05-07T20:32:05.8884731Z T=1, 2025-05-07T20:32:05.8884924Z D=5120, 2025-05-07T20:32:05.8885115Z scale_ub=None, 2025-05-07T20:32:05.8885335Z contiguous=True, 2025-05-07T20:32:05.8885568Z compiled=True, 2025-05-07T20:32:05.8885772Z ) 2025-05-07T20:32:06.4006845Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.4008998Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:06.4010392Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.4011817Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.4013462Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.4014860Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.4016160Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.4017540Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.4018954Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.4020211Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:06.4021623Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.4022846Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:06.4023892Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:06.4024906Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:06.4026124Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.4027565Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.4028692Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.4029724Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:06.4030909Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.4032275Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.4033353Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.4034267Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.4035018Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:06.4036050Z W0507 20:32:06.396558 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.5887288Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.5888557Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:06.5889966Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.5891408Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.5892798Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.5894197Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.5895513Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.5896907Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.5898318Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.5899571Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:06.5900931Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.5902240Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:06.5903281Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:06.5904314Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:06.5905544Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.5906849Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.5907980Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.5909037Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:06.5910219Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.5911666Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.5912739Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.5913667Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.5914415Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:06.5915429Z W0507 20:32:06.584789 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.0872821Z self = 2025-05-07T20:32:07.0873417Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:07.0873815Z 2025-05-07T20:32:07.0873951Z @given( 2025-05-07T20:32:07.0874281Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.0874733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.0875201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.0875550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.0875888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.0876186Z ) 2025-05-07T20:32:07.0876552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.0877005Z def test_silu_mul_quant( 2025-05-07T20:32:07.0877260Z self, 2025-05-07T20:32:07.0877465Z T: int, 2025-05-07T20:32:07.0877671Z D: int, 2025-05-07T20:32:07.0877893Z scale_ub: Optional[float], 2025-05-07T20:32:07.0878175Z contiguous: bool, 2025-05-07T20:32:07.0878426Z compiled: bool, 2025-05-07T20:32:07.0878655Z ) -> None: 2025-05-07T20:32:07.0878891Z torch.manual_seed(2025) 2025-05-07T20:32:07.0879335Z 2025-05-07T20:32:07.0879614Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.0879973Z 2025-05-07T20:32:07.0880179Z x_sign = torch.sign(x) 2025-05-07T20:32:07.0880478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.0880803Z x = x_sign * x_clamp 2025-05-07T20:32:07.0881061Z x0 = x[:, :D] 2025-05-07T20:32:07.0881285Z x1 = x[:, D:] 2025-05-07T20:32:07.0881501Z 2025-05-07T20:32:07.0881699Z if contiguous: 2025-05-07T20:32:07.0881937Z x0 = x0.contiguous() 2025-05-07T20:32:07.0882210Z x1 = x1.contiguous() 2025-05-07T20:32:07.0882461Z 2025-05-07T20:32:07.0882654Z if scale_ub is not None: 2025-05-07T20:32:07.0882945Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.0883295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.0883618Z ) 2025-05-07T20:32:07.0883831Z else: 2025-05-07T20:32:07.0884053Z scale_ub_tensor = None 2025-05-07T20:32:07.0884315Z 2025-05-07T20:32:07.0884556Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.0884888Z op = silu_mul_quant 2025-05-07T20:32:07.0885155Z if compiled: 2025-05-07T20:32:07.0885410Z op = torch.compile(op) 2025-05-07T20:32:07.0885722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.0886011Z 2025-05-07T20:32:07.0886209Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.0886513Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.0886817Z 2025-05-07T20:32:07.0887065Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.0887415Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.0887743Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.0888076Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.0888583Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.0888909Z 2025-05-07T20:32:07.0889124Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.0889325Z 2025-05-07T20:32:07.0889438Z moe/activation_test.py:126: 2025-05-07T20:32:07.0889749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.0890099Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.0890439Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.0891233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.0891994Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.0892554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.0893248Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.0893961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.0894691Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.0895453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:07.0896198Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.0896939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.0897588Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.0898199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.0898718Z fn() 2025-05-07T20:32:07.0899238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.0899917Z self.fn.run( 2025-05-07T20:32:07.0900392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.0900932Z kernel = self.compile( 2025-05-07T20:32:07.0901553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.0902214Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.0902619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.0902860Z 2025-05-07T20:32:07.0903074Z self = 2025-05-07T20:32:07.0904181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.0905569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ebcb6940>} 2025-05-07T20:32:07.0906915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.0907930Z context = 2025-05-07T20:32:07.0908230Z 2025-05-07T20:32:07.0908402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.0908940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.0909414Z module_map=module_map) 2025-05-07T20:32:07.0909791Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.0910674Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.0910964Z E ^ 2025-05-07T20:32:07.0911437Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.0911897Z 2025-05-07T20:32:07.0912316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.0912834Z 2025-05-07T20:32:07.0912942Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.0913368Z self=, 2025-05-07T20:32:07.0913775Z T=2048, 2025-05-07T20:32:07.0913978Z D=5120, 2025-05-07T20:32:07.0914183Z scale_ub=None, 2025-05-07T20:32:07.0914403Z contiguous=True, 2025-05-07T20:32:07.0914639Z compiled=True, 2025-05-07T20:32:07.0914861Z ) 2025-05-07T20:32:07.5584892Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.5586494Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:07.5588193Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.5589645Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.5591019Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.5592628Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.5593941Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.5595312Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.5596733Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.5597984Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:07.5599212Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.5600486Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:07.5601536Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:07.5602562Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:07.5603897Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.5605521Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.5606922Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:07.5608218Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:07.5609699Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.5611404Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.5612734Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.5613864Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.5614784Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:07.5616066Z W0507 20:32:07.554478 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.7459606Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.7461362Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:07.7462712Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.7464132Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.7465538Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.7466935Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.7468267Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.7469665Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.7471095Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.7472355Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:07.7473757Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.7474988Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:07.7476041Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:07.7477072Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:07.7478317Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.7479618Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.7480743Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:07.7481799Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:07.7482983Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.7484354Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.7485513Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.7486426Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.7487168Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:07.7488213Z W0507 20:32:07.741908 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.2433764Z self = 2025-05-07T20:32:08.2434580Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:08.2434964Z 2025-05-07T20:32:08.2435083Z @given( 2025-05-07T20:32:08.2435431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.2435897Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.2436286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.2436625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.2436958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.2437253Z ) 2025-05-07T20:32:08.2437615Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.2438061Z def test_silu_mul_quant( 2025-05-07T20:32:08.2438314Z self, 2025-05-07T20:32:08.2438517Z T: int, 2025-05-07T20:32:08.2438718Z D: int, 2025-05-07T20:32:08.2438945Z scale_ub: Optional[float], 2025-05-07T20:32:08.2439228Z contiguous: bool, 2025-05-07T20:32:08.2439471Z compiled: bool, 2025-05-07T20:32:08.2439710Z ) -> None: 2025-05-07T20:32:08.2439939Z torch.manual_seed(2025) 2025-05-07T20:32:08.2440542Z 2025-05-07T20:32:08.2440836Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.2441192Z 2025-05-07T20:32:08.2441392Z x_sign = torch.sign(x) 2025-05-07T20:32:08.2441697Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.2442021Z x = x_sign * x_clamp 2025-05-07T20:32:08.2442274Z x0 = x[:, :D] 2025-05-07T20:32:08.2442494Z x1 = x[:, D:] 2025-05-07T20:32:08.2442712Z 2025-05-07T20:32:08.2442913Z if contiguous: 2025-05-07T20:32:08.2443148Z x0 = x0.contiguous() 2025-05-07T20:32:08.2443418Z x1 = x1.contiguous() 2025-05-07T20:32:08.2443671Z 2025-05-07T20:32:08.2443869Z if scale_ub is not None: 2025-05-07T20:32:08.2444153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.2444498Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.2444812Z ) 2025-05-07T20:32:08.2445014Z else: 2025-05-07T20:32:08.2445246Z scale_ub_tensor = None 2025-05-07T20:32:08.2445501Z 2025-05-07T20:32:08.2445739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.2446064Z op = silu_mul_quant 2025-05-07T20:32:08.2446327Z if compiled: 2025-05-07T20:32:08.2446585Z op = torch.compile(op) 2025-05-07T20:32:08.2446890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2447175Z 2025-05-07T20:32:08.2447368Z y_fp8, y_scale = fn() 2025-05-07T20:32:08.2455212Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:08.2455561Z 2025-05-07T20:32:08.2455817Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.2456177Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:08.2456493Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:08.2456824Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:08.2457205Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.2457698Z 2025-05-07T20:32:08.2457915Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:08.2458122Z 2025-05-07T20:32:08.2458231Z moe/activation_test.py:126: 2025-05-07T20:32:08.2458555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2458917Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:08.2459256Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.2460065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:08.2460826Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:08.2461492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.2462183Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.2462896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:08.2463637Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.2464406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:08.2465151Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.2465897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:08.2466553Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:08.2467165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:08.2467693Z fn() 2025-05-07T20:32:08.2468291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:08.2468889Z self.fn.run( 2025-05-07T20:32:08.2469366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.2469930Z kernel = self.compile( 2025-05-07T20:32:08.2470518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.2471183Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.2471591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2471835Z 2025-05-07T20:32:08.2472051Z self = 2025-05-07T20:32:08.2473133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.2474534Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb6519d0>} 2025-05-07T20:32:08.2475882Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.2476916Z context = 2025-05-07T20:32:08.2477217Z 2025-05-07T20:32:08.2477392Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.2477931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.2478398Z module_map=module_map) 2025-05-07T20:32:08.2478784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.2479161Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:08.2479526Z E ^ 2025-05-07T20:32:08.2480042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.2480503Z 2025-05-07T20:32:08.2480921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.2481431Z 2025-05-07T20:32:08.2481544Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.2481962Z self=, 2025-05-07T20:32:08.2482378Z T=128, 2025-05-07T20:32:08.2482582Z D=5120, 2025-05-07T20:32:08.2482785Z scale_ub=None, 2025-05-07T20:32:08.2483010Z contiguous=True, 2025-05-07T20:32:08.2483253Z compiled=True, 2025-05-07T20:32:08.2483471Z ) 2025-05-07T20:32:08.7721574Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.7722970Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:08.7724319Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.7725764Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.7727159Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.7728729Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.7730065Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.7731469Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.7732885Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.7734141Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:08.7735389Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.7736612Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:08.7737657Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:08.7738686Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:08.7739935Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.7741719Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.7742857Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.7743913Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:08.7745112Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.7746474Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.7747547Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.7748480Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.7749238Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:08.7750325Z W0507 20:32:08.768148 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9605960Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.9607424Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:08.9608779Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.9610256Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.9611646Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.9613036Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9614331Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.9615710Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9617124Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.9618364Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:08.9619706Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.9620928Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:08.9622053Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:08.9623084Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:08.9624303Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.9625606Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.9626711Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.9627769Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:08.9628943Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.9630355Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.9631546Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9632465Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9633219Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:08.9634246Z W0507 20:32:08.956670 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.7796067Z self = 2025-05-07T20:32:09.7796836Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:09.7797194Z 2025-05-07T20:32:09.7797305Z @given( 2025-05-07T20:32:09.7797655Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.7798001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.7798316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.7798667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.7799017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.7799316Z ) 2025-05-07T20:32:09.7799686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.7800150Z def test_silu_mul_quant( 2025-05-07T20:32:09.7800414Z self, 2025-05-07T20:32:09.7800617Z T: int, 2025-05-07T20:32:09.7800834Z D: int, 2025-05-07T20:32:09.7801070Z scale_ub: Optional[float], 2025-05-07T20:32:09.7801351Z contiguous: bool, 2025-05-07T20:32:09.7801607Z compiled: bool, 2025-05-07T20:32:09.7801843Z ) -> None: 2025-05-07T20:32:09.7802063Z torch.manual_seed(2025) 2025-05-07T20:32:09.7802313Z 2025-05-07T20:32:09.7802785Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.7803131Z 2025-05-07T20:32:09.7803333Z x_sign = torch.sign(x) 2025-05-07T20:32:09.7803633Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.7803952Z x = x_sign * x_clamp 2025-05-07T20:32:09.7804206Z x0 = x[:, :D] 2025-05-07T20:32:09.7804434Z x1 = x[:, D:] 2025-05-07T20:32:09.7804646Z 2025-05-07T20:32:09.7804843Z if contiguous: 2025-05-07T20:32:09.7805091Z x0 = x0.contiguous() 2025-05-07T20:32:09.7805354Z x1 = x1.contiguous() 2025-05-07T20:32:09.7805606Z 2025-05-07T20:32:09.7805806Z if scale_ub is not None: 2025-05-07T20:32:09.7806087Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.7806424Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.7806744Z ) 2025-05-07T20:32:09.7806943Z else: 2025-05-07T20:32:09.7807165Z scale_ub_tensor = None 2025-05-07T20:32:09.7807424Z 2025-05-07T20:32:09.7807665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.7807983Z op = silu_mul_quant 2025-05-07T20:32:09.7808241Z if compiled: 2025-05-07T20:32:09.7808499Z op = torch.compile(op) 2025-05-07T20:32:09.7808797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.7809084Z 2025-05-07T20:32:09.7809286Z y_fp8, y_scale = fn() 2025-05-07T20:32:09.7809578Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:09.7809878Z 2025-05-07T20:32:09.7810126Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.7810466Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:09.7810769Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:09.7811098Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:09.7811469Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.7811969Z 2025-05-07T20:32:09.7812189Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:09.7812389Z 2025-05-07T20:32:09.7812501Z moe/activation_test.py:126: 2025-05-07T20:32:09.7812801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7813147Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:09.7813485Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.7814283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:09.7815035Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:09.7815590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.7816278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.7816971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:09.7817709Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.7818468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:09.7819214Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.7819942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:09.7820584Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:09.7821274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:09.7821799Z fn() 2025-05-07T20:32:09.7822311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:09.7823003Z self.fn.run( 2025-05-07T20:32:09.7823477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.7824008Z kernel = self.compile( 2025-05-07T20:32:09.7824553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.7825215Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.7825624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7825859Z 2025-05-07T20:32:09.7826070Z self = 2025-05-07T20:32:09.7827164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.7828549Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb9bb550>} 2025-05-07T20:32:09.7829905Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.7830928Z context = 2025-05-07T20:32:09.7831223Z 2025-05-07T20:32:09.7831395Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.7831927Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.7832403Z module_map=module_map) 2025-05-07T20:32:09.7832774Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.7833144Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:09.7833509Z E ^ 2025-05-07T20:32:09.7834054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.7834615Z 2025-05-07T20:32:09.7835121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.7835754Z 2025-05-07T20:32:09.7835865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.7836347Z self=, 2025-05-07T20:32:09.7836814Z T=4096, 2025-05-07T20:32:09.7837019Z D=5120, 2025-05-07T20:32:09.7837234Z scale_ub=None, 2025-05-07T20:32:09.7837463Z contiguous=True, 2025-05-07T20:32:09.7837709Z compiled=True, 2025-05-07T20:32:09.7837936Z ) 2025-05-07T20:32:10.3095927Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.3098122Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:10.3100811Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.3102409Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.3103814Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.3105230Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.3106728Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.3108134Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.3109577Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.3110832Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:10.3112062Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.3113277Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:10.3114339Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:10.3115369Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:10.3116704Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.3118004Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.3119139Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.3120209Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:10.3121467Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.3122835Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.3123925Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.3124867Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.3125627Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:10.3126670Z W0507 20:32:10.305411 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.4986388Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.4987503Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:10.4989036Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.4990483Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.4991934Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.4993326Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.4994660Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.4996034Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.4997470Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.4998737Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:10.5000073Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.5001294Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:10.5002337Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:10.5003376Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:10.5004608Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.5005897Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.5007019Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.5008065Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:10.5009243Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.5010613Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.5011789Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.5012711Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.5013463Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:10.5014493Z W0507 20:32:10.494721 87502 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1449984Z self = 2025-05-07T20:32:11.1450637Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.1451015Z 2025-05-07T20:32:11.1451136Z @given( 2025-05-07T20:32:11.1451483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1451914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1452303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1452655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1452994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1453294Z ) 2025-05-07T20:32:11.1453658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1454111Z def test_silu_mul_quant( 2025-05-07T20:32:11.1454375Z self, 2025-05-07T20:32:11.1454586Z T: int, 2025-05-07T20:32:11.1454792Z D: int, 2025-05-07T20:32:11.1455025Z scale_ub: Optional[float], 2025-05-07T20:32:11.1455311Z contiguous: bool, 2025-05-07T20:32:11.1455559Z compiled: bool, 2025-05-07T20:32:11.1455807Z ) -> None: 2025-05-07T20:32:11.1456042Z torch.manual_seed(2025) 2025-05-07T20:32:11.1456303Z 2025-05-07T20:32:11.1456940Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1457316Z 2025-05-07T20:32:11.1457529Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1457861Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1458181Z x = x_sign * x_clamp 2025-05-07T20:32:11.1458440Z x0 = x[:, :D] 2025-05-07T20:32:11.1458671Z x1 = x[:, D:] 2025-05-07T20:32:11.1458885Z 2025-05-07T20:32:11.1459087Z if contiguous: 2025-05-07T20:32:11.1459344Z x0 = x0.contiguous() 2025-05-07T20:32:11.1459618Z x1 = x1.contiguous() 2025-05-07T20:32:11.1459871Z 2025-05-07T20:32:11.1460078Z if scale_ub is not None: 2025-05-07T20:32:11.1460368Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.1460716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.1461044Z ) 2025-05-07T20:32:11.1461360Z else: 2025-05-07T20:32:11.1461579Z scale_ub_tensor = None 2025-05-07T20:32:11.1461859Z 2025-05-07T20:32:11.1462112Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1462437Z op = silu_mul_quant 2025-05-07T20:32:11.1462704Z if compiled: 2025-05-07T20:32:11.1462971Z op = torch.compile(op) 2025-05-07T20:32:11.1463276Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1463565Z 2025-05-07T20:32:11.1463772Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.1464072Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.1464368Z 2025-05-07T20:32:11.1464617Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1464963Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.1465263Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.1465591Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.1465961Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.1466444Z 2025-05-07T20:32:11.1466664Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.1466864Z 2025-05-07T20:32:11.1466980Z moe/activation_test.py:126: 2025-05-07T20:32:11.1467282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1467632Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.1467974Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.1468770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.1469527Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.1470094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.1470810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.1471538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.1472270Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.1473027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:11.1473779Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.1474511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.1475153Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.1475771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.1476307Z fn() 2025-05-07T20:32:11.1476817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.1477525Z self.fn.run( 2025-05-07T20:32:11.1478005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.1478546Z kernel = self.compile( 2025-05-07T20:32:11.1479090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.1479751Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.1480166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1480401Z 2025-05-07T20:32:11.1480619Z self = 2025-05-07T20:32:11.1481762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.1483158Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb907790>} 2025-05-07T20:32:11.1484521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.1485545Z context = 2025-05-07T20:32:11.1485840Z 2025-05-07T20:32:11.1486013Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.1486546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.1487021Z module_map=module_map) 2025-05-07T20:32:11.1487395Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.1487762Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.1488043Z E ^ 2025-05-07T20:32:11.1488604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1489063Z 2025-05-07T20:32:11.1489482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.1490001Z 2025-05-07T20:32:11.1490108Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1490535Z self=, 2025-05-07T20:32:11.1490949Z T=16384, 2025-05-07T20:32:11.1491149Z D=5120, 2025-05-07T20:32:11.1491354Z scale_ub=None, 2025-05-07T20:32:11.1491576Z contiguous=True, 2025-05-07T20:32:11.1491804Z compiled=True, 2025-05-07T20:32:11.1492021Z ) 2025-05-07T20:32:11.1916439Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:11.1917715Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:11.1919072Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:11.1920058Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:11.1921163Z W0507 20:32:11.190207 87502 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:11.3137356Z self = 2025-05-07T20:32:11.3138061Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.3138389Z 2025-05-07T20:32:11.3138732Z @given( 2025-05-07T20:32:11.3138992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.3139327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.3139646Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.3139992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.3140644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.3140951Z ) 2025-05-07T20:32:11.3141383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.3141847Z def test_silu_mul_quant( 2025-05-07T20:32:11.3142111Z self, 2025-05-07T20:32:11.3142315Z T: int, 2025-05-07T20:32:11.3142528Z D: int, 2025-05-07T20:32:11.3142762Z scale_ub: Optional[float], 2025-05-07T20:32:11.3143043Z contiguous: bool, 2025-05-07T20:32:11.3143300Z compiled: bool, 2025-05-07T20:32:11.3143545Z ) -> None: 2025-05-07T20:32:11.3143782Z torch.manual_seed(2025) 2025-05-07T20:32:11.3144041Z 2025-05-07T20:32:11.3144329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.3144677Z 2025-05-07T20:32:11.3144883Z x_sign = torch.sign(x) 2025-05-07T20:32:11.3145189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.3145510Z x = x_sign * x_clamp 2025-05-07T20:32:11.3145770Z x0 = x[:, :D] 2025-05-07T20:32:11.3145999Z x1 = x[:, D:] 2025-05-07T20:32:11.3146222Z 2025-05-07T20:32:11.3146415Z if contiguous: 2025-05-07T20:32:11.3146662Z x0 = x0.contiguous() 2025-05-07T20:32:11.3146936Z x1 = x1.contiguous() 2025-05-07T20:32:11.3147186Z 2025-05-07T20:32:11.3147397Z if scale_ub is not None: 2025-05-07T20:32:11.3147685Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.3148032Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.3148354Z ) 2025-05-07T20:32:11.3148729Z else: 2025-05-07T20:32:11.3148948Z scale_ub_tensor = None 2025-05-07T20:32:11.3149213Z 2025-05-07T20:32:11.3149460Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.3149782Z op = silu_mul_quant 2025-05-07T20:32:11.3150046Z if compiled: 2025-05-07T20:32:11.3150308Z op = torch.compile(op) 2025-05-07T20:32:11.3150610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.3150902Z 2025-05-07T20:32:11.3151107Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.3151405Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.3151706Z 2025-05-07T20:32:11.3151972Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.3152321Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.3152622Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.3152952Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.3153336Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.3153662Z 2025-05-07T20:32:11.3153874Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.3154081Z 2025-05-07T20:32:11.3154188Z moe/activation_test.py:126: 2025-05-07T20:32:11.3154500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.3154847Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.3155187Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.3155995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.3156760Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.3157321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.3158013Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.3158840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.3159583Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.3160350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:11.3161107Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.3161892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.3162536Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.3163152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.3163693Z fn() 2025-05-07T20:32:11.3164223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.3164821Z self.fn.run( 2025-05-07T20:32:11.3165295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.3165834Z kernel = self.compile( 2025-05-07T20:32:11.3166373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.3167032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.3167444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.3167679Z 2025-05-07T20:32:11.3167900Z self = 2025-05-07T20:32:11.3168992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.3170487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb259b80>} 2025-05-07T20:32:11.3171846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.3172864Z context = 2025-05-07T20:32:11.3173154Z 2025-05-07T20:32:11.3173336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.3173875Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.3174345Z module_map=module_map) 2025-05-07T20:32:11.3174725Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.3175101Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.3175375Z E ^ 2025-05-07T20:32:11.3175841Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.3176288Z 2025-05-07T20:32:11.3176717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.3177229Z 2025-05-07T20:32:11.3177335Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.3177760Z self=, 2025-05-07T20:32:11.3178170Z T=1, 2025-05-07T20:32:11.3178357Z D=5120, 2025-05-07T20:32:11.3178560Z scale_ub=1200.0, 2025-05-07T20:32:11.3178797Z contiguous=True, 2025-05-07T20:32:11.3179024Z compiled=True, 2025-05-07T20:32:11.3179241Z ) 2025-05-07T20:32:11.4885287Z self = 2025-05-07T20:32:11.4886945Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.4887487Z 2025-05-07T20:32:11.4887661Z @given( 2025-05-07T20:32:11.4888130Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.4888779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.4889414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.4890151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.4890802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.4891179Z ) 2025-05-07T20:32:11.4891581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.4892109Z def test_silu_mul_quant( 2025-05-07T20:32:11.4892392Z self, 2025-05-07T20:32:11.4892601Z T: int, 2025-05-07T20:32:11.4892813Z D: int, 2025-05-07T20:32:11.4893052Z scale_ub: Optional[float], 2025-05-07T20:32:11.4893353Z contiguous: bool, 2025-05-07T20:32:11.4893630Z compiled: bool, 2025-05-07T20:32:11.4893880Z ) -> None: 2025-05-07T20:32:11.4894111Z torch.manual_seed(2025) 2025-05-07T20:32:11.4894380Z 2025-05-07T20:32:11.4894682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.4895070Z 2025-05-07T20:32:11.4895275Z x_sign = torch.sign(x) 2025-05-07T20:32:11.4895601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.4895947Z x = x_sign * x_clamp 2025-05-07T20:32:11.4896218Z x0 = x[:, :D] 2025-05-07T20:32:11.4896453Z x1 = x[:, D:] 2025-05-07T20:32:11.4896674Z 2025-05-07T20:32:11.4896874Z if contiguous: 2025-05-07T20:32:11.4897131Z x0 = x0.contiguous() 2025-05-07T20:32:11.4897412Z x1 = x1.contiguous() 2025-05-07T20:32:11.4897681Z 2025-05-07T20:32:11.4897891Z if scale_ub is not None: 2025-05-07T20:32:11.4898192Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.4898703Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.4899030Z ) 2025-05-07T20:32:11.4899233Z else: 2025-05-07T20:32:11.4899447Z scale_ub_tensor = None 2025-05-07T20:32:11.4899712Z 2025-05-07T20:32:11.4899953Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.4900270Z op = silu_mul_quant 2025-05-07T20:32:11.4900535Z if compiled: 2025-05-07T20:32:11.4900797Z op = torch.compile(op) 2025-05-07T20:32:11.4901225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.4901515Z 2025-05-07T20:32:11.4901723Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.4901892Z 2025-05-07T20:32:11.4901999Z moe/activation_test.py:117: 2025-05-07T20:32:11.4902303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4902652Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.4902944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.4903517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.4904083Z return fn(*args, **kwargs) 2025-05-07T20:32:11.4904745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.4905430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.4905972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.4906656Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.4907323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.4907855Z kernel = self.compile( 2025-05-07T20:32:11.4908409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.4909159Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.4909568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4909807Z 2025-05-07T20:32:11.4910020Z self = 2025-05-07T20:32:11.4911127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.4912510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb831e50>} 2025-05-07T20:32:11.4913856Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.4914872Z context = 2025-05-07T20:32:11.4915172Z 2025-05-07T20:32:11.4915343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.4915873Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.4916345Z module_map=module_map) 2025-05-07T20:32:11.4916718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.4917075Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.4917343Z E ^ 2025-05-07T20:32:11.4917805Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.4918260Z 2025-05-07T20:32:11.4918676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.4919195Z 2025-05-07T20:32:11.4919421Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.4919842Z self=, 2025-05-07T20:32:11.4920253Z T=1, 2025-05-07T20:32:11.4920445Z D=5120, 2025-05-07T20:32:11.4920647Z scale_ub=None, 2025-05-07T20:32:11.4920868Z contiguous=False, 2025-05-07T20:32:11.4921106Z compiled=True, 2025-05-07T20:32:11.4921324Z ) 2025-05-07T20:32:11.5732324Z self = 2025-05-07T20:32:11.5732898Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.5733291Z 2025-05-07T20:32:11.5733405Z @given( 2025-05-07T20:32:11.5733730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5734175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5734519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5734867Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5735221Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5735544Z ) 2025-05-07T20:32:11.5735903Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5736356Z def test_silu_mul_quant( 2025-05-07T20:32:11.5736602Z self, 2025-05-07T20:32:11.5736810Z T: int, 2025-05-07T20:32:11.5737020Z D: int, 2025-05-07T20:32:11.5737248Z scale_ub: Optional[float], 2025-05-07T20:32:11.5737526Z contiguous: bool, 2025-05-07T20:32:11.5737777Z compiled: bool, 2025-05-07T20:32:11.5738010Z ) -> None: 2025-05-07T20:32:11.5738230Z torch.manual_seed(2025) 2025-05-07T20:32:11.5738482Z 2025-05-07T20:32:11.5738762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5739105Z 2025-05-07T20:32:11.5739310Z x_sign = torch.sign(x) 2025-05-07T20:32:11.5739614Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.5739929Z x = x_sign * x_clamp 2025-05-07T20:32:11.5740737Z x0 = x[:, :D] 2025-05-07T20:32:11.5740973Z x1 = x[:, D:] 2025-05-07T20:32:11.5741248Z 2025-05-07T20:32:11.5741442Z if contiguous: 2025-05-07T20:32:11.5741684Z x0 = x0.contiguous() 2025-05-07T20:32:11.5741948Z x1 = x1.contiguous() 2025-05-07T20:32:11.5742209Z 2025-05-07T20:32:11.5742411Z if scale_ub is not None: 2025-05-07T20:32:11.5742687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.5743037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.5743357Z ) 2025-05-07T20:32:11.5743558Z else: 2025-05-07T20:32:11.5743773Z scale_ub_tensor = None 2025-05-07T20:32:11.5744037Z 2025-05-07T20:32:11.5744278Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5744598Z op = silu_mul_quant 2025-05-07T20:32:11.5744864Z if compiled: 2025-05-07T20:32:11.5745121Z op = torch.compile(op) 2025-05-07T20:32:11.5745434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5745725Z 2025-05-07T20:32:11.5745929Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.5746220Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.5746522Z 2025-05-07T20:32:11.5746768Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5747111Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.5747417Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.5747741Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.5748109Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5748423Z 2025-05-07T20:32:11.5748635Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.5748836Z 2025-05-07T20:32:11.5748950Z moe/activation_test.py:126: 2025-05-07T20:32:11.5749284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5749790Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.5750131Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5750930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.5751687Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.5752244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.5752936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.5753624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.5754352Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.5755108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:11.5755867Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.5756591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.5757234Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.5757845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.5758369Z fn() 2025-05-07T20:32:11.5758880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.5759468Z self.fn.run( 2025-05-07T20:32:11.5759941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.5760478Z kernel = self.compile( 2025-05-07T20:32:11.5761114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.5761790Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.5762198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5762437Z 2025-05-07T20:32:11.5762652Z self = 2025-05-07T20:32:11.5763734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.5765133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb34bee0>} 2025-05-07T20:32:11.5766481Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.5767499Z context = 2025-05-07T20:32:11.5767799Z 2025-05-07T20:32:11.5767970Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.5768501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.5768980Z module_map=module_map) 2025-05-07T20:32:11.5769358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.5769726Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.5770001Z E ^ 2025-05-07T20:32:11.5770463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.5770926Z 2025-05-07T20:32:11.5771353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.5771963Z 2025-05-07T20:32:11.5772071Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5772494Z self=, 2025-05-07T20:32:11.5772901Z T=1, 2025-05-07T20:32:11.5773099Z D=5120, 2025-05-07T20:32:11.5773302Z scale_ub=None, 2025-05-07T20:32:11.5773527Z contiguous=True, 2025-05-07T20:32:11.5773762Z compiled=False, 2025-05-07T20:32:11.5773983Z ) 2025-05-07T20:32:11.9644446Z self = 2025-05-07T20:32:11.9645225Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.9645604Z 2025-05-07T20:32:11.9645723Z @given( 2025-05-07T20:32:11.9646049Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.9646475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.9646803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.9647189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.9647526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.9647831Z ) 2025-05-07T20:32:11.9648193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.9648655Z def test_silu_mul_quant( 2025-05-07T20:32:11.9648904Z self, 2025-05-07T20:32:11.9649115Z T: int, 2025-05-07T20:32:11.9649324Z D: int, 2025-05-07T20:32:11.9649550Z scale_ub: Optional[float], 2025-05-07T20:32:11.9649838Z contiguous: bool, 2025-05-07T20:32:11.9650095Z compiled: bool, 2025-05-07T20:32:11.9650327Z ) -> None: 2025-05-07T20:32:11.9650561Z torch.manual_seed(2025) 2025-05-07T20:32:11.9650816Z 2025-05-07T20:32:11.9651095Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.9651455Z 2025-05-07T20:32:11.9651664Z x_sign = torch.sign(x) 2025-05-07T20:32:11.9652325Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.9652656Z x = x_sign * x_clamp 2025-05-07T20:32:11.9652913Z x0 = x[:, :D] 2025-05-07T20:32:11.9653133Z x1 = x[:, D:] 2025-05-07T20:32:11.9653354Z 2025-05-07T20:32:11.9653551Z if contiguous: 2025-05-07T20:32:11.9653795Z x0 = x0.contiguous() 2025-05-07T20:32:11.9654061Z x1 = x1.contiguous() 2025-05-07T20:32:11.9654313Z 2025-05-07T20:32:11.9654518Z if scale_ub is not None: 2025-05-07T20:32:11.9654797Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.9655151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.9655476Z ) 2025-05-07T20:32:11.9655675Z else: 2025-05-07T20:32:11.9655898Z scale_ub_tensor = None 2025-05-07T20:32:11.9656163Z 2025-05-07T20:32:11.9656403Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.9656741Z op = silu_mul_quant 2025-05-07T20:32:11.9657021Z if compiled: 2025-05-07T20:32:11.9657273Z op = torch.compile(op) 2025-05-07T20:32:11.9657582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9657867Z 2025-05-07T20:32:11.9658062Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.9658244Z 2025-05-07T20:32:11.9658349Z moe/activation_test.py:117: 2025-05-07T20:32:11.9658659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9659008Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.9659295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9659995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.9660696Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.9661335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.9662217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.9662893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.9663438Z kernel = self.compile( 2025-05-07T20:32:11.9663984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.9664655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.9665066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9665302Z 2025-05-07T20:32:11.9665521Z self = 2025-05-07T20:32:11.9666599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.9668007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ead64dc0>} 2025-05-07T20:32:11.9669367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.9670395Z context = 2025-05-07T20:32:11.9670688Z 2025-05-07T20:32:11.9670860Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.9671392Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.9671877Z module_map=module_map) 2025-05-07T20:32:11.9672260Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.9672618Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.9673002Z E ^ 2025-05-07T20:32:11.9673474Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.9673928Z 2025-05-07T20:32:11.9674345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.9674875Z 2025-05-07T20:32:11.9674986Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.9675415Z self=, 2025-05-07T20:32:11.9675837Z T=128, 2025-05-07T20:32:11.9676031Z D=5120, 2025-05-07T20:32:11.9676235Z scale_ub=None, 2025-05-07T20:32:11.9676468Z contiguous=False, 2025-05-07T20:32:11.9676700Z compiled=True, 2025-05-07T20:32:11.9676915Z ) 2025-05-07T20:32:11.9677246Z self = 2025-05-07T20:32:11.9677779Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.9678057Z 2025-05-07T20:32:11.9678137Z @given( 2025-05-07T20:32:11.9678383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.9678738Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.9679058Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.9687676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.9688037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.9688339Z ) 2025-05-07T20:32:11.9688696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.9689157Z def test_silu_mul_quant( 2025-05-07T20:32:11.9689417Z self, 2025-05-07T20:32:11.9689617Z T: int, 2025-05-07T20:32:11.9689827Z D: int, 2025-05-07T20:32:11.9690065Z scale_ub: Optional[float], 2025-05-07T20:32:11.9690342Z contiguous: bool, 2025-05-07T20:32:11.9690597Z compiled: bool, 2025-05-07T20:32:11.9690960Z ) -> None: 2025-05-07T20:32:11.9691183Z torch.manual_seed(2025) 2025-05-07T20:32:11.9691446Z 2025-05-07T20:32:11.9691733Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.9692091Z 2025-05-07T20:32:11.9692291Z x_sign = torch.sign(x) 2025-05-07T20:32:11.9692607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.9692934Z x = x_sign * x_clamp 2025-05-07T20:32:11.9693183Z x0 = x[:, :D] 2025-05-07T20:32:11.9693414Z x1 = x[:, D:] 2025-05-07T20:32:11.9693637Z 2025-05-07T20:32:11.9693828Z if contiguous: 2025-05-07T20:32:11.9694074Z x0 = x0.contiguous() 2025-05-07T20:32:11.9694347Z x1 = x1.contiguous() 2025-05-07T20:32:11.9694594Z 2025-05-07T20:32:11.9694798Z if scale_ub is not None: 2025-05-07T20:32:11.9695084Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.9695430Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.9695773Z ) 2025-05-07T20:32:11.9695984Z else: 2025-05-07T20:32:11.9696200Z scale_ub_tensor = None 2025-05-07T20:32:11.9696470Z 2025-05-07T20:32:11.9696715Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.9697049Z op = silu_mul_quant 2025-05-07T20:32:11.9697308Z if compiled: 2025-05-07T20:32:11.9697566Z op = torch.compile(op) 2025-05-07T20:32:11.9697875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9698152Z 2025-05-07T20:32:11.9698356Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.9698526Z 2025-05-07T20:32:11.9698639Z moe/activation_test.py:117: 2025-05-07T20:32:11.9698939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9699287Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.9699584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9700234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.9700819Z return fn(*args, **kwargs) 2025-05-07T20:32:11.9701636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.9702341Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.9702882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.9703578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.9704250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.9704805Z kernel = self.compile( 2025-05-07T20:32:11.9705362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.9706040Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.9706459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9706694Z 2025-05-07T20:32:11.9706912Z self = 2025-05-07T20:32:11.9708003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.9709380Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ead64670>} 2025-05-07T20:32:11.9710740Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.9711865Z context = 2025-05-07T20:32:11.9712160Z 2025-05-07T20:32:11.9712336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.9712882Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.9713365Z module_map=module_map) 2025-05-07T20:32:11.9713746Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.9714111Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.9714386Z E ^ 2025-05-07T20:32:11.9714864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.9715315Z 2025-05-07T20:32:11.9715731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.9716260Z 2025-05-07T20:32:11.9716368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.9716810Z self=, 2025-05-07T20:32:11.9717221Z T=128, 2025-05-07T20:32:11.9717415Z D=7168, 2025-05-07T20:32:11.9717630Z scale_ub=1200.0, 2025-05-07T20:32:11.9717868Z contiguous=False, 2025-05-07T20:32:11.9718100Z compiled=False, 2025-05-07T20:32:11.9718321Z ) 2025-05-07T20:32:12.1251287Z self = 2025-05-07T20:32:12.1252069Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.1252461Z 2025-05-07T20:32:12.1252586Z @given( 2025-05-07T20:32:12.1252847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1253184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1253511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1253859Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1254203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1254876Z ) 2025-05-07T20:32:12.1255250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1255711Z def test_silu_mul_quant( 2025-05-07T20:32:12.1255966Z self, 2025-05-07T20:32:12.1256174Z T: int, 2025-05-07T20:32:12.1256377Z D: int, 2025-05-07T20:32:12.1256611Z scale_ub: Optional[float], 2025-05-07T20:32:12.1256898Z contiguous: bool, 2025-05-07T20:32:12.1257147Z compiled: bool, 2025-05-07T20:32:12.1257392Z ) -> None: 2025-05-07T20:32:12.1257622Z torch.manual_seed(2025) 2025-05-07T20:32:12.1257875Z 2025-05-07T20:32:12.1258164Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1258530Z 2025-05-07T20:32:12.1258731Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1259041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1259374Z x = x_sign * x_clamp 2025-05-07T20:32:12.1259646Z x0 = x[:, :D] 2025-05-07T20:32:12.1259881Z x1 = x[:, D:] 2025-05-07T20:32:12.1260103Z 2025-05-07T20:32:12.1260299Z if contiguous: 2025-05-07T20:32:12.1260548Z x0 = x0.contiguous() 2025-05-07T20:32:12.1260825Z x1 = x1.contiguous() 2025-05-07T20:32:12.1261198Z 2025-05-07T20:32:12.1261405Z if scale_ub is not None: 2025-05-07T20:32:12.1261696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1262052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1262372Z ) 2025-05-07T20:32:12.1262577Z else: 2025-05-07T20:32:12.1262793Z scale_ub_tensor = None 2025-05-07T20:32:12.1263056Z 2025-05-07T20:32:12.1263300Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1263622Z op = silu_mul_quant 2025-05-07T20:32:12.1263885Z if compiled: 2025-05-07T20:32:12.1264146Z op = torch.compile(op) 2025-05-07T20:32:12.1264620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1264914Z 2025-05-07T20:32:12.1265122Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.1265292Z 2025-05-07T20:32:12.1265408Z moe/activation_test.py:117: 2025-05-07T20:32:12.1265713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1266060Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.1266360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1267063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.1267761Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.1268312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1269001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1269681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1270242Z kernel = self.compile( 2025-05-07T20:32:12.1270793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1271452Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1271864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1272107Z 2025-05-07T20:32:12.1272321Z self = 2025-05-07T20:32:12.1273407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1274920Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea920430>} 2025-05-07T20:32:12.1276284Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1277327Z context = 2025-05-07T20:32:12.1277620Z 2025-05-07T20:32:12.1277799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1278332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1278802Z module_map=module_map) 2025-05-07T20:32:12.1279181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1279546Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.1279807Z E ^ 2025-05-07T20:32:12.1280289Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1280742Z 2025-05-07T20:32:12.1281167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.1281685Z 2025-05-07T20:32:12.1281798Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1282217Z self=, 2025-05-07T20:32:12.1282629Z T=128, 2025-05-07T20:32:12.1282828Z D=5120, 2025-05-07T20:32:12.1283025Z scale_ub=None, 2025-05-07T20:32:12.1283251Z contiguous=False, 2025-05-07T20:32:12.1283487Z compiled=False, 2025-05-07T20:32:12.1283701Z ) 2025-05-07T20:32:12.1284030Z self = 2025-05-07T20:32:12.1284535Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.1284806Z 2025-05-07T20:32:12.1284894Z @given( 2025-05-07T20:32:12.1285134Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1285552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1285871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1286209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1286555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1286859Z ) 2025-05-07T20:32:12.1287217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1287670Z def test_silu_mul_quant( 2025-05-07T20:32:12.1287926Z self, 2025-05-07T20:32:12.1288125Z T: int, 2025-05-07T20:32:12.1288334Z D: int, 2025-05-07T20:32:12.1288569Z scale_ub: Optional[float], 2025-05-07T20:32:12.1288848Z contiguous: bool, 2025-05-07T20:32:12.1289106Z compiled: bool, 2025-05-07T20:32:12.1289347Z ) -> None: 2025-05-07T20:32:12.1289575Z torch.manual_seed(2025) 2025-05-07T20:32:12.1289823Z 2025-05-07T20:32:12.1290119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1290478Z 2025-05-07T20:32:12.1290675Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1290983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1291312Z x = x_sign * x_clamp 2025-05-07T20:32:12.1291562Z x0 = x[:, :D] 2025-05-07T20:32:12.1291788Z x1 = x[:, D:] 2025-05-07T20:32:12.1292006Z 2025-05-07T20:32:12.1292196Z if contiguous: 2025-05-07T20:32:12.1292440Z x0 = x0.contiguous() 2025-05-07T20:32:12.1292710Z x1 = x1.contiguous() 2025-05-07T20:32:12.1292954Z 2025-05-07T20:32:12.1293158Z if scale_ub is not None: 2025-05-07T20:32:12.1293446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1293784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1294102Z ) 2025-05-07T20:32:12.1294308Z else: 2025-05-07T20:32:12.1294531Z scale_ub_tensor = None 2025-05-07T20:32:12.1294906Z 2025-05-07T20:32:12.1295155Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1295484Z op = silu_mul_quant 2025-05-07T20:32:12.1295742Z if compiled: 2025-05-07T20:32:12.1296004Z op = torch.compile(op) 2025-05-07T20:32:12.1296311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1296592Z 2025-05-07T20:32:12.1296798Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.1296970Z 2025-05-07T20:32:12.1297082Z moe/activation_test.py:117: 2025-05-07T20:32:12.1297381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1297725Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.1298024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1298728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.1299421Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.1299980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1300674Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1301405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1301958Z kernel = self.compile( 2025-05-07T20:32:12.1302516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1303179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1303584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1303829Z 2025-05-07T20:32:12.1304045Z self = 2025-05-07T20:32:12.1305138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1306615Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ead645e0>} 2025-05-07T20:32:12.1307953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1308976Z context = 2025-05-07T20:32:12.1309274Z 2025-05-07T20:32:12.1309446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1309984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1310462Z module_map=module_map) 2025-05-07T20:32:12.1310843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1311213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.1311490Z E ^ 2025-05-07T20:32:12.1311963Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1312419Z 2025-05-07T20:32:12.1312837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.1313347Z 2025-05-07T20:32:12.1313461Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1313877Z self=, 2025-05-07T20:32:12.1314287Z T=128, 2025-05-07T20:32:12.1314484Z D=5120, 2025-05-07T20:32:12.1314689Z scale_ub=1200.0, 2025-05-07T20:32:12.1314914Z contiguous=True, 2025-05-07T20:32:12.1315148Z compiled=False, 2025-05-07T20:32:12.1315369Z ) 2025-05-07T20:32:12.3612564Z self = 2025-05-07T20:32:12.3613351Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.3613677Z 2025-05-07T20:32:12.3613793Z @given( 2025-05-07T20:32:12.3614075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3614401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3614716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3615062Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3615407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3615700Z ) 2025-05-07T20:32:12.3616065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3616521Z def test_silu_mul_quant( 2025-05-07T20:32:12.3616774Z self, 2025-05-07T20:32:12.3616974Z T: int, 2025-05-07T20:32:12.3617188Z D: int, 2025-05-07T20:32:12.3617444Z scale_ub: Optional[float], 2025-05-07T20:32:12.3617723Z contiguous: bool, 2025-05-07T20:32:12.3617976Z compiled: bool, 2025-05-07T20:32:12.3618214Z ) -> None: 2025-05-07T20:32:12.3618437Z torch.manual_seed(2025) 2025-05-07T20:32:12.3618693Z 2025-05-07T20:32:12.3618979Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3619326Z 2025-05-07T20:32:12.3619531Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3619839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3620161Z x = x_sign * x_clamp 2025-05-07T20:32:12.3620416Z x0 = x[:, :D] 2025-05-07T20:32:12.3620649Z x1 = x[:, D:] 2025-05-07T20:32:12.3620861Z 2025-05-07T20:32:12.3621151Z if contiguous: 2025-05-07T20:32:12.3621443Z x0 = x0.contiguous() 2025-05-07T20:32:12.3621719Z x1 = x1.contiguous() 2025-05-07T20:32:12.3621967Z 2025-05-07T20:32:12.3622369Z if scale_ub is not None: 2025-05-07T20:32:12.3622654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3622995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3623318Z ) 2025-05-07T20:32:12.3623523Z else: 2025-05-07T20:32:12.3623737Z scale_ub_tensor = None 2025-05-07T20:32:12.3623999Z 2025-05-07T20:32:12.3624242Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3624561Z op = silu_mul_quant 2025-05-07T20:32:12.3624823Z if compiled: 2025-05-07T20:32:12.3625081Z op = torch.compile(op) 2025-05-07T20:32:12.3625386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3625672Z 2025-05-07T20:32:12.3625874Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3626048Z 2025-05-07T20:32:12.3626158Z moe/activation_test.py:117: 2025-05-07T20:32:12.3626461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3626817Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3627112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3627813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3628507Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3629060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3629754Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3630429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3630965Z kernel = self.compile( 2025-05-07T20:32:12.3631529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3632283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3632702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3632939Z 2025-05-07T20:32:12.3633151Z self = 2025-05-07T20:32:12.3634239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3635624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb2c3b80>} 2025-05-07T20:32:12.3636970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3637995Z context = 2025-05-07T20:32:12.3638303Z 2025-05-07T20:32:12.3638480Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3639018Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3639495Z module_map=module_map) 2025-05-07T20:32:12.3639871Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3640530Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3640802Z E ^ 2025-05-07T20:32:12.3641269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3641726Z 2025-05-07T20:32:12.3642152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3642675Z 2025-05-07T20:32:12.3642782Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3643386Z self=, 2025-05-07T20:32:12.3643790Z T=1, 2025-05-07T20:32:12.3643988Z D=7168, 2025-05-07T20:32:12.3644192Z scale_ub=1200.0, 2025-05-07T20:32:12.3644420Z contiguous=True, 2025-05-07T20:32:12.3644654Z compiled=True, 2025-05-07T20:32:12.3644874Z ) 2025-05-07T20:32:12.3645196Z self = 2025-05-07T20:32:12.3645698Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.3645968Z 2025-05-07T20:32:12.3646051Z @given( 2025-05-07T20:32:12.3646297Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3646616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3646934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3647278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3647611Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3647916Z ) 2025-05-07T20:32:12.3648280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3648728Z def test_silu_mul_quant( 2025-05-07T20:32:12.3648982Z self, 2025-05-07T20:32:12.3649188Z T: int, 2025-05-07T20:32:12.3649391Z D: int, 2025-05-07T20:32:12.3649619Z scale_ub: Optional[float], 2025-05-07T20:32:12.3649907Z contiguous: bool, 2025-05-07T20:32:12.3650158Z compiled: bool, 2025-05-07T20:32:12.3650385Z ) -> None: 2025-05-07T20:32:12.3650615Z torch.manual_seed(2025) 2025-05-07T20:32:12.3650868Z 2025-05-07T20:32:12.3651144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3651504Z 2025-05-07T20:32:12.3651706Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3652008Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3652334Z x = x_sign * x_clamp 2025-05-07T20:32:12.3652587Z x0 = x[:, :D] 2025-05-07T20:32:12.3652937Z x1 = x[:, D:] 2025-05-07T20:32:12.3653161Z 2025-05-07T20:32:12.3653355Z if contiguous: 2025-05-07T20:32:12.3653592Z x0 = x0.contiguous() 2025-05-07T20:32:12.3653865Z x1 = x1.contiguous() 2025-05-07T20:32:12.3654122Z 2025-05-07T20:32:12.3654318Z if scale_ub is not None: 2025-05-07T20:32:12.3654606Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3654951Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3655278Z ) 2025-05-07T20:32:12.3655474Z else: 2025-05-07T20:32:12.3655697Z scale_ub_tensor = None 2025-05-07T20:32:12.3655962Z 2025-05-07T20:32:12.3656199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3656526Z op = silu_mul_quant 2025-05-07T20:32:12.3656789Z if compiled: 2025-05-07T20:32:12.3657058Z op = torch.compile(op) 2025-05-07T20:32:12.3657377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3657673Z 2025-05-07T20:32:12.3657876Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3658047Z 2025-05-07T20:32:12.3658150Z moe/activation_test.py:117: 2025-05-07T20:32:12.3658487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3658823Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3659116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3659694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.3668503Z return fn(*args, **kwargs) 2025-05-07T20:32:12.3669204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3669910Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3670465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3671294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3671979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3672530Z kernel = self.compile( 2025-05-07T20:32:12.3673092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3673751Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3674169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3674411Z 2025-05-07T20:32:12.3674633Z self = 2025-05-07T20:32:12.3675722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3677125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea58e820>} 2025-05-07T20:32:12.3678491Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3679526Z context = 2025-05-07T20:32:12.3679822Z 2025-05-07T20:32:12.3680011Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3680552Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3681035Z module_map=module_map) 2025-05-07T20:32:12.3681424Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3681878Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3682149Z E ^ 2025-05-07T20:32:12.3682624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3683081Z 2025-05-07T20:32:12.3683513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3684028Z 2025-05-07T20:32:12.3684148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3684571Z self=, 2025-05-07T20:32:12.3684990Z T=1, 2025-05-07T20:32:12.3685191Z D=7168, 2025-05-07T20:32:12.3685389Z scale_ub=1200.0, 2025-05-07T20:32:12.3685636Z contiguous=False, 2025-05-07T20:32:12.3685878Z compiled=True, 2025-05-07T20:32:12.3686093Z ) 2025-05-07T20:32:12.5323304Z self = 2025-05-07T20:32:12.5324037Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.5324381Z 2025-05-07T20:32:12.5324472Z @given( 2025-05-07T20:32:12.5324725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.5325055Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.5325385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.5325722Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.5326072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.5326378Z ) 2025-05-07T20:32:12.5326734Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.5327194Z def test_silu_mul_quant( 2025-05-07T20:32:12.5327452Z self, 2025-05-07T20:32:12.5327656Z T: int, 2025-05-07T20:32:12.5327869Z D: int, 2025-05-07T20:32:12.5328108Z scale_ub: Optional[float], 2025-05-07T20:32:12.5328387Z contiguous: bool, 2025-05-07T20:32:12.5329025Z compiled: bool, 2025-05-07T20:32:12.5329270Z ) -> None: 2025-05-07T20:32:12.5329493Z torch.manual_seed(2025) 2025-05-07T20:32:12.5329754Z 2025-05-07T20:32:12.5330045Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.5330403Z 2025-05-07T20:32:12.5330605Z x_sign = torch.sign(x) 2025-05-07T20:32:12.5330915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.5331243Z x = x_sign * x_clamp 2025-05-07T20:32:12.5331497Z x0 = x[:, :D] 2025-05-07T20:32:12.5331732Z x1 = x[:, D:] 2025-05-07T20:32:12.5331955Z 2025-05-07T20:32:12.5332149Z if contiguous: 2025-05-07T20:32:12.5332396Z x0 = x0.contiguous() 2025-05-07T20:32:12.5332669Z x1 = x1.contiguous() 2025-05-07T20:32:12.5332918Z 2025-05-07T20:32:12.5333130Z if scale_ub is not None: 2025-05-07T20:32:12.5333418Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.5333765Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.5334100Z ) 2025-05-07T20:32:12.5334308Z else: 2025-05-07T20:32:12.5334524Z scale_ub_tensor = None 2025-05-07T20:32:12.5334790Z 2025-05-07T20:32:12.5335038Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.5335370Z op = silu_mul_quant 2025-05-07T20:32:12.5335631Z if compiled: 2025-05-07T20:32:12.5335899Z op = torch.compile(op) 2025-05-07T20:32:12.5336212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.5336496Z 2025-05-07T20:32:12.5336706Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.5336876Z 2025-05-07T20:32:12.5336991Z moe/activation_test.py:117: 2025-05-07T20:32:12.5337297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.5337649Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.5337962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.5338688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.5339266Z return fn(*args, **kwargs) 2025-05-07T20:32:12.5339941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.5340918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.5341531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.5342276Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.5342962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.5343512Z kernel = self.compile( 2025-05-07T20:32:12.5344064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.5344737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.5345147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.5345381Z 2025-05-07T20:32:12.5345606Z self = 2025-05-07T20:32:12.5346709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.5348090Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea4944c0>} 2025-05-07T20:32:12.5349435Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.5350837Z context = 2025-05-07T20:32:12.5351209Z 2025-05-07T20:32:12.5351430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.5352045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.5352604Z module_map=module_map) 2025-05-07T20:32:12.5353027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.5353429Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.5353728Z E ^ 2025-05-07T20:32:12.5354281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.5354838Z 2025-05-07T20:32:12.5355353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.5355983Z 2025-05-07T20:32:12.5356109Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.5356591Z self=, 2025-05-07T20:32:12.5357071Z T=1, 2025-05-07T20:32:12.5357274Z D=7168, 2025-05-07T20:32:12.5357491Z scale_ub=None, 2025-05-07T20:32:12.5357731Z contiguous=False, 2025-05-07T20:32:12.5357983Z compiled=True, 2025-05-07T20:32:12.5358211Z ) 2025-05-07T20:32:12.8366579Z self = 2025-05-07T20:32:12.8367352Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.8367715Z 2025-05-07T20:32:12.8367824Z @given( 2025-05-07T20:32:12.8368135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.8368509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.8368835Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.8369188Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.8369885Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.8370201Z ) 2025-05-07T20:32:12.8370574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.8371037Z def test_silu_mul_quant( 2025-05-07T20:32:12.8371291Z self, 2025-05-07T20:32:12.8371504Z T: int, 2025-05-07T20:32:12.8371718Z D: int, 2025-05-07T20:32:12.8371975Z scale_ub: Optional[float], 2025-05-07T20:32:12.8372293Z contiguous: bool, 2025-05-07T20:32:12.8372546Z compiled: bool, 2025-05-07T20:32:12.8372781Z ) -> None: 2025-05-07T20:32:12.8373011Z torch.manual_seed(2025) 2025-05-07T20:32:12.8373266Z 2025-05-07T20:32:12.8373546Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.8373907Z 2025-05-07T20:32:12.8374113Z x_sign = torch.sign(x) 2025-05-07T20:32:12.8374419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.8374740Z x = x_sign * x_clamp 2025-05-07T20:32:12.8375010Z x0 = x[:, :D] 2025-05-07T20:32:12.8375243Z x1 = x[:, D:] 2025-05-07T20:32:12.8375461Z 2025-05-07T20:32:12.8375667Z if contiguous: 2025-05-07T20:32:12.8375917Z x0 = x0.contiguous() 2025-05-07T20:32:12.8376187Z x1 = x1.contiguous() 2025-05-07T20:32:12.8376443Z 2025-05-07T20:32:12.8376650Z if scale_ub is not None: 2025-05-07T20:32:12.8376933Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.8377287Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.8377613Z ) 2025-05-07T20:32:12.8377818Z else: 2025-05-07T20:32:12.8378048Z scale_ub_tensor = None 2025-05-07T20:32:12.8378317Z 2025-05-07T20:32:12.8378560Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8378893Z op = silu_mul_quant 2025-05-07T20:32:12.8379163Z if compiled: 2025-05-07T20:32:12.8379425Z op = torch.compile(op) 2025-05-07T20:32:12.8379911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8380202Z 2025-05-07T20:32:12.8380410Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.8380705Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.8381012Z 2025-05-07T20:32:12.8381360Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8381710Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.8382013Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.8382343Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.8382718Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.8383034Z 2025-05-07T20:32:12.8383249Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.8383451Z 2025-05-07T20:32:12.8383571Z moe/activation_test.py:126: 2025-05-07T20:32:12.8383875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8384242Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.8384591Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.8385396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.8386158Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.8386722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.8387418Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.8388121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.8388865Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.8389714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.8390486Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.8391216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.8391867Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.8392483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.8393018Z fn() 2025-05-07T20:32:12.8393536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.8394127Z self.fn.run( 2025-05-07T20:32:12.8394613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.8395149Z kernel = self.compile( 2025-05-07T20:32:12.8395713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.8396380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.8396795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8397030Z 2025-05-07T20:32:12.8397246Z self = 2025-05-07T20:32:12.8398336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.8399724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea51e040>} 2025-05-07T20:32:12.8401097Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.8402267Z context = 2025-05-07T20:32:12.8402562Z 2025-05-07T20:32:12.8402735Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.8403272Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.8403751Z module_map=module_map) 2025-05-07T20:32:12.8404129Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.8404497Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.8404781Z E ^ 2025-05-07T20:32:12.8405247Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.8405707Z 2025-05-07T20:32:12.8406140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.8406680Z 2025-05-07T20:32:12.8406789Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.8407222Z self=, 2025-05-07T20:32:12.8407633Z T=1, 2025-05-07T20:32:12.8407833Z D=5120, 2025-05-07T20:32:12.8408043Z scale_ub=1200.0, 2025-05-07T20:32:12.8408278Z contiguous=False, 2025-05-07T20:32:12.8408520Z compiled=True, 2025-05-07T20:32:12.8408743Z ) 2025-05-07T20:32:13.0410435Z self = 2025-05-07T20:32:13.0410970Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.0411378Z 2025-05-07T20:32:13.0411543Z @given( 2025-05-07T20:32:13.0412027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0412666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0413297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0414533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0415223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0415797Z ) 2025-05-07T20:32:13.0416508Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0417401Z def test_silu_mul_quant( 2025-05-07T20:32:13.0417890Z self, 2025-05-07T20:32:13.0418292Z T: int, 2025-05-07T20:32:13.0418702Z D: int, 2025-05-07T20:32:13.0419144Z scale_ub: Optional[float], 2025-05-07T20:32:13.0419701Z contiguous: bool, 2025-05-07T20:32:13.0420197Z compiled: bool, 2025-05-07T20:32:13.0420654Z ) -> None: 2025-05-07T20:32:13.0421230Z torch.manual_seed(2025) 2025-05-07T20:32:13.0421510Z 2025-05-07T20:32:13.0421812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0422200Z 2025-05-07T20:32:13.0422397Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0422712Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0423045Z x = x_sign * x_clamp 2025-05-07T20:32:13.0423305Z x0 = x[:, :D] 2025-05-07T20:32:13.0423532Z x1 = x[:, D:] 2025-05-07T20:32:13.0423752Z 2025-05-07T20:32:13.0423944Z if contiguous: 2025-05-07T20:32:13.0428274Z x0 = x0.contiguous() 2025-05-07T20:32:13.0428550Z x1 = x1.contiguous() 2025-05-07T20:32:13.0428796Z 2025-05-07T20:32:13.0429002Z if scale_ub is not None: 2025-05-07T20:32:13.0429289Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0429637Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0429963Z ) 2025-05-07T20:32:13.0430168Z else: 2025-05-07T20:32:13.0430383Z scale_ub_tensor = None 2025-05-07T20:32:13.0430653Z 2025-05-07T20:32:13.0430899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0431226Z op = silu_mul_quant 2025-05-07T20:32:13.0431592Z if compiled: 2025-05-07T20:32:13.0431857Z op = torch.compile(op) 2025-05-07T20:32:13.0432167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0432449Z 2025-05-07T20:32:13.0432655Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0432826Z 2025-05-07T20:32:13.0432960Z moe/activation_test.py:117: 2025-05-07T20:32:13.0433271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0433616Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0433907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0434476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.0435054Z return fn(*args, **kwargs) 2025-05-07T20:32:13.0435725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0436424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0436989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0437680Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0438344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0438892Z kernel = self.compile( 2025-05-07T20:32:13.0439443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0440423Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0440836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0441081Z 2025-05-07T20:32:13.0441294Z self = 2025-05-07T20:32:13.0442508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0444254Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea51ef70>} 2025-05-07T20:32:13.0445943Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0447216Z context = 2025-05-07T20:32:13.0447565Z 2025-05-07T20:32:13.0447752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0448377Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0448935Z module_map=module_map) 2025-05-07T20:32:13.0449353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0449760Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0450047Z E ^ 2025-05-07T20:32:13.0450598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0451166Z 2025-05-07T20:32:13.0451594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0452165Z 2025-05-07T20:32:13.0452271Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0452696Z self=, 2025-05-07T20:32:13.0453097Z T=1, 2025-05-07T20:32:13.0453291Z D=5120, 2025-05-07T20:32:13.0453497Z scale_ub=1200.0, 2025-05-07T20:32:13.0453726Z contiguous=False, 2025-05-07T20:32:13.0453965Z compiled=False, 2025-05-07T20:32:13.0454250Z ) 2025-05-07T20:32:13.0454574Z self = 2025-05-07T20:32:13.0455078Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.0455357Z 2025-05-07T20:32:13.0455439Z @given( 2025-05-07T20:32:13.0455678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0456000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0456319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0456660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0456997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0457295Z ) 2025-05-07T20:32:13.0457653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0458097Z def test_silu_mul_quant( 2025-05-07T20:32:13.0458349Z self, 2025-05-07T20:32:13.0458550Z T: int, 2025-05-07T20:32:13.0458749Z D: int, 2025-05-07T20:32:13.0458984Z scale_ub: Optional[float], 2025-05-07T20:32:13.0459265Z contiguous: bool, 2025-05-07T20:32:13.0459508Z compiled: bool, 2025-05-07T20:32:13.0459740Z ) -> None: 2025-05-07T20:32:13.0459964Z torch.manual_seed(2025) 2025-05-07T20:32:13.0460213Z 2025-05-07T20:32:13.0460489Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0460841Z 2025-05-07T20:32:13.0461043Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0461401Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0461722Z x = x_sign * x_clamp 2025-05-07T20:32:13.0461972Z x0 = x[:, :D] 2025-05-07T20:32:13.0462191Z x1 = x[:, D:] 2025-05-07T20:32:13.0462405Z 2025-05-07T20:32:13.0462596Z if contiguous: 2025-05-07T20:32:13.0462828Z x0 = x0.contiguous() 2025-05-07T20:32:13.0463094Z x1 = x1.contiguous() 2025-05-07T20:32:13.0463343Z 2025-05-07T20:32:13.0463627Z if scale_ub is not None: 2025-05-07T20:32:13.0463912Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0464462Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0464776Z ) 2025-05-07T20:32:13.0464979Z else: 2025-05-07T20:32:13.0465199Z scale_ub_tensor = None 2025-05-07T20:32:13.0465461Z 2025-05-07T20:32:13.0465695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0466019Z op = silu_mul_quant 2025-05-07T20:32:13.0466282Z if compiled: 2025-05-07T20:32:13.0466531Z op = torch.compile(op) 2025-05-07T20:32:13.0466841Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0467127Z 2025-05-07T20:32:13.0467322Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0467499Z 2025-05-07T20:32:13.0467601Z moe/activation_test.py:117: 2025-05-07T20:32:13.0467908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0468251Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0468542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0469236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0469937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0470550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0471242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0471915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0472611Z kernel = self.compile( 2025-05-07T20:32:13.0473159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0473826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0474286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0474518Z 2025-05-07T20:32:13.0474729Z self = 2025-05-07T20:32:13.0475816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0477190Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9e373a0>} 2025-05-07T20:32:13.0478536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0479562Z context = 2025-05-07T20:32:13.0479856Z 2025-05-07T20:32:13.0480027Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0480560Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0481035Z module_map=module_map) 2025-05-07T20:32:13.0481411Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0481791Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0482060Z E ^ 2025-05-07T20:32:13.0482523Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0482980Z 2025-05-07T20:32:13.0483404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0492117Z 2025-05-07T20:32:13.0492256Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0492815Z self=, 2025-05-07T20:32:13.0493228Z T=16384, 2025-05-07T20:32:13.0493444Z D=5120, 2025-05-07T20:32:13.0493655Z scale_ub=1200.0, 2025-05-07T20:32:13.0493889Z contiguous=False, 2025-05-07T20:32:13.0494129Z compiled=True, 2025-05-07T20:32:13.0494354Z ) 2025-05-07T20:32:13.1664619Z self = 2025-05-07T20:32:13.1665150Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.1665445Z 2025-05-07T20:32:13.1665529Z @given( 2025-05-07T20:32:13.1665778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.1666109Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.1666428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.1666776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.1667126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.1667456Z ) 2025-05-07T20:32:13.1667820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.1668278Z def test_silu_mul_quant( 2025-05-07T20:32:13.1668523Z self, 2025-05-07T20:32:13.1668732Z T: int, 2025-05-07T20:32:13.1668945Z D: int, 2025-05-07T20:32:13.1669443Z scale_ub: Optional[float], 2025-05-07T20:32:13.1669737Z contiguous: bool, 2025-05-07T20:32:13.1669992Z compiled: bool, 2025-05-07T20:32:13.1670226Z ) -> None: 2025-05-07T20:32:13.1670455Z torch.manual_seed(2025) 2025-05-07T20:32:13.1670714Z 2025-05-07T20:32:13.1670991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.1671351Z 2025-05-07T20:32:13.1671554Z x_sign = torch.sign(x) 2025-05-07T20:32:13.1671853Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.1672165Z x = x_sign * x_clamp 2025-05-07T20:32:13.1672520Z x0 = x[:, :D] 2025-05-07T20:32:13.1672749Z x1 = x[:, D:] 2025-05-07T20:32:13.1672960Z 2025-05-07T20:32:13.1673161Z if contiguous: 2025-05-07T20:32:13.1673406Z x0 = x0.contiguous() 2025-05-07T20:32:13.1673673Z x1 = x1.contiguous() 2025-05-07T20:32:13.1673929Z 2025-05-07T20:32:13.1674139Z if scale_ub is not None: 2025-05-07T20:32:13.1674418Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.1674771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.1675097Z ) 2025-05-07T20:32:13.1675296Z else: 2025-05-07T20:32:13.1675521Z scale_ub_tensor = None 2025-05-07T20:32:13.1675790Z 2025-05-07T20:32:13.1676028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.1676361Z op = silu_mul_quant 2025-05-07T20:32:13.1676631Z if compiled: 2025-05-07T20:32:13.1676895Z op = torch.compile(op) 2025-05-07T20:32:13.1677208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1677503Z 2025-05-07T20:32:13.1677711Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.1677885Z 2025-05-07T20:32:13.1678024Z moe/activation_test.py:117: 2025-05-07T20:32:13.1678327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1678676Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.1678975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1679544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.1680123Z return fn(*args, **kwargs) 2025-05-07T20:32:13.1680798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.1681491Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.1682030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.1682887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.1683564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.1684100Z kernel = self.compile( 2025-05-07T20:32:13.1684665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.1685342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.1685750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1685984Z 2025-05-07T20:32:13.1686197Z self = 2025-05-07T20:32:13.1687285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.1688672Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea45b0d0>} 2025-05-07T20:32:13.1690042Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.1691130Z context = 2025-05-07T20:32:13.1691425Z 2025-05-07T20:32:13.1691597Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.1692132Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.1692604Z module_map=module_map) 2025-05-07T20:32:13.1692975Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.1693392Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.1693664Z E ^ 2025-05-07T20:32:13.1694133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.1694592Z 2025-05-07T20:32:13.1695010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.1695534Z 2025-05-07T20:32:13.1695639Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1696059Z self=, 2025-05-07T20:32:13.1696470Z T=2048, 2025-05-07T20:32:13.1696664Z D=7168, 2025-05-07T20:32:13.1696868Z scale_ub=1200.0, 2025-05-07T20:32:13.1697100Z contiguous=False, 2025-05-07T20:32:13.1697330Z compiled=True, 2025-05-07T20:32:13.1697551Z ) 2025-05-07T20:32:13.1697880Z self = 2025-05-07T20:32:13.1698391Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.1698676Z 2025-05-07T20:32:13.1698759Z @given( 2025-05-07T20:32:13.1699003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.1699322Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.1699649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.1699990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.1700330Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.1700625Z ) 2025-05-07T20:32:13.1700986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.1701548Z def test_silu_mul_quant( 2025-05-07T20:32:13.1701795Z self, 2025-05-07T20:32:13.1702006Z T: int, 2025-05-07T20:32:13.1702216Z D: int, 2025-05-07T20:32:13.1702440Z scale_ub: Optional[float], 2025-05-07T20:32:13.1702728Z contiguous: bool, 2025-05-07T20:32:13.1703074Z compiled: bool, 2025-05-07T20:32:13.1703305Z ) -> None: 2025-05-07T20:32:13.1703533Z torch.manual_seed(2025) 2025-05-07T20:32:13.1703790Z 2025-05-07T20:32:13.1704073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.1704425Z 2025-05-07T20:32:13.1704620Z x_sign = torch.sign(x) 2025-05-07T20:32:13.1704924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.1705249Z x = x_sign * x_clamp 2025-05-07T20:32:13.1705504Z x0 = x[:, :D] 2025-05-07T20:32:13.1705721Z x1 = x[:, D:] 2025-05-07T20:32:13.1705936Z 2025-05-07T20:32:13.1706129Z if contiguous: 2025-05-07T20:32:13.1706359Z x0 = x0.contiguous() 2025-05-07T20:32:13.1706626Z x1 = x1.contiguous() 2025-05-07T20:32:13.1706876Z 2025-05-07T20:32:13.1707066Z if scale_ub is not None: 2025-05-07T20:32:13.1707349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.1707699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.1708008Z ) 2025-05-07T20:32:13.1708205Z else: 2025-05-07T20:32:13.1708424Z scale_ub_tensor = None 2025-05-07T20:32:13.1708677Z 2025-05-07T20:32:13.1708916Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.1709291Z op = silu_mul_quant 2025-05-07T20:32:13.1709544Z if compiled: 2025-05-07T20:32:13.1709802Z op = torch.compile(op) 2025-05-07T20:32:13.1710107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1710389Z 2025-05-07T20:32:13.1710582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.1710759Z 2025-05-07T20:32:13.1710863Z moe/activation_test.py:117: 2025-05-07T20:32:13.1711170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1711551Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.1711841Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1712463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.1713020Z return fn(*args, **kwargs) 2025-05-07T20:32:13.1713686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.1714380Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.1714923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.1715600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.1716268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.1716816Z kernel = self.compile( 2025-05-07T20:32:13.1717355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.1718028Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.1718431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1718664Z 2025-05-07T20:32:13.1718881Z self = 2025-05-07T20:32:13.1719960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.1721330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea45bca0>} 2025-05-07T20:32:13.1722803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.1723827Z context = 2025-05-07T20:32:13.1724119Z 2025-05-07T20:32:13.1724296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.1724823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.1725301Z module_map=module_map) 2025-05-07T20:32:13.1725677Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.1726029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.1726303Z E ^ 2025-05-07T20:32:13.1726772Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.1727218Z 2025-05-07T20:32:13.1727641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.1728155Z 2025-05-07T20:32:13.4418878Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.4419358Z self=, 2025-05-07T20:32:13.4419776Z T=1, 2025-05-07T20:32:13.4419979Z D=5120, 2025-05-07T20:32:13.4420261Z scale_ub=None, 2025-05-07T20:32:13.4420585Z contiguous=False, 2025-05-07T20:32:13.4421191Z compiled=False, 2025-05-07T20:32:13.4421417Z ) 2025-05-07T20:32:13.4421751Z self = 2025-05-07T20:32:13.4422253Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.4422527Z 2025-05-07T20:32:13.4422610Z @given( 2025-05-07T20:32:13.4422854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.4423173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.4423495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.4423840Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.4424332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.4424654Z ) 2025-05-07T20:32:13.4425061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.4425583Z def test_silu_mul_quant( 2025-05-07T20:32:13.4425847Z self, 2025-05-07T20:32:13.4426062Z T: int, 2025-05-07T20:32:13.4426270Z D: int, 2025-05-07T20:32:13.4426493Z scale_ub: Optional[float], 2025-05-07T20:32:13.4426778Z contiguous: bool, 2025-05-07T20:32:13.4427060Z compiled: bool, 2025-05-07T20:32:13.4427297Z ) -> None: 2025-05-07T20:32:13.4427523Z torch.manual_seed(2025) 2025-05-07T20:32:13.4427819Z 2025-05-07T20:32:13.4428100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.4428451Z 2025-05-07T20:32:13.4428648Z x_sign = torch.sign(x) 2025-05-07T20:32:13.4428954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.4429281Z x = x_sign * x_clamp 2025-05-07T20:32:13.4429532Z x0 = x[:, :D] 2025-05-07T20:32:13.4429760Z x1 = x[:, D:] 2025-05-07T20:32:13.4429980Z 2025-05-07T20:32:13.4430174Z if contiguous: 2025-05-07T20:32:13.4430421Z x0 = x0.contiguous() 2025-05-07T20:32:13.4430696Z x1 = x1.contiguous() 2025-05-07T20:32:13.4430945Z 2025-05-07T20:32:13.4431147Z if scale_ub is not None: 2025-05-07T20:32:13.4431460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.4431838Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.4432154Z ) 2025-05-07T20:32:13.4432360Z else: 2025-05-07T20:32:13.4432584Z scale_ub_tensor = None 2025-05-07T20:32:13.4432840Z 2025-05-07T20:32:13.4433084Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4433418Z op = silu_mul_quant 2025-05-07T20:32:13.4433681Z if compiled: 2025-05-07T20:32:13.4434090Z op = torch.compile(op) 2025-05-07T20:32:13.4434407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4434692Z 2025-05-07T20:32:13.4434896Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.4435066Z 2025-05-07T20:32:13.4435181Z moe/activation_test.py:117: 2025-05-07T20:32:13.4435487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4435829Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.4436121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4436823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.4437517Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.4438077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.4438770Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.4439453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.4439999Z kernel = self.compile( 2025-05-07T20:32:13.4440866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.4441629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.4442034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4442278Z 2025-05-07T20:32:13.4442490Z self = 2025-05-07T20:32:13.4443613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.4445011Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea15a670>} 2025-05-07T20:32:13.4446474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.4447496Z context = 2025-05-07T20:32:13.4447797Z 2025-05-07T20:32:13.4447968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.4448502Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.4448978Z module_map=module_map) 2025-05-07T20:32:13.4449353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.4449718Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.4449989Z E ^ 2025-05-07T20:32:13.4450458Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.4450919Z 2025-05-07T20:32:13.4451346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.4451865Z 2025-05-07T20:32:13.4451977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.4452409Z self=, 2025-05-07T20:32:13.4452816Z T=4096, 2025-05-07T20:32:13.4453016Z D=7168, 2025-05-07T20:32:13.4453224Z scale_ub=1200.0, 2025-05-07T20:32:13.4453457Z contiguous=False, 2025-05-07T20:32:13.4453696Z compiled=False, 2025-05-07T20:32:13.4453913Z ) 2025-05-07T20:32:13.4454243Z self = 2025-05-07T20:32:13.4454754Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.4455037Z 2025-05-07T20:32:13.4455132Z @given( 2025-05-07T20:32:13.4455541Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.4455864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.4456185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.4456525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.4456860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.4457157Z ) 2025-05-07T20:32:13.4457516Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.4457960Z def test_silu_mul_quant( 2025-05-07T20:32:13.4458209Z self, 2025-05-07T20:32:13.4458412Z T: int, 2025-05-07T20:32:13.4458611Z D: int, 2025-05-07T20:32:13.4458837Z scale_ub: Optional[float], 2025-05-07T20:32:13.4459124Z contiguous: bool, 2025-05-07T20:32:13.4459373Z compiled: bool, 2025-05-07T20:32:13.4459606Z ) -> None: 2025-05-07T20:32:13.4459834Z torch.manual_seed(2025) 2025-05-07T20:32:13.4460130Z 2025-05-07T20:32:13.4460415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.4460769Z 2025-05-07T20:32:13.4460968Z x_sign = torch.sign(x) 2025-05-07T20:32:13.4461344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.4461724Z x = x_sign * x_clamp 2025-05-07T20:32:13.4461971Z x0 = x[:, :D] 2025-05-07T20:32:13.4462202Z x1 = x[:, D:] 2025-05-07T20:32:13.4462451Z 2025-05-07T20:32:13.4462639Z if contiguous: 2025-05-07T20:32:13.4462883Z x0 = x0.contiguous() 2025-05-07T20:32:13.4463155Z x1 = x1.contiguous() 2025-05-07T20:32:13.4463399Z 2025-05-07T20:32:13.4463600Z if scale_ub is not None: 2025-05-07T20:32:13.4463886Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.4464231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.4464543Z ) 2025-05-07T20:32:13.4464744Z else: 2025-05-07T20:32:13.4465046Z scale_ub_tensor = None 2025-05-07T20:32:13.4465298Z 2025-05-07T20:32:13.4465544Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4465865Z op = silu_mul_quant 2025-05-07T20:32:13.4466120Z if compiled: 2025-05-07T20:32:13.4466377Z op = torch.compile(op) 2025-05-07T20:32:13.4466686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4466966Z 2025-05-07T20:32:13.4467169Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.4467340Z 2025-05-07T20:32:13.4467449Z moe/activation_test.py:117: 2025-05-07T20:32:13.4467747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4468089Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.4468382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4469079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.4469783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.4470330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.4471026Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.4471704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.4472241Z kernel = self.compile( 2025-05-07T20:32:13.4472800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.4473503Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.4473909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4474150Z 2025-05-07T20:32:13.4474361Z self = 2025-05-07T20:32:13.4475569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.4476947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea256040>} 2025-05-07T20:32:13.4478304Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.4479323Z context = 2025-05-07T20:32:13.4479625Z 2025-05-07T20:32:13.4479801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.4480341Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.4480830Z module_map=module_map) 2025-05-07T20:32:13.4481201Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.4481561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.4481829Z E ^ 2025-05-07T20:32:13.4482291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.4482801Z 2025-05-07T20:32:13.4483219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.4483741Z 2025-05-07T20:32:13.4483848Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.4484278Z self=, 2025-05-07T20:32:13.4484681Z T=16384, 2025-05-07T20:32:13.4484888Z D=7168, 2025-05-07T20:32:13.4485090Z scale_ub=None, 2025-05-07T20:32:13.4485307Z contiguous=True, 2025-05-07T20:32:13.4485589Z compiled=True, 2025-05-07T20:32:13.4485802Z ) 2025-05-07T20:32:13.7576272Z self = 2025-05-07T20:32:13.7588754Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.7589168Z 2025-05-07T20:32:13.7589317Z @given( 2025-05-07T20:32:13.7589644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.7590114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.7590438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.7590777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.7591125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.7591431Z ) 2025-05-07T20:32:13.7591777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.7592231Z def test_silu_mul_quant( 2025-05-07T20:32:13.7592483Z self, 2025-05-07T20:32:13.7592681Z T: int, 2025-05-07T20:32:13.7592905Z D: int, 2025-05-07T20:32:13.7593133Z scale_ub: Optional[float], 2025-05-07T20:32:13.7593411Z contiguous: bool, 2025-05-07T20:32:13.7593669Z compiled: bool, 2025-05-07T20:32:13.7593914Z ) -> None: 2025-05-07T20:32:13.7594174Z torch.manual_seed(2025) 2025-05-07T20:32:13.7594427Z 2025-05-07T20:32:13.7594705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.7595060Z 2025-05-07T20:32:13.7595263Z x_sign = torch.sign(x) 2025-05-07T20:32:13.7595554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.7595880Z x = x_sign * x_clamp 2025-05-07T20:32:13.7596131Z x0 = x[:, :D] 2025-05-07T20:32:13.7596356Z x1 = x[:, D:] 2025-05-07T20:32:13.7596563Z 2025-05-07T20:32:13.7596757Z if contiguous: 2025-05-07T20:32:13.7596999Z x0 = x0.contiguous() 2025-05-07T20:32:13.7597263Z x1 = x1.contiguous() 2025-05-07T20:32:13.7597852Z 2025-05-07T20:32:13.7598058Z if scale_ub is not None: 2025-05-07T20:32:13.7598333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.7598682Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.7599005Z ) 2025-05-07T20:32:13.7599197Z else: 2025-05-07T20:32:13.7599418Z scale_ub_tensor = None 2025-05-07T20:32:13.7599679Z 2025-05-07T20:32:13.7599912Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.7600234Z op = silu_mul_quant 2025-05-07T20:32:13.7600497Z if compiled: 2025-05-07T20:32:13.7600747Z op = torch.compile(op) 2025-05-07T20:32:13.7601056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.7601341Z 2025-05-07T20:32:13.7601539Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.7601705Z 2025-05-07T20:32:13.7601808Z moe/activation_test.py:117: 2025-05-07T20:32:13.7602116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.7602463Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.7602748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.7603326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.7604003Z return fn(*args, **kwargs) 2025-05-07T20:32:13.7604661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.7605367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.7605920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.7606613Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.7607284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.7607925Z kernel = self.compile( 2025-05-07T20:32:13.7608485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.7609146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.7609552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.7609798Z 2025-05-07T20:32:13.7610012Z self = 2025-05-07T20:32:13.7611096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.7612481Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea256ca0>} 2025-05-07T20:32:13.7613842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.7614862Z context = 2025-05-07T20:32:13.7615163Z 2025-05-07T20:32:13.7615335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.7615874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.7616339Z module_map=module_map) 2025-05-07T20:32:13.7616767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.7617129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.7617389Z E ^ 2025-05-07T20:32:13.7617861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.7618328Z 2025-05-07T20:32:13.7618839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.7619360Z 2025-05-07T20:32:13.7619472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.7619884Z self=, 2025-05-07T20:32:13.7620297Z T=4096, 2025-05-07T20:32:13.7620494Z D=5120, 2025-05-07T20:32:13.7620687Z scale_ub=None, 2025-05-07T20:32:13.7620920Z contiguous=False, 2025-05-07T20:32:13.7621246Z compiled=True, 2025-05-07T20:32:13.7621461Z ) 2025-05-07T20:32:13.7621779Z self = 2025-05-07T20:32:13.7622285Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.7622556Z 2025-05-07T20:32:13.7622644Z @given( 2025-05-07T20:32:13.7622876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.7623197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.7623527Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.7623859Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.7624200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.7624496Z ) 2025-05-07T20:32:13.7624850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.7625355Z def test_silu_mul_quant( 2025-05-07T20:32:13.7625605Z self, 2025-05-07T20:32:13.7625809Z T: int, 2025-05-07T20:32:13.7626006Z D: int, 2025-05-07T20:32:13.7626231Z scale_ub: Optional[float], 2025-05-07T20:32:13.7626516Z contiguous: bool, 2025-05-07T20:32:13.7626760Z compiled: bool, 2025-05-07T20:32:13.7626993Z ) -> None: 2025-05-07T20:32:13.7627218Z torch.manual_seed(2025) 2025-05-07T20:32:13.7627462Z 2025-05-07T20:32:13.7627741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.7628145Z 2025-05-07T20:32:13.7628343Z x_sign = torch.sign(x) 2025-05-07T20:32:13.7628644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.7628966Z x = x_sign * x_clamp 2025-05-07T20:32:13.7629208Z x0 = x[:, :D] 2025-05-07T20:32:13.7629435Z x1 = x[:, D:] 2025-05-07T20:32:13.7629653Z 2025-05-07T20:32:13.7629839Z if contiguous: 2025-05-07T20:32:13.7630081Z x0 = x0.contiguous() 2025-05-07T20:32:13.7630353Z x1 = x1.contiguous() 2025-05-07T20:32:13.7630592Z 2025-05-07T20:32:13.7630793Z if scale_ub is not None: 2025-05-07T20:32:13.7631074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.7631410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.7631731Z ) 2025-05-07T20:32:13.7631946Z else: 2025-05-07T20:32:13.7632188Z scale_ub_tensor = None 2025-05-07T20:32:13.7632448Z 2025-05-07T20:32:13.7632688Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.7633002Z op = silu_mul_quant 2025-05-07T20:32:13.7633265Z if compiled: 2025-05-07T20:32:13.7633519Z op = torch.compile(op) 2025-05-07T20:32:13.7633815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.7634102Z 2025-05-07T20:32:13.7634306Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.7634472Z 2025-05-07T20:32:13.7634580Z moe/activation_test.py:117: 2025-05-07T20:32:13.7634875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.7635212Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.7635501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.7636058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.7636615Z return fn(*args, **kwargs) 2025-05-07T20:32:13.7637369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.7638074Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.7638608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.7639291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.7639962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.7640779Z kernel = self.compile( 2025-05-07T20:32:13.7641328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.7642041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.7642440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.7642673Z 2025-05-07T20:32:13.7642892Z self = 2025-05-07T20:32:13.7643974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.7645448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9ffb8b0>} 2025-05-07T20:32:13.7646790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.7647805Z context = 2025-05-07T20:32:13.7648096Z 2025-05-07T20:32:13.7648264Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.7648798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.7649347Z module_map=module_map) 2025-05-07T20:32:13.7649712Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.7650066Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.7650327Z E ^ 2025-05-07T20:32:13.7650784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.7651246Z 2025-05-07T20:32:13.7651662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.7652178Z 2025-05-07T20:32:13.9591748Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.9592462Z self=, 2025-05-07T20:32:13.9593010Z T=4096, 2025-05-07T20:32:13.9593218Z D=5120, 2025-05-07T20:32:13.9593430Z scale_ub=1200.0, 2025-05-07T20:32:13.9593691Z contiguous=False, 2025-05-07T20:32:13.9593932Z compiled=False, 2025-05-07T20:32:13.9594156Z ) 2025-05-07T20:32:13.9594480Z self = 2025-05-07T20:32:13.9594998Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.9595286Z 2025-05-07T20:32:13.9595381Z @given( 2025-05-07T20:32:13.9595617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.9595943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.9596271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.9596615Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.9596953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.9597250Z ) 2025-05-07T20:32:13.9597611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.9598067Z def test_silu_mul_quant( 2025-05-07T20:32:13.9598667Z self, 2025-05-07T20:32:13.9598890Z T: int, 2025-05-07T20:32:13.9599096Z D: int, 2025-05-07T20:32:13.9599328Z scale_ub: Optional[float], 2025-05-07T20:32:13.9599611Z contiguous: bool, 2025-05-07T20:32:13.9599859Z compiled: bool, 2025-05-07T20:32:13.9600101Z ) -> None: 2025-05-07T20:32:13.9600338Z torch.manual_seed(2025) 2025-05-07T20:32:13.9600587Z 2025-05-07T20:32:13.9600873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.9601239Z 2025-05-07T20:32:13.9601444Z x_sign = torch.sign(x) 2025-05-07T20:32:13.9601794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.9602184Z x = x_sign * x_clamp 2025-05-07T20:32:13.9602492Z x0 = x[:, :D] 2025-05-07T20:32:13.9602712Z x1 = x[:, D:] 2025-05-07T20:32:13.9602928Z 2025-05-07T20:32:13.9603124Z if contiguous: 2025-05-07T20:32:13.9603361Z x0 = x0.contiguous() 2025-05-07T20:32:13.9603641Z x1 = x1.contiguous() 2025-05-07T20:32:13.9603892Z 2025-05-07T20:32:13.9604087Z if scale_ub is not None: 2025-05-07T20:32:13.9604375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.9604724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.9605181Z ) 2025-05-07T20:32:13.9605389Z else: 2025-05-07T20:32:13.9605612Z scale_ub_tensor = None 2025-05-07T20:32:13.9605865Z 2025-05-07T20:32:13.9606106Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.9606430Z op = silu_mul_quant 2025-05-07T20:32:13.9606683Z if compiled: 2025-05-07T20:32:13.9606936Z op = torch.compile(op) 2025-05-07T20:32:13.9607241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9607524Z 2025-05-07T20:32:13.9607716Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.9607894Z 2025-05-07T20:32:13.9608000Z moe/activation_test.py:117: 2025-05-07T20:32:13.9608393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9608729Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.9609020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9609722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.9610421Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.9610970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.9611664Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.9612336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.9612929Z kernel = self.compile( 2025-05-07T20:32:13.9613717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.9614418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.9614827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9615062Z 2025-05-07T20:32:13.9615273Z self = 2025-05-07T20:32:13.9616369Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.9617763Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea3d0040>} 2025-05-07T20:32:13.9619220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.9620264Z context = 2025-05-07T20:32:13.9620565Z 2025-05-07T20:32:13.9620736Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.9621382Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.9621859Z module_map=module_map) 2025-05-07T20:32:13.9622229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.9622592Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.9622868Z E ^ 2025-05-07T20:32:13.9623337Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.9623794Z 2025-05-07T20:32:13.9624214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.9624751Z 2025-05-07T20:32:13.9624860Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.9625282Z self=, 2025-05-07T20:32:13.9625689Z T=4096, 2025-05-07T20:32:13.9625887Z D=5120, 2025-05-07T20:32:13.9626095Z scale_ub=1200.0, 2025-05-07T20:32:13.9626381Z contiguous=False, 2025-05-07T20:32:13.9626618Z compiled=True, 2025-05-07T20:32:13.9626837Z ) 2025-05-07T20:32:13.9627159Z self = 2025-05-07T20:32:13.9627668Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.9627956Z 2025-05-07T20:32:13.9628036Z @given( 2025-05-07T20:32:13.9628277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.9628595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.9628912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.9629255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.9629636Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.9629931Z ) 2025-05-07T20:32:13.9630287Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.9630733Z def test_silu_mul_quant( 2025-05-07T20:32:13.9630986Z self, 2025-05-07T20:32:13.9631190Z T: int, 2025-05-07T20:32:13.9631390Z D: int, 2025-05-07T20:32:13.9631620Z scale_ub: Optional[float], 2025-05-07T20:32:13.9631900Z contiguous: bool, 2025-05-07T20:32:13.9632146Z compiled: bool, 2025-05-07T20:32:13.9632372Z ) -> None: 2025-05-07T20:32:13.9632596Z torch.manual_seed(2025) 2025-05-07T20:32:13.9632848Z 2025-05-07T20:32:13.9633121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.9633473Z 2025-05-07T20:32:13.9633671Z x_sign = torch.sign(x) 2025-05-07T20:32:13.9633969Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.9634298Z x = x_sign * x_clamp 2025-05-07T20:32:13.9634548Z x0 = x[:, :D] 2025-05-07T20:32:13.9634769Z x1 = x[:, D:] 2025-05-07T20:32:13.9634984Z 2025-05-07T20:32:13.9635180Z if contiguous: 2025-05-07T20:32:13.9635422Z x0 = x0.contiguous() 2025-05-07T20:32:13.9635697Z x1 = x1.contiguous() 2025-05-07T20:32:13.9635944Z 2025-05-07T20:32:13.9636147Z if scale_ub is not None: 2025-05-07T20:32:13.9636431Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.9636773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.9637097Z ) 2025-05-07T20:32:13.9637301Z else: 2025-05-07T20:32:13.9637512Z scale_ub_tensor = None 2025-05-07T20:32:13.9637775Z 2025-05-07T20:32:13.9638015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.9638334Z op = silu_mul_quant 2025-05-07T20:32:13.9638596Z if compiled: 2025-05-07T20:32:13.9638947Z op = torch.compile(op) 2025-05-07T20:32:13.9639254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9639539Z 2025-05-07T20:32:13.9639740Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.9639910Z 2025-05-07T20:32:13.9640023Z moe/activation_test.py:117: 2025-05-07T20:32:13.9640700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9641136Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.9641429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9641985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.9642609Z return fn(*args, **kwargs) 2025-05-07T20:32:13.9643274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.9643981Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.9644529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.9645217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.9645888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.9646531Z kernel = self.compile( 2025-05-07T20:32:13.9647080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.9647741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.9648146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9648379Z 2025-05-07T20:32:13.9648591Z self = 2025-05-07T20:32:13.9649680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.9651142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea3d0ee0>} 2025-05-07T20:32:13.9652488Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.9653502Z context = 2025-05-07T20:32:13.9653799Z 2025-05-07T20:32:13.9653973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.9654508Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.9654984Z module_map=module_map) 2025-05-07T20:32:13.9655356Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.9655718Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.9655987Z E ^ 2025-05-07T20:32:13.9656456Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.9656916Z 2025-05-07T20:32:13.9657334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.9657858Z 2025-05-07T20:32:14.2423965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2424715Z self=, 2025-05-07T20:32:14.2425299Z T=2048, 2025-05-07T20:32:14.2425518Z D=7168, 2025-05-07T20:32:14.2425721Z scale_ub=1200.0, 2025-05-07T20:32:14.2425962Z contiguous=False, 2025-05-07T20:32:14.2426205Z compiled=False, 2025-05-07T20:32:14.2426422Z ) 2025-05-07T20:32:14.2427141Z self = 2025-05-07T20:32:14.2427661Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2427942Z 2025-05-07T20:32:14.2428035Z @given( 2025-05-07T20:32:14.2428280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2428614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2428940Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2429286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2429637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2429941Z ) 2025-05-07T20:32:14.2430296Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2430757Z def test_silu_mul_quant( 2025-05-07T20:32:14.2431014Z self, 2025-05-07T20:32:14.2431215Z T: int, 2025-05-07T20:32:14.2431427Z D: int, 2025-05-07T20:32:14.2431671Z scale_ub: Optional[float], 2025-05-07T20:32:14.2431957Z contiguous: bool, 2025-05-07T20:32:14.2432203Z compiled: bool, 2025-05-07T20:32:14.2432446Z ) -> None: 2025-05-07T20:32:14.2432678Z torch.manual_seed(2025) 2025-05-07T20:32:14.2432928Z 2025-05-07T20:32:14.2433213Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2433659Z 2025-05-07T20:32:14.2433886Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2434191Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2434518Z x = x_sign * x_clamp 2025-05-07T20:32:14.2434767Z x0 = x[:, :D] 2025-05-07T20:32:14.2435001Z x1 = x[:, D:] 2025-05-07T20:32:14.2435223Z 2025-05-07T20:32:14.2435419Z if contiguous: 2025-05-07T20:32:14.2435668Z x0 = x0.contiguous() 2025-05-07T20:32:14.2435947Z x1 = x1.contiguous() 2025-05-07T20:32:14.2436197Z 2025-05-07T20:32:14.2436408Z if scale_ub is not None: 2025-05-07T20:32:14.2436792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2437137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2437464Z ) 2025-05-07T20:32:14.2437675Z else: 2025-05-07T20:32:14.2437892Z scale_ub_tensor = None 2025-05-07T20:32:14.2438161Z 2025-05-07T20:32:14.2438410Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2438734Z op = silu_mul_quant 2025-05-07T20:32:14.2449429Z if compiled: 2025-05-07T20:32:14.2449755Z op = torch.compile(op) 2025-05-07T20:32:14.2450077Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2450373Z 2025-05-07T20:32:14.2450582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2450754Z 2025-05-07T20:32:14.2450862Z moe/activation_test.py:117: 2025-05-07T20:32:14.2451173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2451535Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2451828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2452534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2453236Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2453781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2454474Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2455146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2455696Z kernel = self.compile( 2025-05-07T20:32:14.2456246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2456912Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2457509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2457750Z 2025-05-07T20:32:14.2457973Z self = 2025-05-07T20:32:14.2459053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2460453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9e90550>} 2025-05-07T20:32:14.2461892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2462920Z context = 2025-05-07T20:32:14.2463217Z 2025-05-07T20:32:14.2463403Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2463942Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2464427Z module_map=module_map) 2025-05-07T20:32:14.2464913Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2465277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2465557Z E ^ 2025-05-07T20:32:14.2466030Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2466479Z 2025-05-07T20:32:14.2466905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2467426Z 2025-05-07T20:32:14.2467533Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2467970Z self=, 2025-05-07T20:32:14.2468449Z T=1, 2025-05-07T20:32:14.2468639Z D=7168, 2025-05-07T20:32:14.2468840Z scale_ub=None, 2025-05-07T20:32:14.2469065Z contiguous=True, 2025-05-07T20:32:14.2469379Z compiled=False, 2025-05-07T20:32:14.2469642Z ) 2025-05-07T20:32:14.2470055Z self = 2025-05-07T20:32:14.2470564Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2470836Z 2025-05-07T20:32:14.2470919Z @given( 2025-05-07T20:32:14.2471160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2471484Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2471798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2472142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2472486Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2472786Z ) 2025-05-07T20:32:14.2473153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2473607Z def test_silu_mul_quant( 2025-05-07T20:32:14.2473853Z self, 2025-05-07T20:32:14.2474063Z T: int, 2025-05-07T20:32:14.2474271Z D: int, 2025-05-07T20:32:14.2474494Z scale_ub: Optional[float], 2025-05-07T20:32:14.2474780Z contiguous: bool, 2025-05-07T20:32:14.2475034Z compiled: bool, 2025-05-07T20:32:14.2475270Z ) -> None: 2025-05-07T20:32:14.2475492Z torch.manual_seed(2025) 2025-05-07T20:32:14.2475749Z 2025-05-07T20:32:14.2476031Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2476376Z 2025-05-07T20:32:14.2476582Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2476887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2477200Z x = x_sign * x_clamp 2025-05-07T20:32:14.2477459Z x0 = x[:, :D] 2025-05-07T20:32:14.2477788Z x1 = x[:, D:] 2025-05-07T20:32:14.2478001Z 2025-05-07T20:32:14.2478198Z if contiguous: 2025-05-07T20:32:14.2478440Z x0 = x0.contiguous() 2025-05-07T20:32:14.2478705Z x1 = x1.contiguous() 2025-05-07T20:32:14.2478959Z 2025-05-07T20:32:14.2479166Z if scale_ub is not None: 2025-05-07T20:32:14.2479445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2479795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2480119Z ) 2025-05-07T20:32:14.2480326Z else: 2025-05-07T20:32:14.2480542Z scale_ub_tensor = None 2025-05-07T20:32:14.2480810Z 2025-05-07T20:32:14.2481055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2481374Z op = silu_mul_quant 2025-05-07T20:32:14.2481639Z if compiled: 2025-05-07T20:32:14.2481902Z op = torch.compile(op) 2025-05-07T20:32:14.2482246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2482551Z 2025-05-07T20:32:14.2482759Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2482930Z 2025-05-07T20:32:14.2483035Z moe/activation_test.py:117: 2025-05-07T20:32:14.2483342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2483685Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2484023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2484729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2485433Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2485979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2486667Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2487353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2487955Z kernel = self.compile( 2025-05-07T20:32:14.2488506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2489164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2489587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2489826Z 2025-05-07T20:32:14.2490048Z self = 2025-05-07T20:32:14.2491145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2492522Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9e51160>} 2025-05-07T20:32:14.2493872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2494901Z context = 2025-05-07T20:32:14.2495203Z 2025-05-07T20:32:14.2495388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2495922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2496408Z module_map=module_map) 2025-05-07T20:32:14.2496793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2497163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2497432Z E ^ 2025-05-07T20:32:14.2497909Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2498442Z 2025-05-07T20:32:14.2498879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2499397Z 2025-05-07T20:32:14.2499503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2499932Z self=, 2025-05-07T20:32:14.2500349Z T=16384, 2025-05-07T20:32:14.2500545Z D=7168, 2025-05-07T20:32:14.2500749Z scale_ub=1200.0, 2025-05-07T20:32:14.2500984Z contiguous=False, 2025-05-07T20:32:14.2501318Z compiled=True, 2025-05-07T20:32:14.2501537Z ) 2025-05-07T20:32:14.4410254Z self = 2025-05-07T20:32:14.4411789Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.4412375Z 2025-05-07T20:32:14.4412461Z @given( 2025-05-07T20:32:14.4412710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.4413074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.4413392Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.4413738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.4414084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.4414670Z ) 2025-05-07T20:32:14.4415032Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.4415486Z def test_silu_mul_quant( 2025-05-07T20:32:14.4415742Z self, 2025-05-07T20:32:14.4415948Z T: int, 2025-05-07T20:32:14.4416160Z D: int, 2025-05-07T20:32:14.4416392Z scale_ub: Optional[float], 2025-05-07T20:32:14.4416673Z contiguous: bool, 2025-05-07T20:32:14.4416924Z compiled: bool, 2025-05-07T20:32:14.4417165Z ) -> None: 2025-05-07T20:32:14.4417387Z torch.manual_seed(2025) 2025-05-07T20:32:14.4417645Z 2025-05-07T20:32:14.4417941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.4418395Z 2025-05-07T20:32:14.4418599Z x_sign = torch.sign(x) 2025-05-07T20:32:14.4418903Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.4419221Z x = x_sign * x_clamp 2025-05-07T20:32:14.4419476Z x0 = x[:, :D] 2025-05-07T20:32:14.4419708Z x1 = x[:, D:] 2025-05-07T20:32:14.4419919Z 2025-05-07T20:32:14.4420121Z if contiguous: 2025-05-07T20:32:14.4420367Z x0 = x0.contiguous() 2025-05-07T20:32:14.4420633Z x1 = x1.contiguous() 2025-05-07T20:32:14.4420890Z 2025-05-07T20:32:14.4421214Z if scale_ub is not None: 2025-05-07T20:32:14.4421509Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.4421853Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.4422178Z ) 2025-05-07T20:32:14.4422384Z else: 2025-05-07T20:32:14.4422603Z scale_ub_tensor = None 2025-05-07T20:32:14.4422873Z 2025-05-07T20:32:14.4423121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.4423440Z op = silu_mul_quant 2025-05-07T20:32:14.4423707Z if compiled: 2025-05-07T20:32:14.4423977Z op = torch.compile(op) 2025-05-07T20:32:14.4424281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.4424577Z 2025-05-07T20:32:14.4424782Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.4424955Z 2025-05-07T20:32:14.4425061Z moe/activation_test.py:117: 2025-05-07T20:32:14.4425375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.4425725Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.4426024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.4426597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.4427173Z return fn(*args, **kwargs) 2025-05-07T20:32:14.4428008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.4428722Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.4429276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.4429975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.4430652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.4431190Z kernel = self.compile( 2025-05-07T20:32:14.4431751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.4432428Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.4432835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.4433085Z 2025-05-07T20:32:14.4433306Z self = 2025-05-07T20:32:14.4434396Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.4435857Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9e51dc0>} 2025-05-07T20:32:14.4437197Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.4438211Z context = 2025-05-07T20:32:14.4438510Z 2025-05-07T20:32:14.4438687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.4439277Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.4439749Z module_map=module_map) 2025-05-07T20:32:14.4440403Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.4440954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.4441275Z E ^ 2025-05-07T20:32:14.4441740Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.4442196Z 2025-05-07T20:32:14.4442620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.4443142Z 2025-05-07T20:32:14.4443250Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.4443672Z self=, 2025-05-07T20:32:14.4444078Z T=1, 2025-05-07T20:32:14.4444273Z D=7168, 2025-05-07T20:32:14.4444485Z scale_ub=None, 2025-05-07T20:32:14.4444708Z contiguous=False, 2025-05-07T20:32:14.4444947Z compiled=False, 2025-05-07T20:32:14.4445163Z ) 2025-05-07T20:32:14.4445483Z self = 2025-05-07T20:32:14.4445983Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.4446255Z 2025-05-07T20:32:14.4446337Z @given( 2025-05-07T20:32:14.4446579Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.4446896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.4447215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.4447560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.4447891Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.4448187Z ) 2025-05-07T20:32:14.4448547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.4449141Z def test_silu_mul_quant( 2025-05-07T20:32:14.4449400Z self, 2025-05-07T20:32:14.4449608Z T: int, 2025-05-07T20:32:14.4449808Z D: int, 2025-05-07T20:32:14.4450041Z scale_ub: Optional[float], 2025-05-07T20:32:14.4450327Z contiguous: bool, 2025-05-07T20:32:14.4450578Z compiled: bool, 2025-05-07T20:32:14.4450809Z ) -> None: 2025-05-07T20:32:14.4451035Z torch.manual_seed(2025) 2025-05-07T20:32:14.4451290Z 2025-05-07T20:32:14.4451568Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.4451946Z 2025-05-07T20:32:14.4452174Z x_sign = torch.sign(x) 2025-05-07T20:32:14.4452469Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.4452792Z x = x_sign * x_clamp 2025-05-07T20:32:14.4453048Z x0 = x[:, :D] 2025-05-07T20:32:14.4453269Z x1 = x[:, D:] 2025-05-07T20:32:14.4453486Z 2025-05-07T20:32:14.4453684Z if contiguous: 2025-05-07T20:32:14.4453928Z x0 = x0.contiguous() 2025-05-07T20:32:14.4454198Z x1 = x1.contiguous() 2025-05-07T20:32:14.4454448Z 2025-05-07T20:32:14.4454645Z if scale_ub is not None: 2025-05-07T20:32:14.4454930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.4455273Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.4455669Z ) 2025-05-07T20:32:14.4455866Z else: 2025-05-07T20:32:14.4456085Z scale_ub_tensor = None 2025-05-07T20:32:14.4456350Z 2025-05-07T20:32:14.4456588Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.4456917Z op = silu_mul_quant 2025-05-07T20:32:14.4457180Z if compiled: 2025-05-07T20:32:14.4457431Z op = torch.compile(op) 2025-05-07T20:32:14.4457741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.4458027Z 2025-05-07T20:32:14.4458224Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.4458400Z 2025-05-07T20:32:14.4458585Z moe/activation_test.py:117: 2025-05-07T20:32:14.4458892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.4459228Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.4459521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.4460216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.4460916Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.4461528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.4462217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.4462890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.4463432Z kernel = self.compile( 2025-05-07T20:32:14.4463977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.4464645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.4465055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.4465291Z 2025-05-07T20:32:14.4465508Z self = 2025-05-07T20:32:14.4466591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.4467962Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9ed0790>} 2025-05-07T20:32:14.4469390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.4470425Z context = 2025-05-07T20:32:14.4470717Z 2025-05-07T20:32:14.4470892Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.4471427Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.4471901Z module_map=module_map) 2025-05-07T20:32:14.4472274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.4472642Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.4472914Z E ^ 2025-05-07T20:32:14.4473384Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.4473834Z 2025-05-07T20:32:14.4474270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.4474793Z 2025-05-07T20:32:14.4474901Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.4475327Z self=, 2025-05-07T20:32:14.4475740Z T=2048, 2025-05-07T20:32:14.4475932Z D=7168, 2025-05-07T20:32:14.4476185Z scale_ub=None, 2025-05-07T20:32:14.4476420Z contiguous=False, 2025-05-07T20:32:14.4476651Z compiled=True, 2025-05-07T20:32:14.4476871Z ) 2025-05-07T20:32:14.5657184Z self = 2025-05-07T20:32:14.5658000Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.5658361Z 2025-05-07T20:32:14.5658446Z @given( 2025-05-07T20:32:14.5658695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.5659019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.5659344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.5660037Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.5660375Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.5660681Z ) 2025-05-07T20:32:14.5661045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.5661617Z def test_silu_mul_quant( 2025-05-07T20:32:14.5661870Z self, 2025-05-07T20:32:14.5662079Z T: int, 2025-05-07T20:32:14.5662282Z D: int, 2025-05-07T20:32:14.5662512Z scale_ub: Optional[float], 2025-05-07T20:32:14.5662795Z contiguous: bool, 2025-05-07T20:32:14.5663047Z compiled: bool, 2025-05-07T20:32:14.5663278Z ) -> None: 2025-05-07T20:32:14.5663504Z torch.manual_seed(2025) 2025-05-07T20:32:14.5663759Z 2025-05-07T20:32:14.5664037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5664392Z 2025-05-07T20:32:14.5664597Z x_sign = torch.sign(x) 2025-05-07T20:32:14.5664901Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.5665227Z x = x_sign * x_clamp 2025-05-07T20:32:14.5665481Z x0 = x[:, :D] 2025-05-07T20:32:14.5665702Z x1 = x[:, D:] 2025-05-07T20:32:14.5665923Z 2025-05-07T20:32:14.5666122Z if contiguous: 2025-05-07T20:32:14.5666364Z x0 = x0.contiguous() 2025-05-07T20:32:14.5666639Z x1 = x1.contiguous() 2025-05-07T20:32:14.5666896Z 2025-05-07T20:32:14.5667094Z if scale_ub is not None: 2025-05-07T20:32:14.5667389Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.5667736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.5668058Z ) 2025-05-07T20:32:14.5668256Z else: 2025-05-07T20:32:14.5668481Z scale_ub_tensor = None 2025-05-07T20:32:14.5668745Z 2025-05-07T20:32:14.5668983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.5669315Z op = silu_mul_quant 2025-05-07T20:32:14.5669735Z if compiled: 2025-05-07T20:32:14.5669997Z op = torch.compile(op) 2025-05-07T20:32:14.5670316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5670605Z 2025-05-07T20:32:14.5670804Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.5670980Z 2025-05-07T20:32:14.5671090Z moe/activation_test.py:117: 2025-05-07T20:32:14.5671402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5671742Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.5672044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5672620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.5673195Z return fn(*args, **kwargs) 2025-05-07T20:32:14.5673857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.5674560Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.5675106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.5675793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.5676457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.5677090Z kernel = self.compile( 2025-05-07T20:32:14.5677642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.5678302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.5678710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5678952Z 2025-05-07T20:32:14.5679167Z self = 2025-05-07T20:32:14.5680261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.5681700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9f2f430>} 2025-05-07T20:32:14.5683053Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.5684076Z context = 2025-05-07T20:32:14.5684368Z 2025-05-07T20:32:14.5684571Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.5685108Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.5685593Z module_map=module_map) 2025-05-07T20:32:14.5685972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.5686329Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.5686602Z E ^ 2025-05-07T20:32:14.5687082Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.5687537Z 2025-05-07T20:32:14.5687966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.5688480Z 2025-05-07T20:32:14.5688588Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.5689010Z self=, 2025-05-07T20:32:14.5689422Z T=4096, 2025-05-07T20:32:14.5689616Z D=7168, 2025-05-07T20:32:14.5689822Z scale_ub=None, 2025-05-07T20:32:14.5690057Z contiguous=False, 2025-05-07T20:32:14.5690289Z compiled=True, 2025-05-07T20:32:14.5690589Z ) 2025-05-07T20:32:14.5690924Z self = 2025-05-07T20:32:14.5691427Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.5691707Z 2025-05-07T20:32:14.5691790Z @given( 2025-05-07T20:32:14.5701834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.5702170Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.5702496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.5702848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.5703181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.5703484Z ) 2025-05-07T20:32:14.5703847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.5704299Z def test_silu_mul_quant( 2025-05-07T20:32:14.5704563Z self, 2025-05-07T20:32:14.5704772Z T: int, 2025-05-07T20:32:14.5704992Z D: int, 2025-05-07T20:32:14.5705224Z scale_ub: Optional[float], 2025-05-07T20:32:14.5705514Z contiguous: bool, 2025-05-07T20:32:14.5705770Z compiled: bool, 2025-05-07T20:32:14.5706001Z ) -> None: 2025-05-07T20:32:14.5706237Z torch.manual_seed(2025) 2025-05-07T20:32:14.5706500Z 2025-05-07T20:32:14.5706900Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5707262Z 2025-05-07T20:32:14.5707466Z x_sign = torch.sign(x) 2025-05-07T20:32:14.5707763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.5708088Z x = x_sign * x_clamp 2025-05-07T20:32:14.5708344Z x0 = x[:, :D] 2025-05-07T20:32:14.5708565Z x1 = x[:, D:] 2025-05-07T20:32:14.5708784Z 2025-05-07T20:32:14.5708983Z if contiguous: 2025-05-07T20:32:14.5709218Z x0 = x0.contiguous() 2025-05-07T20:32:14.5709489Z x1 = x1.contiguous() 2025-05-07T20:32:14.5709744Z 2025-05-07T20:32:14.5709994Z if scale_ub is not None: 2025-05-07T20:32:14.5710283Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.5710632Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.5710957Z ) 2025-05-07T20:32:14.5711161Z else: 2025-05-07T20:32:14.5711383Z scale_ub_tensor = None 2025-05-07T20:32:14.5711661Z 2025-05-07T20:32:14.5711897Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.5712225Z op = silu_mul_quant 2025-05-07T20:32:14.5712490Z if compiled: 2025-05-07T20:32:14.5712743Z op = torch.compile(op) 2025-05-07T20:32:14.5713055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5713340Z 2025-05-07T20:32:14.5713536Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.5713712Z 2025-05-07T20:32:14.5713819Z moe/activation_test.py:117: 2025-05-07T20:32:14.5714123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5714476Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.5714765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5715345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.5715921Z return fn(*args, **kwargs) 2025-05-07T20:32:14.5716585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.5717278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.5717826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.5718518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.5719182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.5719724Z kernel = self.compile( 2025-05-07T20:32:14.5720361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.5721029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.5721446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5721691Z 2025-05-07T20:32:14.5721903Z self = 2025-05-07T20:32:14.5722989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.5724373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9b48040>} 2025-05-07T20:32:14.5725745Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.5726788Z context = 2025-05-07T20:32:14.5727082Z 2025-05-07T20:32:14.5727264Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.5727840Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.5728314Z module_map=module_map) 2025-05-07T20:32:14.5728691Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.5729053Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.5729316Z E ^ 2025-05-07T20:32:14.5729784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.5730241Z 2025-05-07T20:32:14.5730677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.5731243Z 2025-05-07T20:32:14.9769645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.9770838Z self=, 2025-05-07T20:32:14.9771689Z T=16384, 2025-05-07T20:32:14.9772006Z D=5120, 2025-05-07T20:32:14.9772205Z scale_ub=1200.0, 2025-05-07T20:32:14.9772446Z contiguous=False, 2025-05-07T20:32:14.9772685Z compiled=False, 2025-05-07T20:32:14.9772899Z ) 2025-05-07T20:32:14.9773239Z self = 2025-05-07T20:32:14.9773756Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.9774041Z 2025-05-07T20:32:14.9774126Z @given( 2025-05-07T20:32:14.9774375Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.9774701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.9775046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.9775392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.9775739Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.9776037Z ) 2025-05-07T20:32:14.9776395Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.9776856Z def test_silu_mul_quant( 2025-05-07T20:32:14.9777111Z self, 2025-05-07T20:32:14.9777315Z T: int, 2025-05-07T20:32:14.9777525Z D: int, 2025-05-07T20:32:14.9777758Z scale_ub: Optional[float], 2025-05-07T20:32:14.9778035Z contiguous: bool, 2025-05-07T20:32:14.9778288Z compiled: bool, 2025-05-07T20:32:14.9778527Z ) -> None: 2025-05-07T20:32:14.9778750Z torch.manual_seed(2025) 2025-05-07T20:32:14.9779005Z 2025-05-07T20:32:14.9779288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.9779644Z 2025-05-07T20:32:14.9780175Z x_sign = torch.sign(x) 2025-05-07T20:32:14.9780490Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.9780825Z x = x_sign * x_clamp 2025-05-07T20:32:14.9781162Z x0 = x[:, :D] 2025-05-07T20:32:14.9781391Z x1 = x[:, D:] 2025-05-07T20:32:14.9781602Z 2025-05-07T20:32:14.9781800Z if contiguous: 2025-05-07T20:32:14.9782043Z x0 = x0.contiguous() 2025-05-07T20:32:14.9782309Z x1 = x1.contiguous() 2025-05-07T20:32:14.9782564Z 2025-05-07T20:32:14.9782767Z if scale_ub is not None: 2025-05-07T20:32:14.9783044Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.9783390Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.9783711Z ) 2025-05-07T20:32:14.9783907Z else: 2025-05-07T20:32:14.9784130Z scale_ub_tensor = None 2025-05-07T20:32:14.9784392Z 2025-05-07T20:32:14.9784628Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.9784961Z op = silu_mul_quant 2025-05-07T20:32:14.9785223Z if compiled: 2025-05-07T20:32:14.9785475Z op = torch.compile(op) 2025-05-07T20:32:14.9785784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9786068Z 2025-05-07T20:32:14.9786269Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.9786518Z 2025-05-07T20:32:14.9786623Z moe/activation_test.py:117: 2025-05-07T20:32:14.9786933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9787275Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.9787561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9788268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.9788964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.9789521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.9790295Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.9790970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.9791516Z kernel = self.compile( 2025-05-07T20:32:14.9792063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.9792726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.9793135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9793369Z 2025-05-07T20:32:14.9793585Z self = 2025-05-07T20:32:14.9794667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.9796079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9b488b0>} 2025-05-07T20:32:14.9797417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.9798440Z context = 2025-05-07T20:32:14.9798732Z 2025-05-07T20:32:14.9798911Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.9799437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.9799907Z module_map=module_map) 2025-05-07T20:32:14.9800284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.9800721Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.9800999Z E ^ 2025-05-07T20:32:14.9801466Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.9801917Z 2025-05-07T20:32:14.9802351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.9802905Z 2025-05-07T20:32:14.9803012Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.9803436Z self=, 2025-05-07T20:32:14.9803846Z T=16384, 2025-05-07T20:32:14.9804041Z D=5120, 2025-05-07T20:32:14.9804242Z scale_ub=1200.0, 2025-05-07T20:32:14.9804474Z contiguous=True, 2025-05-07T20:32:14.9804696Z compiled=True, 2025-05-07T20:32:14.9804907Z ) 2025-05-07T20:32:14.9805235Z self = 2025-05-07T20:32:14.9805743Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.9806032Z 2025-05-07T20:32:14.9806111Z @given( 2025-05-07T20:32:14.9806353Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.9806674Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.9807030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.9807369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.9807706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.9807993Z ) 2025-05-07T20:32:14.9808351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.9808804Z def test_silu_mul_quant( 2025-05-07T20:32:14.9809048Z self, 2025-05-07T20:32:14.9809250Z T: int, 2025-05-07T20:32:14.9809455Z D: int, 2025-05-07T20:32:14.9809682Z scale_ub: Optional[float], 2025-05-07T20:32:14.9809955Z contiguous: bool, 2025-05-07T20:32:14.9810285Z compiled: bool, 2025-05-07T20:32:14.9810518Z ) -> None: 2025-05-07T20:32:14.9810737Z torch.manual_seed(2025) 2025-05-07T20:32:14.9810988Z 2025-05-07T20:32:14.9811267Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.9811616Z 2025-05-07T20:32:14.9811821Z x_sign = torch.sign(x) 2025-05-07T20:32:14.9812127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.9812444Z x = x_sign * x_clamp 2025-05-07T20:32:14.9812694Z x0 = x[:, :D] 2025-05-07T20:32:14.9812924Z x1 = x[:, D:] 2025-05-07T20:32:14.9813133Z 2025-05-07T20:32:14.9813330Z if contiguous: 2025-05-07T20:32:14.9813573Z x0 = x0.contiguous() 2025-05-07T20:32:14.9813835Z x1 = x1.contiguous() 2025-05-07T20:32:14.9814089Z 2025-05-07T20:32:14.9814288Z if scale_ub is not None: 2025-05-07T20:32:14.9814565Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.9814922Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.9815238Z ) 2025-05-07T20:32:14.9815437Z else: 2025-05-07T20:32:14.9815649Z scale_ub_tensor = None 2025-05-07T20:32:14.9815905Z 2025-05-07T20:32:14.9816145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.9816464Z op = silu_mul_quant 2025-05-07T20:32:14.9816723Z if compiled: 2025-05-07T20:32:14.9816979Z op = torch.compile(op) 2025-05-07T20:32:14.9817278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9817562Z 2025-05-07T20:32:14.9817763Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.9817928Z 2025-05-07T20:32:14.9818031Z moe/activation_test.py:117: 2025-05-07T20:32:14.9818335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9818676Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.9818968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9819615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.9820196Z return fn(*args, **kwargs) 2025-05-07T20:32:14.9820853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.9821626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.9822168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.9822853Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.9823523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.9824053Z kernel = self.compile( 2025-05-07T20:32:14.9824599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.9825269Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.9825678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9825919Z 2025-05-07T20:32:14.9826129Z self = 2025-05-07T20:32:14.9827261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.9828640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9a215e0>} 2025-05-07T20:32:14.9829988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.9831040Z context = 2025-05-07T20:32:14.9831339Z 2025-05-07T20:32:14.9831509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.9832045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.9832524Z module_map=module_map) 2025-05-07T20:32:14.9832893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.9833258Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.9833528Z E ^ 2025-05-07T20:32:14.9833995Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.9834453Z 2025-05-07T20:32:14.9834867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.9835386Z 2025-05-07T20:32:15.2077554Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2078358Z self=, 2025-05-07T20:32:15.2079035Z T=16384, 2025-05-07T20:32:15.2079353Z D=5120, 2025-05-07T20:32:15.2079657Z scale_ub=None, 2025-05-07T20:32:15.2080014Z contiguous=False, 2025-05-07T20:32:15.2080399Z compiled=True, 2025-05-07T20:32:15.2080733Z ) 2025-05-07T20:32:15.2081259Z self = 2025-05-07T20:32:15.2082107Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:15.2082508Z 2025-05-07T20:32:15.2082616Z @given( 2025-05-07T20:32:15.2082940Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2083377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2083817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2084288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2085180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2085618Z ) 2025-05-07T20:32:15.2086163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2086874Z def test_silu_mul_quant( 2025-05-07T20:32:15.2087227Z self, 2025-05-07T20:32:15.2087521Z T: int, 2025-05-07T20:32:15.2087829Z D: int, 2025-05-07T20:32:15.2088150Z scale_ub: Optional[float], 2025-05-07T20:32:15.2088565Z contiguous: bool, 2025-05-07T20:32:15.2088941Z compiled: bool, 2025-05-07T20:32:15.2089298Z ) -> None: 2025-05-07T20:32:15.2089632Z torch.manual_seed(2025) 2025-05-07T20:32:15.2090008Z 2025-05-07T20:32:15.2090415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2090966Z 2025-05-07T20:32:15.2091286Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2091747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2092246Z x = x_sign * x_clamp 2025-05-07T20:32:15.2092633Z x0 = x[:, :D] 2025-05-07T20:32:15.2092984Z x1 = x[:, D:] 2025-05-07T20:32:15.2093317Z 2025-05-07T20:32:15.2093619Z if contiguous: 2025-05-07T20:32:15.2093997Z x0 = x0.contiguous() 2025-05-07T20:32:15.2094414Z x1 = x1.contiguous() 2025-05-07T20:32:15.2094969Z 2025-05-07T20:32:15.2095296Z if scale_ub is not None: 2025-05-07T20:32:15.2095754Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.2096322Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.2096847Z ) 2025-05-07T20:32:15.2097157Z else: 2025-05-07T20:32:15.2097499Z scale_ub_tensor = None 2025-05-07T20:32:15.2097916Z 2025-05-07T20:32:15.2098292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.2098828Z op = silu_mul_quant 2025-05-07T20:32:15.2099249Z if compiled: 2025-05-07T20:32:15.2099870Z op = torch.compile(op) 2025-05-07T20:32:15.2100367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2100842Z 2025-05-07T20:32:15.2101299Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.2101588Z 2025-05-07T20:32:15.2101749Z moe/activation_test.py:117: 2025-05-07T20:32:15.2102246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2102812Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.2103280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2104247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.2105220Z return fn(*args, **kwargs) 2025-05-07T20:32:15.2106362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.2107555Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.2108499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.2109711Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.2110859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.2111801Z kernel = self.compile( 2025-05-07T20:32:15.2112804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.2113947Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.2114621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2115017Z 2025-05-07T20:32:15.2115349Z self = 2025-05-07T20:32:15.2117318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.2119684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9c405e0>} 2025-05-07T20:32:15.2121994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.2123783Z context = 2025-05-07T20:32:15.2124284Z 2025-05-07T20:32:15.2124557Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.2125447Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.2126250Z module_map=module_map) 2025-05-07T20:32:15.2126873Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.2127448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.2127880Z E ^ 2025-05-07T20:32:15.2128668Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.2129569Z 2025-05-07T20:32:15.2130302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.2131203Z 2025-05-07T20:32:15.2131381Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2132077Z self=, 2025-05-07T20:32:15.2132770Z T=2048, 2025-05-07T20:32:15.2133074Z D=5120, 2025-05-07T20:32:15.2133389Z scale_ub=None, 2025-05-07T20:32:15.2133728Z contiguous=False, 2025-05-07T20:32:15.2134090Z compiled=True, 2025-05-07T20:32:15.2134423Z ) 2025-05-07T20:32:15.3350086Z self = 2025-05-07T20:32:15.3351289Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:15.3351755Z 2025-05-07T20:32:15.3351892Z @given( 2025-05-07T20:32:15.3352268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.3352853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.3353334Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.3353833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.3354320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.3354787Z ) 2025-05-07T20:32:15.3355387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.3356154Z def test_silu_mul_quant( 2025-05-07T20:32:15.3356559Z self, 2025-05-07T20:32:15.3356876Z T: int, 2025-05-07T20:32:15.3357193Z D: int, 2025-05-07T20:32:15.3357550Z scale_ub: Optional[float], 2025-05-07T20:32:15.3358017Z contiguous: bool, 2025-05-07T20:32:15.3358408Z compiled: bool, 2025-05-07T20:32:15.3358779Z ) -> None: 2025-05-07T20:32:15.3359158Z torch.manual_seed(2025) 2025-05-07T20:32:15.3359556Z 2025-05-07T20:32:15.3360009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.3360598Z 2025-05-07T20:32:15.3360917Z x_sign = torch.sign(x) 2025-05-07T20:32:15.3361389Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.3361908Z x = x_sign * x_clamp 2025-05-07T20:32:15.3362356Z x0 = x[:, :D] 2025-05-07T20:32:15.3362700Z x1 = x[:, D:] 2025-05-07T20:32:15.3363043Z 2025-05-07T20:32:15.3363349Z if contiguous: 2025-05-07T20:32:15.3363722Z x0 = x0.contiguous() 2025-05-07T20:32:15.3364153Z x1 = x1.contiguous() 2025-05-07T20:32:15.3364555Z 2025-05-07T20:32:15.3364863Z if scale_ub is not None: 2025-05-07T20:32:15.3365540Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.3366115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.3366629Z ) 2025-05-07T20:32:15.3378895Z else: 2025-05-07T20:32:15.3379273Z scale_ub_tensor = None 2025-05-07T20:32:15.3379695Z 2025-05-07T20:32:15.3380083Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.3380603Z op = silu_mul_quant 2025-05-07T20:32:15.3381025Z if compiled: 2025-05-07T20:32:15.3381557Z op = torch.compile(op) 2025-05-07T20:32:15.3382030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.3382492Z 2025-05-07T20:32:15.3382803Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.3383077Z 2025-05-07T20:32:15.3383245Z moe/activation_test.py:117: 2025-05-07T20:32:15.3383726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.3384280Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.3384773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.3385738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.3386724Z return fn(*args, **kwargs) 2025-05-07T20:32:15.3387885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.3389259Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.3390204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.3391404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.3392582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.3393515Z kernel = self.compile( 2025-05-07T20:32:15.3394466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.3395686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.3396362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.3396770Z 2025-05-07T20:32:15.3397117Z self = 2025-05-07T20:32:15.3399043Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.3401544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9a21c10>} 2025-05-07T20:32:15.3404038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.3405862Z context = 2025-05-07T20:32:15.3406375Z 2025-05-07T20:32:15.3406653Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.3407563Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.3408373Z module_map=module_map) 2025-05-07T20:32:15.3408980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.3409577Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.3410016Z E ^ 2025-05-07T20:32:15.3410813Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.3411632Z 2025-05-07T20:32:15.3412368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.3413462Z 2025-05-07T20:32:15.3413638Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.3414348Z self=, 2025-05-07T20:32:15.3415035Z T=2048, 2025-05-07T20:32:15.3415346Z D=5120, 2025-05-07T20:32:15.3415671Z scale_ub=1200.0, 2025-05-07T20:32:15.3416033Z contiguous=False, 2025-05-07T20:32:15.3416409Z compiled=True, 2025-05-07T20:32:15.3416750Z ) 2025-05-07T20:32:15.3417277Z self = 2025-05-07T20:32:15.3418136Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.3418611Z 2025-05-07T20:32:15.3418748Z @given( 2025-05-07T20:32:15.3419125Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.3419653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.3420175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.3420731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.3421360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.3421741Z ) 2025-05-07T20:32:15.3422243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.3422835Z def test_silu_mul_quant( 2025-05-07T20:32:15.3423268Z self, 2025-05-07T20:32:15.3423547Z T: int, 2025-05-07T20:32:15.3423815Z D: int, 2025-05-07T20:32:15.3424125Z scale_ub: Optional[float], 2025-05-07T20:32:15.3424507Z contiguous: bool, 2025-05-07T20:32:15.3424837Z compiled: bool, 2025-05-07T20:32:15.3425159Z ) -> None: 2025-05-07T20:32:15.3425455Z torch.manual_seed(2025) 2025-05-07T20:32:15.3425785Z 2025-05-07T20:32:15.3426152Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.3426661Z 2025-05-07T20:32:15.3426955Z x_sign = torch.sign(x) 2025-05-07T20:32:15.3427385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.3427951Z x = x_sign * x_clamp 2025-05-07T20:32:15.3428310Z x0 = x[:, :D] 2025-05-07T20:32:15.3428636Z x1 = x[:, D:] 2025-05-07T20:32:15.3428960Z 2025-05-07T20:32:15.3429240Z if contiguous: 2025-05-07T20:32:15.3429573Z x0 = x0.contiguous() 2025-05-07T20:32:15.3429973Z x1 = x1.contiguous() 2025-05-07T20:32:15.3430333Z 2025-05-07T20:32:15.3430615Z if scale_ub is not None: 2025-05-07T20:32:15.3431025Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.3431510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.3431947Z ) 2025-05-07T20:32:15.3432234Z else: 2025-05-07T20:32:15.3432542Z scale_ub_tensor = None 2025-05-07T20:32:15.3432920Z 2025-05-07T20:32:15.3433269Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.3433769Z op = silu_mul_quant 2025-05-07T20:32:15.3434149Z if compiled: 2025-05-07T20:32:15.3434521Z op = torch.compile(op) 2025-05-07T20:32:15.3434993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.3435431Z 2025-05-07T20:32:15.3435722Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.3435993Z 2025-05-07T20:32:15.3436145Z moe/activation_test.py:117: 2025-05-07T20:32:15.3436620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.3437142Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.3437588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.3438490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.3439392Z return fn(*args, **kwargs) 2025-05-07T20:32:15.3440684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.3441822Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.3442915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.3444019Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.3445098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.3445961Z kernel = self.compile( 2025-05-07T20:32:15.3446833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.3447840Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.3448481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.3448854Z 2025-05-07T20:32:15.3449190Z self = 2025-05-07T20:32:15.3450995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.3453431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e98e8820>} 2025-05-07T20:32:15.3455854Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.3457575Z context = 2025-05-07T20:32:15.3458074Z 2025-05-07T20:32:15.3458348Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.3459256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.3460079Z module_map=module_map) 2025-05-07T20:32:15.3460811Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.3461491Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.3461934Z E ^ 2025-05-07T20:32:15.3462741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.3463553Z 2025-05-07T20:32:15.3464290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.3465221Z 2025-05-07T20:32:15.5699072Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5699817Z self=, 2025-05-07T20:32:15.5700482Z T=4096, 2025-05-07T20:32:15.5700781Z D=5120, 2025-05-07T20:32:15.5701265Z scale_ub=1200.0, 2025-05-07T20:32:15.5701632Z contiguous=True, 2025-05-07T20:32:15.5701992Z compiled=True, 2025-05-07T20:32:15.5702343Z ) 2025-05-07T20:32:15.5702865Z self = 2025-05-07T20:32:15.5703690Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.5704071Z 2025-05-07T20:32:15.5704182Z @given( 2025-05-07T20:32:15.5704502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.5704959Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.5705401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.5705894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.5706392Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.5706827Z ) 2025-05-07T20:32:15.5707376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.5708077Z def test_silu_mul_quant( 2025-05-07T20:32:15.5708424Z self, 2025-05-07T20:32:15.5708716Z T: int, 2025-05-07T20:32:15.5709037Z D: int, 2025-05-07T20:32:15.5710306Z scale_ub: Optional[float], 2025-05-07T20:32:15.5710754Z contiguous: bool, 2025-05-07T20:32:15.5711135Z compiled: bool, 2025-05-07T20:32:15.5711498Z ) -> None: 2025-05-07T20:32:15.5711867Z torch.manual_seed(2025) 2025-05-07T20:32:15.5712326Z 2025-05-07T20:32:15.5712754Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.5713315Z 2025-05-07T20:32:15.5713618Z x_sign = torch.sign(x) 2025-05-07T20:32:15.5714097Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.5714610Z x = x_sign * x_clamp 2025-05-07T20:32:15.5714996Z x0 = x[:, :D] 2025-05-07T20:32:15.5715338Z x1 = x[:, D:] 2025-05-07T20:32:15.5715674Z 2025-05-07T20:32:15.5715975Z if contiguous: 2025-05-07T20:32:15.5716362Z x0 = x0.contiguous() 2025-05-07T20:32:15.5716795Z x1 = x1.contiguous() 2025-05-07T20:32:15.5717192Z 2025-05-07T20:32:15.5717515Z if scale_ub is not None: 2025-05-07T20:32:15.5717977Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.5718524Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.5719044Z ) 2025-05-07T20:32:15.5719360Z else: 2025-05-07T20:32:15.5719705Z scale_ub_tensor = None 2025-05-07T20:32:15.5720267Z 2025-05-07T20:32:15.5720649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.5721195Z op = silu_mul_quant 2025-05-07T20:32:15.5721605Z if compiled: 2025-05-07T20:32:15.5722021Z op = torch.compile(op) 2025-05-07T20:32:15.5722517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5722974Z 2025-05-07T20:32:15.5723289Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.5723569Z 2025-05-07T20:32:15.5723740Z moe/activation_test.py:117: 2025-05-07T20:32:15.5724235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5724935Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.5725408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5726375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.5727333Z return fn(*args, **kwargs) 2025-05-07T20:32:15.5728483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.5729698Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.5730630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.5731820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.5733042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.5733967Z kernel = self.compile( 2025-05-07T20:32:15.5734911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.5736036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.5736698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5737085Z 2025-05-07T20:32:15.5737417Z self = 2025-05-07T20:32:15.5739215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.5741912Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9ddd430>} 2025-05-07T20:32:15.5744496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.5746299Z context = 2025-05-07T20:32:15.5746797Z 2025-05-07T20:32:15.5747081Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.5747977Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.5748764Z module_map=module_map) 2025-05-07T20:32:15.5749368Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.5749950Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.5750381Z E ^ 2025-05-07T20:32:15.5751180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.5751978Z 2025-05-07T20:32:15.5752717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.5753637Z 2025-05-07T20:32:15.5753808Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5754513Z self=, 2025-05-07T20:32:15.5755179Z T=128, 2025-05-07T20:32:15.5755591Z D=5120, 2025-05-07T20:32:15.5755903Z scale_ub=1200.0, 2025-05-07T20:32:15.5756266Z contiguous=False, 2025-05-07T20:32:15.5756629Z compiled=True, 2025-05-07T20:32:15.5756968Z ) 2025-05-07T20:32:15.9157143Z self = 2025-05-07T20:32:15.9158052Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.9158515Z 2025-05-07T20:32:15.9158654Z @given( 2025-05-07T20:32:15.9159028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9159559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9160066Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9160887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9161408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9161888Z ) 2025-05-07T20:32:15.9162472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9163245Z def test_silu_mul_quant( 2025-05-07T20:32:15.9163644Z self, 2025-05-07T20:32:15.9163950Z T: int, 2025-05-07T20:32:15.9164276Z D: int, 2025-05-07T20:32:15.9164632Z scale_ub: Optional[float], 2025-05-07T20:32:15.9165084Z contiguous: bool, 2025-05-07T20:32:15.9165470Z compiled: bool, 2025-05-07T20:32:15.9165841Z ) -> None: 2025-05-07T20:32:15.9166190Z torch.manual_seed(2025) 2025-05-07T20:32:15.9166585Z 2025-05-07T20:32:15.9167031Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9167619Z 2025-05-07T20:32:15.9167928Z x_sign = torch.sign(x) 2025-05-07T20:32:15.9168422Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.9168947Z x = x_sign * x_clamp 2025-05-07T20:32:15.9169334Z x0 = x[:, :D] 2025-05-07T20:32:15.9169691Z x1 = x[:, D:] 2025-05-07T20:32:15.9170030Z 2025-05-07T20:32:15.9170325Z if contiguous: 2025-05-07T20:32:15.9170715Z x0 = x0.contiguous() 2025-05-07T20:32:15.9171145Z x1 = x1.contiguous() 2025-05-07T20:32:15.9171537Z 2025-05-07T20:32:15.9171852Z if scale_ub is not None: 2025-05-07T20:32:15.9172304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.9172906Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.9173425Z ) 2025-05-07T20:32:15.9173740Z else: 2025-05-07T20:32:15.9174082Z scale_ub_tensor = None 2025-05-07T20:32:15.9174501Z 2025-05-07T20:32:15.9174876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.9175633Z op = silu_mul_quant 2025-05-07T20:32:15.9176051Z if compiled: 2025-05-07T20:32:15.9176458Z op = torch.compile(op) 2025-05-07T20:32:15.9176951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9177405Z 2025-05-07T20:32:15.9177720Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.9178001Z 2025-05-07T20:32:15.9178171Z moe/activation_test.py:117: 2025-05-07T20:32:15.9178657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9179221Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.9179689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9180647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.9181739Z return fn(*args, **kwargs) 2025-05-07T20:32:15.9182889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.9184112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.9184990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.9186151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.9187392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.9188300Z kernel = self.compile( 2025-05-07T20:32:15.9189196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.9190300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.9190963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9191356Z 2025-05-07T20:32:15.9191708Z self = 2025-05-07T20:32:15.9193631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.9196216Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e982b040>} 2025-05-07T20:32:15.9198639Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.9200461Z context = 2025-05-07T20:32:15.9200966Z 2025-05-07T20:32:15.9201251Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.9202147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.9202979Z module_map=module_map) 2025-05-07T20:32:15.9203595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.9204181Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.9204619Z E ^ 2025-05-07T20:32:15.9205404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.9206212Z 2025-05-07T20:32:15.9206954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.9207863Z 2025-05-07T20:32:15.9208031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9208736Z self=, 2025-05-07T20:32:15.9209429Z T=16384, 2025-05-07T20:32:15.9209740Z D=7168, 2025-05-07T20:32:15.9210058Z scale_ub=1200.0, 2025-05-07T20:32:15.9210427Z contiguous=True, 2025-05-07T20:32:15.9210908Z compiled=True, 2025-05-07T20:32:15.9211255Z ) 2025-05-07T20:32:15.9211790Z self = 2025-05-07T20:32:15.9212647Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.9213124Z 2025-05-07T20:32:15.9213248Z @given( 2025-05-07T20:32:15.9213627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9214155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9214661Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9215225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9215783Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9216259Z ) 2025-05-07T20:32:15.9216857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9217619Z def test_silu_mul_quant( 2025-05-07T20:32:15.9218011Z self, 2025-05-07T20:32:15.9218330Z T: int, 2025-05-07T20:32:15.9218667Z D: int, 2025-05-07T20:32:15.9219030Z scale_ub: Optional[float], 2025-05-07T20:32:15.9219477Z contiguous: bool, 2025-05-07T20:32:15.9219878Z compiled: bool, 2025-05-07T20:32:15.9220245Z ) -> None: 2025-05-07T20:32:15.9220588Z torch.manual_seed(2025) 2025-05-07T20:32:15.9221149Z 2025-05-07T20:32:15.9221603Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9222174Z 2025-05-07T20:32:15.9222490Z x_sign = torch.sign(x) 2025-05-07T20:32:15.9222974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.9223489Z x = x_sign * x_clamp 2025-05-07T20:32:15.9223892Z x0 = x[:, :D] 2025-05-07T20:32:15.9224249Z x1 = x[:, D:] 2025-05-07T20:32:15.9224584Z 2025-05-07T20:32:15.9224891Z if contiguous: 2025-05-07T20:32:15.9225278Z x0 = x0.contiguous() 2025-05-07T20:32:15.9225703Z x1 = x1.contiguous() 2025-05-07T20:32:15.9226186Z 2025-05-07T20:32:15.9226506Z if scale_ub is not None: 2025-05-07T20:32:15.9226963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.9227522Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.9228009Z ) 2025-05-07T20:32:15.9228308Z else: 2025-05-07T20:32:15.9228586Z scale_ub_tensor = None 2025-05-07T20:32:15.9228930Z 2025-05-07T20:32:15.9229255Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.9229666Z op = silu_mul_quant 2025-05-07T20:32:15.9230016Z if compiled: 2025-05-07T20:32:15.9230358Z op = torch.compile(op) 2025-05-07T20:32:15.9230764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9231148Z 2025-05-07T20:32:15.9231413Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.9231647Z 2025-05-07T20:32:15.9231785Z moe/activation_test.py:117: 2025-05-07T20:32:15.9232193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9232648Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.9233066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9233879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.9234686Z return fn(*args, **kwargs) 2025-05-07T20:32:15.9235661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.9236660Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.9237459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.9238472Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.9239439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.9240679Z kernel = self.compile( 2025-05-07T20:32:15.9241534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.9242554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.9243180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9243586Z 2025-05-07T20:32:15.9243910Z self = 2025-05-07T20:32:15.9245681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.9247961Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e982bb80>} 2025-05-07T20:32:15.9250173Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.9251847Z context = 2025-05-07T20:32:15.9252424Z 2025-05-07T20:32:15.9252685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.9253528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.9254272Z module_map=module_map) 2025-05-07T20:32:15.9254845Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.9255400Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.9255802Z E ^ 2025-05-07T20:32:15.9256552Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.9257296Z 2025-05-07T20:32:15.9258091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.9258925Z 2025-05-07T20:32:16.2018284Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2036191Z self=, 2025-05-07T20:32:16.2036939Z T=16384, 2025-05-07T20:32:16.2037262Z D=5120, 2025-05-07T20:32:16.2037571Z scale_ub=1200.0, 2025-05-07T20:32:16.2037946Z contiguous=True, 2025-05-07T20:32:16.2038318Z compiled=False, 2025-05-07T20:32:16.2038646Z ) 2025-05-07T20:32:16.2039181Z self = 2025-05-07T20:32:16.2040037Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.2040832Z 2025-05-07T20:32:16.2040973Z @given( 2025-05-07T20:32:16.2041352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2041889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2042430Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2042996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2043561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2044051Z ) 2025-05-07T20:32:16.2044652Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2045428Z def test_silu_mul_quant( 2025-05-07T20:32:16.2045844Z self, 2025-05-07T20:32:16.2046154Z T: int, 2025-05-07T20:32:16.2046479Z D: int, 2025-05-07T20:32:16.2046840Z scale_ub: Optional[float], 2025-05-07T20:32:16.2047291Z contiguous: bool, 2025-05-07T20:32:16.2047696Z compiled: bool, 2025-05-07T20:32:16.2048072Z ) -> None: 2025-05-07T20:32:16.2048427Z torch.manual_seed(2025) 2025-05-07T20:32:16.2048829Z 2025-05-07T20:32:16.2049284Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2049878Z 2025-05-07T20:32:16.2050598Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2051108Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2051645Z x = x_sign * x_clamp 2025-05-07T20:32:16.2052045Z x0 = x[:, :D] 2025-05-07T20:32:16.2052404Z x1 = x[:, D:] 2025-05-07T20:32:16.2052764Z 2025-05-07T20:32:16.2053063Z if contiguous: 2025-05-07T20:32:16.2053456Z x0 = x0.contiguous() 2025-05-07T20:32:16.2053898Z x1 = x1.contiguous() 2025-05-07T20:32:16.2054290Z 2025-05-07T20:32:16.2054599Z if scale_ub is not None: 2025-05-07T20:32:16.2055066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2055631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2056161Z ) 2025-05-07T20:32:16.2056480Z else: 2025-05-07T20:32:16.2056804Z scale_ub_tensor = None 2025-05-07T20:32:16.2057227Z 2025-05-07T20:32:16.2057615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2058135Z op = silu_mul_quant 2025-05-07T20:32:16.2058525Z if compiled: 2025-05-07T20:32:16.2058925Z op = torch.compile(op) 2025-05-07T20:32:16.2059378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2059826Z 2025-05-07T20:32:16.2060256Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.2060509Z 2025-05-07T20:32:16.2060675Z moe/activation_test.py:117: 2025-05-07T20:32:16.2061262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2061811Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.2062278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2063434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.2064614Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.2065549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2066877Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2068034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2068966Z kernel = self.compile( 2025-05-07T20:32:16.2069880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2071019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2071686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2072099Z 2025-05-07T20:32:16.2072489Z self = 2025-05-07T20:32:16.2074403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2076853Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e97835e0>} 2025-05-07T20:32:16.2079248Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2081058Z context = 2025-05-07T20:32:16.2081570Z 2025-05-07T20:32:16.2081846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2082721Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2083518Z module_map=module_map) 2025-05-07T20:32:16.2084263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2084868Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.2085310Z E ^ 2025-05-07T20:32:16.2086103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2086909Z 2025-05-07T20:32:16.2087640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2088536Z 2025-05-07T20:32:16.2088714Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2089402Z self=, 2025-05-07T20:32:16.2090090Z T=1, 2025-05-07T20:32:16.2090398Z D=7168, 2025-05-07T20:32:16.2090717Z scale_ub=1200.0, 2025-05-07T20:32:16.2091077Z contiguous=False, 2025-05-07T20:32:16.2091453Z compiled=False, 2025-05-07T20:32:16.2091794Z ) 2025-05-07T20:32:16.2092336Z self = 2025-05-07T20:32:16.2093165Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.2093608Z 2025-05-07T20:32:16.2093746Z @given( 2025-05-07T20:32:16.2094113Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2094639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2095239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2095792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2096339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2096815Z ) 2025-05-07T20:32:16.2097402Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2098150Z def test_silu_mul_quant( 2025-05-07T20:32:16.2098559Z self, 2025-05-07T20:32:16.2098879Z T: int, 2025-05-07T20:32:16.2099213Z D: int, 2025-05-07T20:32:16.2099574Z scale_ub: Optional[float], 2025-05-07T20:32:16.2100123Z contiguous: bool, 2025-05-07T20:32:16.2100514Z compiled: bool, 2025-05-07T20:32:16.2100887Z ) -> None: 2025-05-07T20:32:16.2101314Z torch.manual_seed(2025) 2025-05-07T20:32:16.2101692Z 2025-05-07T20:32:16.2102131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2102711Z 2025-05-07T20:32:16.2103022Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2103521Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2104040Z x = x_sign * x_clamp 2025-05-07T20:32:16.2104448Z x0 = x[:, :D] 2025-05-07T20:32:16.2104808Z x1 = x[:, D:] 2025-05-07T20:32:16.2105150Z 2025-05-07T20:32:16.2105443Z if contiguous: 2025-05-07T20:32:16.2105826Z x0 = x0.contiguous() 2025-05-07T20:32:16.2106262Z x1 = x1.contiguous() 2025-05-07T20:32:16.2106656Z 2025-05-07T20:32:16.2106968Z if scale_ub is not None: 2025-05-07T20:32:16.2107433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2107989Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2108519Z ) 2025-05-07T20:32:16.2108836Z else: 2025-05-07T20:32:16.2109171Z scale_ub_tensor = None 2025-05-07T20:32:16.2109594Z 2025-05-07T20:32:16.2109975Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2110501Z op = silu_mul_quant 2025-05-07T20:32:16.2110921Z if compiled: 2025-05-07T20:32:16.2111328Z op = torch.compile(op) 2025-05-07T20:32:16.2111808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2112258Z 2025-05-07T20:32:16.2112564Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.2112834Z 2025-05-07T20:32:16.2113006Z moe/activation_test.py:117: 2025-05-07T20:32:16.2113470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2114014Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.2114635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2115817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.2117068Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.2117910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2118939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2120038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2120971Z kernel = self.compile( 2025-05-07T20:32:16.2121910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2123041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2123731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2124138Z 2025-05-07T20:32:16.2124485Z self = 2025-05-07T20:32:16.2126400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2128939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e97839d0>} 2025-05-07T20:32:16.2131309Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2133171Z context = 2025-05-07T20:32:16.2133664Z 2025-05-07T20:32:16.2134027Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2134931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2135729Z module_map=module_map) 2025-05-07T20:32:16.2136343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2136936Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.2137362Z E ^ 2025-05-07T20:32:16.2138168Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2138971Z 2025-05-07T20:32:16.2139711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2140916Z 2025-05-07T20:32:16.2141149Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2141848Z self=, 2025-05-07T20:32:16.2142511Z T=4096, 2025-05-07T20:32:16.2142815Z D=7168, 2025-05-07T20:32:16.2143120Z scale_ub=1200.0, 2025-05-07T20:32:16.2143482Z contiguous=False, 2025-05-07T20:32:16.2143847Z compiled=True, 2025-05-07T20:32:16.2144170Z ) 2025-05-07T20:32:16.3312826Z self = 2025-05-07T20:32:16.3313774Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3314259Z 2025-05-07T20:32:16.3314383Z @given( 2025-05-07T20:32:16.3314767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3315299Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3315793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3316320Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3316807Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3317253Z ) 2025-05-07T20:32:16.3318203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3318992Z def test_silu_mul_quant( 2025-05-07T20:32:16.3319382Z self, 2025-05-07T20:32:16.3319726Z T: int, 2025-05-07T20:32:16.3320044Z D: int, 2025-05-07T20:32:16.3320402Z scale_ub: Optional[float], 2025-05-07T20:32:16.3320857Z contiguous: bool, 2025-05-07T20:32:16.3321249Z compiled: bool, 2025-05-07T20:32:16.3321619Z ) -> None: 2025-05-07T20:32:16.3321966Z torch.manual_seed(2025) 2025-05-07T20:32:16.3322358Z 2025-05-07T20:32:16.3322805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3323385Z 2025-05-07T20:32:16.3323695Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3324162Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3324680Z x = x_sign * x_clamp 2025-05-07T20:32:16.3325079Z x0 = x[:, :D] 2025-05-07T20:32:16.3325429Z x1 = x[:, D:] 2025-05-07T20:32:16.3325775Z 2025-05-07T20:32:16.3326089Z if contiguous: 2025-05-07T20:32:16.3326468Z x0 = x0.contiguous() 2025-05-07T20:32:16.3326899Z x1 = x1.contiguous() 2025-05-07T20:32:16.3327299Z 2025-05-07T20:32:16.3327607Z if scale_ub is not None: 2025-05-07T20:32:16.3328062Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3328749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3329259Z ) 2025-05-07T20:32:16.3329576Z else: 2025-05-07T20:32:16.3329918Z scale_ub_tensor = None 2025-05-07T20:32:16.3330330Z 2025-05-07T20:32:16.3330711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3331238Z op = silu_mul_quant 2025-05-07T20:32:16.3331657Z if compiled: 2025-05-07T20:32:16.3332065Z op = torch.compile(op) 2025-05-07T20:32:16.3332603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3333071Z 2025-05-07T20:32:16.3333517Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3333804Z 2025-05-07T20:32:16.3333965Z moe/activation_test.py:117: 2025-05-07T20:32:16.3334463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3335014Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3335497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3336465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3337434Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3338579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3339785Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3340985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3342261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3343397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3344308Z kernel = self.compile( 2025-05-07T20:32:16.3345212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3346318Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3346979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3347374Z 2025-05-07T20:32:16.3347730Z self = 2025-05-07T20:32:16.3349658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3352342Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9696c10>} 2025-05-07T20:32:16.3354807Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3356631Z context = 2025-05-07T20:32:16.3357130Z 2025-05-07T20:32:16.3357413Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3358298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3359116Z module_map=module_map) 2025-05-07T20:32:16.3359728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3360318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3360748Z E ^ 2025-05-07T20:32:16.3361550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3362360Z 2025-05-07T20:32:16.3363101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3364118Z 2025-05-07T20:32:16.3364288Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3364993Z self=, 2025-05-07T20:32:16.3365690Z T=128, 2025-05-07T20:32:16.3365999Z D=7168, 2025-05-07T20:32:16.3366310Z scale_ub=1200.0, 2025-05-07T20:32:16.3366682Z contiguous=False, 2025-05-07T20:32:16.3367054Z compiled=True, 2025-05-07T20:32:16.3367388Z ) 2025-05-07T20:32:16.3367930Z self = 2025-05-07T20:32:16.3368775Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3369351Z 2025-05-07T20:32:16.3369482Z @given( 2025-05-07T20:32:16.3369852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3370383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3370892Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3371456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3372026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3372513Z ) 2025-05-07T20:32:16.3373101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3373866Z def test_silu_mul_quant( 2025-05-07T20:32:16.3374273Z self, 2025-05-07T20:32:16.3374584Z T: int, 2025-05-07T20:32:16.3374917Z D: int, 2025-05-07T20:32:16.3375271Z scale_ub: Optional[float], 2025-05-07T20:32:16.3375714Z contiguous: bool, 2025-05-07T20:32:16.3376111Z compiled: bool, 2025-05-07T20:32:16.3376479Z ) -> None: 2025-05-07T20:32:16.3376833Z torch.manual_seed(2025) 2025-05-07T20:32:16.3377236Z 2025-05-07T20:32:16.3377685Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3378260Z 2025-05-07T20:32:16.3378573Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3379056Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3379584Z x = x_sign * x_clamp 2025-05-07T20:32:16.3379976Z x0 = x[:, :D] 2025-05-07T20:32:16.3380333Z x1 = x[:, D:] 2025-05-07T20:32:16.3380680Z 2025-05-07T20:32:16.3380978Z if contiguous: 2025-05-07T20:32:16.3381466Z x0 = x0.contiguous() 2025-05-07T20:32:16.3381896Z x1 = x1.contiguous() 2025-05-07T20:32:16.3382296Z 2025-05-07T20:32:16.3382611Z if scale_ub is not None: 2025-05-07T20:32:16.3383122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3383658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3384151Z ) 2025-05-07T20:32:16.3384534Z else: 2025-05-07T20:32:16.3384819Z scale_ub_tensor = None 2025-05-07T20:32:16.3385165Z 2025-05-07T20:32:16.3385492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3385920Z op = silu_mul_quant 2025-05-07T20:32:16.3386268Z if compiled: 2025-05-07T20:32:16.3386604Z op = torch.compile(op) 2025-05-07T20:32:16.3387000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3387385Z 2025-05-07T20:32:16.3387655Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3387885Z 2025-05-07T20:32:16.3388030Z moe/activation_test.py:117: 2025-05-07T20:32:16.3388443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3388903Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3389326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3390163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3390984Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3391969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3392987Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3393870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3394937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3396013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3396873Z kernel = self.compile( 2025-05-07T20:32:16.3397760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3398893Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3399715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3400107Z 2025-05-07T20:32:16.3400460Z self = 2025-05-07T20:32:16.3402295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3404691Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e98a9820>} 2025-05-07T20:32:16.3407088Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3408910Z context = 2025-05-07T20:32:16.3409431Z 2025-05-07T20:32:16.3409708Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3410616Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3411431Z module_map=module_map) 2025-05-07T20:32:16.3412037Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3412636Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3413077Z E ^ 2025-05-07T20:32:16.3413886Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3414696Z 2025-05-07T20:32:16.3415428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3416347Z 2025-05-07T20:32:16.5092428Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5093281Z self=, 2025-05-07T20:32:16.5093724Z T=2048, 2025-05-07T20:32:16.5093954Z D=7168, 2025-05-07T20:32:16.5094149Z scale_ub=None, 2025-05-07T20:32:16.5094373Z contiguous=True, 2025-05-07T20:32:16.5094607Z compiled=True, 2025-05-07T20:32:16.5094822Z ) 2025-05-07T20:32:16.5095156Z self = 2025-05-07T20:32:16.5095671Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.5095943Z 2025-05-07T20:32:16.5096032Z @given( 2025-05-07T20:32:16.5096264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5096589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5096907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5097239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5097582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5097873Z ) 2025-05-07T20:32:16.5098235Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5098686Z def test_silu_mul_quant( 2025-05-07T20:32:16.5098938Z self, 2025-05-07T20:32:16.5099133Z T: int, 2025-05-07T20:32:16.5099338Z D: int, 2025-05-07T20:32:16.5099567Z scale_ub: Optional[float], 2025-05-07T20:32:16.5099921Z contiguous: bool, 2025-05-07T20:32:16.5100162Z compiled: bool, 2025-05-07T20:32:16.5100399Z ) -> None: 2025-05-07T20:32:16.5100622Z torch.manual_seed(2025) 2025-05-07T20:32:16.5100866Z 2025-05-07T20:32:16.5101242Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5101595Z 2025-05-07T20:32:16.5101789Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5102088Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5102406Z x = x_sign * x_clamp 2025-05-07T20:32:16.5102651Z x0 = x[:, :D] 2025-05-07T20:32:16.5102968Z x1 = x[:, D:] 2025-05-07T20:32:16.5103183Z 2025-05-07T20:32:16.5103369Z if contiguous: 2025-05-07T20:32:16.5103610Z x0 = x0.contiguous() 2025-05-07T20:32:16.5103878Z x1 = x1.contiguous() 2025-05-07T20:32:16.5104122Z 2025-05-07T20:32:16.5104319Z if scale_ub is not None: 2025-05-07T20:32:16.5104604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5104952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5113510Z ) 2025-05-07T20:32:16.5113746Z else: 2025-05-07T20:32:16.5113969Z scale_ub_tensor = None 2025-05-07T20:32:16.5114235Z 2025-05-07T20:32:16.5114476Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5114810Z op = silu_mul_quant 2025-05-07T20:32:16.5115077Z if compiled: 2025-05-07T20:32:16.5115328Z op = torch.compile(op) 2025-05-07T20:32:16.5115640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5115939Z 2025-05-07T20:32:16.5116135Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5116311Z 2025-05-07T20:32:16.5116417Z moe/activation_test.py:117: 2025-05-07T20:32:16.5116728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5117078Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5117372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5117950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.5118524Z return fn(*args, **kwargs) 2025-05-07T20:32:16.5119183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5119887Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5120434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5121243Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5121923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5122467Z kernel = self.compile( 2025-05-07T20:32:16.5123018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5123682Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5124082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5124323Z 2025-05-07T20:32:16.5124535Z self = 2025-05-07T20:32:16.5125804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5127206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e97b54c0>} 2025-05-07T20:32:16.5128559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5129645Z context = 2025-05-07T20:32:16.5129943Z 2025-05-07T20:32:16.5130118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5130650Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5131114Z module_map=module_map) 2025-05-07T20:32:16.5131493Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5131854Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5132162Z E ^ 2025-05-07T20:32:16.5132639Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5133104Z 2025-05-07T20:32:16.5133520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5134043Z 2025-05-07T20:32:16.5134154Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5134565Z self=, 2025-05-07T20:32:16.5134975Z T=16384, 2025-05-07T20:32:16.5135174Z D=5120, 2025-05-07T20:32:16.5135366Z scale_ub=None, 2025-05-07T20:32:16.5135594Z contiguous=False, 2025-05-07T20:32:16.5135829Z compiled=False, 2025-05-07T20:32:16.5136046Z ) 2025-05-07T20:32:16.5136365Z self = 2025-05-07T20:32:16.5136875Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.5137163Z 2025-05-07T20:32:16.5137252Z @given( 2025-05-07T20:32:16.5137482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5137808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5138127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5138467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5138805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5139099Z ) 2025-05-07T20:32:16.5139456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5139907Z def test_silu_mul_quant( 2025-05-07T20:32:16.5140535Z self, 2025-05-07T20:32:16.5140801Z T: int, 2025-05-07T20:32:16.5141129Z D: int, 2025-05-07T20:32:16.5141424Z scale_ub: Optional[float], 2025-05-07T20:32:16.5141787Z contiguous: bool, 2025-05-07T20:32:16.5142052Z compiled: bool, 2025-05-07T20:32:16.5142452Z ) -> None: 2025-05-07T20:32:16.5142680Z torch.manual_seed(2025) 2025-05-07T20:32:16.5142924Z 2025-05-07T20:32:16.5143207Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5143557Z 2025-05-07T20:32:16.5143750Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5144054Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5146091Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.5147954Z 2025-05-07T20:32:16.5148078Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.5148302Z 2025-05-07T20:32:16.5148408Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5148841Z self=, 2025-05-07T20:32:16.5149241Z T=4096, 2025-05-07T20:32:16.5149502Z D=7168, 2025-05-07T20:32:16.5149704Z scale_ub=1200.0, 2025-05-07T20:32:16.5149929Z contiguous=True, 2025-05-07T20:32:16.5150157Z compiled=True, 2025-05-07T20:32:16.5150367Z ) 2025-05-07T20:32:16.5150684Z self = 2025-05-07T20:32:16.5151183Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.5151455Z 2025-05-07T20:32:16.5151544Z @given( 2025-05-07T20:32:16.5151773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5152097Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5152490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5152829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5153162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5153457Z ) 2025-05-07T20:32:16.5153813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5154258Z def test_silu_mul_quant( 2025-05-07T20:32:16.5154510Z self, 2025-05-07T20:32:16.5154714Z T: int, 2025-05-07T20:32:16.5154913Z D: int, 2025-05-07T20:32:16.5155143Z scale_ub: Optional[float], 2025-05-07T20:32:16.5155426Z contiguous: bool, 2025-05-07T20:32:16.5155667Z compiled: bool, 2025-05-07T20:32:16.5155898Z ) -> None: 2025-05-07T20:32:16.5156124Z torch.manual_seed(2025) 2025-05-07T20:32:16.5156369Z 2025-05-07T20:32:16.5156647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5156999Z 2025-05-07T20:32:16.5157209Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5157500Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5159486Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.5161372Z 2025-05-07T20:32:16.5161495Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.5161711Z 2025-05-07T20:32:16.5161823Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5162240Z self=, 2025-05-07T20:32:16.5162774Z T=16384, 2025-05-07T20:32:16.5162977Z D=7168, 2025-05-07T20:32:16.5163173Z scale_ub=None, 2025-05-07T20:32:16.5163398Z contiguous=False, 2025-05-07T20:32:16.5163630Z compiled=False, 2025-05-07T20:32:16.5163839Z ) 2025-05-07T20:32:16.6211437Z self = 2025-05-07T20:32:16.6212192Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.6212482Z 2025-05-07T20:32:16.6212574Z @given( 2025-05-07T20:32:16.6212816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6213143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6213465Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6213804Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6214145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6214449Z ) 2025-05-07T20:32:16.6214818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6215286Z def test_silu_mul_quant( 2025-05-07T20:32:16.6215543Z self, 2025-05-07T20:32:16.6215752Z T: int, 2025-05-07T20:32:16.6215955Z D: int, 2025-05-07T20:32:16.6216186Z scale_ub: Optional[float], 2025-05-07T20:32:16.6216748Z contiguous: bool, 2025-05-07T20:32:16.6216996Z compiled: bool, 2025-05-07T20:32:16.6217236Z ) -> None: 2025-05-07T20:32:16.6217465Z torch.manual_seed(2025) 2025-05-07T20:32:16.6217712Z 2025-05-07T20:32:16.6217997Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6220056Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6222188Z 2025-05-07T20:32:16.6222322Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6222543Z 2025-05-07T20:32:16.6222656Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6223077Z self=, 2025-05-07T20:32:16.6223487Z T=2048, 2025-05-07T20:32:16.6223688Z D=7168, 2025-05-07T20:32:16.6223884Z scale_ub=1200.0, 2025-05-07T20:32:16.6224119Z contiguous=True, 2025-05-07T20:32:16.6224352Z compiled=True, 2025-05-07T20:32:16.6224562Z ) 2025-05-07T20:32:16.6224893Z self = 2025-05-07T20:32:16.6225395Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.6225682Z 2025-05-07T20:32:16.6225764Z @given( 2025-05-07T20:32:16.6226006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6226328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6226645Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6226982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6227324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6227620Z ) 2025-05-07T20:32:16.6227974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6228425Z def test_silu_mul_quant( 2025-05-07T20:32:16.6228681Z self, 2025-05-07T20:32:16.6228881Z T: int, 2025-05-07T20:32:16.6229091Z D: int, 2025-05-07T20:32:16.6229323Z scale_ub: Optional[float], 2025-05-07T20:32:16.6229603Z contiguous: bool, 2025-05-07T20:32:16.6229855Z compiled: bool, 2025-05-07T20:32:16.6230090Z ) -> None: 2025-05-07T20:32:16.6230453Z torch.manual_seed(2025) 2025-05-07T20:32:16.6230715Z 2025-05-07T20:32:16.6230997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6231348Z 2025-05-07T20:32:16.6231547Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6231851Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6233828Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6235648Z 2025-05-07T20:32:16.6235793Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.6236009Z 2025-05-07T20:32:16.6236115Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6236539Z self=, 2025-05-07T20:32:16.6236945Z T=2048, 2025-05-07T20:32:16.6237208Z D=7168, 2025-05-07T20:32:16.6237404Z scale_ub=None, 2025-05-07T20:32:16.6237627Z contiguous=True, 2025-05-07T20:32:16.6237861Z compiled=False, 2025-05-07T20:32:16.6238070Z ) 2025-05-07T20:32:16.6238395Z self = 2025-05-07T20:32:16.6238894Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6239166Z 2025-05-07T20:32:16.6239248Z @given( 2025-05-07T20:32:16.6239503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6239829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6240486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6240935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6241283Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6241578Z ) 2025-05-07T20:32:16.6241936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6242390Z def test_silu_mul_quant( 2025-05-07T20:32:16.6242634Z self, 2025-05-07T20:32:16.6242840Z T: int, 2025-05-07T20:32:16.6243048Z D: int, 2025-05-07T20:32:16.6243270Z scale_ub: Optional[float], 2025-05-07T20:32:16.6243554Z contiguous: bool, 2025-05-07T20:32:16.6243805Z compiled: bool, 2025-05-07T20:32:16.6244033Z ) -> None: 2025-05-07T20:32:16.6244262Z torch.manual_seed(2025) 2025-05-07T20:32:16.6244516Z 2025-05-07T20:32:16.6244795Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6245145Z 2025-05-07T20:32:16.6245346Z > x_sign = torch.sign(x) 2025-05-07T20:32:16.6247260Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6249119Z 2025-05-07T20:32:16.6249245Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:16.6249461Z 2025-05-07T20:32:16.6249568Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6249988Z self=, 2025-05-07T20:32:16.6250397Z T=1, 2025-05-07T20:32:16.6250590Z D=7168, 2025-05-07T20:32:16.6250784Z scale_ub=1200.0, 2025-05-07T20:32:16.6251141Z contiguous=True, 2025-05-07T20:32:16.6251375Z compiled=False, 2025-05-07T20:32:16.6251582Z ) 2025-05-07T20:32:16.7821031Z self = 2025-05-07T20:32:16.7821943Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.7822335Z 2025-05-07T20:32:16.7822461Z @given( 2025-05-07T20:32:16.7822740Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.7823062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.7823389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.7823740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.7824082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.7824386Z ) 2025-05-07T20:32:16.7824755Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.7825205Z def test_silu_mul_quant( 2025-05-07T20:32:16.7825480Z self, 2025-05-07T20:32:16.7825690Z T: int, 2025-05-07T20:32:16.7825891Z D: int, 2025-05-07T20:32:16.7826125Z scale_ub: Optional[float], 2025-05-07T20:32:16.7826415Z contiguous: bool, 2025-05-07T20:32:16.7826666Z compiled: bool, 2025-05-07T20:32:16.7827092Z ) -> None: 2025-05-07T20:32:16.7827324Z torch.manual_seed(2025) 2025-05-07T20:32:16.7827573Z 2025-05-07T20:32:16.7827863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.7828222Z 2025-05-07T20:32:16.7828429Z x_sign = torch.sign(x) 2025-05-07T20:32:16.7828733Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.7829064Z x = x_sign * x_clamp 2025-05-07T20:32:16.7829321Z x0 = x[:, :D] 2025-05-07T20:32:16.7829548Z x1 = x[:, D:] 2025-05-07T20:32:16.7829775Z 2025-05-07T20:32:16.7829975Z if contiguous: 2025-05-07T20:32:16.7830222Z x0 = x0.contiguous() 2025-05-07T20:32:16.7830604Z x1 = x1.contiguous() 2025-05-07T20:32:16.7830858Z 2025-05-07T20:32:16.7831061Z if scale_ub is not None: 2025-05-07T20:32:16.7831352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.7831711Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.7832036Z ) 2025-05-07T20:32:16.7832246Z else: 2025-05-07T20:32:16.7832472Z scale_ub_tensor = None 2025-05-07T20:32:16.7832733Z 2025-05-07T20:32:16.7832978Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.7833308Z op = silu_mul_quant 2025-05-07T20:32:16.7833574Z if compiled: 2025-05-07T20:32:16.7833830Z op = torch.compile(op) 2025-05-07T20:32:16.7834140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.7834428Z 2025-05-07T20:32:16.7834625Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.7834801Z 2025-05-07T20:32:16.7834906Z moe/activation_test.py:117: 2025-05-07T20:32:16.7835221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.7835559Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.7835854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.7836569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.7837276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.7837822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.7838519Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.7839197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.7839735Z kernel = self.compile( 2025-05-07T20:32:16.7840756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.7841446Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.7841865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.7842100Z 2025-05-07T20:32:16.7842311Z self = 2025-05-07T20:32:16.7843402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.7844781Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e93ec040>} 2025-05-07T20:32:16.7846133Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.7847169Z context = 2025-05-07T20:32:16.7847461Z 2025-05-07T20:32:16.7847633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.7848234Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.7848712Z module_map=module_map) 2025-05-07T20:32:16.7849084Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.7849450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.7849723Z E ^ 2025-05-07T20:32:16.7850197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.7850657Z 2025-05-07T20:32:16.7851076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.7851688Z 2025-05-07T20:32:16.7851798Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.7852227Z self=, 2025-05-07T20:32:16.7852634Z T=128, 2025-05-07T20:32:16.7852839Z D=5120, 2025-05-07T20:32:16.7853054Z scale_ub=None, 2025-05-07T20:32:16.7853281Z contiguous=True, 2025-05-07T20:32:16.7853519Z compiled=False, 2025-05-07T20:32:16.7853742Z ) 2025-05-07T20:32:16.7854076Z self = 2025-05-07T20:32:16.7854574Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.7854853Z 2025-05-07T20:32:16.7854936Z @given( 2025-05-07T20:32:16.7855185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.7855507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.7855831Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.7856186Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.7856521Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.7856819Z ) 2025-05-07T20:32:16.7857180Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.7857635Z def test_silu_mul_quant( 2025-05-07T20:32:16.7857886Z self, 2025-05-07T20:32:16.7858091Z T: int, 2025-05-07T20:32:16.7858298Z D: int, 2025-05-07T20:32:16.7858526Z scale_ub: Optional[float], 2025-05-07T20:32:16.7858810Z contiguous: bool, 2025-05-07T20:32:16.7859064Z compiled: bool, 2025-05-07T20:32:16.7859292Z ) -> None: 2025-05-07T20:32:16.7859520Z torch.manual_seed(2025) 2025-05-07T20:32:16.7859777Z 2025-05-07T20:32:16.7860055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.7860414Z 2025-05-07T20:32:16.7860621Z x_sign = torch.sign(x) 2025-05-07T20:32:16.7861004Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.7861411Z x = x_sign * x_clamp 2025-05-07T20:32:16.7861666Z x0 = x[:, :D] 2025-05-07T20:32:16.7861888Z x1 = x[:, D:] 2025-05-07T20:32:16.7862113Z 2025-05-07T20:32:16.7862314Z if contiguous: 2025-05-07T20:32:16.7862553Z x0 = x0.contiguous() 2025-05-07T20:32:16.7862831Z x1 = x1.contiguous() 2025-05-07T20:32:16.7863088Z 2025-05-07T20:32:16.7863295Z if scale_ub is not None: 2025-05-07T20:32:16.7863576Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.7863926Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.7864248Z ) 2025-05-07T20:32:16.7864448Z else: 2025-05-07T20:32:16.7864677Z scale_ub_tensor = None 2025-05-07T20:32:16.7864942Z 2025-05-07T20:32:16.7865182Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.7865513Z op = silu_mul_quant 2025-05-07T20:32:16.7865791Z if compiled: 2025-05-07T20:32:16.7866048Z op = torch.compile(op) 2025-05-07T20:32:16.7866363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.7866653Z 2025-05-07T20:32:16.7866853Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.7867030Z 2025-05-07T20:32:16.7867134Z moe/activation_test.py:117: 2025-05-07T20:32:16.7867500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.7867848Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.7868140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.7868843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.7869542Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.7870088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.7870786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.7871513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.7872062Z kernel = self.compile( 2025-05-07T20:32:16.7872618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.7873287Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.7873697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.7873932Z 2025-05-07T20:32:16.7874152Z self = 2025-05-07T20:32:16.7875238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.7876622Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e93ec9d0>} 2025-05-07T20:32:16.7877965Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.7878991Z context = 2025-05-07T20:32:16.7879288Z 2025-05-07T20:32:16.7879459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.7879995Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.7880470Z module_map=module_map) 2025-05-07T20:32:16.7880848Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.7881290Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.7881756Z E ^ 2025-05-07T20:32:16.7882315Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.7882803Z 2025-05-07T20:32:16.7883424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.7891538Z 2025-05-07T20:32:16.7891679Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.7892121Z self=, 2025-05-07T20:32:16.7892545Z T=128, 2025-05-07T20:32:16.7892748Z D=7168, 2025-05-07T20:32:16.7892947Z scale_ub=None, 2025-05-07T20:32:16.7893177Z contiguous=True, 2025-05-07T20:32:16.7893417Z compiled=False, 2025-05-07T20:32:16.7893631Z ) 2025-05-07T20:32:16.8794610Z self = 2025-05-07T20:32:16.8795184Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.8795495Z 2025-05-07T20:32:16.8795591Z @given( 2025-05-07T20:32:16.8795838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.8796176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.8796505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.8797143Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.8797485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.8797796Z ) 2025-05-07T20:32:16.8798171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.8798626Z def test_silu_mul_quant( 2025-05-07T20:32:16.8798896Z self, 2025-05-07T20:32:16.8799116Z T: int, 2025-05-07T20:32:16.8799325Z D: int, 2025-05-07T20:32:16.8799568Z scale_ub: Optional[float], 2025-05-07T20:32:16.8799862Z contiguous: bool, 2025-05-07T20:32:16.8800113Z compiled: bool, 2025-05-07T20:32:16.8800361Z ) -> None: 2025-05-07T20:32:16.8800702Z torch.manual_seed(2025) 2025-05-07T20:32:16.8800955Z 2025-05-07T20:32:16.8801255Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.8801622Z 2025-05-07T20:32:16.8801840Z x_sign = torch.sign(x) 2025-05-07T20:32:16.8802146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.8802487Z x = x_sign * x_clamp 2025-05-07T20:32:16.8802755Z x0 = x[:, :D] 2025-05-07T20:32:16.8802983Z x1 = x[:, D:] 2025-05-07T20:32:16.8803209Z 2025-05-07T20:32:16.8803417Z if contiguous: 2025-05-07T20:32:16.8803660Z x0 = x0.contiguous() 2025-05-07T20:32:16.8803943Z x1 = x1.contiguous() 2025-05-07T20:32:16.8804203Z 2025-05-07T20:32:16.8804406Z if scale_ub is not None: 2025-05-07T20:32:16.8804706Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.8805063Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.8805390Z ) 2025-05-07T20:32:16.8805607Z else: 2025-05-07T20:32:16.8805836Z scale_ub_tensor = None 2025-05-07T20:32:16.8806098Z 2025-05-07T20:32:16.8806349Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.8806689Z op = silu_mul_quant 2025-05-07T20:32:16.8806964Z if compiled: 2025-05-07T20:32:16.8807226Z op = torch.compile(op) 2025-05-07T20:32:16.8807549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.8807849Z 2025-05-07T20:32:16.8808055Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.8808241Z 2025-05-07T20:32:16.8808348Z moe/activation_test.py:117: 2025-05-07T20:32:16.8808672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.8809017Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.8809328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.8810191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.8810913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.8811468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.8812179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.8812874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.8813423Z kernel = self.compile( 2025-05-07T20:32:16.8813988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.8814662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.8815081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.8815321Z 2025-05-07T20:32:16.8815545Z self = 2025-05-07T20:32:16.8816643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.8818093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e94f1430>} 2025-05-07T20:32:16.8819451Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.8820497Z context = 2025-05-07T20:32:16.8820805Z 2025-05-07T20:32:16.8820980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.8821623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.8822165Z module_map=module_map) 2025-05-07T20:32:16.8822545Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.8822942Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.8823259Z E ^ 2025-05-07T20:32:16.8823738Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.8824210Z 2025-05-07T20:32:16.8824632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.8825159Z 2025-05-07T20:32:16.8825270Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.8825704Z self=, 2025-05-07T20:32:16.8826118Z T=2048, 2025-05-07T20:32:16.8826327Z D=7168, 2025-05-07T20:32:16.8826536Z scale_ub=1200.0, 2025-05-07T20:32:16.8826780Z contiguous=True, 2025-05-07T20:32:16.8827026Z compiled=False, 2025-05-07T20:32:16.8827256Z ) 2025-05-07T20:32:16.8827588Z self = 2025-05-07T20:32:16.8828103Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.8828397Z 2025-05-07T20:32:16.8828481Z @given( 2025-05-07T20:32:16.8828726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.8829050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.8829380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.8829728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.8830068Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.8830374Z ) 2025-05-07T20:32:16.8830741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.8831200Z def test_silu_mul_quant( 2025-05-07T20:32:16.8831585Z self, 2025-05-07T20:32:16.8831802Z T: int, 2025-05-07T20:32:16.8832022Z D: int, 2025-05-07T20:32:16.8832252Z scale_ub: Optional[float], 2025-05-07T20:32:16.8832542Z contiguous: bool, 2025-05-07T20:32:16.8832800Z compiled: bool, 2025-05-07T20:32:16.8833037Z ) -> None: 2025-05-07T20:32:16.8833275Z torch.manual_seed(2025) 2025-05-07T20:32:16.8833535Z 2025-05-07T20:32:16.8833818Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.8835876Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.8837750Z 2025-05-07T20:32:16.8837873Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.8838097Z 2025-05-07T20:32:16.8838204Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.8838680Z self=, 2025-05-07T20:32:16.8839085Z T=1, 2025-05-07T20:32:16.8839282Z D=5120, 2025-05-07T20:32:16.8839487Z scale_ub=1200.0, 2025-05-07T20:32:16.8839717Z contiguous=True, 2025-05-07T20:32:16.8839954Z compiled=False, 2025-05-07T20:32:16.8840456Z ) 2025-05-07T20:32:16.9329157Z self = 2025-05-07T20:32:16.9329712Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.9329980Z 2025-05-07T20:32:16.9330064Z @given( 2025-05-07T20:32:16.9330308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.9330833Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.9331153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.9331489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.9331830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.9332134Z ) 2025-05-07T20:32:16.9332488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.9332945Z def test_silu_mul_quant( 2025-05-07T20:32:16.9333200Z self, 2025-05-07T20:32:16.9333402Z T: int, 2025-05-07T20:32:16.9333612Z D: int, 2025-05-07T20:32:16.9333847Z scale_ub: Optional[float], 2025-05-07T20:32:16.9334123Z contiguous: bool, 2025-05-07T20:32:16.9334376Z compiled: bool, 2025-05-07T20:32:16.9334616Z ) -> None: 2025-05-07T20:32:16.9334837Z torch.manual_seed(2025) 2025-05-07T20:32:16.9335094Z 2025-05-07T20:32:16.9335386Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.9335746Z 2025-05-07T20:32:16.9335943Z x_sign = torch.sign(x) 2025-05-07T20:32:16.9336247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.9336576Z x = x_sign * x_clamp 2025-05-07T20:32:16.9336829Z x0 = x[:, :D] 2025-05-07T20:32:16.9337064Z x1 = x[:, D:] 2025-05-07T20:32:16.9337285Z 2025-05-07T20:32:16.9337478Z if contiguous: 2025-05-07T20:32:16.9337723Z x0 = x0.contiguous() 2025-05-07T20:32:16.9337996Z x1 = x1.contiguous() 2025-05-07T20:32:16.9338245Z 2025-05-07T20:32:16.9338450Z if scale_ub is not None: 2025-05-07T20:32:16.9338740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.9339083Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.9339410Z ) 2025-05-07T20:32:16.9339621Z else: 2025-05-07T20:32:16.9339841Z scale_ub_tensor = None 2025-05-07T20:32:16.9340514Z 2025-05-07T20:32:16.9340769Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.9341204Z op = silu_mul_quant 2025-05-07T20:32:16.9341470Z if compiled: 2025-05-07T20:32:16.9341737Z op = torch.compile(op) 2025-05-07T20:32:16.9342051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.9342338Z 2025-05-07T20:32:16.9342548Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.9342721Z 2025-05-07T20:32:16.9342835Z moe/activation_test.py:117: 2025-05-07T20:32:16.9343139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.9343490Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.9343788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.9344483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.9345189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.9345747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.9346440Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.9347112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.9347737Z kernel = self.compile( 2025-05-07T20:32:16.9348292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.9348959Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.9349363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.9349606Z 2025-05-07T20:32:16.9349820Z self = 2025-05-07T20:32:16.9350911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.9352367Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9412160>} 2025-05-07T20:32:16.9353789Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.9354815Z context = 2025-05-07T20:32:16.9355111Z 2025-05-07T20:32:16.9355293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.9355822Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.9356318Z module_map=module_map) 2025-05-07T20:32:16.9356704Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.9357073Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.9357342Z E ^ 2025-05-07T20:32:16.9357818Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.9358273Z 2025-05-07T20:32:16.9358700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.9359226Z 2025-05-07T20:32:16.9359340Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.9359759Z self=, 2025-05-07T20:32:16.9360174Z T=2048, 2025-05-07T20:32:16.9360379Z D=5120, 2025-05-07T20:32:16.9360580Z scale_ub=None, 2025-05-07T20:32:16.9360808Z contiguous=True, 2025-05-07T20:32:16.9361046Z compiled=False, 2025-05-07T20:32:16.9361268Z ) 2025-05-07T20:32:16.9361720Z self = 2025-05-07T20:32:16.9362230Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.9362505Z 2025-05-07T20:32:16.9362588Z @given( 2025-05-07T20:32:16.9362838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.9363169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.9363491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.9363827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.9364169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.9364468Z ) 2025-05-07T20:32:16.9364825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.9365281Z def test_silu_mul_quant( 2025-05-07T20:32:16.9365539Z self, 2025-05-07T20:32:16.9365739Z T: int, 2025-05-07T20:32:16.9365949Z D: int, 2025-05-07T20:32:16.9366192Z scale_ub: Optional[float], 2025-05-07T20:32:16.9366471Z contiguous: bool, 2025-05-07T20:32:16.9366729Z compiled: bool, 2025-05-07T20:32:16.9366965Z ) -> None: 2025-05-07T20:32:16.9367188Z torch.manual_seed(2025) 2025-05-07T20:32:16.9367445Z 2025-05-07T20:32:16.9367729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.9368144Z 2025-05-07T20:32:16.9368345Z > x_sign = torch.sign(x) 2025-05-07T20:32:16.9370292Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.9372208Z 2025-05-07T20:32:16.9372332Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:16.9372552Z 2025-05-07T20:32:16.9372667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.9373088Z self=, 2025-05-07T20:32:16.9373509Z T=16384, 2025-05-07T20:32:16.9373716Z D=5120, 2025-05-07T20:32:16.9373921Z scale_ub=None, 2025-05-07T20:32:16.9374149Z contiguous=True, 2025-05-07T20:32:16.9374388Z compiled=False, 2025-05-07T20:32:16.9374606Z ) 2025-05-07T20:32:16.9374935Z self = 2025-05-07T20:32:16.9375445Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.9375723Z 2025-05-07T20:32:16.9375812Z @given( 2025-05-07T20:32:16.9376051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.9376390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.9376711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.9377057Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.9377407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.9377714Z ) 2025-05-07T20:32:16.9378080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.9378530Z def test_silu_mul_quant( 2025-05-07T20:32:16.9378786Z self, 2025-05-07T20:32:16.9378999Z T: int, 2025-05-07T20:32:16.9379205Z D: int, 2025-05-07T20:32:16.9379439Z scale_ub: Optional[float], 2025-05-07T20:32:16.9379724Z contiguous: bool, 2025-05-07T20:32:16.9379970Z compiled: bool, 2025-05-07T20:32:16.9380206Z ) -> None: 2025-05-07T20:32:16.9380433Z torch.manual_seed(2025) 2025-05-07T20:32:16.9380682Z 2025-05-07T20:32:16.9380965Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.9383271Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.9385154Z 2025-05-07T20:32:16.9385280Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.9385500Z 2025-05-07T20:32:16.9385616Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.9386033Z self=, 2025-05-07T20:32:16.9386445Z T=4096, 2025-05-07T20:32:16.9386643Z D=5120, 2025-05-07T20:32:16.9386850Z scale_ub=None, 2025-05-07T20:32:16.9387079Z contiguous=True, 2025-05-07T20:32:16.9387316Z compiled=False, 2025-05-07T20:32:16.9387528Z ) 2025-05-07T20:32:17.0425111Z self = 2025-05-07T20:32:17.0425775Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.0426267Z 2025-05-07T20:32:17.0426353Z @given( 2025-05-07T20:32:17.0426601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.0426922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.0427247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.0427595Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.0427944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.0428236Z ) 2025-05-07T20:32:17.0428598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.0429060Z def test_silu_mul_quant( 2025-05-07T20:32:17.0429427Z self, 2025-05-07T20:32:17.0429643Z T: int, 2025-05-07T20:32:17.0429858Z D: int, 2025-05-07T20:32:17.0430086Z scale_ub: Optional[float], 2025-05-07T20:32:17.0430379Z contiguous: bool, 2025-05-07T20:32:17.0430635Z compiled: bool, 2025-05-07T20:32:17.0430875Z ) -> None: 2025-05-07T20:32:17.0431106Z torch.manual_seed(2025) 2025-05-07T20:32:17.0431367Z 2025-05-07T20:32:17.0431647Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.0433720Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.0435603Z 2025-05-07T20:32:17.0435729Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.0435955Z 2025-05-07T20:32:17.0436063Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.0436493Z self=, 2025-05-07T20:32:17.0436900Z T=2048, 2025-05-07T20:32:17.0437102Z D=5120, 2025-05-07T20:32:17.0437311Z scale_ub=None, 2025-05-07T20:32:17.0437536Z contiguous=False, 2025-05-07T20:32:17.0437782Z compiled=False, 2025-05-07T20:32:17.0438002Z ) 2025-05-07T20:32:17.0438323Z self = 2025-05-07T20:32:17.0438831Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.0439118Z 2025-05-07T20:32:17.0439202Z @given( 2025-05-07T20:32:17.0439579Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.0439905Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.0440513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.0440858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.0441195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.0441497Z ) 2025-05-07T20:32:17.0441865Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.0442317Z def test_silu_mul_quant( 2025-05-07T20:32:17.0442578Z self, 2025-05-07T20:32:17.0442788Z T: int, 2025-05-07T20:32:17.0443000Z D: int, 2025-05-07T20:32:17.0443227Z scale_ub: Optional[float], 2025-05-07T20:32:17.0443511Z contiguous: bool, 2025-05-07T20:32:17.0443763Z compiled: bool, 2025-05-07T20:32:17.0443993Z ) -> None: 2025-05-07T20:32:17.0444222Z torch.manual_seed(2025) 2025-05-07T20:32:17.0444477Z 2025-05-07T20:32:17.0444767Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.0446784Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.0448737Z 2025-05-07T20:32:17.0448862Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.0449088Z 2025-05-07T20:32:17.0449195Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.0449621Z self=, 2025-05-07T20:32:17.0450099Z T=4096, 2025-05-07T20:32:17.0450303Z D=7168, 2025-05-07T20:32:17.0450510Z scale_ub=None, 2025-05-07T20:32:17.0450735Z contiguous=True, 2025-05-07T20:32:17.0450975Z compiled=True, 2025-05-07T20:32:17.0451199Z ) 2025-05-07T20:32:17.0451528Z self = 2025-05-07T20:32:17.0452034Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.0452311Z 2025-05-07T20:32:17.0452407Z @given( 2025-05-07T20:32:17.0452662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.0453031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.0453357Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.0453702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.0454042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.0454346Z ) 2025-05-07T20:32:17.0454716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.0455170Z def test_silu_mul_quant( 2025-05-07T20:32:17.0455432Z self, 2025-05-07T20:32:17.0455641Z T: int, 2025-05-07T20:32:17.0455849Z D: int, 2025-05-07T20:32:17.0456084Z scale_ub: Optional[float], 2025-05-07T20:32:17.0456371Z contiguous: bool, 2025-05-07T20:32:17.0456623Z compiled: bool, 2025-05-07T20:32:17.0456860Z ) -> None: 2025-05-07T20:32:17.0457089Z torch.manual_seed(2025) 2025-05-07T20:32:17.0457340Z 2025-05-07T20:32:17.0457623Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.0459765Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.0461744Z 2025-05-07T20:32:17.0461870Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.0462092Z 2025-05-07T20:32:17.0462208Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.0462627Z self=, 2025-05-07T20:32:17.0463046Z T=2048, 2025-05-07T20:32:17.0463249Z D=5120, 2025-05-07T20:32:17.0463452Z scale_ub=1200.0, 2025-05-07T20:32:17.0463690Z contiguous=False, 2025-05-07T20:32:17.0463931Z compiled=False, 2025-05-07T20:32:17.0464144Z ) 2025-05-07T20:32:17.0464474Z self = 2025-05-07T20:32:17.0464985Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.0465270Z 2025-05-07T20:32:17.0465365Z @given( 2025-05-07T20:32:17.0465602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.0465934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.0466256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.0466595Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.0466996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.0467296Z ) 2025-05-07T20:32:17.0467652Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.0468107Z def test_silu_mul_quant( 2025-05-07T20:32:17.0468378Z self, 2025-05-07T20:32:17.0468589Z T: int, 2025-05-07T20:32:17.0468795Z D: int, 2025-05-07T20:32:17.0469029Z scale_ub: Optional[float], 2025-05-07T20:32:17.0469449Z contiguous: bool, 2025-05-07T20:32:17.0469743Z compiled: bool, 2025-05-07T20:32:17.0478544Z ) -> None: 2025-05-07T20:32:17.0478881Z torch.manual_seed(2025) 2025-05-07T20:32:17.0479138Z 2025-05-07T20:32:17.0479427Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.0481468Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.0483334Z 2025-05-07T20:32:17.0483457Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.0483680Z 2025-05-07T20:32:17.0483794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.0484224Z self=, 2025-05-07T20:32:17.0484650Z T=4096, 2025-05-07T20:32:17.0484856Z D=7168, 2025-05-07T20:32:17.0485059Z scale_ub=1200.0, 2025-05-07T20:32:17.0485303Z contiguous=True, 2025-05-07T20:32:17.0485546Z compiled=False, 2025-05-07T20:32:17.0485761Z ) 2025-05-07T20:32:17.0486106Z self = 2025-05-07T20:32:17.0486626Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.0486904Z 2025-05-07T20:32:17.0487000Z @given( 2025-05-07T20:32:17.0487239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.0487577Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.0487904Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.0488247Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.0488598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.0488906Z ) 2025-05-07T20:32:17.0489350Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.0489813Z def test_silu_mul_quant( 2025-05-07T20:32:17.0490073Z self, 2025-05-07T20:32:17.0490276Z T: int, 2025-05-07T20:32:17.0490490Z D: int, 2025-05-07T20:32:17.0490730Z scale_ub: Optional[float], 2025-05-07T20:32:17.0491009Z contiguous: bool, 2025-05-07T20:32:17.0491266Z compiled: bool, 2025-05-07T20:32:17.0491505Z ) -> None: 2025-05-07T20:32:17.0491741Z torch.manual_seed(2025) 2025-05-07T20:32:17.0491991Z 2025-05-07T20:32:17.0492278Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.0494367Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.0496285Z 2025-05-07T20:32:17.0496418Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.0496639Z 2025-05-07T20:32:17.0496748Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.0497179Z self=, 2025-05-07T20:32:17.0497603Z T=16384, 2025-05-07T20:32:17.0497816Z D=7168, 2025-05-07T20:32:17.0498015Z scale_ub=None, 2025-05-07T20:32:17.0498248Z contiguous=False, 2025-05-07T20:32:17.0498491Z compiled=True, 2025-05-07T20:32:17.0498701Z ) 2025-05-07T20:32:17.3865608Z self = 2025-05-07T20:32:17.3866223Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.3866848Z 2025-05-07T20:32:17.3866932Z @given( 2025-05-07T20:32:17.3867182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.3867503Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.3867837Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.3868198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.3868546Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.3868845Z ) 2025-05-07T20:32:17.3869212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.3869670Z def test_silu_mul_quant( 2025-05-07T20:32:17.3869923Z self, 2025-05-07T20:32:17.3870134Z T: int, 2025-05-07T20:32:17.3870350Z D: int, 2025-05-07T20:32:17.3870580Z scale_ub: Optional[float], 2025-05-07T20:32:17.3870871Z contiguous: bool, 2025-05-07T20:32:17.3871133Z compiled: bool, 2025-05-07T20:32:17.3871381Z ) -> None: 2025-05-07T20:32:17.3871611Z torch.manual_seed(2025) 2025-05-07T20:32:17.3871869Z 2025-05-07T20:32:17.3872149Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.3874237Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.3876100Z 2025-05-07T20:32:17.3876225Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.3876451Z 2025-05-07T20:32:17.3876719Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.3877152Z self=, 2025-05-07T20:32:17.3877567Z T=4096, 2025-05-07T20:32:17.3877769Z D=7168, 2025-05-07T20:32:17.3877976Z scale_ub=None, 2025-05-07T20:32:17.3878196Z contiguous=True, 2025-05-07T20:32:17.3878436Z compiled=False, 2025-05-07T20:32:17.3878660Z ) 2025-05-07T20:32:17.3878986Z self = 2025-05-07T20:32:17.3879496Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.3879780Z 2025-05-07T20:32:17.3879864Z @given( 2025-05-07T20:32:17.3880104Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.3880425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.3880749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.3881094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.3881441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.3881739Z ) 2025-05-07T20:32:17.3882101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.3882547Z def test_silu_mul_quant( 2025-05-07T20:32:17.3882800Z self, 2025-05-07T20:32:17.3883008Z T: int, 2025-05-07T20:32:17.3883294Z D: int, 2025-05-07T20:32:17.3883524Z scale_ub: Optional[float], 2025-05-07T20:32:17.3883812Z contiguous: bool, 2025-05-07T20:32:17.3884067Z compiled: bool, 2025-05-07T20:32:17.3884295Z ) -> None: 2025-05-07T20:32:17.3884522Z torch.manual_seed(2025) 2025-05-07T20:32:17.3884777Z 2025-05-07T20:32:17.3885049Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.3887079Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.3889000Z 2025-05-07T20:32:17.3889123Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.3889347Z 2025-05-07T20:32:17.3889454Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.3889879Z self=, 2025-05-07T20:32:17.3890283Z T=16384, 2025-05-07T20:32:17.3890495Z D=7168, 2025-05-07T20:32:17.3890697Z scale_ub=None, 2025-05-07T20:32:17.3890915Z contiguous=True, 2025-05-07T20:32:17.3891157Z compiled=False, 2025-05-07T20:32:17.3891374Z ) 2025-05-07T20:32:17.3891697Z self = 2025-05-07T20:32:17.3892203Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.3892476Z 2025-05-07T20:32:17.3892563Z @given( 2025-05-07T20:32:17.3892792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.3893128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.3893484Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.3893823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.3894156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.3894451Z ) 2025-05-07T20:32:17.3894810Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.3895252Z def test_silu_mul_quant( 2025-05-07T20:32:17.3895502Z self, 2025-05-07T20:32:17.3895710Z T: int, 2025-05-07T20:32:17.3895910Z D: int, 2025-05-07T20:32:17.3896138Z scale_ub: Optional[float], 2025-05-07T20:32:17.3896502Z contiguous: bool, 2025-05-07T20:32:17.3896747Z compiled: bool, 2025-05-07T20:32:17.3896981Z ) -> None: 2025-05-07T20:32:17.3897208Z torch.manual_seed(2025) 2025-05-07T20:32:17.3897454Z 2025-05-07T20:32:17.3897733Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.3899762Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.3901707Z 2025-05-07T20:32:17.3901830Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.3902056Z 2025-05-07T20:32:17.3902170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.3902586Z self=, 2025-05-07T20:32:17.3902995Z T=16384, 2025-05-07T20:32:17.3903200Z D=7168, 2025-05-07T20:32:17.3903446Z scale_ub=1200.0, 2025-05-07T20:32:17.3903682Z contiguous=True, 2025-05-07T20:32:17.3903911Z compiled=False, 2025-05-07T20:32:17.3904118Z ) 2025-05-07T20:32:17.3904445Z self = 2025-05-07T20:32:17.3904951Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.3905228Z 2025-05-07T20:32:17.3905318Z @given( 2025-05-07T20:32:17.3905548Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.3905872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.3906190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.3906573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.3906912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.3907211Z ) 2025-05-07T20:32:17.3907563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.3908016Z def test_silu_mul_quant( 2025-05-07T20:32:17.3908275Z self, 2025-05-07T20:32:17.3908480Z T: int, 2025-05-07T20:32:17.3908680Z D: int, 2025-05-07T20:32:17.3908912Z scale_ub: Optional[float], 2025-05-07T20:32:17.3909195Z contiguous: bool, 2025-05-07T20:32:17.3909435Z compiled: bool, 2025-05-07T20:32:17.3909665Z ) -> None: 2025-05-07T20:32:17.3909894Z torch.manual_seed(2025) 2025-05-07T20:32:17.3910141Z 2025-05-07T20:32:17.3910419Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.3912493Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.3914331Z 2025-05-07T20:32:17.3914460Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.3914676Z 2025-05-07T20:32:17.3914788Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.3915205Z self=, 2025-05-07T20:32:17.3915613Z T=128, 2025-05-07T20:32:17.3915808Z D=5120, 2025-05-07T20:32:17.3916000Z scale_ub=1200.0, 2025-05-07T20:32:17.3916252Z contiguous=False, 2025-05-07T20:32:17.3916485Z compiled=False, 2025-05-07T20:32:17.3916710Z ) 2025-05-07T20:32:17.5553035Z self = 2025-05-07T20:32:17.5553598Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.5553877Z 2025-05-07T20:32:17.5553964Z @given( 2025-05-07T20:32:17.5554208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5554542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5554864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5555203Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5555546Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5555854Z ) 2025-05-07T20:32:17.5556205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5556661Z def test_silu_mul_quant( 2025-05-07T20:32:17.5556920Z self, 2025-05-07T20:32:17.5557121Z T: int, 2025-05-07T20:32:17.5557337Z D: int, 2025-05-07T20:32:17.5557586Z scale_ub: Optional[float], 2025-05-07T20:32:17.5557867Z contiguous: bool, 2025-05-07T20:32:17.5558117Z compiled: bool, 2025-05-07T20:32:17.5558359Z ) -> None: 2025-05-07T20:32:17.5558579Z torch.manual_seed(2025) 2025-05-07T20:32:17.5558832Z 2025-05-07T20:32:17.5559186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5559539Z 2025-05-07T20:32:17.5559736Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5560040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5560364Z x = x_sign * x_clamp 2025-05-07T20:32:17.5560614Z x0 = x[:, :D] 2025-05-07T20:32:17.5560843Z x1 = x[:, D:] 2025-05-07T20:32:17.5561072Z 2025-05-07T20:32:17.5561262Z if contiguous: 2025-05-07T20:32:17.5561509Z x0 = x0.contiguous() 2025-05-07T20:32:17.5561784Z x1 = x1.contiguous() 2025-05-07T20:32:17.5562034Z 2025-05-07T20:32:17.5562327Z if scale_ub is not None: 2025-05-07T20:32:17.5562620Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5563001Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5563322Z ) 2025-05-07T20:32:17.5563528Z else: 2025-05-07T20:32:17.5563740Z scale_ub_tensor = None 2025-05-07T20:32:17.5564006Z 2025-05-07T20:32:17.5564249Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5564576Z op = silu_mul_quant 2025-05-07T20:32:17.5564834Z if compiled: 2025-05-07T20:32:17.5565093Z op = torch.compile(op) 2025-05-07T20:32:17.5565419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5565706Z 2025-05-07T20:32:17.5565901Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5566081Z 2025-05-07T20:32:17.5566186Z moe/activation_test.py:117: 2025-05-07T20:32:17.5566492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5566844Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5567135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5567837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5568543Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5569086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5569781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5570460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5570994Z kernel = self.compile( 2025-05-07T20:32:17.5571549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5572298Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5572717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5572952Z 2025-05-07T20:32:17.5573163Z self = 2025-05-07T20:32:17.5574245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5575646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9301ca0>} 2025-05-07T20:32:17.5576985Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5578004Z context = 2025-05-07T20:32:17.5578304Z 2025-05-07T20:32:17.5578475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5579006Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5579584Z module_map=module_map) 2025-05-07T20:32:17.5579952Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5580323Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5580591Z E ^ 2025-05-07T20:32:17.5581053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5581583Z 2025-05-07T20:32:17.5582009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5582528Z 2025-05-07T20:32:17.5582634Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5583115Z self=, 2025-05-07T20:32:17.5583519Z T=2048, 2025-05-07T20:32:17.5583716Z D=7168, 2025-05-07T20:32:17.5583921Z scale_ub=None, 2025-05-07T20:32:17.5584142Z contiguous=False, 2025-05-07T20:32:17.5584379Z compiled=False, 2025-05-07T20:32:17.5584603Z ) 2025-05-07T20:32:17.5584922Z self = 2025-05-07T20:32:17.5585427Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.5585709Z 2025-05-07T20:32:17.5585791Z @given( 2025-05-07T20:32:17.5586031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5586348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5586667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5587008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5587338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5587646Z ) 2025-05-07T20:32:17.5588003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5588453Z def test_silu_mul_quant( 2025-05-07T20:32:17.5588701Z self, 2025-05-07T20:32:17.5588904Z T: int, 2025-05-07T20:32:17.5589112Z D: int, 2025-05-07T20:32:17.5589335Z scale_ub: Optional[float], 2025-05-07T20:32:17.5589617Z contiguous: bool, 2025-05-07T20:32:17.5589869Z compiled: bool, 2025-05-07T20:32:17.5590094Z ) -> None: 2025-05-07T20:32:17.5590319Z torch.manual_seed(2025) 2025-05-07T20:32:17.5590574Z 2025-05-07T20:32:17.5590849Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5592986Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5594832Z 2025-05-07T20:32:17.5594955Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5595181Z 2025-05-07T20:32:17.5595286Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5595711Z self=, 2025-05-07T20:32:17.5596113Z T=128, 2025-05-07T20:32:17.5596313Z D=7168, 2025-05-07T20:32:17.5596519Z scale_ub=1200.0, 2025-05-07T20:32:17.5596746Z contiguous=True, 2025-05-07T20:32:17.5596979Z compiled=True, 2025-05-07T20:32:17.5597189Z ) 2025-05-07T20:32:17.6052224Z self = 2025-05-07T20:32:17.6052759Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.6053063Z 2025-05-07T20:32:17.6053144Z @given( 2025-05-07T20:32:17.6053378Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.6053696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.6054179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.6054522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.6054860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.6055152Z ) 2025-05-07T20:32:17.6055504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.6055954Z def test_silu_mul_quant( 2025-05-07T20:32:17.6056202Z self, 2025-05-07T20:32:17.6056398Z T: int, 2025-05-07T20:32:17.6056602Z D: int, 2025-05-07T20:32:17.6056830Z scale_ub: Optional[float], 2025-05-07T20:32:17.6057106Z contiguous: bool, 2025-05-07T20:32:17.6057439Z compiled: bool, 2025-05-07T20:32:17.6057674Z ) -> None: 2025-05-07T20:32:17.6057892Z torch.manual_seed(2025) 2025-05-07T20:32:17.6058143Z 2025-05-07T20:32:17.6058421Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.6058765Z 2025-05-07T20:32:17.6058971Z x_sign = torch.sign(x) 2025-05-07T20:32:17.6059270Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.6059591Z x = x_sign * x_clamp 2025-05-07T20:32:17.6059835Z x0 = x[:, :D] 2025-05-07T20:32:17.6060060Z x1 = x[:, D:] 2025-05-07T20:32:17.6060277Z 2025-05-07T20:32:17.6060466Z if contiguous: 2025-05-07T20:32:17.6060707Z x0 = x0.contiguous() 2025-05-07T20:32:17.6060979Z x1 = x1.contiguous() 2025-05-07T20:32:17.6061359Z 2025-05-07T20:32:17.6061567Z if scale_ub is not None: 2025-05-07T20:32:17.6061851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.6062203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.6062527Z ) 2025-05-07T20:32:17.6062729Z else: 2025-05-07T20:32:17.6062942Z scale_ub_tensor = None 2025-05-07T20:32:17.6063206Z 2025-05-07T20:32:17.6063447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.6063773Z op = silu_mul_quant 2025-05-07T20:32:17.6064035Z if compiled: 2025-05-07T20:32:17.6064294Z op = torch.compile(op) 2025-05-07T20:32:17.6064594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.6064880Z 2025-05-07T20:32:17.6065081Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.6065249Z 2025-05-07T20:32:17.6065358Z moe/activation_test.py:117: 2025-05-07T20:32:17.6065653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.6065994Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.6066283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.6066982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.6067563Z return fn(*args, **kwargs) 2025-05-07T20:32:17.6068223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.6068917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.6069461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.6070150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.6070819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.6071350Z kernel = self.compile( 2025-05-07T20:32:17.6071900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.6072577Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.6073032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.6073264Z 2025-05-07T20:32:17.6073477Z self = 2025-05-07T20:32:17.6074604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.6075981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e95ae280>} 2025-05-07T20:32:17.6077325Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.6078387Z context = 2025-05-07T20:32:17.6078679Z 2025-05-07T20:32:17.6078851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.6079379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.6079851Z module_map=module_map) 2025-05-07T20:32:17.6080218Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.6080580Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.6080846Z E ^ 2025-05-07T20:32:17.6081335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.6081791Z 2025-05-07T20:32:17.6082207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.6082735Z 2025-05-07T20:32:17.6082852Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.6083278Z self=, 2025-05-07T20:32:17.6083689Z T=128, 2025-05-07T20:32:17.6083959Z D=7168, 2025-05-07T20:32:17.6084265Z scale_ub=1200.0, 2025-05-07T20:32:17.6084765Z contiguous=True, 2025-05-07T20:32:17.6085105Z compiled=False, 2025-05-07T20:32:17.6093967Z ) 2025-05-07T20:32:17.6094325Z self = 2025-05-07T20:32:17.6094830Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.6095115Z 2025-05-07T20:32:17.6095198Z @given( 2025-05-07T20:32:17.6095446Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.6095774Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.6096089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.6096433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.6096911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.6097212Z ) 2025-05-07T20:32:17.6097582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.6098039Z def test_silu_mul_quant( 2025-05-07T20:32:17.6098288Z self, 2025-05-07T20:32:17.6098498Z T: int, 2025-05-07T20:32:17.6098708Z D: int, 2025-05-07T20:32:17.6098932Z scale_ub: Optional[float], 2025-05-07T20:32:17.6099219Z contiguous: bool, 2025-05-07T20:32:17.6099473Z compiled: bool, 2025-05-07T20:32:17.6099707Z ) -> None: 2025-05-07T20:32:17.6099930Z torch.manual_seed(2025) 2025-05-07T20:32:17.6100186Z 2025-05-07T20:32:17.6100470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.6100817Z 2025-05-07T20:32:17.6101019Z x_sign = torch.sign(x) 2025-05-07T20:32:17.6101407Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.6103486Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.6105412Z 2025-05-07T20:32:17.6105536Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.6105761Z 2025-05-07T20:32:17.6105868Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.6106292Z self=, 2025-05-07T20:32:17.6106705Z T=128, 2025-05-07T20:32:17.6106896Z D=5120, 2025-05-07T20:32:17.6107099Z scale_ub=1200.0, 2025-05-07T20:32:17.6107387Z contiguous=True, 2025-05-07T20:32:17.6107615Z compiled=True, 2025-05-07T20:32:17.6107833Z ) 2025-05-07T20:32:17.6108163Z self = 2025-05-07T20:32:17.6108663Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.6108944Z 2025-05-07T20:32:17.6109028Z @given( 2025-05-07T20:32:17.6109269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.6109589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.6109910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.6110253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.6110598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.6110892Z ) 2025-05-07T20:32:17.6111254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.6111705Z def test_silu_mul_quant( 2025-05-07T20:32:17.6111949Z self, 2025-05-07T20:32:17.6112153Z T: int, 2025-05-07T20:32:17.6112365Z D: int, 2025-05-07T20:32:17.6112590Z scale_ub: Optional[float], 2025-05-07T20:32:17.6112874Z contiguous: bool, 2025-05-07T20:32:17.6113129Z compiled: bool, 2025-05-07T20:32:17.6113358Z ) -> None: 2025-05-07T20:32:17.6113592Z torch.manual_seed(2025) 2025-05-07T20:32:17.6113850Z 2025-05-07T20:32:17.6114128Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.6114486Z 2025-05-07T20:32:17.6114691Z x_sign = torch.sign(x) 2025-05-07T20:32:17.6114978Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.6117073Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.6118959Z 2025-05-07T20:32:17.6119084Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.6119310Z 2025-05-07T20:32:17.6119418Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.6119843Z self=, 2025-05-07T20:32:17.6120247Z T=128, 2025-05-07T20:32:17.6120447Z D=7168, 2025-05-07T20:32:17.6120649Z scale_ub=None, 2025-05-07T20:32:17.6120868Z contiguous=True, 2025-05-07T20:32:17.6121104Z compiled=True, 2025-05-07T20:32:17.6121321Z ) 2025-05-07T20:32:17.8643555Z self = 2025-05-07T20:32:17.8644132Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.8644421Z 2025-05-07T20:32:17.8644511Z @given( 2025-05-07T20:32:17.8644759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8645084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8645406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8646057Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8646411Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8646702Z ) 2025-05-07T20:32:17.8647066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8647529Z def test_silu_mul_quant( 2025-05-07T20:32:17.8647780Z self, 2025-05-07T20:32:17.8647990Z T: int, 2025-05-07T20:32:17.8648200Z D: int, 2025-05-07T20:32:17.8648424Z scale_ub: Optional[float], 2025-05-07T20:32:17.8648711Z contiguous: bool, 2025-05-07T20:32:17.8648964Z compiled: bool, 2025-05-07T20:32:17.8649304Z ) -> None: 2025-05-07T20:32:17.8649535Z torch.manual_seed(2025) 2025-05-07T20:32:17.8649792Z 2025-05-07T20:32:17.8650070Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8652119Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.8654066Z 2025-05-07T20:32:17.8654191Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.8654418Z 2025-05-07T20:32:17.8682150Z FAILED 2025-05-07T20:32:17.8682346Z 2025-05-07T20:32:17.8682555Z =================================== FAILURES =================================== 2025-05-07T20:32:17.8683001Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:17.8683456Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:17.8684219Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:17.8684935Z | yield 2025-05-07T20:32:17.8685433Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:32:17.8686051Z | self._callTestMethod(testMethod) 2025-05-07T20:32:17.8686931Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:32:17.8687858Z | method() 2025-05-07T20:32:17.8688828Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:17.8689729Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8690448Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:17.8691191Z | raise the_error_hypothesis_found 2025-05-07T20:32:17.8691773Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:17.8692345Z +-+---------------- 1 ---------------- 2025-05-07T20:32:17.8692680Z | Traceback (most recent call last): 2025-05-07T20:32:17.8693490Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:17.8694391Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8696747Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.8698818Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:17.8699281Z | self=, 2025-05-07T20:32:17.8699707Z | T=2048, 2025-05-07T20:32:17.8699967Z | D=5120, # or any other generated value 2025-05-07T20:32:17.8700316Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:17.8700698Z | contiguous=True, # or any other generated value 2025-05-07T20:32:17.8701322Z | compiled=False, # or any other generated value 2025-05-07T20:32:17.8701830Z | ) 2025-05-07T20:32:17.8702086Z | 2025-05-07T20:32:17.8702738Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:17.8703547Z +---------------- 2 ---------------- 2025-05-07T20:32:17.8703980Z | Traceback (most recent call last): 2025-05-07T20:32:17.8704932Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:17.8705856Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8708240Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.8710465Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:17.8710970Z | self=, 2025-05-07T20:32:17.8711463Z | T=128, 2025-05-07T20:32:17.8711785Z | D=7168, 2025-05-07T20:32:17.8712116Z | scale_ub=None, 2025-05-07T20:32:17.8712498Z | contiguous=True, 2025-05-07T20:32:17.8712899Z | compiled=True, 2025-05-07T20:32:17.8713266Z | ) 2025-05-07T20:32:17.8713564Z | 2025-05-07T20:32:17.8714480Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:17.8715635Z +---------------- 3 ---------------- 2025-05-07T20:32:17.8716112Z | Traceback (most recent call last): 2025-05-07T20:32:17.8717220Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:17.8718399Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8720936Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.8723015Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:17.8723477Z | self=, 2025-05-07T20:32:17.8723901Z | T=128, 2025-05-07T20:32:17.8724116Z | D=5120, 2025-05-07T20:32:17.8724336Z | scale_ub=1200.0, 2025-05-07T20:32:17.8724713Z | contiguous=True, 2025-05-07T20:32:17.8724967Z | compiled=True, 2025-05-07T20:32:17.8725198Z | ) 2025-05-07T20:32:17.8725391Z | 2025-05-07T20:32:17.8726018Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:17.8726639Z +---------------- 4 ---------------- 2025-05-07T20:32:17.8726953Z | Traceback (most recent call last): 2025-05-07T20:32:17.8727687Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:17.8728497Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.8729163Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:17.8729879Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.8730893Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:17.8731715Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.8732336Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:17.8733087Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8733850Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:17.8734915Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.8736065Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:17.8737218Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.8738359Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:17.8739406Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.8740663Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:17.8741612Z | fn() 2025-05-07T20:32:17.8742976Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:17.8743905Z | self.fn.run( 2025-05-07T20:32:17.8744643Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:17.8745419Z | kernel = self.compile( 2025-05-07T20:32:17.8746315Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:17.8747343Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8748332Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:17.8749425Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8750166Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8750664Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.8751064Z | ^ 2025-05-07T20:32:17.8751727Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8752521Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:17.8753155Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:17.8754020Z | self=, 2025-05-07T20:32:17.8754627Z | T=1, # or any other generated value 2025-05-07T20:32:17.8755076Z | D=5120, # or any other generated value 2025-05-07T20:32:17.8755560Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:17.8756082Z | contiguous=True, # or any other generated value 2025-05-07T20:32:17.8756599Z | compiled=True, # or any other generated value 2025-05-07T20:32:17.8757036Z | ) 2025-05-07T20:32:17.8757297Z | 2025-05-07T20:32:17.8758051Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:17.8759039Z +------------------------------------ 2025-05-07T20:32:17.8759565Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:17.8760130Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8760725Z self=, 2025-05-07T20:32:17.8761306Z T=1, 2025-05-07T20:32:17.8761574Z D=5120, 2025-05-07T20:32:17.8761849Z scale_ub=None, 2025-05-07T20:32:17.8762162Z contiguous=True, 2025-05-07T20:32:17.8762488Z compiled=True, 2025-05-07T20:32:17.8762787Z ) 2025-05-07T20:32:17.8763254Z self = 2025-05-07T20:32:17.8763941Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.8764311Z 2025-05-07T20:32:17.8764427Z @given( 2025-05-07T20:32:17.8764769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8765215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8765667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8766137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8766617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8767023Z ) 2025-05-07T20:32:17.8767512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8768148Z def test_silu_mul_quant( 2025-05-07T20:32:17.8768508Z self, 2025-05-07T20:32:17.8768775Z T: int, 2025-05-07T20:32:17.8769064Z D: int, 2025-05-07T20:32:17.8769378Z scale_ub: Optional[float], 2025-05-07T20:32:17.8769764Z contiguous: bool, 2025-05-07T20:32:17.8770120Z compiled: bool, 2025-05-07T20:32:17.8770457Z ) -> None: 2025-05-07T20:32:17.8770764Z torch.manual_seed(2025) 2025-05-07T20:32:17.8771116Z 2025-05-07T20:32:17.8771629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8772124Z 2025-05-07T20:32:17.8772411Z x_sign = torch.sign(x) 2025-05-07T20:32:17.8772844Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.8773286Z x = x_sign * x_clamp 2025-05-07T20:32:17.8773629Z x0 = x[:, :D] 2025-05-07T20:32:17.8773941Z x1 = x[:, D:] 2025-05-07T20:32:17.8774252Z 2025-05-07T20:32:17.8774514Z if contiguous: 2025-05-07T20:32:17.8774854Z x0 = x0.contiguous() 2025-05-07T20:32:17.8775234Z x1 = x1.contiguous() 2025-05-07T20:32:17.8775581Z 2025-05-07T20:32:17.8775867Z if scale_ub is not None: 2025-05-07T20:32:17.8776270Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.8776745Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.8777182Z ) 2025-05-07T20:32:17.8777465Z else: 2025-05-07T20:32:17.8777775Z scale_ub_tensor = None 2025-05-07T20:32:17.8778149Z 2025-05-07T20:32:17.8778476Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8778928Z op = silu_mul_quant 2025-05-07T20:32:17.8779273Z if compiled: 2025-05-07T20:32:17.8779625Z op = torch.compile(op) 2025-05-07T20:32:17.8780102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8780488Z 2025-05-07T20:32:17.8780769Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.8781299Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.8781723Z 2025-05-07T20:32:17.8782069Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8782553Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.8783009Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.8783460Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.8783939Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.8784452Z 2025-05-07T20:32:17.8784739Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.8784984Z 2025-05-07T20:32:17.8785132Z moe/activation_test.py:126: 2025-05-07T20:32:17.8785559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8786032Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.8786500Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.8787612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.8788674Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.8789443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.8790403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8791395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.8792423Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.8793490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.8794570Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.8795602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.8796521Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.8797366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.8798104Z fn() 2025-05-07T20:32:17.8798868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.8799705Z self.fn.run( 2025-05-07T20:32:17.8800380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.8801150Z kernel = self.compile( 2025-05-07T20:32:17.8801920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.8802834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8803388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8803713Z 2025-05-07T20:32:17.8804000Z self = 2025-05-07T20:32:17.8805525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.8807491Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7edee7040>} 2025-05-07T20:32:17.8809401Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.8810901Z context = 2025-05-07T20:32:17.8811324Z 2025-05-07T20:32:17.8811569Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.8812334Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8813013Z module_map=module_map) 2025-05-07T20:32:17.8813569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8814063Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.8814519Z E ^ 2025-05-07T20:32:17.8815187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8815828Z 2025-05-07T20:32:17.8816415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8817177Z 2025-05-07T20:32:17.8817334Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8817912Z self=, 2025-05-07T20:32:17.8818480Z T=2048, 2025-05-07T20:32:17.8818747Z D=5120, 2025-05-07T20:32:17.8819023Z scale_ub=1200.0, 2025-05-07T20:32:17.8819340Z contiguous=True, 2025-05-07T20:32:17.8819652Z compiled=False, 2025-05-07T20:32:17.8819945Z ) 2025-05-07T20:32:17.8820394Z self = 2025-05-07T20:32:17.8821227Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.8821626Z 2025-05-07T20:32:17.8821746Z @given( 2025-05-07T20:32:17.8822076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8822516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8822951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8823403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8823863Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8824254Z ) 2025-05-07T20:32:17.8846327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8846977Z def test_silu_mul_quant( 2025-05-07T20:32:17.8847328Z self, 2025-05-07T20:32:17.8847605Z T: int, 2025-05-07T20:32:17.8847887Z D: int, 2025-05-07T20:32:17.8848183Z scale_ub: Optional[float], 2025-05-07T20:32:17.8848558Z contiguous: bool, 2025-05-07T20:32:17.8848879Z compiled: bool, 2025-05-07T20:32:17.8849472Z ) -> None: 2025-05-07T20:32:17.8849785Z torch.manual_seed(2025) 2025-05-07T20:32:17.8850128Z 2025-05-07T20:32:17.8850507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8850969Z 2025-05-07T20:32:17.8851253Z x_sign = torch.sign(x) 2025-05-07T20:32:17.8851679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.8852117Z x = x_sign * x_clamp 2025-05-07T20:32:17.8852473Z x0 = x[:, :D] 2025-05-07T20:32:17.8852794Z x1 = x[:, D:] 2025-05-07T20:32:17.8853087Z 2025-05-07T20:32:17.8853353Z if contiguous: 2025-05-07T20:32:17.8853684Z x0 = x0.contiguous() 2025-05-07T20:32:17.8854034Z x1 = x1.contiguous() 2025-05-07T20:32:17.8854381Z 2025-05-07T20:32:17.8854660Z if scale_ub is not None: 2025-05-07T20:32:17.8855046Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.8855521Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.8855961Z ) 2025-05-07T20:32:17.8856214Z else: 2025-05-07T20:32:17.8856514Z scale_ub_tensor = None 2025-05-07T20:32:17.8856890Z 2025-05-07T20:32:17.8857229Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8857667Z op = silu_mul_quant 2025-05-07T20:32:17.8858138Z if compiled: 2025-05-07T20:32:17.8858499Z op = torch.compile(op) 2025-05-07T20:32:17.8858922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8859322Z 2025-05-07T20:32:17.8859601Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.8859839Z 2025-05-07T20:32:17.8859980Z moe/activation_test.py:117: 2025-05-07T20:32:17.8860399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8860883Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.8861407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8862358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.8863420Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.8864175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.8865104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8866021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.8866762Z kernel = self.compile( 2025-05-07T20:32:17.8867522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.8868393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8868962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8869284Z 2025-05-07T20:32:17.8869582Z self = 2025-05-07T20:32:17.8871052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.8873010Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7cbe6e5e0>} 2025-05-07T20:32:17.8874884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.8876220Z context = 2025-05-07T20:32:17.8876607Z 2025-05-07T20:32:17.8876835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.8877661Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8878340Z module_map=module_map) 2025-05-07T20:32:17.8878858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8879356Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.8879747Z E ^ 2025-05-07T20:32:17.8880384Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8880849Z 2025-05-07T20:32:17.8881289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8881813Z 2025-05-07T20:32:17.8881923Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8882353Z self=, 2025-05-07T20:32:17.8882771Z T=2048, 2025-05-07T20:32:17.8882974Z D=5120, 2025-05-07T20:32:17.8883191Z scale_ub=1200.0, 2025-05-07T20:32:17.8883428Z contiguous=True, 2025-05-07T20:32:17.8883660Z compiled=True, 2025-05-07T20:32:17.8883882Z ) 2025-05-07T20:32:17.8884216Z self = 2025-05-07T20:32:17.8884727Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.8885072Z 2025-05-07T20:32:17.8885155Z @given( 2025-05-07T20:32:17.8885398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8885725Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8886037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8886370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8886707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8887005Z ) 2025-05-07T20:32:17.8887367Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8887823Z def test_silu_mul_quant( 2025-05-07T20:32:17.8888136Z self, 2025-05-07T20:32:17.8888337Z T: int, 2025-05-07T20:32:17.8888544Z D: int, 2025-05-07T20:32:17.8888774Z scale_ub: Optional[float], 2025-05-07T20:32:17.8889049Z contiguous: bool, 2025-05-07T20:32:17.8889298Z compiled: bool, 2025-05-07T20:32:17.8889533Z ) -> None: 2025-05-07T20:32:17.8889753Z torch.manual_seed(2025) 2025-05-07T20:32:17.8890004Z 2025-05-07T20:32:17.8890285Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8890630Z 2025-05-07T20:32:17.8890832Z x_sign = torch.sign(x) 2025-05-07T20:32:17.8891135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.8891448Z x = x_sign * x_clamp 2025-05-07T20:32:17.8891701Z x0 = x[:, :D] 2025-05-07T20:32:17.8891929Z x1 = x[:, D:] 2025-05-07T20:32:17.8892140Z 2025-05-07T20:32:17.8892339Z if contiguous: 2025-05-07T20:32:17.8892594Z x0 = x0.contiguous() 2025-05-07T20:32:17.8892873Z x1 = x1.contiguous() 2025-05-07T20:32:17.8893123Z 2025-05-07T20:32:17.8893328Z if scale_ub is not None: 2025-05-07T20:32:17.8893612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.8893956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.8894282Z ) 2025-05-07T20:32:17.8894489Z else: 2025-05-07T20:32:17.8894707Z scale_ub_tensor = None 2025-05-07T20:32:17.8894971Z 2025-05-07T20:32:17.8895215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8895539Z op = silu_mul_quant 2025-05-07T20:32:17.8895809Z if compiled: 2025-05-07T20:32:17.8896053Z op = torch.compile(op) 2025-05-07T20:32:17.8896356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8896638Z 2025-05-07T20:32:17.8896842Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.8897208Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.8897517Z 2025-05-07T20:32:17.8897755Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8898099Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.8898400Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.8898719Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.8899083Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.8899400Z 2025-05-07T20:32:17.8899611Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.8899806Z 2025-05-07T20:32:17.8899908Z moe/activation_test.py:126: 2025-05-07T20:32:17.8900212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8900558Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.8900890Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.8901787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.8902549Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.8903156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.8903840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8904604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.8905329Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.8906084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.8906828Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.8907565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.8908258Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.8908871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.8909390Z fn() 2025-05-07T20:32:17.8909910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.8910499Z self.fn.run( 2025-05-07T20:32:17.8910976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.8911519Z kernel = self.compile( 2025-05-07T20:32:17.8912077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.8912740Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8913187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8913430Z 2025-05-07T20:32:17.8913640Z self = 2025-05-07T20:32:17.8914729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.8916129Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eca155e0>} 2025-05-07T20:32:17.8917472Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.8918495Z context = 2025-05-07T20:32:17.8918793Z 2025-05-07T20:32:17.8919043Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.8919585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8920051Z module_map=module_map) 2025-05-07T20:32:17.8920435Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8920806Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.8921087Z E ^ 2025-05-07T20:32:17.8921552Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8922010Z 2025-05-07T20:32:17.8922438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8922953Z 2025-05-07T20:32:17.8923064Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8923481Z self=, 2025-05-07T20:32:17.8923894Z T=16384, 2025-05-07T20:32:17.8924095Z D=7168, 2025-05-07T20:32:17.8924295Z scale_ub=1200.0, 2025-05-07T20:32:17.8924520Z contiguous=False, 2025-05-07T20:32:17.8924766Z compiled=False, 2025-05-07T20:32:17.8924980Z ) 2025-05-07T20:32:17.8925298Z self = 2025-05-07T20:32:17.8925886Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.8926170Z 2025-05-07T20:32:17.8926259Z @given( 2025-05-07T20:32:17.8926490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8926819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8927133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8927469Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8927801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8928096Z ) 2025-05-07T20:32:17.8928462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8928963Z def test_silu_mul_quant( 2025-05-07T20:32:17.8929217Z self, 2025-05-07T20:32:17.8929418Z T: int, 2025-05-07T20:32:17.8929615Z D: int, 2025-05-07T20:32:17.8929839Z scale_ub: Optional[float], 2025-05-07T20:32:17.8930121Z contiguous: bool, 2025-05-07T20:32:17.8930361Z compiled: bool, 2025-05-07T20:32:17.8930592Z ) -> None: 2025-05-07T20:32:17.8930830Z torch.manual_seed(2025) 2025-05-07T20:32:17.8931076Z 2025-05-07T20:32:17.8931355Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8931709Z 2025-05-07T20:32:17.8931900Z x_sign = torch.sign(x) 2025-05-07T20:32:17.8932201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.8932524Z x = x_sign * x_clamp 2025-05-07T20:32:17.8932770Z x0 = x[:, :D] 2025-05-07T20:32:17.8932999Z x1 = x[:, D:] 2025-05-07T20:32:17.8933235Z 2025-05-07T20:32:17.8933447Z if contiguous: 2025-05-07T20:32:17.8933691Z x0 = x0.contiguous() 2025-05-07T20:32:17.8933957Z x1 = x1.contiguous() 2025-05-07T20:32:17.8934198Z 2025-05-07T20:32:17.8934394Z if scale_ub is not None: 2025-05-07T20:32:17.8934673Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.8935014Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.8935332Z ) 2025-05-07T20:32:17.8935525Z else: 2025-05-07T20:32:17.8935742Z scale_ub_tensor = None 2025-05-07T20:32:17.8935993Z 2025-05-07T20:32:17.8936226Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8936549Z op = silu_mul_quant 2025-05-07T20:32:17.8936801Z if compiled: 2025-05-07T20:32:17.8937055Z op = torch.compile(op) 2025-05-07T20:32:17.8937359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8937638Z 2025-05-07T20:32:17.8937924Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.8938095Z 2025-05-07T20:32:17.8938203Z moe/activation_test.py:117: 2025-05-07T20:32:17.8938499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8938843Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.8939140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8939850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.8940833Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.8941442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.8942129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8942794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.8943345Z kernel = self.compile( 2025-05-07T20:32:17.8943894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.8944558Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8944960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8945297Z 2025-05-07T20:32:17.8945509Z self = 2025-05-07T20:32:17.8946598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.8947972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ec7e61f0>} 2025-05-07T20:32:17.8949376Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.8950399Z context = 2025-05-07T20:32:17.8950699Z 2025-05-07T20:32:17.8950870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.8951400Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8951870Z module_map=module_map) 2025-05-07T20:32:17.8952252Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8952619Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.8952913Z E ^ 2025-05-07T20:32:17.8953400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8953860Z 2025-05-07T20:32:17.8954282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8954804Z 2025-05-07T20:32:17.8954917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8955331Z self=, 2025-05-07T20:32:17.8955749Z T=1, 2025-05-07T20:32:17.8955940Z D=7168, 2025-05-07T20:32:17.8956144Z scale_ub=None, 2025-05-07T20:32:17.8956359Z contiguous=True, 2025-05-07T20:32:17.8956596Z compiled=True, 2025-05-07T20:32:17.8956808Z ) 2025-05-07T20:32:17.8957129Z self = 2025-05-07T20:32:17.8957623Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.8957884Z 2025-05-07T20:32:17.8957971Z @given( 2025-05-07T20:32:17.8958203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8958524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8959047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8959385Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8959723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8960016Z ) 2025-05-07T20:32:17.8960373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8960821Z def test_silu_mul_quant( 2025-05-07T20:32:17.8961071Z self, 2025-05-07T20:32:17.8961273Z T: int, 2025-05-07T20:32:17.8961472Z D: int, 2025-05-07T20:32:17.8961698Z scale_ub: Optional[float], 2025-05-07T20:32:17.8961984Z contiguous: bool, 2025-05-07T20:32:17.8962225Z compiled: bool, 2025-05-07T20:32:17.8962455Z ) -> None: 2025-05-07T20:32:17.8962678Z torch.manual_seed(2025) 2025-05-07T20:32:17.8962923Z 2025-05-07T20:32:17.8963202Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8963555Z 2025-05-07T20:32:17.8963755Z x_sign = torch.sign(x) 2025-05-07T20:32:17.8964054Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.8964375Z x = x_sign * x_clamp 2025-05-07T20:32:17.8964618Z x0 = x[:, :D] 2025-05-07T20:32:17.8964842Z x1 = x[:, D:] 2025-05-07T20:32:17.8965106Z 2025-05-07T20:32:17.8965298Z if contiguous: 2025-05-07T20:32:17.8965532Z x0 = x0.contiguous() 2025-05-07T20:32:17.8965800Z x1 = x1.contiguous() 2025-05-07T20:32:17.8966047Z 2025-05-07T20:32:17.8966242Z if scale_ub is not None: 2025-05-07T20:32:17.8966525Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.8966868Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.8967180Z ) 2025-05-07T20:32:17.8967380Z else: 2025-05-07T20:32:17.8967599Z scale_ub_tensor = None 2025-05-07T20:32:17.8967852Z 2025-05-07T20:32:17.8968097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8968472Z op = silu_mul_quant 2025-05-07T20:32:17.8968728Z if compiled: 2025-05-07T20:32:17.8968985Z op = torch.compile(op) 2025-05-07T20:32:17.8969290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8969567Z 2025-05-07T20:32:17.8969770Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.8970067Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.8970366Z 2025-05-07T20:32:17.8970603Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8970947Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.8971252Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.8971575Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.8971944Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.8972264Z 2025-05-07T20:32:17.8972468Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.8972683Z 2025-05-07T20:32:17.8972797Z moe/activation_test.py:126: 2025-05-07T20:32:17.8973139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8973485Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.8973818Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.8974614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.8975372Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.8975920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.8976609Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8977304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.8978124Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.8978989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.8979760Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.8980498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.8981209Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.8981820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.8982347Z fn() 2025-05-07T20:32:17.8982861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.8983488Z self.fn.run( 2025-05-07T20:32:17.8983965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.8984513Z kernel = self.compile( 2025-05-07T20:32:17.8985067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.8985725Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8986223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8986460Z 2025-05-07T20:32:17.8986681Z self = 2025-05-07T20:32:17.8987761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.8989124Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ec7e6790>} 2025-05-07T20:32:17.8990520Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.8991545Z context = 2025-05-07T20:32:17.8991842Z 2025-05-07T20:32:17.8992019Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.8992544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8993015Z module_map=module_map) 2025-05-07T20:32:17.8993391Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8993757Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.8994026Z E ^ 2025-05-07T20:32:17.8994499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8994958Z 2025-05-07T20:32:17.8995380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8995891Z 2025-05-07T20:32:17.8995997Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8996427Z self=, 2025-05-07T20:32:17.8996837Z T=4096, 2025-05-07T20:32:17.8997034Z D=5120, 2025-05-07T20:32:17.8997226Z scale_ub=None, 2025-05-07T20:32:17.8997450Z contiguous=False, 2025-05-07T20:32:17.8997689Z compiled=False, 2025-05-07T20:32:17.8997896Z ) 2025-05-07T20:32:17.8998228Z self = 2025-05-07T20:32:17.8998730Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.8999005Z 2025-05-07T20:32:17.8999086Z @given( 2025-05-07T20:32:17.8999439Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8999765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9000074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9000409Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9000748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9001047Z ) 2025-05-07T20:32:17.9001397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9001845Z def test_silu_mul_quant( 2025-05-07T20:32:17.9002094Z self, 2025-05-07T20:32:17.9002288Z T: int, 2025-05-07T20:32:17.9002496Z D: int, 2025-05-07T20:32:17.9002721Z scale_ub: Optional[float], 2025-05-07T20:32:17.9003043Z contiguous: bool, 2025-05-07T20:32:17.9003293Z compiled: bool, 2025-05-07T20:32:17.9003522Z ) -> None: 2025-05-07T20:32:17.9003738Z torch.manual_seed(2025) 2025-05-07T20:32:17.9003987Z 2025-05-07T20:32:17.9004275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9004618Z 2025-05-07T20:32:17.9004819Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9005123Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9005436Z x = x_sign * x_clamp 2025-05-07T20:32:17.9005740Z x0 = x[:, :D] 2025-05-07T20:32:17.9005970Z x1 = x[:, D:] 2025-05-07T20:32:17.9006186Z 2025-05-07T20:32:17.9014489Z if contiguous: 2025-05-07T20:32:17.9014770Z x0 = x0.contiguous() 2025-05-07T20:32:17.9015046Z x1 = x1.contiguous() 2025-05-07T20:32:17.9015301Z 2025-05-07T20:32:17.9015502Z if scale_ub is not None: 2025-05-07T20:32:17.9015789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9016136Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9016451Z ) 2025-05-07T20:32:17.9016659Z else: 2025-05-07T20:32:17.9016893Z scale_ub_tensor = None 2025-05-07T20:32:17.9017236Z 2025-05-07T20:32:17.9017475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9017808Z op = silu_mul_quant 2025-05-07T20:32:17.9018071Z if compiled: 2025-05-07T20:32:17.9018326Z op = torch.compile(op) 2025-05-07T20:32:17.9018642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9018934Z 2025-05-07T20:32:17.9019128Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9019308Z 2025-05-07T20:32:17.9019415Z moe/activation_test.py:117: 2025-05-07T20:32:17.9019728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9020065Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9020362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9021159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9021897Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9022449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9023148Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9023819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9024358Z kernel = self.compile( 2025-05-07T20:32:17.9024915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9025578Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9025994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9026227Z 2025-05-07T20:32:17.9026441Z self = 2025-05-07T20:32:17.9027618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9029022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ec3e9550>} 2025-05-07T20:32:17.9030384Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9031406Z context = 2025-05-07T20:32:17.9031699Z 2025-05-07T20:32:17.9031869Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9032401Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9032924Z module_map=module_map) 2025-05-07T20:32:17.9033317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9033684Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9033959Z E ^ 2025-05-07T20:32:17.9034431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9034927Z 2025-05-07T20:32:17.9035345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9035876Z 2025-05-07T20:32:17.9035985Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9036420Z self=, 2025-05-07T20:32:17.9036833Z T=4096, 2025-05-07T20:32:17.9037026Z D=7168, 2025-05-07T20:32:17.9037232Z scale_ub=None, 2025-05-07T20:32:17.9037460Z contiguous=False, 2025-05-07T20:32:17.9037693Z compiled=False, 2025-05-07T20:32:17.9037967Z ) 2025-05-07T20:32:17.9038299Z self = 2025-05-07T20:32:17.9038801Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.9039084Z 2025-05-07T20:32:17.9039170Z @given( 2025-05-07T20:32:17.9039415Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9039734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9040060Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9040726Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9041066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9041369Z ) 2025-05-07T20:32:17.9041726Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9042186Z def test_silu_mul_quant( 2025-05-07T20:32:17.9042432Z self, 2025-05-07T20:32:17.9042638Z T: int, 2025-05-07T20:32:17.9042850Z D: int, 2025-05-07T20:32:17.9043078Z scale_ub: Optional[float], 2025-05-07T20:32:17.9043363Z contiguous: bool, 2025-05-07T20:32:17.9043610Z compiled: bool, 2025-05-07T20:32:17.9043839Z ) -> None: 2025-05-07T20:32:17.9044061Z torch.manual_seed(2025) 2025-05-07T20:32:17.9044314Z 2025-05-07T20:32:17.9044599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9044950Z 2025-05-07T20:32:17.9045155Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9045448Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9045770Z x = x_sign * x_clamp 2025-05-07T20:32:17.9046018Z x0 = x[:, :D] 2025-05-07T20:32:17.9046238Z x1 = x[:, D:] 2025-05-07T20:32:17.9046454Z 2025-05-07T20:32:17.9046650Z if contiguous: 2025-05-07T20:32:17.9046890Z x0 = x0.contiguous() 2025-05-07T20:32:17.9047157Z x1 = x1.contiguous() 2025-05-07T20:32:17.9047413Z 2025-05-07T20:32:17.9047799Z if scale_ub is not None: 2025-05-07T20:32:17.9048081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9048434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9048755Z ) 2025-05-07T20:32:17.9048952Z else: 2025-05-07T20:32:17.9049175Z scale_ub_tensor = None 2025-05-07T20:32:17.9049441Z 2025-05-07T20:32:17.9049677Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9050008Z op = silu_mul_quant 2025-05-07T20:32:17.9050271Z if compiled: 2025-05-07T20:32:17.9050525Z op = torch.compile(op) 2025-05-07T20:32:17.9050831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9051117Z 2025-05-07T20:32:17.9051310Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9051491Z 2025-05-07T20:32:17.9051593Z moe/activation_test.py:117: 2025-05-07T20:32:17.9051904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9052258Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9052548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9053254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9054025Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9054561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9055259Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9055927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9056466Z kernel = self.compile( 2025-05-07T20:32:17.9057015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9057685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9058164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9058397Z 2025-05-07T20:32:17.9058614Z self = 2025-05-07T20:32:17.9059699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9061059Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ec29b5e0>} 2025-05-07T20:32:17.9062470Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9063551Z context = 2025-05-07T20:32:17.9063843Z 2025-05-07T20:32:17.9064022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9064551Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9065024Z module_map=module_map) 2025-05-07T20:32:17.9065397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9065757Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9066018Z E ^ 2025-05-07T20:32:17.9066486Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9066934Z 2025-05-07T20:32:17.9067354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9067865Z 2025-05-07T20:32:17.9067977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9068483Z self=, 2025-05-07T20:32:17.9068896Z T=128, 2025-05-07T20:32:17.9069093Z D=7168, 2025-05-07T20:32:17.9069285Z scale_ub=None, 2025-05-07T20:32:17.9069510Z contiguous=False, 2025-05-07T20:32:17.9069746Z compiled=True, 2025-05-07T20:32:17.9069951Z ) 2025-05-07T20:32:17.9070277Z self = 2025-05-07T20:32:17.9070774Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9071042Z 2025-05-07T20:32:17.9071127Z @given( 2025-05-07T20:32:17.9071358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9071677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9071993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9072324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9072661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9072968Z ) 2025-05-07T20:32:17.9073324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9073777Z def test_silu_mul_quant( 2025-05-07T20:32:17.9074029Z self, 2025-05-07T20:32:17.9074225Z T: int, 2025-05-07T20:32:17.9074478Z D: int, 2025-05-07T20:32:17.9074705Z scale_ub: Optional[float], 2025-05-07T20:32:17.9074979Z contiguous: bool, 2025-05-07T20:32:17.9075226Z compiled: bool, 2025-05-07T20:32:17.9075455Z ) -> None: 2025-05-07T20:32:17.9075679Z torch.manual_seed(2025) 2025-05-07T20:32:17.9075921Z 2025-05-07T20:32:17.9076198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9076549Z 2025-05-07T20:32:17.9076743Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9077045Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9077365Z x = x_sign * x_clamp 2025-05-07T20:32:17.9077658Z x0 = x[:, :D] 2025-05-07T20:32:17.9077881Z x1 = x[:, D:] 2025-05-07T20:32:17.9078096Z 2025-05-07T20:32:17.9078282Z if contiguous: 2025-05-07T20:32:17.9078525Z x0 = x0.contiguous() 2025-05-07T20:32:17.9078793Z x1 = x1.contiguous() 2025-05-07T20:32:17.9079037Z 2025-05-07T20:32:17.9079241Z if scale_ub is not None: 2025-05-07T20:32:17.9079523Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9079860Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9080179Z ) 2025-05-07T20:32:17.9080380Z else: 2025-05-07T20:32:17.9080599Z scale_ub_tensor = None 2025-05-07T20:32:17.9080853Z 2025-05-07T20:32:17.9081093Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9081417Z op = silu_mul_quant 2025-05-07T20:32:17.9081672Z if compiled: 2025-05-07T20:32:17.9081928Z op = torch.compile(op) 2025-05-07T20:32:17.9082245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9082523Z 2025-05-07T20:32:17.9082721Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.9083014Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.9083306Z 2025-05-07T20:32:17.9083550Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9083898Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.9084192Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.9084513Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.9084885Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9085199Z 2025-05-07T20:32:17.9085404Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.9085613Z 2025-05-07T20:32:17.9085715Z moe/activation_test.py:126: 2025-05-07T20:32:17.9086019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9086476Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.9086819Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9087617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.9088368Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.9088926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9089612Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9090308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.9091027Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9091783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.9092542Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9093268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.9093911Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.9094564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.9095089Z fn() 2025-05-07T20:32:17.9095594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.9096180Z self.fn.run( 2025-05-07T20:32:17.9096658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9097183Z kernel = self.compile( 2025-05-07T20:32:17.9097748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9098453Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9098857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9099094Z 2025-05-07T20:32:17.9099309Z self = 2025-05-07T20:32:17.9100400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9101868Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ecc73af0>} 2025-05-07T20:32:17.9103209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9104246Z context = 2025-05-07T20:32:17.9104538Z 2025-05-07T20:32:17.9104708Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9105241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9105709Z module_map=module_map) 2025-05-07T20:32:17.9106078Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9106445Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.9106722Z E ^ 2025-05-07T20:32:17.9107187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9107637Z 2025-05-07T20:32:17.9108053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9108665Z 2025-05-07T20:32:17.9108774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9109195Z self=, 2025-05-07T20:32:17.9109602Z T=128, 2025-05-07T20:32:17.9109791Z D=7168, 2025-05-07T20:32:17.9109994Z scale_ub=None, 2025-05-07T20:32:17.9111196Z contiguous=False, 2025-05-07T20:32:17.9111431Z compiled=False, 2025-05-07T20:32:17.9111644Z ) 2025-05-07T20:32:17.9111972Z self = 2025-05-07T20:32:17.9112468Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.9112747Z 2025-05-07T20:32:17.9112829Z @given( 2025-05-07T20:32:17.9113069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9113390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9113707Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9114053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9114395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9114687Z ) 2025-05-07T20:32:17.9115044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9115493Z def test_silu_mul_quant( 2025-05-07T20:32:17.9115789Z self, 2025-05-07T20:32:17.9115992Z T: int, 2025-05-07T20:32:17.9116201Z D: int, 2025-05-07T20:32:17.9116422Z scale_ub: Optional[float], 2025-05-07T20:32:17.9116703Z contiguous: bool, 2025-05-07T20:32:17.9116952Z compiled: bool, 2025-05-07T20:32:17.9117180Z ) -> None: 2025-05-07T20:32:17.9117411Z torch.manual_seed(2025) 2025-05-07T20:32:17.9117666Z 2025-05-07T20:32:17.9117947Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9118299Z 2025-05-07T20:32:17.9118502Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9118802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9119172Z x = x_sign * x_clamp 2025-05-07T20:32:17.9119420Z x0 = x[:, :D] 2025-05-07T20:32:17.9119647Z x1 = x[:, D:] 2025-05-07T20:32:17.9119854Z 2025-05-07T20:32:17.9120045Z if contiguous: 2025-05-07T20:32:17.9120281Z x0 = x0.contiguous() 2025-05-07T20:32:17.9120544Z x1 = x1.contiguous() 2025-05-07T20:32:17.9120792Z 2025-05-07T20:32:17.9120991Z if scale_ub is not None: 2025-05-07T20:32:17.9121266Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9121612Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9121931Z ) 2025-05-07T20:32:17.9122124Z else: 2025-05-07T20:32:17.9122342Z scale_ub_tensor = None 2025-05-07T20:32:17.9122602Z 2025-05-07T20:32:17.9122835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9123158Z op = silu_mul_quant 2025-05-07T20:32:17.9123420Z if compiled: 2025-05-07T20:32:17.9123678Z op = torch.compile(op) 2025-05-07T20:32:17.9123982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9124265Z 2025-05-07T20:32:17.9124469Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9124637Z 2025-05-07T20:32:17.9124740Z moe/activation_test.py:117: 2025-05-07T20:32:17.9125047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9125389Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9125674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9126371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9127074Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9127620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9128384Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9129065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9129695Z kernel = self.compile( 2025-05-07T20:32:17.9130470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9131253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9131659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9131893Z 2025-05-07T20:32:17.9132111Z self = 2025-05-07T20:32:17.9133205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9134651Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ebf6f430>} 2025-05-07T20:32:17.9136000Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9137102Z context = 2025-05-07T20:32:17.9137397Z 2025-05-07T20:32:17.9137573Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9138103Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9138575Z module_map=module_map) 2025-05-07T20:32:17.9138952Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9139306Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9139628Z E ^ 2025-05-07T20:32:17.9140359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9140939Z 2025-05-07T20:32:17.9141549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9142080Z 2025-05-07T20:32:17.9142187Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9142607Z self=, 2025-05-07T20:32:17.9143013Z T=4096, 2025-05-07T20:32:17.9143202Z D=5120, 2025-05-07T20:32:17.9143404Z scale_ub=1200.0, 2025-05-07T20:32:17.9143636Z contiguous=True, 2025-05-07T20:32:17.9143865Z compiled=False, 2025-05-07T20:32:17.9144078Z ) 2025-05-07T20:32:17.9144402Z self = 2025-05-07T20:32:17.9144902Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.9145197Z 2025-05-07T20:32:17.9145278Z @given( 2025-05-07T20:32:17.9145517Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9145837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9146146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9146494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9146831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9147119Z ) 2025-05-07T20:32:17.9147482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9147931Z def test_silu_mul_quant( 2025-05-07T20:32:17.9148179Z self, 2025-05-07T20:32:17.9148385Z T: int, 2025-05-07T20:32:17.9148594Z D: int, 2025-05-07T20:32:17.9148816Z scale_ub: Optional[float], 2025-05-07T20:32:17.9149100Z contiguous: bool, 2025-05-07T20:32:17.9149351Z compiled: bool, 2025-05-07T20:32:17.9149583Z ) -> None: 2025-05-07T20:32:17.9150005Z torch.manual_seed(2025) 2025-05-07T20:32:17.9150262Z 2025-05-07T20:32:17.9150545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9150890Z 2025-05-07T20:32:17.9151094Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9151396Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9151714Z x = x_sign * x_clamp 2025-05-07T20:32:17.9151967Z x0 = x[:, :D] 2025-05-07T20:32:17.9152194Z x1 = x[:, D:] 2025-05-07T20:32:17.9152405Z 2025-05-07T20:32:17.9152599Z if contiguous: 2025-05-07T20:32:17.9152841Z x0 = x0.contiguous() 2025-05-07T20:32:17.9153106Z x1 = x1.contiguous() 2025-05-07T20:32:17.9153358Z 2025-05-07T20:32:17.9153562Z if scale_ub is not None: 2025-05-07T20:32:17.9153840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9154192Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9154513Z ) 2025-05-07T20:32:17.9154722Z else: 2025-05-07T20:32:17.9154938Z scale_ub_tensor = None 2025-05-07T20:32:17.9155201Z 2025-05-07T20:32:17.9155439Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9155761Z op = silu_mul_quant 2025-05-07T20:32:17.9156095Z if compiled: 2025-05-07T20:32:17.9156351Z op = torch.compile(op) 2025-05-07T20:32:17.9156651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9156936Z 2025-05-07T20:32:17.9157136Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9157304Z 2025-05-07T20:32:17.9157406Z moe/activation_test.py:117: 2025-05-07T20:32:17.9157709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9158049Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9158342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9159038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9159842Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9160387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9161066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9161739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9162393Z kernel = self.compile( 2025-05-07T20:32:17.9162979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9163653Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9164060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9164295Z 2025-05-07T20:32:17.9164522Z self = 2025-05-07T20:32:17.9165618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9180262Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ebcb6430>} 2025-05-07T20:32:17.9181773Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9182816Z context = 2025-05-07T20:32:17.9183115Z 2025-05-07T20:32:17.9183300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9184657Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9185162Z module_map=module_map) 2025-05-07T20:32:17.9185546Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9185906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9186178Z E ^ 2025-05-07T20:32:17.9186659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9187114Z 2025-05-07T20:32:17.9187539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9188064Z 2025-05-07T20:32:17.9188170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9188596Z self=, 2025-05-07T20:32:17.9189005Z T=1, 2025-05-07T20:32:17.9189190Z D=5120, 2025-05-07T20:32:17.9189394Z scale_ub=None, 2025-05-07T20:32:17.9189635Z contiguous=True, 2025-05-07T20:32:17.9189863Z compiled=True, 2025-05-07T20:32:17.9190080Z ) 2025-05-07T20:32:17.9190409Z self = 2025-05-07T20:32:17.9190899Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.9191239Z 2025-05-07T20:32:17.9191320Z @given( 2025-05-07T20:32:17.9191561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9191889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9192200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9192544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9192884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9193178Z ) 2025-05-07T20:32:17.9193555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9194117Z def test_silu_mul_quant( 2025-05-07T20:32:17.9194427Z self, 2025-05-07T20:32:17.9194636Z T: int, 2025-05-07T20:32:17.9194842Z D: int, 2025-05-07T20:32:17.9195063Z scale_ub: Optional[float], 2025-05-07T20:32:17.9195172Z contiguous: bool, 2025-05-07T20:32:17.9195261Z compiled: bool, 2025-05-07T20:32:17.9195353Z ) -> None: 2025-05-07T20:32:17.9195456Z torch.manual_seed(2025) 2025-05-07T20:32:17.9195531Z 2025-05-07T20:32:17.9195711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9195787Z 2025-05-07T20:32:17.9195883Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9196018Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9196111Z x = x_sign * x_clamp 2025-05-07T20:32:17.9196195Z x0 = x[:, :D] 2025-05-07T20:32:17.9196284Z x1 = x[:, D:] 2025-05-07T20:32:17.9196361Z 2025-05-07T20:32:17.9196446Z if contiguous: 2025-05-07T20:32:17.9196551Z x0 = x0.contiguous() 2025-05-07T20:32:17.9196655Z x1 = x1.contiguous() 2025-05-07T20:32:17.9196730Z 2025-05-07T20:32:17.9196833Z if scale_ub is not None: 2025-05-07T20:32:17.9196942Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9197089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9197171Z ) 2025-05-07T20:32:17.9197252Z else: 2025-05-07T20:32:17.9197355Z scale_ub_tensor = None 2025-05-07T20:32:17.9197429Z 2025-05-07T20:32:17.9197563Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9197663Z op = silu_mul_quant 2025-05-07T20:32:17.9197751Z if compiled: 2025-05-07T20:32:17.9197855Z op = torch.compile(op) 2025-05-07T20:32:17.9197972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9198047Z 2025-05-07T20:32:17.9198142Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.9198273Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.9198443Z 2025-05-07T20:32:17.9198594Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9198700Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.9198803Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.9198937Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.9199083Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9199162Z 2025-05-07T20:32:17.9199273Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.9199277Z 2025-05-07T20:32:17.9199380Z moe/activation_test.py:126: 2025-05-07T20:32:17.9199519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9199627Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.9199766Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9200351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.9200459Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.9200827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9201068Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9201487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.9201758Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9202162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.9202421Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9202815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.9203032Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.9203385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.9203468Z fn() 2025-05-07T20:32:17.9203867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.9203967Z self.fn.run( 2025-05-07T20:32:17.9204324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9204461Z kernel = self.compile( 2025-05-07T20:32:17.9205002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9205199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9205339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9205356Z 2025-05-07T20:32:17.9205567Z self = 2025-05-07T20:32:17.9206352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9206874Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ebcb6940>} 2025-05-07T20:32:17.9207617Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9207819Z context = 2025-05-07T20:32:17.9207824Z 2025-05-07T20:32:17.9208095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9208372Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9208491Z module_map=module_map) 2025-05-07T20:32:17.9208655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9208770Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.9208851Z E ^ 2025-05-07T20:32:17.9209209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9209214Z 2025-05-07T20:32:17.9209635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9209639Z 2025-05-07T20:32:17.9209744Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9209976Z self=, 2025-05-07T20:32:17.9210058Z T=2048, 2025-05-07T20:32:17.9210141Z D=5120, 2025-05-07T20:32:17.9210240Z scale_ub=None, 2025-05-07T20:32:17.9210330Z contiguous=True, 2025-05-07T20:32:17.9210415Z compiled=True, 2025-05-07T20:32:17.9210497Z ) 2025-05-07T20:32:17.9210717Z self = 2025-05-07T20:32:17.9210935Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.9210940Z 2025-05-07T20:32:17.9211026Z @given( 2025-05-07T20:32:17.9211147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9211259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9211376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9211495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9211620Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9211697Z ) 2025-05-07T20:32:17.9211951Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9212130Z def test_silu_mul_quant( 2025-05-07T20:32:17.9212210Z self, 2025-05-07T20:32:17.9212290Z T: int, 2025-05-07T20:32:17.9212375Z D: int, 2025-05-07T20:32:17.9212476Z scale_ub: Optional[float], 2025-05-07T20:32:17.9212566Z contiguous: bool, 2025-05-07T20:32:17.9212664Z compiled: bool, 2025-05-07T20:32:17.9212745Z ) -> None: 2025-05-07T20:32:17.9212847Z torch.manual_seed(2025) 2025-05-07T20:32:17.9212922Z 2025-05-07T20:32:17.9213094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9213185Z 2025-05-07T20:32:17.9213280Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9213407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9213505Z x = x_sign * x_clamp 2025-05-07T20:32:17.9213587Z x0 = x[:, :D] 2025-05-07T20:32:17.9213669Z x1 = x[:, D:] 2025-05-07T20:32:17.9213751Z 2025-05-07T20:32:17.9213847Z if contiguous: 2025-05-07T20:32:17.9213942Z x0 = x0.contiguous() 2025-05-07T20:32:17.9214040Z x1 = x1.contiguous() 2025-05-07T20:32:17.9214114Z 2025-05-07T20:32:17.9214207Z if scale_ub is not None: 2025-05-07T20:32:17.9214322Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9214464Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9214549Z ) 2025-05-07T20:32:17.9214628Z else: 2025-05-07T20:32:17.9214725Z scale_ub_tensor = None 2025-05-07T20:32:17.9214809Z 2025-05-07T20:32:17.9214943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9215037Z op = silu_mul_quant 2025-05-07T20:32:17.9215132Z if compiled: 2025-05-07T20:32:17.9215236Z op = torch.compile(op) 2025-05-07T20:32:17.9215344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9215427Z 2025-05-07T20:32:17.9215522Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.9215743Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.9215830Z 2025-05-07T20:32:17.9215971Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9216087Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.9216190Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.9216319Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.9216470Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9216546Z 2025-05-07T20:32:17.9216649Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.9216654Z 2025-05-07T20:32:17.9216762Z moe/activation_test.py:126: 2025-05-07T20:32:17.9216892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9217006Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.9217145Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9217716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.9217829Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.9218192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9218472Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9218854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.9219117Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9219530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.9219785Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9220216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.9220393Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.9220745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.9220836Z fn() 2025-05-07T20:32:17.9221332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.9221420Z self.fn.run( 2025-05-07T20:32:17.9221760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9221864Z kernel = self.compile( 2025-05-07T20:32:17.9222247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9222425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9222570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9222575Z 2025-05-07T20:32:17.9222789Z self = 2025-05-07T20:32:17.9223578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9224097Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb6519d0>} 2025-05-07T20:32:17.9224860Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9225138Z context = 2025-05-07T20:32:17.9225147Z 2025-05-07T20:32:17.9225326Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9225599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9225709Z module_map=module_map) 2025-05-07T20:32:17.9225879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9225983Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.9226062Z E ^ 2025-05-07T20:32:17.9226424Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9226429Z 2025-05-07T20:32:17.9226841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9226845Z 2025-05-07T20:32:17.9226952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9227189Z self=, 2025-05-07T20:32:17.9227271Z T=128, 2025-05-07T20:32:17.9227356Z D=5120, 2025-05-07T20:32:17.9227441Z scale_ub=None, 2025-05-07T20:32:17.9227528Z contiguous=True, 2025-05-07T20:32:17.9227618Z compiled=True, 2025-05-07T20:32:17.9227739Z ) 2025-05-07T20:32:17.9227959Z self = 2025-05-07T20:32:17.9228142Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.9228148Z 2025-05-07T20:32:17.9228226Z @given( 2025-05-07T20:32:17.9228345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9228454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9228570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9228696Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9228811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9228932Z ) 2025-05-07T20:32:17.9229192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9229290Z def test_silu_mul_quant( 2025-05-07T20:32:17.9229369Z self, 2025-05-07T20:32:17.9229454Z T: int, 2025-05-07T20:32:17.9229531Z D: int, 2025-05-07T20:32:17.9229633Z scale_ub: Optional[float], 2025-05-07T20:32:17.9229730Z contiguous: bool, 2025-05-07T20:32:17.9229816Z compiled: bool, 2025-05-07T20:32:17.9229897Z ) -> None: 2025-05-07T20:32:17.9229999Z torch.manual_seed(2025) 2025-05-07T20:32:17.9230073Z 2025-05-07T20:32:17.9230250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9230326Z 2025-05-07T20:32:17.9230420Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9230554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9230645Z x = x_sign * x_clamp 2025-05-07T20:32:17.9230727Z x0 = x[:, :D] 2025-05-07T20:32:17.9230825Z x1 = x[:, D:] 2025-05-07T20:32:17.9230899Z 2025-05-07T20:32:17.9230985Z if contiguous: 2025-05-07T20:32:17.9231090Z x0 = x0.contiguous() 2025-05-07T20:32:17.9231180Z x1 = x1.contiguous() 2025-05-07T20:32:17.9231254Z 2025-05-07T20:32:17.9231354Z if scale_ub is not None: 2025-05-07T20:32:17.9231464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9231608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9231686Z ) 2025-05-07T20:32:17.9231766Z else: 2025-05-07T20:32:17.9231877Z scale_ub_tensor = None 2025-05-07T20:32:17.9231951Z 2025-05-07T20:32:17.9232085Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9232184Z op = silu_mul_quant 2025-05-07T20:32:17.9232271Z if compiled: 2025-05-07T20:32:17.9232375Z op = torch.compile(op) 2025-05-07T20:32:17.9232488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9232725Z 2025-05-07T20:32:17.9232821Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.9232949Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.9233023Z 2025-05-07T20:32:17.9233166Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9233275Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.9233376Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.9233510Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.9233657Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9233735Z 2025-05-07T20:32:17.9233845Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.9233849Z 2025-05-07T20:32:17.9233950Z moe/activation_test.py:126: 2025-05-07T20:32:17.9234080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9234195Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.9234342Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9234907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.9235010Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.9235422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9235658Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9236027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.9236290Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9236689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.9236947Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9237372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.9237541Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.9237891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.9237980Z fn() 2025-05-07T20:32:17.9238387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.9238480Z self.fn.run( 2025-05-07T20:32:17.9238819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9238915Z kernel = self.compile( 2025-05-07T20:32:17.9239305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9239493Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9239624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9239638Z 2025-05-07T20:32:17.9239848Z self = 2025-05-07T20:32:17.9240931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9241457Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb9bb550>} 2025-05-07T20:32:17.9242431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9242665Z context = 2025-05-07T20:32:17.9242670Z 2025-05-07T20:32:17.9242854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9243168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9243292Z module_map=module_map) 2025-05-07T20:32:17.9243470Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9243590Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.9243669Z E ^ 2025-05-07T20:32:17.9244100Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9244105Z 2025-05-07T20:32:17.9244614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9244621Z 2025-05-07T20:32:17.9244739Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9244998Z self=, 2025-05-07T20:32:17.9245084Z T=4096, 2025-05-07T20:32:17.9245163Z D=5120, 2025-05-07T20:32:17.9245258Z scale_ub=None, 2025-05-07T20:32:17.9245436Z contiguous=True, 2025-05-07T20:32:17.9245520Z compiled=True, 2025-05-07T20:32:17.9245601Z ) 2025-05-07T20:32:17.9245819Z self = 2025-05-07T20:32:17.9245989Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.9245994Z 2025-05-07T20:32:17.9246081Z @given( 2025-05-07T20:32:17.9246201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9246302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9246424Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9246541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9246748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9246829Z ) 2025-05-07T20:32:17.9247079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9247184Z def test_silu_mul_quant( 2025-05-07T20:32:17.9247263Z self, 2025-05-07T20:32:17.9247343Z T: int, 2025-05-07T20:32:17.9247426Z D: int, 2025-05-07T20:32:17.9247526Z scale_ub: Optional[float], 2025-05-07T20:32:17.9247617Z contiguous: bool, 2025-05-07T20:32:17.9247711Z compiled: bool, 2025-05-07T20:32:17.9247794Z ) -> None: 2025-05-07T20:32:17.9247890Z torch.manual_seed(2025) 2025-05-07T20:32:17.9247972Z 2025-05-07T20:32:17.9248141Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9248223Z 2025-05-07T20:32:17.9248316Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9248456Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9248556Z x = x_sign * x_clamp 2025-05-07T20:32:17.9248647Z x0 = x[:, :D] 2025-05-07T20:32:17.9248730Z x1 = x[:, D:] 2025-05-07T20:32:17.9248804Z 2025-05-07T20:32:17.9248897Z if contiguous: 2025-05-07T20:32:17.9248989Z x0 = x0.contiguous() 2025-05-07T20:32:17.9249087Z x1 = x1.contiguous() 2025-05-07T20:32:17.9249165Z 2025-05-07T20:32:17.9249257Z if scale_ub is not None: 2025-05-07T20:32:17.9249375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9249512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9249589Z ) 2025-05-07T20:32:17.9249679Z else: 2025-05-07T20:32:17.9249776Z scale_ub_tensor = None 2025-05-07T20:32:17.9249853Z 2025-05-07T20:32:17.9249992Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9250085Z op = silu_mul_quant 2025-05-07T20:32:17.9250172Z if compiled: 2025-05-07T20:32:17.9250405Z op = torch.compile(op) 2025-05-07T20:32:17.9250515Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9250596Z 2025-05-07T20:32:17.9250689Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.9250813Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.9250897Z 2025-05-07T20:32:17.9251035Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9251140Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.9251250Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.9251372Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.9251511Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9251594Z 2025-05-07T20:32:17.9251695Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.9251700Z 2025-05-07T20:32:17.9251804Z moe/activation_test.py:126: 2025-05-07T20:32:17.9251938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9252049Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.9252191Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9252767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.9252957Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.9253330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9253557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9253927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.9254183Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9254584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.9254890Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9255261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.9255435Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.9255774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.9255854Z fn() 2025-05-07T20:32:17.9256254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.9256338Z self.fn.run( 2025-05-07T20:32:17.9256674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9256774Z kernel = self.compile( 2025-05-07T20:32:17.9257155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9257341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9257469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9257476Z 2025-05-07T20:32:17.9257685Z self = 2025-05-07T20:32:17.9258459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9258963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb907790>} 2025-05-07T20:32:17.9259786Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9259990Z context = 2025-05-07T20:32:17.9259995Z 2025-05-07T20:32:17.9260164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9260437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9260547Z module_map=module_map) 2025-05-07T20:32:17.9260716Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9260819Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.9260897Z E ^ 2025-05-07T20:32:17.9261417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9261424Z 2025-05-07T20:32:17.9261852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9261860Z 2025-05-07T20:32:17.9261973Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9262197Z self=, 2025-05-07T20:32:17.9262276Z T=16384, 2025-05-07T20:32:17.9262406Z D=5120, 2025-05-07T20:32:17.9262489Z scale_ub=None, 2025-05-07T20:32:17.9262574Z contiguous=True, 2025-05-07T20:32:17.9262666Z compiled=True, 2025-05-07T20:32:17.9262740Z ) 2025-05-07T20:32:17.9262958Z self = 2025-05-07T20:32:17.9263137Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.9263142Z 2025-05-07T20:32:17.9263222Z @given( 2025-05-07T20:32:17.9263346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9263447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9263570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9263740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9263857Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9263936Z ) 2025-05-07T20:32:17.9264199Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9264297Z def test_silu_mul_quant( 2025-05-07T20:32:17.9264375Z self, 2025-05-07T20:32:17.9264458Z T: int, 2025-05-07T20:32:17.9264536Z D: int, 2025-05-07T20:32:17.9264641Z scale_ub: Optional[float], 2025-05-07T20:32:17.9264733Z contiguous: bool, 2025-05-07T20:32:17.9264821Z compiled: bool, 2025-05-07T20:32:17.9264913Z ) -> None: 2025-05-07T20:32:17.9265012Z torch.manual_seed(2025) 2025-05-07T20:32:17.9265087Z 2025-05-07T20:32:17.9265264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9265340Z 2025-05-07T20:32:17.9265434Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9265575Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9265669Z x = x_sign * x_clamp 2025-05-07T20:32:17.9265753Z x0 = x[:, :D] 2025-05-07T20:32:17.9265846Z x1 = x[:, D:] 2025-05-07T20:32:17.9265922Z 2025-05-07T20:32:17.9266009Z if contiguous: 2025-05-07T20:32:17.9266113Z x0 = x0.contiguous() 2025-05-07T20:32:17.9266204Z x1 = x1.contiguous() 2025-05-07T20:32:17.9266284Z 2025-05-07T20:32:17.9266377Z if scale_ub is not None: 2025-05-07T20:32:17.9266484Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9266628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9266705Z ) 2025-05-07T20:32:17.9266785Z else: 2025-05-07T20:32:17.9266889Z scale_ub_tensor = None 2025-05-07T20:32:17.9266965Z 2025-05-07T20:32:17.9267101Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9267289Z op = silu_mul_quant 2025-05-07T20:32:17.9267379Z if compiled: 2025-05-07T20:32:17.9267482Z op = torch.compile(op) 2025-05-07T20:32:17.9267596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9267670Z 2025-05-07T20:32:17.9267772Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.9267899Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.9267974Z 2025-05-07T20:32:17.9268119Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9268223Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.9268324Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.9268454Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.9268602Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9268677Z 2025-05-07T20:32:17.9268786Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.9268791Z 2025-05-07T20:32:17.9268897Z moe/activation_test.py:126: 2025-05-07T20:32:17.9269038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9269145Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.9269280Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9269846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.9269997Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.9270358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9270591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9270955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.9271222Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9271659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.9271917Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9272301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.9272469Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.9272816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.9272896Z fn() 2025-05-07T20:32:17.9273291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.9273381Z self.fn.run( 2025-05-07T20:32:17.9273716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9273817Z kernel = self.compile( 2025-05-07T20:32:17.9274201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9274383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9274519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9274524Z 2025-05-07T20:32:17.9274732Z self = 2025-05-07T20:32:17.9275500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9276010Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb259b80>} 2025-05-07T20:32:17.9276858Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9277062Z context = 2025-05-07T20:32:17.9277069Z 2025-05-07T20:32:17.9277237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9277507Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9277616Z module_map=module_map) 2025-05-07T20:32:17.9277779Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9277890Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.9277968Z E ^ 2025-05-07T20:32:17.9278322Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9278334Z 2025-05-07T20:32:17.9278757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9278761Z 2025-05-07T20:32:17.9278865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9279092Z self=, 2025-05-07T20:32:17.9279215Z T=1, 2025-05-07T20:32:17.9279293Z D=5120, 2025-05-07T20:32:17.9279386Z scale_ub=1200.0, 2025-05-07T20:32:17.9279473Z contiguous=True, 2025-05-07T20:32:17.9279556Z compiled=True, 2025-05-07T20:32:17.9279637Z ) 2025-05-07T20:32:17.9279854Z self = 2025-05-07T20:32:17.9280021Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.9280025Z 2025-05-07T20:32:17.9280114Z @given( 2025-05-07T20:32:17.9280233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9280390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9280510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9280630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9280751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9280826Z ) 2025-05-07T20:32:17.9281076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9281179Z def test_silu_mul_quant( 2025-05-07T20:32:17.9281259Z self, 2025-05-07T20:32:17.9281340Z T: int, 2025-05-07T20:32:17.9281425Z D: int, 2025-05-07T20:32:17.9281524Z scale_ub: Optional[float], 2025-05-07T20:32:17.9281616Z contiguous: bool, 2025-05-07T20:32:17.9281712Z compiled: bool, 2025-05-07T20:32:17.9281793Z ) -> None: 2025-05-07T20:32:17.9281896Z torch.manual_seed(2025) 2025-05-07T20:32:17.9281972Z 2025-05-07T20:32:17.9282145Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9282232Z 2025-05-07T20:32:17.9282327Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9282455Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9282553Z x = x_sign * x_clamp 2025-05-07T20:32:17.9282641Z x0 = x[:, :D] 2025-05-07T20:32:17.9282725Z x1 = x[:, D:] 2025-05-07T20:32:17.9282807Z 2025-05-07T20:32:17.9282894Z if contiguous: 2025-05-07T20:32:17.9282987Z x0 = x0.contiguous() 2025-05-07T20:32:17.9283087Z x1 = x1.contiguous() 2025-05-07T20:32:17.9283163Z 2025-05-07T20:32:17.9283266Z if scale_ub is not None: 2025-05-07T20:32:17.9283374Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9283512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9283599Z ) 2025-05-07T20:32:17.9283679Z else: 2025-05-07T20:32:17.9283776Z scale_ub_tensor = None 2025-05-07T20:32:17.9283857Z 2025-05-07T20:32:17.9284079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9284174Z op = silu_mul_quant 2025-05-07T20:32:17.9284271Z if compiled: 2025-05-07T20:32:17.9284374Z op = torch.compile(op) 2025-05-07T20:32:17.9284482Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9284566Z 2025-05-07T20:32:17.9284664Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9284669Z 2025-05-07T20:32:17.9284776Z moe/activation_test.py:117: 2025-05-07T20:32:17.9284905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9285008Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9285118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9285489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9285583Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9286090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9286191Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9286555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9286827Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9287164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9287269Z kernel = self.compile( 2025-05-07T20:32:17.9287649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9287825Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9287960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9287964Z 2025-05-07T20:32:17.9288176Z self = 2025-05-07T20:32:17.9289016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9289528Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb831e50>} 2025-05-07T20:32:17.9290272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9290464Z context = 2025-05-07T20:32:17.9290469Z 2025-05-07T20:32:17.9290636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9290911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9291024Z module_map=module_map) 2025-05-07T20:32:17.9291193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9291292Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9291374Z E ^ 2025-05-07T20:32:17.9291734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9291739Z 2025-05-07T20:32:17.9292148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9292152Z 2025-05-07T20:32:17.9292258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9292488Z self=, 2025-05-07T20:32:17.9292568Z T=1, 2025-05-07T20:32:17.9292652Z D=5120, 2025-05-07T20:32:17.9292817Z scale_ub=None, 2025-05-07T20:32:17.9292908Z contiguous=False, 2025-05-07T20:32:17.9293000Z compiled=True, 2025-05-07T20:32:17.9293078Z ) 2025-05-07T20:32:17.9293296Z self = 2025-05-07T20:32:17.9293469Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9293476Z 2025-05-07T20:32:17.9293554Z @given( 2025-05-07T20:32:17.9293675Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9293784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9293899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9294025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9294140Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9294216Z ) 2025-05-07T20:32:17.9294470Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9294568Z def test_silu_mul_quant( 2025-05-07T20:32:17.9294659Z self, 2025-05-07T20:32:17.9294750Z T: int, 2025-05-07T20:32:17.9294830Z D: int, 2025-05-07T20:32:17.9294931Z scale_ub: Optional[float], 2025-05-07T20:32:17.9295033Z contiguous: bool, 2025-05-07T20:32:17.9295122Z compiled: bool, 2025-05-07T20:32:17.9295249Z ) -> None: 2025-05-07T20:32:17.9295354Z torch.manual_seed(2025) 2025-05-07T20:32:17.9295430Z 2025-05-07T20:32:17.9295610Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9295687Z 2025-05-07T20:32:17.9295783Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9295919Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9296010Z x = x_sign * x_clamp 2025-05-07T20:32:17.9296093Z x0 = x[:, :D] 2025-05-07T20:32:17.9296184Z x1 = x[:, D:] 2025-05-07T20:32:17.9296258Z 2025-05-07T20:32:17.9296345Z if contiguous: 2025-05-07T20:32:17.9296449Z x0 = x0.contiguous() 2025-05-07T20:32:17.9296593Z x1 = x1.contiguous() 2025-05-07T20:32:17.9296668Z 2025-05-07T20:32:17.9296772Z if scale_ub is not None: 2025-05-07T20:32:17.9296880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9297028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9297112Z ) 2025-05-07T20:32:17.9297194Z else: 2025-05-07T20:32:17.9297299Z scale_ub_tensor = None 2025-05-07T20:32:17.9297374Z 2025-05-07T20:32:17.9297506Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9297606Z op = silu_mul_quant 2025-05-07T20:32:17.9297694Z if compiled: 2025-05-07T20:32:17.9297798Z op = torch.compile(op) 2025-05-07T20:32:17.9297918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9297994Z 2025-05-07T20:32:17.9298087Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.9298223Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.9298306Z 2025-05-07T20:32:17.9298456Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9298560Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.9298662Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.9298796Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.9298943Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9299019Z 2025-05-07T20:32:17.9299127Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.9299132Z 2025-05-07T20:32:17.9299230Z moe/activation_test.py:126: 2025-05-07T20:32:17.9299360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9299476Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.9299611Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9300261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.9300373Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.9300733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9300965Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9301455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.9301729Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9302127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.9302380Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9302766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.9302936Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.9303284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.9303369Z fn() 2025-05-07T20:32:17.9303814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.9303903Z self.fn.run( 2025-05-07T20:32:17.9304238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9304333Z kernel = self.compile( 2025-05-07T20:32:17.9304716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9304899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9305035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9305089Z 2025-05-07T20:32:17.9305299Z self = 2025-05-07T20:32:17.9306068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9306590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb34bee0>} 2025-05-07T20:32:17.9307327Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9307527Z context = 2025-05-07T20:32:17.9307532Z 2025-05-07T20:32:17.9307707Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9307972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9308088Z module_map=module_map) 2025-05-07T20:32:17.9308251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9308365Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.9308444Z E ^ 2025-05-07T20:32:17.9308796Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9308800Z 2025-05-07T20:32:17.9309220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9309225Z 2025-05-07T20:32:17.9309329Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9309553Z self=, 2025-05-07T20:32:17.9309754Z T=1, 2025-05-07T20:32:17.9309834Z D=5120, 2025-05-07T20:32:17.9309923Z scale_ub=None, 2025-05-07T20:32:17.9310009Z contiguous=True, 2025-05-07T20:32:17.9310096Z compiled=False, 2025-05-07T20:32:17.9310178Z ) 2025-05-07T20:32:17.9310395Z self = 2025-05-07T20:32:17.9310566Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.9310571Z 2025-05-07T20:32:17.9310656Z @given( 2025-05-07T20:32:17.9310776Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9310877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9311003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9311121Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9311241Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9311320Z ) 2025-05-07T20:32:17.9311577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9311682Z def test_silu_mul_quant( 2025-05-07T20:32:17.9311761Z self, 2025-05-07T20:32:17.9311839Z T: int, 2025-05-07T20:32:17.9311922Z D: int, 2025-05-07T20:32:17.9312022Z scale_ub: Optional[float], 2025-05-07T20:32:17.9312155Z contiguous: bool, 2025-05-07T20:32:17.9312261Z compiled: bool, 2025-05-07T20:32:17.9318166Z ) -> None: 2025-05-07T20:32:17.9318286Z torch.manual_seed(2025) 2025-05-07T20:32:17.9318364Z 2025-05-07T20:32:17.9318552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9318631Z 2025-05-07T20:32:17.9318730Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9318868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9318962Z x = x_sign * x_clamp 2025-05-07T20:32:17.9319047Z x0 = x[:, :D] 2025-05-07T20:32:17.9319140Z x1 = x[:, D:] 2025-05-07T20:32:17.9319299Z 2025-05-07T20:32:17.9319405Z if contiguous: 2025-05-07T20:32:17.9319502Z x0 = x0.contiguous() 2025-05-07T20:32:17.9319595Z x1 = x1.contiguous() 2025-05-07T20:32:17.9319680Z 2025-05-07T20:32:17.9319775Z if scale_ub is not None: 2025-05-07T20:32:17.9319883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9320039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9320120Z ) 2025-05-07T20:32:17.9320202Z else: 2025-05-07T20:32:17.9320312Z scale_ub_tensor = None 2025-05-07T20:32:17.9320387Z 2025-05-07T20:32:17.9320526Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9320629Z op = silu_mul_quant 2025-05-07T20:32:17.9320718Z if compiled: 2025-05-07T20:32:17.9320825Z op = torch.compile(op) 2025-05-07T20:32:17.9320944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9321021Z 2025-05-07T20:32:17.9321131Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9321136Z 2025-05-07T20:32:17.9321241Z moe/activation_test.py:117: 2025-05-07T20:32:17.9321374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9321489Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9321594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9322109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9322220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9322580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9322820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9323162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9323260Z kernel = self.compile( 2025-05-07T20:32:17.9323751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9323932Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9324062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9324076Z 2025-05-07T20:32:17.9324286Z self = 2025-05-07T20:32:17.9325058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9325583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ead64dc0>} 2025-05-07T20:32:17.9326337Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9326546Z context = 2025-05-07T20:32:17.9326551Z 2025-05-07T20:32:17.9326762Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9327033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9327155Z module_map=module_map) 2025-05-07T20:32:17.9327319Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9327426Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9327506Z E ^ 2025-05-07T20:32:17.9327863Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9327868Z 2025-05-07T20:32:17.9328336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9328340Z 2025-05-07T20:32:17.9328445Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9328670Z self=, 2025-05-07T20:32:17.9328760Z T=128, 2025-05-07T20:32:17.9328841Z D=5120, 2025-05-07T20:32:17.9328933Z scale_ub=None, 2025-05-07T20:32:17.9329022Z contiguous=False, 2025-05-07T20:32:17.9329106Z compiled=True, 2025-05-07T20:32:17.9329189Z ) 2025-05-07T20:32:17.9329410Z self = 2025-05-07T20:32:17.9329584Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9329589Z 2025-05-07T20:32:17.9329676Z @given( 2025-05-07T20:32:17.9329796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9329898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9330034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9330153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9330281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9330356Z ) 2025-05-07T20:32:17.9330605Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9330712Z def test_silu_mul_quant( 2025-05-07T20:32:17.9330794Z self, 2025-05-07T20:32:17.9330872Z T: int, 2025-05-07T20:32:17.9330958Z D: int, 2025-05-07T20:32:17.9331063Z scale_ub: Optional[float], 2025-05-07T20:32:17.9331154Z contiguous: bool, 2025-05-07T20:32:17.9331251Z compiled: bool, 2025-05-07T20:32:17.9331335Z ) -> None: 2025-05-07T20:32:17.9331432Z torch.manual_seed(2025) 2025-05-07T20:32:17.9331514Z 2025-05-07T20:32:17.9331685Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9331771Z 2025-05-07T20:32:17.9331951Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9332081Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9332179Z x = x_sign * x_clamp 2025-05-07T20:32:17.9332262Z x0 = x[:, :D] 2025-05-07T20:32:17.9332344Z x1 = x[:, D:] 2025-05-07T20:32:17.9332429Z 2025-05-07T20:32:17.9332515Z if contiguous: 2025-05-07T20:32:17.9332609Z x0 = x0.contiguous() 2025-05-07T20:32:17.9332710Z x1 = x1.contiguous() 2025-05-07T20:32:17.9332804Z 2025-05-07T20:32:17.9332904Z if scale_ub is not None: 2025-05-07T20:32:17.9333046Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9333188Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9333280Z ) 2025-05-07T20:32:17.9333362Z else: 2025-05-07T20:32:17.9333458Z scale_ub_tensor = None 2025-05-07T20:32:17.9333547Z 2025-05-07T20:32:17.9333681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9333788Z op = silu_mul_quant 2025-05-07T20:32:17.9333888Z if compiled: 2025-05-07T20:32:17.9333991Z op = torch.compile(op) 2025-05-07T20:32:17.9334099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9334185Z 2025-05-07T20:32:17.9334322Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9334327Z 2025-05-07T20:32:17.9334427Z moe/activation_test.py:117: 2025-05-07T20:32:17.9334567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9334672Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9334786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9335154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9335252Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9335761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9335909Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9336268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9336507Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9336851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9336958Z kernel = self.compile( 2025-05-07T20:32:17.9337340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9337523Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9337665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9337669Z 2025-05-07T20:32:17.9337879Z self = 2025-05-07T20:32:17.9338670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9339187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ead64670>} 2025-05-07T20:32:17.9339942Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9340489Z context = 2025-05-07T20:32:17.9340497Z 2025-05-07T20:32:17.9340739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9341301Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9341421Z module_map=module_map) 2025-05-07T20:32:17.9341587Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9341700Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9341779Z E ^ 2025-05-07T20:32:17.9342157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9342162Z 2025-05-07T20:32:17.9342575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9342580Z 2025-05-07T20:32:17.9342685Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9342924Z self=, 2025-05-07T20:32:17.9343023Z T=128, 2025-05-07T20:32:17.9343105Z D=7168, 2025-05-07T20:32:17.9343218Z scale_ub=1200.0, 2025-05-07T20:32:17.9343319Z contiguous=False, 2025-05-07T20:32:17.9343416Z compiled=False, 2025-05-07T20:32:17.9343492Z ) 2025-05-07T20:32:17.9343712Z self = 2025-05-07T20:32:17.9343894Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.9343966Z 2025-05-07T20:32:17.9344047Z @given( 2025-05-07T20:32:17.9344168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9344278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9344395Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9344516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9344643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9344719Z ) 2025-05-07T20:32:17.9344975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9345071Z def test_silu_mul_quant( 2025-05-07T20:32:17.9345151Z self, 2025-05-07T20:32:17.9345354Z T: int, 2025-05-07T20:32:17.9345435Z D: int, 2025-05-07T20:32:17.9345535Z scale_ub: Optional[float], 2025-05-07T20:32:17.9345634Z contiguous: bool, 2025-05-07T20:32:17.9345722Z compiled: bool, 2025-05-07T20:32:17.9345802Z ) -> None: 2025-05-07T20:32:17.9345911Z torch.manual_seed(2025) 2025-05-07T20:32:17.9345986Z 2025-05-07T20:32:17.9346159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9346242Z 2025-05-07T20:32:17.9346337Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9346471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9346561Z x = x_sign * x_clamp 2025-05-07T20:32:17.9346645Z x0 = x[:, :D] 2025-05-07T20:32:17.9346737Z x1 = x[:, D:] 2025-05-07T20:32:17.9346811Z 2025-05-07T20:32:17.9346897Z if contiguous: 2025-05-07T20:32:17.9346997Z x0 = x0.contiguous() 2025-05-07T20:32:17.9347099Z x1 = x1.contiguous() 2025-05-07T20:32:17.9347173Z 2025-05-07T20:32:17.9347275Z if scale_ub is not None: 2025-05-07T20:32:17.9347383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9347521Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9347606Z ) 2025-05-07T20:32:17.9347688Z else: 2025-05-07T20:32:17.9347793Z scale_ub_tensor = None 2025-05-07T20:32:17.9347867Z 2025-05-07T20:32:17.9347998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9348101Z op = silu_mul_quant 2025-05-07T20:32:17.9348190Z if compiled: 2025-05-07T20:32:17.9348293Z op = torch.compile(op) 2025-05-07T20:32:17.9348409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9348490Z 2025-05-07T20:32:17.9348584Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9348588Z 2025-05-07T20:32:17.9348700Z moe/activation_test.py:117: 2025-05-07T20:32:17.9348916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9349023Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9349134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9349633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9349745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9350108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9350334Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9350682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9350780Z kernel = self.compile( 2025-05-07T20:32:17.9351171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9351360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9351489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9351493Z 2025-05-07T20:32:17.9351720Z self = 2025-05-07T20:32:17.9352550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9353098Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea920430>} 2025-05-07T20:32:17.9353884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9354120Z context = 2025-05-07T20:32:17.9354125Z 2025-05-07T20:32:17.9354305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9354572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9354695Z module_map=module_map) 2025-05-07T20:32:17.9354858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9354960Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9355050Z E ^ 2025-05-07T20:32:17.9355413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9355418Z 2025-05-07T20:32:17.9355847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9355854Z 2025-05-07T20:32:17.9355965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9356197Z self=, 2025-05-07T20:32:17.9356286Z T=128, 2025-05-07T20:32:17.9356373Z D=5120, 2025-05-07T20:32:17.9356456Z scale_ub=None, 2025-05-07T20:32:17.9356549Z contiguous=False, 2025-05-07T20:32:17.9356647Z compiled=False, 2025-05-07T20:32:17.9356721Z ) 2025-05-07T20:32:17.9356939Z self = 2025-05-07T20:32:17.9357121Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.9357125Z 2025-05-07T20:32:17.9357205Z @given( 2025-05-07T20:32:17.9357332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9357434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9357551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9357677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9357874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9357952Z ) 2025-05-07T20:32:17.9358208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9358302Z def test_silu_mul_quant( 2025-05-07T20:32:17.9358388Z self, 2025-05-07T20:32:17.9358469Z T: int, 2025-05-07T20:32:17.9358549Z D: int, 2025-05-07T20:32:17.9358658Z scale_ub: Optional[float], 2025-05-07T20:32:17.9358750Z contiguous: bool, 2025-05-07T20:32:17.9358837Z compiled: bool, 2025-05-07T20:32:17.9358926Z ) -> None: 2025-05-07T20:32:17.9359022Z torch.manual_seed(2025) 2025-05-07T20:32:17.9359098Z 2025-05-07T20:32:17.9359278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9359356Z 2025-05-07T20:32:17.9359449Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9359582Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9359682Z x = x_sign * x_clamp 2025-05-07T20:32:17.9359765Z x0 = x[:, :D] 2025-05-07T20:32:17.9359854Z x1 = x[:, D:] 2025-05-07T20:32:17.9359931Z 2025-05-07T20:32:17.9360026Z if contiguous: 2025-05-07T20:32:17.9360119Z x0 = x0.contiguous() 2025-05-07T20:32:17.9360255Z x1 = x1.contiguous() 2025-05-07T20:32:17.9360340Z 2025-05-07T20:32:17.9360432Z if scale_ub is not None: 2025-05-07T20:32:17.9360539Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9360685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9360762Z ) 2025-05-07T20:32:17.9360840Z else: 2025-05-07T20:32:17.9360942Z scale_ub_tensor = None 2025-05-07T20:32:17.9361016Z 2025-05-07T20:32:17.9361148Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9361246Z op = silu_mul_quant 2025-05-07T20:32:17.9361332Z if compiled: 2025-05-07T20:32:17.9361489Z op = torch.compile(op) 2025-05-07T20:32:17.9361597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9361670Z 2025-05-07T20:32:17.9361770Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9361774Z 2025-05-07T20:32:17.9361872Z moe/activation_test.py:117: 2025-05-07T20:32:17.9362004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9362112Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9362214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9362718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9362825Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9363184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9363416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9363767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9363863Z kernel = self.compile( 2025-05-07T20:32:17.9364251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9364434Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9364570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9364574Z 2025-05-07T20:32:17.9364783Z self = 2025-05-07T20:32:17.9365557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9366146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ead645e0>} 2025-05-07T20:32:17.9366890Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9367094Z context = 2025-05-07T20:32:17.9367098Z 2025-05-07T20:32:17.9367266Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9367536Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9367653Z module_map=module_map) 2025-05-07T20:32:17.9367815Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9367921Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9367998Z E ^ 2025-05-07T20:32:17.9368359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9368363Z 2025-05-07T20:32:17.9368788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9368833Z 2025-05-07T20:32:17.9368940Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9369173Z self=, 2025-05-07T20:32:17.9369253Z T=128, 2025-05-07T20:32:17.9369332Z D=5120, 2025-05-07T20:32:17.9369423Z scale_ub=1200.0, 2025-05-07T20:32:17.9369510Z contiguous=True, 2025-05-07T20:32:17.9369595Z compiled=False, 2025-05-07T20:32:17.9369677Z ) 2025-05-07T20:32:17.9369898Z self = 2025-05-07T20:32:17.9370070Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.9370190Z 2025-05-07T20:32:17.9370283Z @given( 2025-05-07T20:32:17.9370403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9370516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9370635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9370754Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9370878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9370955Z ) 2025-05-07T20:32:17.9371203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9371304Z def test_silu_mul_quant( 2025-05-07T20:32:17.9371384Z self, 2025-05-07T20:32:17.9371462Z T: int, 2025-05-07T20:32:17.9371546Z D: int, 2025-05-07T20:32:17.9371645Z scale_ub: Optional[float], 2025-05-07T20:32:17.9371737Z contiguous: bool, 2025-05-07T20:32:17.9371830Z compiled: bool, 2025-05-07T20:32:17.9371911Z ) -> None: 2025-05-07T20:32:17.9372023Z torch.manual_seed(2025) 2025-05-07T20:32:17.9372103Z 2025-05-07T20:32:17.9372278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9372361Z 2025-05-07T20:32:17.9372456Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9372582Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9372684Z x = x_sign * x_clamp 2025-05-07T20:32:17.9372767Z x0 = x[:, :D] 2025-05-07T20:32:17.9372850Z x1 = x[:, D:] 2025-05-07T20:32:17.9372932Z 2025-05-07T20:32:17.9373019Z if contiguous: 2025-05-07T20:32:17.9373115Z x0 = x0.contiguous() 2025-05-07T20:32:17.9373213Z x1 = x1.contiguous() 2025-05-07T20:32:17.9373286Z 2025-05-07T20:32:17.9373380Z if scale_ub is not None: 2025-05-07T20:32:17.9373497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9373634Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9373721Z ) 2025-05-07T20:32:17.9373888Z else: 2025-05-07T20:32:17.9373989Z scale_ub_tensor = None 2025-05-07T20:32:17.9374071Z 2025-05-07T20:32:17.9374205Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9374296Z op = silu_mul_quant 2025-05-07T20:32:17.9374390Z if compiled: 2025-05-07T20:32:17.9374494Z op = torch.compile(op) 2025-05-07T20:32:17.9374603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9374685Z 2025-05-07T20:32:17.9374780Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9374784Z 2025-05-07T20:32:17.9374893Z moe/activation_test.py:117: 2025-05-07T20:32:17.9375021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9375124Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9375231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9375741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9375845Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9376213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9376436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9376855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9376951Z kernel = self.compile( 2025-05-07T20:32:17.9377331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9377513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9377639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9377644Z 2025-05-07T20:32:17.9377851Z self = 2025-05-07T20:32:17.9378688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9379201Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7eb2c3b80>} 2025-05-07T20:32:17.9379947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9380140Z context = 2025-05-07T20:32:17.9380145Z 2025-05-07T20:32:17.9380317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9380587Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9380698Z module_map=module_map) 2025-05-07T20:32:17.9380884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9380984Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9381203Z E ^ 2025-05-07T20:32:17.9381573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9381577Z 2025-05-07T20:32:17.9381991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9381996Z 2025-05-07T20:32:17.9382109Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9382334Z self=, 2025-05-07T20:32:17.9382412Z T=1, 2025-05-07T20:32:17.9382499Z D=7168, 2025-05-07T20:32:17.9382583Z scale_ub=1200.0, 2025-05-07T20:32:17.9382672Z contiguous=True, 2025-05-07T20:32:17.9382851Z compiled=True, 2025-05-07T20:32:17.9382930Z ) 2025-05-07T20:32:17.9383150Z self = 2025-05-07T20:32:17.9383324Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.9383332Z 2025-05-07T20:32:17.9383412Z @given( 2025-05-07T20:32:17.9383532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9383642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9383757Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9383887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9384001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9384077Z ) 2025-05-07T20:32:17.9384329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9384425Z def test_silu_mul_quant( 2025-05-07T20:32:17.9384504Z self, 2025-05-07T20:32:17.9384598Z T: int, 2025-05-07T20:32:17.9384679Z D: int, 2025-05-07T20:32:17.9384779Z scale_ub: Optional[float], 2025-05-07T20:32:17.9384876Z contiguous: bool, 2025-05-07T20:32:17.9384965Z compiled: bool, 2025-05-07T20:32:17.9385045Z ) -> None: 2025-05-07T20:32:17.9385150Z torch.manual_seed(2025) 2025-05-07T20:32:17.9385265Z 2025-05-07T20:32:17.9385444Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9385520Z 2025-05-07T20:32:17.9385614Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9385747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9385838Z x = x_sign * x_clamp 2025-05-07T20:32:17.9385922Z x0 = x[:, :D] 2025-05-07T20:32:17.9386009Z x1 = x[:, D:] 2025-05-07T20:32:17.9386082Z 2025-05-07T20:32:17.9386167Z if contiguous: 2025-05-07T20:32:17.9386266Z x0 = x0.contiguous() 2025-05-07T20:32:17.9386357Z x1 = x1.contiguous() 2025-05-07T20:32:17.9386481Z 2025-05-07T20:32:17.9386584Z if scale_ub is not None: 2025-05-07T20:32:17.9386696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9386849Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9386927Z ) 2025-05-07T20:32:17.9387009Z else: 2025-05-07T20:32:17.9387113Z scale_ub_tensor = None 2025-05-07T20:32:17.9387189Z 2025-05-07T20:32:17.9387330Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9387431Z op = silu_mul_quant 2025-05-07T20:32:17.9387519Z if compiled: 2025-05-07T20:32:17.9387624Z op = torch.compile(op) 2025-05-07T20:32:17.9387741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9387816Z 2025-05-07T20:32:17.9387912Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9387923Z 2025-05-07T20:32:17.9388026Z moe/activation_test.py:117: 2025-05-07T20:32:17.9388173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9388292Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9388398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9388839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9388946Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9389550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9389654Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9390092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9390351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9390764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9390944Z kernel = self.compile( 2025-05-07T20:32:17.9391332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9391515Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9391645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9391650Z 2025-05-07T20:32:17.9391866Z self = 2025-05-07T20:32:17.9392640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9393147Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea58e820>} 2025-05-07T20:32:17.9393902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9394097Z context = 2025-05-07T20:32:17.9394179Z 2025-05-07T20:32:17.9394357Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9394620Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9394731Z module_map=module_map) 2025-05-07T20:32:17.9394901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9395001Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9395087Z E ^ 2025-05-07T20:32:17.9395441Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9395484Z 2025-05-07T20:32:17.9395911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9395916Z 2025-05-07T20:32:17.9396030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9396254Z self=, 2025-05-07T20:32:17.9396344Z T=1, 2025-05-07T20:32:17.9396423Z D=7168, 2025-05-07T20:32:17.9396507Z scale_ub=1200.0, 2025-05-07T20:32:17.9396604Z contiguous=False, 2025-05-07T20:32:17.9396689Z compiled=True, 2025-05-07T20:32:17.9396764Z ) 2025-05-07T20:32:17.9396991Z self = 2025-05-07T20:32:17.9397158Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9397163Z 2025-05-07T20:32:17.9397242Z @given( 2025-05-07T20:32:17.9397370Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9397480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9397597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9397723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9397839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9397926Z ) 2025-05-07T20:32:17.9398178Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9398274Z def test_silu_mul_quant( 2025-05-07T20:32:17.9398360Z self, 2025-05-07T20:32:17.9398442Z T: int, 2025-05-07T20:32:17.9398521Z D: int, 2025-05-07T20:32:17.9398631Z scale_ub: Optional[float], 2025-05-07T20:32:17.9398724Z contiguous: bool, 2025-05-07T20:32:17.9398815Z compiled: bool, 2025-05-07T20:32:17.9398904Z ) -> None: 2025-05-07T20:32:17.9399002Z torch.manual_seed(2025) 2025-05-07T20:32:17.9399077Z 2025-05-07T20:32:17.9399257Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9399415Z 2025-05-07T20:32:17.9399517Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9399644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9399736Z x = x_sign * x_clamp 2025-05-07T20:32:17.9399831Z x0 = x[:, :D] 2025-05-07T20:32:17.9399913Z x1 = x[:, D:] 2025-05-07T20:32:17.9399993Z 2025-05-07T20:32:17.9400088Z if contiguous: 2025-05-07T20:32:17.9400179Z x0 = x0.contiguous() 2025-05-07T20:32:17.9400269Z x1 = x1.contiguous() 2025-05-07T20:32:17.9400351Z 2025-05-07T20:32:17.9400445Z if scale_ub is not None: 2025-05-07T20:32:17.9400555Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9400702Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9400784Z ) 2025-05-07T20:32:17.9400869Z else: 2025-05-07T20:32:17.9400965Z scale_ub_tensor = None 2025-05-07T20:32:17.9401039Z 2025-05-07T20:32:17.9401193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9401286Z op = silu_mul_quant 2025-05-07T20:32:17.9401373Z if compiled: 2025-05-07T20:32:17.9401481Z op = torch.compile(op) 2025-05-07T20:32:17.9401589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9401709Z 2025-05-07T20:32:17.9401812Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9401816Z 2025-05-07T20:32:17.9401915Z moe/activation_test.py:117: 2025-05-07T20:32:17.9402044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9402153Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9402256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9402630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9402723Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9403222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9403372Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9403734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9403963Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9404306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9404401Z kernel = self.compile( 2025-05-07T20:32:17.9404788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9404969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9405097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9405101Z 2025-05-07T20:32:17.9405321Z self = 2025-05-07T20:32:17.9406095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9406616Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea4944c0>} 2025-05-07T20:32:17.9407360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9407561Z context = 2025-05-07T20:32:17.9407566Z 2025-05-07T20:32:17.9407732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9408107Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9408228Z module_map=module_map) 2025-05-07T20:32:17.9408389Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9408490Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9408576Z E ^ 2025-05-07T20:32:17.9408928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9408933Z 2025-05-07T20:32:17.9409356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9409360Z 2025-05-07T20:32:17.9409465Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9409689Z self=, 2025-05-07T20:32:17.9409778Z T=1, 2025-05-07T20:32:17.9409855Z D=7168, 2025-05-07T20:32:17.9409938Z scale_ub=None, 2025-05-07T20:32:17.9410040Z contiguous=False, 2025-05-07T20:32:17.9410126Z compiled=True, 2025-05-07T20:32:17.9410207Z ) 2025-05-07T20:32:17.9410424Z self = 2025-05-07T20:32:17.9410590Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9410636Z 2025-05-07T20:32:17.9410726Z @given( 2025-05-07T20:32:17.9410847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9410949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9411074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9411193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9411306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9411388Z ) 2025-05-07T20:32:17.9411638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9411743Z def test_silu_mul_quant( 2025-05-07T20:32:17.9411869Z self, 2025-05-07T20:32:17.9411951Z T: int, 2025-05-07T20:32:17.9412038Z D: int, 2025-05-07T20:32:17.9412140Z scale_ub: Optional[float], 2025-05-07T20:32:17.9412231Z contiguous: bool, 2025-05-07T20:32:17.9412326Z compiled: bool, 2025-05-07T20:32:17.9412407Z ) -> None: 2025-05-07T20:32:17.9412506Z torch.manual_seed(2025) 2025-05-07T20:32:17.9412587Z 2025-05-07T20:32:17.9412761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9412838Z 2025-05-07T20:32:17.9412938Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9413066Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9413165Z x = x_sign * x_clamp 2025-05-07T20:32:17.9413248Z x0 = x[:, :D] 2025-05-07T20:32:17.9413333Z x1 = x[:, D:] 2025-05-07T20:32:17.9413426Z 2025-05-07T20:32:17.9413524Z if contiguous: 2025-05-07T20:32:17.9413637Z x0 = x0.contiguous() 2025-05-07T20:32:17.9413748Z x1 = x1.contiguous() 2025-05-07T20:32:17.9413820Z 2025-05-07T20:32:17.9413913Z if scale_ub is not None: 2025-05-07T20:32:17.9414028Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9414165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9414249Z ) 2025-05-07T20:32:17.9414336Z else: 2025-05-07T20:32:17.9414434Z scale_ub_tensor = None 2025-05-07T20:32:17.9414520Z 2025-05-07T20:32:17.9414652Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9414745Z op = silu_mul_quant 2025-05-07T20:32:17.9414838Z if compiled: 2025-05-07T20:32:17.9414939Z op = torch.compile(op) 2025-05-07T20:32:17.9415046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9415129Z 2025-05-07T20:32:17.9415222Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.9415346Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.9415508Z 2025-05-07T20:32:17.9415649Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9415753Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.9415862Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.9415985Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.9416137Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9416218Z 2025-05-07T20:32:17.9416321Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.9416325Z 2025-05-07T20:32:17.9416433Z moe/activation_test.py:126: 2025-05-07T20:32:17.9416561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9416668Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.9416809Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.9417379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.9417488Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.9417848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9418072Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9418495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.9418751Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9419147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.9419407Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.9419790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.9420017Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.9420358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.9420437Z fn() 2025-05-07T20:32:17.9420844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.9420928Z self.fn.run( 2025-05-07T20:32:17.9421393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9421490Z kernel = self.compile( 2025-05-07T20:32:17.9421867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9422051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9422184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9422191Z 2025-05-07T20:32:17.9422403Z self = 2025-05-07T20:32:17.9423183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9423747Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea51e040>} 2025-05-07T20:32:17.9424493Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9424686Z context = 2025-05-07T20:32:17.9424691Z 2025-05-07T20:32:17.9424945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9425218Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9425327Z module_map=module_map) 2025-05-07T20:32:17.9425495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9425605Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.9425684Z E ^ 2025-05-07T20:32:17.9426051Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9426056Z 2025-05-07T20:32:17.9426468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9426472Z 2025-05-07T20:32:17.9426583Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9426807Z self=, 2025-05-07T20:32:17.9426898Z T=1, 2025-05-07T20:32:17.9426992Z D=5120, 2025-05-07T20:32:17.9427078Z scale_ub=1200.0, 2025-05-07T20:32:17.9427168Z contiguous=False, 2025-05-07T20:32:17.9427259Z compiled=True, 2025-05-07T20:32:17.9427334Z ) 2025-05-07T20:32:17.9427562Z self = 2025-05-07T20:32:17.9427770Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9427775Z 2025-05-07T20:32:17.9427854Z @given( 2025-05-07T20:32:17.9427984Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9428085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9428201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9428330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9428445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9428520Z ) 2025-05-07T20:32:17.9428781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9428928Z def test_silu_mul_quant( 2025-05-07T20:32:17.9429015Z self, 2025-05-07T20:32:17.9429094Z T: int, 2025-05-07T20:32:17.9429173Z D: int, 2025-05-07T20:32:17.9429282Z scale_ub: Optional[float], 2025-05-07T20:32:17.9429375Z contiguous: bool, 2025-05-07T20:32:17.9429466Z compiled: bool, 2025-05-07T20:32:17.9429555Z ) -> None: 2025-05-07T20:32:17.9429650Z torch.manual_seed(2025) 2025-05-07T20:32:17.9429730Z 2025-05-07T20:32:17.9429915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9429993Z 2025-05-07T20:32:17.9430088Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9430223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9430314Z x = x_sign * x_clamp 2025-05-07T20:32:17.9430404Z x0 = x[:, :D] 2025-05-07T20:32:17.9430487Z x1 = x[:, D:] 2025-05-07T20:32:17.9430561Z 2025-05-07T20:32:17.9430665Z if contiguous: 2025-05-07T20:32:17.9430764Z x0 = x0.contiguous() 2025-05-07T20:32:17.9430855Z x1 = x1.contiguous() 2025-05-07T20:32:17.9430936Z 2025-05-07T20:32:17.9431028Z if scale_ub is not None: 2025-05-07T20:32:17.9431137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9431286Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9431370Z ) 2025-05-07T20:32:17.9431451Z else: 2025-05-07T20:32:17.9431555Z scale_ub_tensor = None 2025-05-07T20:32:17.9431629Z 2025-05-07T20:32:17.9431764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9431867Z op = silu_mul_quant 2025-05-07T20:32:17.9431956Z if compiled: 2025-05-07T20:32:17.9432064Z op = torch.compile(op) 2025-05-07T20:32:17.9432175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9432249Z 2025-05-07T20:32:17.9432433Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9432439Z 2025-05-07T20:32:17.9432539Z moe/activation_test.py:117: 2025-05-07T20:32:17.9432669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9432775Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9432878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9433257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9433351Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9433846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9433949Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9434304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9434529Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9434883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9434979Z kernel = self.compile( 2025-05-07T20:32:17.9435365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9435585Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9435713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9435718Z 2025-05-07T20:32:17.9435936Z self = 2025-05-07T20:32:17.9436707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9437220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea51ef70>} 2025-05-07T20:32:17.9438016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9438210Z context = 2025-05-07T20:32:17.9438221Z 2025-05-07T20:32:17.9438387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9438650Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9438767Z module_map=module_map) 2025-05-07T20:32:17.9438930Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9439033Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9439117Z E ^ 2025-05-07T20:32:17.9439486Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9439491Z 2025-05-07T20:32:17.9439910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9439916Z 2025-05-07T20:32:17.9440020Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9440662Z self=, 2025-05-07T20:32:17.9440801Z T=1, 2025-05-07T20:32:17.9440920Z D=5120, 2025-05-07T20:32:17.9441055Z scale_ub=1200.0, 2025-05-07T20:32:17.9441191Z contiguous=False, 2025-05-07T20:32:17.9441310Z compiled=False, 2025-05-07T20:32:17.9441412Z ) 2025-05-07T20:32:17.9441657Z self = 2025-05-07T20:32:17.9441830Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.9441836Z 2025-05-07T20:32:17.9442206Z @given( 2025-05-07T20:32:17.9442330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9442431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9442553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9442671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9442789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9442871Z ) 2025-05-07T20:32:17.9443119Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9443214Z def test_silu_mul_quant( 2025-05-07T20:32:17.9443301Z self, 2025-05-07T20:32:17.9443380Z T: int, 2025-05-07T20:32:17.9443464Z D: int, 2025-05-07T20:32:17.9443578Z scale_ub: Optional[float], 2025-05-07T20:32:17.9455854Z contiguous: bool, 2025-05-07T20:32:17.9456029Z compiled: bool, 2025-05-07T20:32:17.9456149Z ) -> None: 2025-05-07T20:32:17.9456338Z torch.manual_seed(2025) 2025-05-07T20:32:17.9456448Z 2025-05-07T20:32:17.9456691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9456796Z 2025-05-07T20:32:17.9456921Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9457095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9457425Z x = x_sign * x_clamp 2025-05-07T20:32:17.9457543Z x0 = x[:, :D] 2025-05-07T20:32:17.9457640Z x1 = x[:, D:] 2025-05-07T20:32:17.9457719Z 2025-05-07T20:32:17.9457810Z if contiguous: 2025-05-07T20:32:17.9457919Z x0 = x0.contiguous() 2025-05-07T20:32:17.9458014Z x1 = x1.contiguous() 2025-05-07T20:32:17.9458090Z 2025-05-07T20:32:17.9458195Z if scale_ub is not None: 2025-05-07T20:32:17.9458311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9458457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9458549Z ) 2025-05-07T20:32:17.9458724Z else: 2025-05-07T20:32:17.9458834Z scale_ub_tensor = None 2025-05-07T20:32:17.9458914Z 2025-05-07T20:32:17.9459050Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9459156Z op = silu_mul_quant 2025-05-07T20:32:17.9459247Z if compiled: 2025-05-07T20:32:17.9459357Z op = torch.compile(op) 2025-05-07T20:32:17.9459477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9459554Z 2025-05-07T20:32:17.9459652Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9459657Z 2025-05-07T20:32:17.9459773Z moe/activation_test.py:117: 2025-05-07T20:32:17.9459917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9460039Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9460148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9460664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9460778Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9461290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9461539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9461962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9462088Z kernel = self.compile( 2025-05-07T20:32:17.9462490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9462678Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9462812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9462817Z 2025-05-07T20:32:17.9463038Z self = 2025-05-07T20:32:17.9463917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9464451Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9e373a0>} 2025-05-07T20:32:17.9465213Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9465412Z context = 2025-05-07T20:32:17.9465425Z 2025-05-07T20:32:17.9465603Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9465882Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9466005Z module_map=module_map) 2025-05-07T20:32:17.9466177Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9466281Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9466375Z E ^ 2025-05-07T20:32:17.9466776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9466781Z 2025-05-07T20:32:17.9467207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9467211Z 2025-05-07T20:32:17.9467321Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9467549Z self=, 2025-05-07T20:32:17.9467641Z T=16384, 2025-05-07T20:32:17.9467724Z D=5120, 2025-05-07T20:32:17.9467812Z scale_ub=1200.0, 2025-05-07T20:32:17.9467963Z contiguous=False, 2025-05-07T20:32:17.9468052Z compiled=True, 2025-05-07T20:32:17.9468132Z ) 2025-05-07T20:32:17.9468364Z self = 2025-05-07T20:32:17.9468546Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9468553Z 2025-05-07T20:32:17.9468643Z @given( 2025-05-07T20:32:17.9468770Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9468877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9469006Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9469127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9469245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9469331Z ) 2025-05-07T20:32:17.9469584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9469689Z def test_silu_mul_quant( 2025-05-07T20:32:17.9469772Z self, 2025-05-07T20:32:17.9469862Z T: int, 2025-05-07T20:32:17.9469953Z D: int, 2025-05-07T20:32:17.9470056Z scale_ub: Optional[float], 2025-05-07T20:32:17.9470150Z contiguous: bool, 2025-05-07T20:32:17.9470248Z compiled: bool, 2025-05-07T20:32:17.9470330Z ) -> None: 2025-05-07T20:32:17.9470432Z torch.manual_seed(2025) 2025-05-07T20:32:17.9470515Z 2025-05-07T20:32:17.9470688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9470765Z 2025-05-07T20:32:17.9470869Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9471002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9471096Z x = x_sign * x_clamp 2025-05-07T20:32:17.9471187Z x0 = x[:, :D] 2025-05-07T20:32:17.9471271Z x1 = x[:, D:] 2025-05-07T20:32:17.9471354Z 2025-05-07T20:32:17.9471443Z if contiguous: 2025-05-07T20:32:17.9471539Z x0 = x0.contiguous() 2025-05-07T20:32:17.9471718Z x1 = x1.contiguous() 2025-05-07T20:32:17.9471801Z 2025-05-07T20:32:17.9471899Z if scale_ub is not None: 2025-05-07T20:32:17.9472018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9472161Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9472242Z ) 2025-05-07T20:32:17.9493363Z else: 2025-05-07T20:32:17.9493577Z scale_ub_tensor = None 2025-05-07T20:32:17.9493654Z 2025-05-07T20:32:17.9493807Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9493903Z op = silu_mul_quant 2025-05-07T20:32:17.9494005Z if compiled: 2025-05-07T20:32:17.9494107Z op = torch.compile(op) 2025-05-07T20:32:17.9494215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9494292Z 2025-05-07T20:32:17.9494383Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9494389Z 2025-05-07T20:32:17.9494494Z moe/activation_test.py:117: 2025-05-07T20:32:17.9494662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9494766Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9494871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9495258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9495522Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9496022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9496120Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9496476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9496706Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9497047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9497206Z kernel = self.compile( 2025-05-07T20:32:17.9497599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9497776Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9497910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9497915Z 2025-05-07T20:32:17.9498124Z self = 2025-05-07T20:32:17.9498898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9499416Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea45b0d0>} 2025-05-07T20:32:17.9500178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9500377Z context = 2025-05-07T20:32:17.9500384Z 2025-05-07T20:32:17.9500554Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9500833Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9500945Z module_map=module_map) 2025-05-07T20:32:17.9501185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9501292Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9501369Z E ^ 2025-05-07T20:32:17.9501730Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9501841Z 2025-05-07T20:32:17.9502270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9502275Z 2025-05-07T20:32:17.9502379Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9502606Z self=, 2025-05-07T20:32:17.9502690Z T=2048, 2025-05-07T20:32:17.9502770Z D=7168, 2025-05-07T20:32:17.9502879Z scale_ub=1200.0, 2025-05-07T20:32:17.9502974Z contiguous=False, 2025-05-07T20:32:17.9503073Z compiled=True, 2025-05-07T20:32:17.9503157Z ) 2025-05-07T20:32:17.9503380Z self = 2025-05-07T20:32:17.9503556Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9503566Z 2025-05-07T20:32:17.9503641Z @given( 2025-05-07T20:32:17.9503764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9503884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9504000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9504118Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9504239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9504314Z ) 2025-05-07T20:32:17.9504611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9504712Z def test_silu_mul_quant( 2025-05-07T20:32:17.9504786Z self, 2025-05-07T20:32:17.9504862Z T: int, 2025-05-07T20:32:17.9504941Z D: int, 2025-05-07T20:32:17.9505039Z scale_ub: Optional[float], 2025-05-07T20:32:17.9505133Z contiguous: bool, 2025-05-07T20:32:17.9505218Z compiled: bool, 2025-05-07T20:32:17.9505297Z ) -> None: 2025-05-07T20:32:17.9505397Z torch.manual_seed(2025) 2025-05-07T20:32:17.9505472Z 2025-05-07T20:32:17.9505652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9505825Z 2025-05-07T20:32:17.9505918Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9506045Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9506137Z x = x_sign * x_clamp 2025-05-07T20:32:17.9506219Z x0 = x[:, :D] 2025-05-07T20:32:17.9506301Z x1 = x[:, D:] 2025-05-07T20:32:17.9506384Z 2025-05-07T20:32:17.9506468Z if contiguous: 2025-05-07T20:32:17.9506566Z x0 = x0.contiguous() 2025-05-07T20:32:17.9506659Z x1 = x1.contiguous() 2025-05-07T20:32:17.9506735Z 2025-05-07T20:32:17.9506839Z if scale_ub is not None: 2025-05-07T20:32:17.9506948Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9507087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9507178Z ) 2025-05-07T20:32:17.9507257Z else: 2025-05-07T20:32:17.9507354Z scale_ub_tensor = None 2025-05-07T20:32:17.9507437Z 2025-05-07T20:32:17.9507578Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9507670Z op = silu_mul_quant 2025-05-07T20:32:17.9507765Z if compiled: 2025-05-07T20:32:17.9507866Z op = torch.compile(op) 2025-05-07T20:32:17.9507976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9508052Z 2025-05-07T20:32:17.9508145Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9508150Z 2025-05-07T20:32:17.9508255Z moe/activation_test.py:117: 2025-05-07T20:32:17.9508387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9508489Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9508599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9508974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9509066Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9509658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9509760Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9510132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9510363Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9510700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9510803Z kernel = self.compile( 2025-05-07T20:32:17.9511183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9511366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9511494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9511499Z 2025-05-07T20:32:17.9511713Z self = 2025-05-07T20:32:17.9512499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9513053Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea45bca0>} 2025-05-07T20:32:17.9513806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9513999Z context = 2025-05-07T20:32:17.9514004Z 2025-05-07T20:32:17.9514181Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9514501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9514611Z module_map=module_map) 2025-05-07T20:32:17.9514784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9514886Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9514970Z E ^ 2025-05-07T20:32:17.9515340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9515345Z 2025-05-07T20:32:17.9515757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9515762Z 2025-05-07T20:32:17.9515875Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9516099Z self=, 2025-05-07T20:32:17.9516180Z T=1, 2025-05-07T20:32:17.9516268Z D=5120, 2025-05-07T20:32:17.9516357Z scale_ub=None, 2025-05-07T20:32:17.9516448Z contiguous=False, 2025-05-07T20:32:17.9516546Z compiled=False, 2025-05-07T20:32:17.9516622Z ) 2025-05-07T20:32:17.9516845Z self = 2025-05-07T20:32:17.9517023Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.9517030Z 2025-05-07T20:32:17.9517110Z @given( 2025-05-07T20:32:17.9517240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9517342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9517458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9517586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9517703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9517780Z ) 2025-05-07T20:32:17.9518035Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9518130Z def test_silu_mul_quant( 2025-05-07T20:32:17.9518302Z self, 2025-05-07T20:32:17.9518382Z T: int, 2025-05-07T20:32:17.9518461Z D: int, 2025-05-07T20:32:17.9518568Z scale_ub: Optional[float], 2025-05-07T20:32:17.9518659Z contiguous: bool, 2025-05-07T20:32:17.9518749Z compiled: bool, 2025-05-07T20:32:17.9518840Z ) -> None: 2025-05-07T20:32:17.9518934Z torch.manual_seed(2025) 2025-05-07T20:32:17.9519009Z 2025-05-07T20:32:17.9519188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9519264Z 2025-05-07T20:32:17.9519357Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9519491Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9519581Z x = x_sign * x_clamp 2025-05-07T20:32:17.9519663Z x0 = x[:, :D] 2025-05-07T20:32:17.9519753Z x1 = x[:, D:] 2025-05-07T20:32:17.9519826Z 2025-05-07T20:32:17.9519917Z if contiguous: 2025-05-07T20:32:17.9520011Z x0 = x0.contiguous() 2025-05-07T20:32:17.9520113Z x1 = x1.contiguous() 2025-05-07T20:32:17.9520193Z 2025-05-07T20:32:17.9520285Z if scale_ub is not None: 2025-05-07T20:32:17.9520393Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9520542Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9520665Z ) 2025-05-07T20:32:17.9520741Z else: 2025-05-07T20:32:17.9520849Z scale_ub_tensor = None 2025-05-07T20:32:17.9520924Z 2025-05-07T20:32:17.9521057Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9521159Z op = silu_mul_quant 2025-05-07T20:32:17.9521246Z if compiled: 2025-05-07T20:32:17.9521355Z op = torch.compile(op) 2025-05-07T20:32:17.9521464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9521538Z 2025-05-07T20:32:17.9521640Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9521645Z 2025-05-07T20:32:17.9521748Z moe/activation_test.py:117: 2025-05-07T20:32:17.9521924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9522035Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9522136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9522635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9522745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9523112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9523347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9523690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9523788Z kernel = self.compile( 2025-05-07T20:32:17.9524191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9524373Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9524515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9524519Z 2025-05-07T20:32:17.9524728Z self = 2025-05-07T20:32:17.9525501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9526009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea15a670>} 2025-05-07T20:32:17.9526866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9527073Z context = 2025-05-07T20:32:17.9527078Z 2025-05-07T20:32:17.9527246Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9527512Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9527627Z module_map=module_map) 2025-05-07T20:32:17.9527789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9527897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9527975Z E ^ 2025-05-07T20:32:17.9528335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9528340Z 2025-05-07T20:32:17.9528761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9528767Z 2025-05-07T20:32:17.9528870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9529100Z self=, 2025-05-07T20:32:17.9529180Z T=4096, 2025-05-07T20:32:17.9529260Z D=7168, 2025-05-07T20:32:17.9529394Z scale_ub=1200.0, 2025-05-07T20:32:17.9529481Z contiguous=False, 2025-05-07T20:32:17.9529567Z compiled=False, 2025-05-07T20:32:17.9529651Z ) 2025-05-07T20:32:17.9529871Z self = 2025-05-07T20:32:17.9530050Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.9530054Z 2025-05-07T20:32:17.9530143Z @given( 2025-05-07T20:32:17.9530264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9530374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9530491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9530657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9530779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9530856Z ) 2025-05-07T20:32:17.9531103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9531207Z def test_silu_mul_quant( 2025-05-07T20:32:17.9531290Z self, 2025-05-07T20:32:17.9531370Z T: int, 2025-05-07T20:32:17.9531456Z D: int, 2025-05-07T20:32:17.9531562Z scale_ub: Optional[float], 2025-05-07T20:32:17.9531652Z contiguous: bool, 2025-05-07T20:32:17.9531738Z compiled: bool, 2025-05-07T20:32:17.9531827Z ) -> None: 2025-05-07T20:32:17.9531923Z torch.manual_seed(2025) 2025-05-07T20:32:17.9531996Z 2025-05-07T20:32:17.9532173Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9532246Z 2025-05-07T20:32:17.9532347Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9532481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9532569Z x = x_sign * x_clamp 2025-05-07T20:32:17.9532657Z x0 = x[:, :D] 2025-05-07T20:32:17.9532738Z x1 = x[:, D:] 2025-05-07T20:32:17.9532810Z 2025-05-07T20:32:17.9532900Z if contiguous: 2025-05-07T20:32:17.9532992Z x0 = x0.contiguous() 2025-05-07T20:32:17.9533083Z x1 = x1.contiguous() 2025-05-07T20:32:17.9533164Z 2025-05-07T20:32:17.9533258Z if scale_ub is not None: 2025-05-07T20:32:17.9533364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9533507Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9533585Z ) 2025-05-07T20:32:17.9533668Z else: 2025-05-07T20:32:17.9533763Z scale_ub_tensor = None 2025-05-07T20:32:17.9533837Z 2025-05-07T20:32:17.9533975Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9534067Z op = silu_mul_quant 2025-05-07T20:32:17.9534239Z if compiled: 2025-05-07T20:32:17.9534349Z op = torch.compile(op) 2025-05-07T20:32:17.9534455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9534528Z 2025-05-07T20:32:17.9534628Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9534632Z 2025-05-07T20:32:17.9534732Z moe/activation_test.py:117: 2025-05-07T20:32:17.9534861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9534969Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9535070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9535584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9535681Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9536040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9536282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9536628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9536730Z kernel = self.compile( 2025-05-07T20:32:17.9537109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9537327Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9537459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9537464Z 2025-05-07T20:32:17.9537670Z self = 2025-05-07T20:32:17.9538444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9539001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea256040>} 2025-05-07T20:32:17.9539753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9539953Z context = 2025-05-07T20:32:17.9539958Z 2025-05-07T20:32:17.9540484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9540831Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9540943Z module_map=module_map) 2025-05-07T20:32:17.9541167Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9541276Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9541365Z E ^ 2025-05-07T20:32:17.9541719Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9541730Z 2025-05-07T20:32:17.9542150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9542158Z 2025-05-07T20:32:17.9542263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9542494Z self=, 2025-05-07T20:32:17.9542571Z T=16384, 2025-05-07T20:32:17.9542648Z D=7168, 2025-05-07T20:32:17.9542736Z scale_ub=None, 2025-05-07T20:32:17.9542821Z contiguous=True, 2025-05-07T20:32:17.9542904Z compiled=True, 2025-05-07T20:32:17.9542983Z ) 2025-05-07T20:32:17.9543199Z self = 2025-05-07T20:32:17.9543398Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.9543672Z 2025-05-07T20:32:17.9543755Z @given( 2025-05-07T20:32:17.9543874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9543979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9544094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9544214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9544336Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9544410Z ) 2025-05-07T20:32:17.9544656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9544757Z def test_silu_mul_quant( 2025-05-07T20:32:17.9544834Z self, 2025-05-07T20:32:17.9544917Z T: int, 2025-05-07T20:32:17.9544995Z D: int, 2025-05-07T20:32:17.9545093Z scale_ub: Optional[float], 2025-05-07T20:32:17.9545188Z contiguous: bool, 2025-05-07T20:32:17.9545276Z compiled: bool, 2025-05-07T20:32:17.9545357Z ) -> None: 2025-05-07T20:32:17.9545465Z torch.manual_seed(2025) 2025-05-07T20:32:17.9545538Z 2025-05-07T20:32:17.9545709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9545789Z 2025-05-07T20:32:17.9545882Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9546078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9546178Z x = x_sign * x_clamp 2025-05-07T20:32:17.9546261Z x0 = x[:, :D] 2025-05-07T20:32:17.9546340Z x1 = x[:, D:] 2025-05-07T20:32:17.9546421Z 2025-05-07T20:32:17.9546507Z if contiguous: 2025-05-07T20:32:17.9546605Z x0 = x0.contiguous() 2025-05-07T20:32:17.9546694Z x1 = x1.contiguous() 2025-05-07T20:32:17.9546767Z 2025-05-07T20:32:17.9546865Z if scale_ub is not None: 2025-05-07T20:32:17.9546971Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9547107Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9547261Z ) 2025-05-07T20:32:17.9547340Z else: 2025-05-07T20:32:17.9547435Z scale_ub_tensor = None 2025-05-07T20:32:17.9547518Z 2025-05-07T20:32:17.9547649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9547738Z op = silu_mul_quant 2025-05-07T20:32:17.9547834Z if compiled: 2025-05-07T20:32:17.9547933Z op = torch.compile(op) 2025-05-07T20:32:17.9548046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9548118Z 2025-05-07T20:32:17.9548209Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9548213Z 2025-05-07T20:32:17.9548317Z moe/activation_test.py:117: 2025-05-07T20:32:17.9548444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9548545Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9548654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9549024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9549119Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9549620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9549718Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9550092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9550316Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9550654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9550755Z kernel = self.compile( 2025-05-07T20:32:17.9551133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9551316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9551526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9551532Z 2025-05-07T20:32:17.9551741Z self = 2025-05-07T20:32:17.9552519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9553038Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea256ca0>} 2025-05-07T20:32:17.9553784Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9553979Z context = 2025-05-07T20:32:17.9553987Z 2025-05-07T20:32:17.9554155Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9554423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9554571Z module_map=module_map) 2025-05-07T20:32:17.9554741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9554840Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9554917Z E ^ 2025-05-07T20:32:17.9555274Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9555278Z 2025-05-07T20:32:17.9555688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9555692Z 2025-05-07T20:32:17.9555801Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9556107Z self=, 2025-05-07T20:32:17.9556185Z T=4096, 2025-05-07T20:32:17.9556271Z D=5120, 2025-05-07T20:32:17.9556353Z scale_ub=None, 2025-05-07T20:32:17.9556438Z contiguous=False, 2025-05-07T20:32:17.9556527Z compiled=True, 2025-05-07T20:32:17.9556602Z ) 2025-05-07T20:32:17.9556821Z self = 2025-05-07T20:32:17.9557027Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9557032Z 2025-05-07T20:32:17.9557109Z @given( 2025-05-07T20:32:17.9557228Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9557335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9557449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9557566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9557687Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9557770Z ) 2025-05-07T20:32:17.9558028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9558122Z def test_silu_mul_quant( 2025-05-07T20:32:17.9558199Z self, 2025-05-07T20:32:17.9558284Z T: int, 2025-05-07T20:32:17.9558361Z D: int, 2025-05-07T20:32:17.9558464Z scale_ub: Optional[float], 2025-05-07T20:32:17.9558561Z contiguous: bool, 2025-05-07T20:32:17.9558648Z compiled: bool, 2025-05-07T20:32:17.9558726Z ) -> None: 2025-05-07T20:32:17.9558828Z torch.manual_seed(2025) 2025-05-07T20:32:17.9558904Z 2025-05-07T20:32:17.9559074Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9559156Z 2025-05-07T20:32:17.9559248Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9559383Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9559474Z x = x_sign * x_clamp 2025-05-07T20:32:17.9559556Z x0 = x[:, :D] 2025-05-07T20:32:17.9559724Z x1 = x[:, D:] 2025-05-07T20:32:17.9559801Z 2025-05-07T20:32:17.9559886Z if contiguous: 2025-05-07T20:32:17.9559984Z x0 = x0.contiguous() 2025-05-07T20:32:17.9560074Z x1 = x1.contiguous() 2025-05-07T20:32:17.9560147Z 2025-05-07T20:32:17.9560252Z if scale_ub is not None: 2025-05-07T20:32:17.9560359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9560496Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9560587Z ) 2025-05-07T20:32:17.9560665Z else: 2025-05-07T20:32:17.9560762Z scale_ub_tensor = None 2025-05-07T20:32:17.9560843Z 2025-05-07T20:32:17.9560977Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9561078Z op = silu_mul_quant 2025-05-07T20:32:17.9561163Z if compiled: 2025-05-07T20:32:17.9561262Z op = torch.compile(op) 2025-05-07T20:32:17.9561379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9561455Z 2025-05-07T20:32:17.9561547Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9561551Z 2025-05-07T20:32:17.9561659Z moe/activation_test.py:117: 2025-05-07T20:32:17.9561786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9561982Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9562096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9562463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9562562Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9563054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9563150Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9563513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9563791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9564144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9564237Z kernel = self.compile( 2025-05-07T20:32:17.9564624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9564812Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9564939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9564944Z 2025-05-07T20:32:17.9565149Z self = 2025-05-07T20:32:17.9565933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9566440Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9ffb8b0>} 2025-05-07T20:32:17.9567187Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9567382Z context = 2025-05-07T20:32:17.9567387Z 2025-05-07T20:32:17.9567561Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9567824Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9567933Z module_map=module_map) 2025-05-07T20:32:17.9568101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9568282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9568363Z E ^ 2025-05-07T20:32:17.9568724Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9568728Z 2025-05-07T20:32:17.9569140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9569146Z 2025-05-07T20:32:17.9569256Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9569479Z self=, 2025-05-07T20:32:17.9569561Z T=4096, 2025-05-07T20:32:17.9569645Z D=5120, 2025-05-07T20:32:17.9569728Z scale_ub=1200.0, 2025-05-07T20:32:17.9569814Z contiguous=False, 2025-05-07T20:32:17.9569909Z compiled=False, 2025-05-07T20:32:17.9569983Z ) 2025-05-07T20:32:17.9570211Z self = 2025-05-07T20:32:17.9570395Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.9570399Z 2025-05-07T20:32:17.9570479Z @given( 2025-05-07T20:32:17.9570604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9570704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9570865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9570992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9571107Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9571184Z ) 2025-05-07T20:32:17.9571438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9571533Z def test_silu_mul_quant( 2025-05-07T20:32:17.9571615Z self, 2025-05-07T20:32:17.9571694Z T: int, 2025-05-07T20:32:17.9571776Z D: int, 2025-05-07T20:32:17.9571879Z scale_ub: Optional[float], 2025-05-07T20:32:17.9571970Z contiguous: bool, 2025-05-07T20:32:17.9572105Z compiled: bool, 2025-05-07T20:32:17.9572193Z ) -> None: 2025-05-07T20:32:17.9572290Z torch.manual_seed(2025) 2025-05-07T20:32:17.9572367Z 2025-05-07T20:32:17.9572541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9572617Z 2025-05-07T20:32:17.9572713Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9572850Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9572940Z x = x_sign * x_clamp 2025-05-07T20:32:17.9573028Z x0 = x[:, :D] 2025-05-07T20:32:17.9573109Z x1 = x[:, D:] 2025-05-07T20:32:17.9573183Z 2025-05-07T20:32:17.9573274Z if contiguous: 2025-05-07T20:32:17.9573368Z x0 = x0.contiguous() 2025-05-07T20:32:17.9573460Z x1 = x1.contiguous() 2025-05-07T20:32:17.9573540Z 2025-05-07T20:32:17.9573632Z if scale_ub is not None: 2025-05-07T20:32:17.9573739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9573887Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9573966Z ) 2025-05-07T20:32:17.9574043Z else: 2025-05-07T20:32:17.9574144Z scale_ub_tensor = None 2025-05-07T20:32:17.9574217Z 2025-05-07T20:32:17.9574350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9574453Z op = silu_mul_quant 2025-05-07T20:32:17.9574541Z if compiled: 2025-05-07T20:32:17.9574649Z op = torch.compile(op) 2025-05-07T20:32:17.9574754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9574828Z 2025-05-07T20:32:17.9574929Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9574934Z 2025-05-07T20:32:17.9575033Z moe/activation_test.py:117: 2025-05-07T20:32:17.9575160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9575267Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9575367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9575955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9576053Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9576418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9576655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9576993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9577089Z kernel = self.compile( 2025-05-07T20:32:17.9577475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9577650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9577781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9577789Z 2025-05-07T20:32:17.9578003Z self = 2025-05-07T20:32:17.9578788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9579346Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea3d0040>} 2025-05-07T20:32:17.9580097Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9580295Z context = 2025-05-07T20:32:17.9580299Z 2025-05-07T20:32:17.9580470Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9580780Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9580894Z module_map=module_map) 2025-05-07T20:32:17.9581055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9581368Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9581445Z E ^ 2025-05-07T20:32:17.9581799Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9581803Z 2025-05-07T20:32:17.9582223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9582228Z 2025-05-07T20:32:17.9582330Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9582558Z self=, 2025-05-07T20:32:17.9582635Z T=4096, 2025-05-07T20:32:17.9582720Z D=5120, 2025-05-07T20:32:17.9582818Z scale_ub=1200.0, 2025-05-07T20:32:17.9582920Z contiguous=False, 2025-05-07T20:32:17.9583013Z compiled=True, 2025-05-07T20:32:17.9583107Z ) 2025-05-07T20:32:17.9583327Z self = 2025-05-07T20:32:17.9583506Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9583511Z 2025-05-07T20:32:17.9583595Z @given( 2025-05-07T20:32:17.9583713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9583818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9583933Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9584049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9584169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9584245Z ) 2025-05-07T20:32:17.9584533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9584789Z def test_silu_mul_quant( 2025-05-07T20:32:17.9584913Z self, 2025-05-07T20:32:17.9585021Z T: int, 2025-05-07T20:32:17.9585137Z D: int, 2025-05-07T20:32:17.9585281Z scale_ub: Optional[float], 2025-05-07T20:32:17.9585405Z contiguous: bool, 2025-05-07T20:32:17.9585534Z compiled: bool, 2025-05-07T20:32:17.9585643Z ) -> None: 2025-05-07T20:32:17.9585753Z torch.manual_seed(2025) 2025-05-07T20:32:17.9585828Z 2025-05-07T20:32:17.9586001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9586082Z 2025-05-07T20:32:17.9586174Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9586301Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9586396Z x = x_sign * x_clamp 2025-05-07T20:32:17.9586478Z x0 = x[:, :D] 2025-05-07T20:32:17.9586565Z x1 = x[:, D:] 2025-05-07T20:32:17.9586644Z 2025-05-07T20:32:17.9586729Z if contiguous: 2025-05-07T20:32:17.9586832Z x0 = x0.contiguous() 2025-05-07T20:32:17.9586930Z x1 = x1.contiguous() 2025-05-07T20:32:17.9587004Z 2025-05-07T20:32:17.9587105Z if scale_ub is not None: 2025-05-07T20:32:17.9587213Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9587437Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9587522Z ) 2025-05-07T20:32:17.9587601Z else: 2025-05-07T20:32:17.9587697Z scale_ub_tensor = None 2025-05-07T20:32:17.9587780Z 2025-05-07T20:32:17.9587915Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9588007Z op = silu_mul_quant 2025-05-07T20:32:17.9588101Z if compiled: 2025-05-07T20:32:17.9588201Z op = torch.compile(op) 2025-05-07T20:32:17.9588307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9588385Z 2025-05-07T20:32:17.9588476Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9588531Z 2025-05-07T20:32:17.9588640Z moe/activation_test.py:117: 2025-05-07T20:32:17.9588769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9588870Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9588979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9589346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9589440Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9589941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9590039Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9590403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9590628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9590983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9591086Z kernel = self.compile( 2025-05-07T20:32:17.9591465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9591648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9591782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9591786Z 2025-05-07T20:32:17.9591993Z self = 2025-05-07T20:32:17.9592771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9593348Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7ea3d0ee0>} 2025-05-07T20:32:17.9594099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9594296Z context = 2025-05-07T20:32:17.9594300Z 2025-05-07T20:32:17.9594465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9594734Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9594842Z module_map=module_map) 2025-05-07T20:32:17.9595008Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9595107Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9595185Z E ^ 2025-05-07T20:32:17.9595551Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9595558Z 2025-05-07T20:32:17.9595973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9595977Z 2025-05-07T20:32:17.9596127Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9596355Z self=, 2025-05-07T20:32:17.9596432Z T=2048, 2025-05-07T20:32:17.9596515Z D=7168, 2025-05-07T20:32:17.9596598Z scale_ub=1200.0, 2025-05-07T20:32:17.9596685Z contiguous=False, 2025-05-07T20:32:17.9596775Z compiled=False, 2025-05-07T20:32:17.9596849Z ) 2025-05-07T20:32:17.9597067Z self = 2025-05-07T20:32:17.9597251Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.9597256Z 2025-05-07T20:32:17.9597375Z @given( 2025-05-07T20:32:17.9597499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9597611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9597727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9597853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9597973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9598048Z ) 2025-05-07T20:32:17.9598300Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9598395Z def test_silu_mul_quant( 2025-05-07T20:32:17.9598473Z self, 2025-05-07T20:32:17.9598555Z T: int, 2025-05-07T20:32:17.9598636Z D: int, 2025-05-07T20:32:17.9598736Z scale_ub: Optional[float], 2025-05-07T20:32:17.9598832Z contiguous: bool, 2025-05-07T20:32:17.9598920Z compiled: bool, 2025-05-07T20:32:17.9598999Z ) -> None: 2025-05-07T20:32:17.9599099Z torch.manual_seed(2025) 2025-05-07T20:32:17.9599181Z 2025-05-07T20:32:17.9599358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9599432Z 2025-05-07T20:32:17.9599526Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9599659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9599752Z x = x_sign * x_clamp 2025-05-07T20:32:17.9599834Z x0 = x[:, :D] 2025-05-07T20:32:17.9599922Z x1 = x[:, D:] 2025-05-07T20:32:17.9599997Z 2025-05-07T20:32:17.9600083Z if contiguous: 2025-05-07T20:32:17.9606485Z x0 = x0.contiguous() 2025-05-07T20:32:17.9606613Z x1 = x1.contiguous() 2025-05-07T20:32:17.9606690Z 2025-05-07T20:32:17.9606786Z if scale_ub is not None: 2025-05-07T20:32:17.9606907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9607049Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9607130Z ) 2025-05-07T20:32:17.9607219Z else: 2025-05-07T20:32:17.9607452Z scale_ub_tensor = None 2025-05-07T20:32:17.9607530Z 2025-05-07T20:32:17.9607680Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9607775Z op = silu_mul_quant 2025-05-07T20:32:17.9607863Z if compiled: 2025-05-07T20:32:17.9607978Z op = torch.compile(op) 2025-05-07T20:32:17.9608087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9608171Z 2025-05-07T20:32:17.9608265Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9608270Z 2025-05-07T20:32:17.9608371Z moe/activation_test.py:117: 2025-05-07T20:32:17.9608514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9608621Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9608725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9609238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9609346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9609716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9609947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9610339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9610445Z kernel = self.compile( 2025-05-07T20:32:17.9610825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9611005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9611143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9611147Z 2025-05-07T20:32:17.9611356Z self = 2025-05-07T20:32:17.9612195Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9612700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9e90550>} 2025-05-07T20:32:17.9613461Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9613657Z context = 2025-05-07T20:32:17.9613662Z 2025-05-07T20:32:17.9613832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9614117Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9614229Z module_map=module_map) 2025-05-07T20:32:17.9614406Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9614507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9614588Z E ^ 2025-05-07T20:32:17.9614955Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9614962Z 2025-05-07T20:32:17.9615379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9615384Z 2025-05-07T20:32:17.9615489Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9615721Z self=, 2025-05-07T20:32:17.9615800Z T=1, 2025-05-07T20:32:17.9615889Z D=7168, 2025-05-07T20:32:17.9615973Z scale_ub=None, 2025-05-07T20:32:17.9616059Z contiguous=True, 2025-05-07T20:32:17.9616231Z compiled=False, 2025-05-07T20:32:17.9616310Z ) 2025-05-07T20:32:17.9616528Z self = 2025-05-07T20:32:17.9616706Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.9616711Z 2025-05-07T20:32:17.9616793Z @given( 2025-05-07T20:32:17.9616913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9617022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9617140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9617266Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9617382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9617457Z ) 2025-05-07T20:32:17.9617714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9617809Z def test_silu_mul_quant( 2025-05-07T20:32:17.9617888Z self, 2025-05-07T20:32:17.9617977Z T: int, 2025-05-07T20:32:17.9618065Z D: int, 2025-05-07T20:32:17.9618167Z scale_ub: Optional[float], 2025-05-07T20:32:17.9618267Z contiguous: bool, 2025-05-07T20:32:17.9618356Z compiled: bool, 2025-05-07T20:32:17.9618437Z ) -> None: 2025-05-07T20:32:17.9618540Z torch.manual_seed(2025) 2025-05-07T20:32:17.9618659Z 2025-05-07T20:32:17.9618841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9618918Z 2025-05-07T20:32:17.9619010Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9619146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9619238Z x = x_sign * x_clamp 2025-05-07T20:32:17.9619320Z x0 = x[:, :D] 2025-05-07T20:32:17.9619409Z x1 = x[:, D:] 2025-05-07T20:32:17.9619484Z 2025-05-07T20:32:17.9619569Z if contiguous: 2025-05-07T20:32:17.9619669Z x0 = x0.contiguous() 2025-05-07T20:32:17.9619760Z x1 = x1.contiguous() 2025-05-07T20:32:17.9619881Z 2025-05-07T20:32:17.9619982Z if scale_ub is not None: 2025-05-07T20:32:17.9620092Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9620239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9620318Z ) 2025-05-07T20:32:17.9620397Z else: 2025-05-07T20:32:17.9620501Z scale_ub_tensor = None 2025-05-07T20:32:17.9620574Z 2025-05-07T20:32:17.9620710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9620810Z op = silu_mul_quant 2025-05-07T20:32:17.9620900Z if compiled: 2025-05-07T20:32:17.9621002Z op = torch.compile(op) 2025-05-07T20:32:17.9621197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9621272Z 2025-05-07T20:32:17.9621367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9621371Z 2025-05-07T20:32:17.9621480Z moe/activation_test.py:117: 2025-05-07T20:32:17.9621615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9621729Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9621832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9622331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9622444Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9622809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9623035Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9623381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9623480Z kernel = self.compile( 2025-05-07T20:32:17.9623870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9624171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9624300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9624305Z 2025-05-07T20:32:17.9624522Z self = 2025-05-07T20:32:17.9625297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9625810Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9e51160>} 2025-05-07T20:32:17.9626563Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9626764Z context = 2025-05-07T20:32:17.9626777Z 2025-05-07T20:32:17.9626948Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9627212Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9627369Z module_map=module_map) 2025-05-07T20:32:17.9627537Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9627637Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9627727Z E ^ 2025-05-07T20:32:17.9628080Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9628085Z 2025-05-07T20:32:17.9628505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9628510Z 2025-05-07T20:32:17.9628664Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9628886Z self=, 2025-05-07T20:32:17.9628975Z T=16384, 2025-05-07T20:32:17.9629054Z D=7168, 2025-05-07T20:32:17.9629137Z scale_ub=1200.0, 2025-05-07T20:32:17.9629236Z contiguous=False, 2025-05-07T20:32:17.9629322Z compiled=True, 2025-05-07T20:32:17.9629397Z ) 2025-05-07T20:32:17.9629623Z self = 2025-05-07T20:32:17.9629803Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9629807Z 2025-05-07T20:32:17.9629895Z @given( 2025-05-07T20:32:17.9630015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9630117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9630242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9630361Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9630484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9630571Z ) 2025-05-07T20:32:17.9630820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9630922Z def test_silu_mul_quant( 2025-05-07T20:32:17.9631001Z self, 2025-05-07T20:32:17.9631083Z T: int, 2025-05-07T20:32:17.9631167Z D: int, 2025-05-07T20:32:17.9631266Z scale_ub: Optional[float], 2025-05-07T20:32:17.9631357Z contiguous: bool, 2025-05-07T20:32:17.9631453Z compiled: bool, 2025-05-07T20:32:17.9631533Z ) -> None: 2025-05-07T20:32:17.9631629Z torch.manual_seed(2025) 2025-05-07T20:32:17.9631712Z 2025-05-07T20:32:17.9631884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9631960Z 2025-05-07T20:32:17.9632061Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9632187Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9632355Z x = x_sign * x_clamp 2025-05-07T20:32:17.9632452Z x0 = x[:, :D] 2025-05-07T20:32:17.9632534Z x1 = x[:, D:] 2025-05-07T20:32:17.9632617Z 2025-05-07T20:32:17.9632701Z if contiguous: 2025-05-07T20:32:17.9632793Z x0 = x0.contiguous() 2025-05-07T20:32:17.9632893Z x1 = x1.contiguous() 2025-05-07T20:32:17.9632968Z 2025-05-07T20:32:17.9633059Z if scale_ub is not None: 2025-05-07T20:32:17.9633172Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9633311Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9633391Z ) 2025-05-07T20:32:17.9633476Z else: 2025-05-07T20:32:17.9633571Z scale_ub_tensor = None 2025-05-07T20:32:17.9633646Z 2025-05-07T20:32:17.9633786Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9633877Z op = silu_mul_quant 2025-05-07T20:32:17.9633970Z if compiled: 2025-05-07T20:32:17.9634075Z op = torch.compile(op) 2025-05-07T20:32:17.9634184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9634263Z 2025-05-07T20:32:17.9634354Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9634359Z 2025-05-07T20:32:17.9634459Z moe/activation_test.py:117: 2025-05-07T20:32:17.9634595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9634742Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9634846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9635223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9635317Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9635825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9635923Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9636286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9636568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9636908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9637015Z kernel = self.compile( 2025-05-07T20:32:17.9637398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9637574Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9637709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9637714Z 2025-05-07T20:32:17.9637923Z self = 2025-05-07T20:32:17.9638696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9639219Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9e51dc0>} 2025-05-07T20:32:17.9639972Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9641313Z context = 2025-05-07T20:32:17.9641360Z 2025-05-07T20:32:17.9641637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9641932Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9642051Z module_map=module_map) 2025-05-07T20:32:17.9642627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9642755Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9642892Z E ^ 2025-05-07T20:32:17.9643594Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9643614Z 2025-05-07T20:32:17.9644419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9644428Z 2025-05-07T20:32:17.9644622Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9645044Z self=, 2025-05-07T20:32:17.9645191Z T=1, 2025-05-07T20:32:17.9645335Z D=7168, 2025-05-07T20:32:17.9645503Z scale_ub=None, 2025-05-07T20:32:17.9645667Z contiguous=False, 2025-05-07T20:32:17.9645825Z compiled=False, 2025-05-07T20:32:17.9645978Z ) 2025-05-07T20:32:17.9646390Z self = 2025-05-07T20:32:17.9646706Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.9646724Z 2025-05-07T20:32:17.9646871Z @given( 2025-05-07T20:32:17.9647093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9647292Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9647635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9647853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9648077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9648232Z ) 2025-05-07T20:32:17.9648692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9648880Z def test_silu_mul_quant( 2025-05-07T20:32:17.9649028Z self, 2025-05-07T20:32:17.9649172Z T: int, 2025-05-07T20:32:17.9649325Z D: int, 2025-05-07T20:32:17.9649507Z scale_ub: Optional[float], 2025-05-07T20:32:17.9649822Z contiguous: bool, 2025-05-07T20:32:17.9649991Z compiled: bool, 2025-05-07T20:32:17.9650142Z ) -> None: 2025-05-07T20:32:17.9650318Z torch.manual_seed(2025) 2025-05-07T20:32:17.9650465Z 2025-05-07T20:32:17.9650779Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9650922Z 2025-05-07T20:32:17.9651103Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9651335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9651504Z x = x_sign * x_clamp 2025-05-07T20:32:17.9651664Z x0 = x[:, :D] 2025-05-07T20:32:17.9651814Z x1 = x[:, D:] 2025-05-07T20:32:17.9651959Z 2025-05-07T20:32:17.9652118Z if contiguous: 2025-05-07T20:32:17.9652292Z x0 = x0.contiguous() 2025-05-07T20:32:17.9652470Z x1 = x1.contiguous() 2025-05-07T20:32:17.9652604Z 2025-05-07T20:32:17.9652775Z if scale_ub is not None: 2025-05-07T20:32:17.9652977Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9653176Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9653256Z ) 2025-05-07T20:32:17.9653346Z else: 2025-05-07T20:32:17.9653445Z scale_ub_tensor = None 2025-05-07T20:32:17.9653520Z 2025-05-07T20:32:17.9653660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9653759Z op = silu_mul_quant 2025-05-07T20:32:17.9653854Z if compiled: 2025-05-07T20:32:17.9653960Z op = torch.compile(op) 2025-05-07T20:32:17.9654069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9654153Z 2025-05-07T20:32:17.9654247Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9654251Z 2025-05-07T20:32:17.9654351Z moe/activation_test.py:117: 2025-05-07T20:32:17.9654485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9654585Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9654771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9655294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9655392Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9655759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9655990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9656334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9656435Z kernel = self.compile( 2025-05-07T20:32:17.9656822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9657001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9657142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9657149Z 2025-05-07T20:32:17.9657358Z self = 2025-05-07T20:32:17.9658140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9658696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9ed0790>} 2025-05-07T20:32:17.9659454Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9659646Z context = 2025-05-07T20:32:17.9659651Z 2025-05-07T20:32:17.9659867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9660139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9660248Z module_map=module_map) 2025-05-07T20:32:17.9660419Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9660520Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9660598Z E ^ 2025-05-07T20:32:17.9660958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9660963Z 2025-05-07T20:32:17.9661482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9661487Z 2025-05-07T20:32:17.9661596Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9661818Z self=, 2025-05-07T20:32:17.9661908Z T=2048, 2025-05-07T20:32:17.9661991Z D=7168, 2025-05-07T20:32:17.9662075Z scale_ub=None, 2025-05-07T20:32:17.9662164Z contiguous=False, 2025-05-07T20:32:17.9662256Z compiled=True, 2025-05-07T20:32:17.9662330Z ) 2025-05-07T20:32:17.9662550Z self = 2025-05-07T20:32:17.9662736Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9662741Z 2025-05-07T20:32:17.9662824Z @given( 2025-05-07T20:32:17.9662963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9663089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9663210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9663333Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9663446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9663522Z ) 2025-05-07T20:32:17.9663860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9663960Z def test_silu_mul_quant( 2025-05-07T20:32:17.9664042Z self, 2025-05-07T20:32:17.9664132Z T: int, 2025-05-07T20:32:17.9664209Z D: int, 2025-05-07T20:32:17.9664309Z scale_ub: Optional[float], 2025-05-07T20:32:17.9664410Z contiguous: bool, 2025-05-07T20:32:17.9664496Z compiled: bool, 2025-05-07T20:32:17.9664583Z ) -> None: 2025-05-07T20:32:17.9664679Z torch.manual_seed(2025) 2025-05-07T20:32:17.9664752Z 2025-05-07T20:32:17.9664929Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9665004Z 2025-05-07T20:32:17.9665098Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9665232Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9665324Z x = x_sign * x_clamp 2025-05-07T20:32:17.9665407Z x0 = x[:, :D] 2025-05-07T20:32:17.9665497Z x1 = x[:, D:] 2025-05-07T20:32:17.9665574Z 2025-05-07T20:32:17.9665665Z if contiguous: 2025-05-07T20:32:17.9665766Z x0 = x0.contiguous() 2025-05-07T20:32:17.9665855Z x1 = x1.contiguous() 2025-05-07T20:32:17.9665929Z 2025-05-07T20:32:17.9666029Z if scale_ub is not None: 2025-05-07T20:32:17.9666135Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9666323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9666402Z ) 2025-05-07T20:32:17.9666480Z else: 2025-05-07T20:32:17.9666581Z scale_ub_tensor = None 2025-05-07T20:32:17.9666655Z 2025-05-07T20:32:17.9666788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9666885Z op = silu_mul_quant 2025-05-07T20:32:17.9666972Z if compiled: 2025-05-07T20:32:17.9667075Z op = torch.compile(op) 2025-05-07T20:32:17.9667187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9667261Z 2025-05-07T20:32:17.9667402Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9667413Z 2025-05-07T20:32:17.9667512Z moe/activation_test.py:117: 2025-05-07T20:32:17.9667642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9667750Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9667853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9668225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9668326Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9668820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9668922Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9669287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9669517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9669872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9669967Z kernel = self.compile( 2025-05-07T20:32:17.9670346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9670535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9670663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9670667Z 2025-05-07T20:32:17.9670877Z self = 2025-05-07T20:32:17.9671649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9672231Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9f2f430>} 2025-05-07T20:32:17.9673071Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9673316Z context = 2025-05-07T20:32:17.9673321Z 2025-05-07T20:32:17.9673576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9673865Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9673990Z module_map=module_map) 2025-05-07T20:32:17.9674189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9674319Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9674439Z E ^ 2025-05-07T20:32:17.9674809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9674813Z 2025-05-07T20:32:17.9675231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9675303Z 2025-05-07T20:32:17.9675429Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9675652Z self=, 2025-05-07T20:32:17.9675738Z T=4096, 2025-05-07T20:32:17.9675815Z D=7168, 2025-05-07T20:32:17.9675899Z scale_ub=None, 2025-05-07T20:32:17.9675994Z contiguous=False, 2025-05-07T20:32:17.9676082Z compiled=True, 2025-05-07T20:32:17.9676155Z ) 2025-05-07T20:32:17.9676384Z self = 2025-05-07T20:32:17.9676559Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9676605Z 2025-05-07T20:32:17.9676698Z @given( 2025-05-07T20:32:17.9676820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9676921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9677043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9677161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9677276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9677358Z ) 2025-05-07T20:32:17.9677609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9677710Z def test_silu_mul_quant( 2025-05-07T20:32:17.9677791Z self, 2025-05-07T20:32:17.9677870Z T: int, 2025-05-07T20:32:17.9677952Z D: int, 2025-05-07T20:32:17.9678051Z scale_ub: Optional[float], 2025-05-07T20:32:17.9678141Z contiguous: bool, 2025-05-07T20:32:17.9678234Z compiled: bool, 2025-05-07T20:32:17.9678314Z ) -> None: 2025-05-07T20:32:17.9678418Z torch.manual_seed(2025) 2025-05-07T20:32:17.9678498Z 2025-05-07T20:32:17.9678665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9678742Z 2025-05-07T20:32:17.9678841Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9678966Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9679061Z x = x_sign * x_clamp 2025-05-07T20:32:17.9679149Z x0 = x[:, :D] 2025-05-07T20:32:17.9679232Z x1 = x[:, D:] 2025-05-07T20:32:17.9679311Z 2025-05-07T20:32:17.9679396Z if contiguous: 2025-05-07T20:32:17.9679490Z x0 = x0.contiguous() 2025-05-07T20:32:17.9679585Z x1 = x1.contiguous() 2025-05-07T20:32:17.9679660Z 2025-05-07T20:32:17.9679749Z if scale_ub is not None: 2025-05-07T20:32:17.9679860Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9679994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9680071Z ) 2025-05-07T20:32:17.9680241Z else: 2025-05-07T20:32:17.9680338Z scale_ub_tensor = None 2025-05-07T20:32:17.9680411Z 2025-05-07T20:32:17.9680547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9680638Z op = silu_mul_quant 2025-05-07T20:32:17.9680730Z if compiled: 2025-05-07T20:32:17.9680836Z op = torch.compile(op) 2025-05-07T20:32:17.9680941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9681024Z 2025-05-07T20:32:17.9681115Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9681119Z 2025-05-07T20:32:17.9681217Z moe/activation_test.py:117: 2025-05-07T20:32:17.9681350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9681450Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9681551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9681929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9682025Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9682529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9682629Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9683035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9683266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9683604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9683700Z kernel = self.compile( 2025-05-07T20:32:17.9684093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9684271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9684453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9684457Z 2025-05-07T20:32:17.9684667Z self = 2025-05-07T20:32:17.9685443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9685963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9b48040>} 2025-05-07T20:32:17.9686717Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9686917Z context = 2025-05-07T20:32:17.9686930Z 2025-05-07T20:32:17.9687099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9687370Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9687478Z module_map=module_map) 2025-05-07T20:32:17.9687644Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9687750Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9687828Z E ^ 2025-05-07T20:32:17.9688180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9688185Z 2025-05-07T20:32:17.9688610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9688614Z 2025-05-07T20:32:17.9688716Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9689026Z self=, 2025-05-07T20:32:17.9689111Z T=16384, 2025-05-07T20:32:17.9689193Z D=5120, 2025-05-07T20:32:17.9689283Z scale_ub=1200.0, 2025-05-07T20:32:17.9689374Z contiguous=False, 2025-05-07T20:32:17.9689460Z compiled=False, 2025-05-07T20:32:17.9689540Z ) 2025-05-07T20:32:17.9689760Z self = 2025-05-07T20:32:17.9689940Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.9689950Z 2025-05-07T20:32:17.9690027Z @given( 2025-05-07T20:32:17.9690147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9690254Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9690369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9690488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9690609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9690685Z ) 2025-05-07T20:32:17.9690941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9691043Z def test_silu_mul_quant( 2025-05-07T20:32:17.9691120Z self, 2025-05-07T20:32:17.9691206Z T: int, 2025-05-07T20:32:17.9691284Z D: int, 2025-05-07T20:32:17.9691427Z scale_ub: Optional[float], 2025-05-07T20:32:17.9691524Z contiguous: bool, 2025-05-07T20:32:17.9691610Z compiled: bool, 2025-05-07T20:32:17.9691688Z ) -> None: 2025-05-07T20:32:17.9691791Z torch.manual_seed(2025) 2025-05-07T20:32:17.9691864Z 2025-05-07T20:32:17.9692033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9692116Z 2025-05-07T20:32:17.9692208Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9692332Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9692434Z x = x_sign * x_clamp 2025-05-07T20:32:17.9692515Z x0 = x[:, :D] 2025-05-07T20:32:17.9692647Z x1 = x[:, D:] 2025-05-07T20:32:17.9692727Z 2025-05-07T20:32:17.9692811Z if contiguous: 2025-05-07T20:32:17.9692915Z x0 = x0.contiguous() 2025-05-07T20:32:17.9693027Z x1 = x1.contiguous() 2025-05-07T20:32:17.9693103Z 2025-05-07T20:32:17.9693221Z if scale_ub is not None: 2025-05-07T20:32:17.9693330Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9693466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9693549Z ) 2025-05-07T20:32:17.9693628Z else: 2025-05-07T20:32:17.9693723Z scale_ub_tensor = None 2025-05-07T20:32:17.9693807Z 2025-05-07T20:32:17.9693940Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9694034Z op = silu_mul_quant 2025-05-07T20:32:17.9694126Z if compiled: 2025-05-07T20:32:17.9694225Z op = torch.compile(op) 2025-05-07T20:32:17.9694336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9694417Z 2025-05-07T20:32:17.9694509Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9694513Z 2025-05-07T20:32:17.9694617Z moe/activation_test.py:117: 2025-05-07T20:32:17.9694745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9694850Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9694957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9695452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9695548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9695912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9696137Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9696560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9696659Z kernel = self.compile( 2025-05-07T20:32:17.9697039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9697220Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9697348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9697352Z 2025-05-07T20:32:17.9697565Z self = 2025-05-07T20:32:17.9698333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9698835Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9b488b0>} 2025-05-07T20:32:17.9699602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9699796Z context = 2025-05-07T20:32:17.9699840Z 2025-05-07T20:32:17.9700015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9700278Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9700387Z module_map=module_map) 2025-05-07T20:32:17.9700556Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9700654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9700741Z E ^ 2025-05-07T20:32:17.9701147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9701204Z 2025-05-07T20:32:17.9701626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9701631Z 2025-05-07T20:32:17.9701739Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9701964Z self=, 2025-05-07T20:32:17.9702050Z T=16384, 2025-05-07T20:32:17.9702126Z D=5120, 2025-05-07T20:32:17.9702209Z scale_ub=1200.0, 2025-05-07T20:32:17.9702300Z contiguous=True, 2025-05-07T20:32:17.9702383Z compiled=True, 2025-05-07T20:32:17.9702456Z ) 2025-05-07T20:32:17.9702678Z self = 2025-05-07T20:32:17.9702856Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.9702861Z 2025-05-07T20:32:17.9702954Z @given( 2025-05-07T20:32:17.9703093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9703219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9703343Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9703460Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9703573Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9703658Z ) 2025-05-07T20:32:17.9703904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9703998Z def test_silu_mul_quant( 2025-05-07T20:32:17.9704082Z self, 2025-05-07T20:32:17.9704158Z T: int, 2025-05-07T20:32:17.9704234Z D: int, 2025-05-07T20:32:17.9704338Z scale_ub: Optional[float], 2025-05-07T20:32:17.9704426Z contiguous: bool, 2025-05-07T20:32:17.9704516Z compiled: bool, 2025-05-07T20:32:17.9704601Z ) -> None: 2025-05-07T20:32:17.9704695Z torch.manual_seed(2025) 2025-05-07T20:32:17.9704768Z 2025-05-07T20:32:17.9705046Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9705125Z 2025-05-07T20:32:17.9705224Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9705348Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9705437Z x = x_sign * x_clamp 2025-05-07T20:32:17.9705528Z x0 = x[:, :D] 2025-05-07T20:32:17.9705610Z x1 = x[:, D:] 2025-05-07T20:32:17.9705684Z 2025-05-07T20:32:17.9705777Z if contiguous: 2025-05-07T20:32:17.9705872Z x0 = x0.contiguous() 2025-05-07T20:32:17.9705962Z x1 = x1.contiguous() 2025-05-07T20:32:17.9706042Z 2025-05-07T20:32:17.9706133Z if scale_ub is not None: 2025-05-07T20:32:17.9706243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9706383Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9706461Z ) 2025-05-07T20:32:17.9706552Z else: 2025-05-07T20:32:17.9706648Z scale_ub_tensor = None 2025-05-07T20:32:17.9706725Z 2025-05-07T20:32:17.9706870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9706961Z op = silu_mul_quant 2025-05-07T20:32:17.9707047Z if compiled: 2025-05-07T20:32:17.9707154Z op = torch.compile(op) 2025-05-07T20:32:17.9707261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9707382Z 2025-05-07T20:32:17.9707479Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9707483Z 2025-05-07T20:32:17.9707583Z moe/activation_test.py:117: 2025-05-07T20:32:17.9707718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9707822Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9707922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9708293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9708387Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9708883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9709029Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9709395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9709628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9709966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9710059Z kernel = self.compile( 2025-05-07T20:32:17.9710445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9710620Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9710746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9710758Z 2025-05-07T20:32:17.9710969Z self = 2025-05-07T20:32:17.9711739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9712261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9a215e0>} 2025-05-07T20:32:17.9713051Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9713247Z context = 2025-05-07T20:32:17.9713252Z 2025-05-07T20:32:17.9713416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9713757Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9713877Z module_map=module_map) 2025-05-07T20:32:17.9714039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9714138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9714225Z E ^ 2025-05-07T20:32:17.9714575Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9714581Z 2025-05-07T20:32:17.9715003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9715008Z 2025-05-07T20:32:17.9715115Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9715336Z self=, 2025-05-07T20:32:17.9715419Z T=16384, 2025-05-07T20:32:17.9715497Z D=5120, 2025-05-07T20:32:17.9715594Z scale_ub=None, 2025-05-07T20:32:17.9715682Z contiguous=False, 2025-05-07T20:32:17.9715765Z compiled=True, 2025-05-07T20:32:17.9715845Z ) 2025-05-07T20:32:17.9716062Z self = 2025-05-07T20:32:17.9716239Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9716285Z 2025-05-07T20:32:17.9716372Z @given( 2025-05-07T20:32:17.9716490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9716590Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9716712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9716832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9716954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9717029Z ) 2025-05-07T20:32:17.9717274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9717425Z def test_silu_mul_quant( 2025-05-07T20:32:17.9717502Z self, 2025-05-07T20:32:17.9717581Z T: int, 2025-05-07T20:32:17.9717666Z D: int, 2025-05-07T20:32:17.9717764Z scale_ub: Optional[float], 2025-05-07T20:32:17.9717854Z contiguous: bool, 2025-05-07T20:32:17.9717947Z compiled: bool, 2025-05-07T20:32:17.9718027Z ) -> None: 2025-05-07T20:32:17.9718122Z torch.manual_seed(2025) 2025-05-07T20:32:17.9718203Z 2025-05-07T20:32:17.9718371Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9718445Z 2025-05-07T20:32:17.9718544Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9718671Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9718769Z x = x_sign * x_clamp 2025-05-07T20:32:17.9718850Z x0 = x[:, :D] 2025-05-07T20:32:17.9718930Z x1 = x[:, D:] 2025-05-07T20:32:17.9719009Z 2025-05-07T20:32:17.9719095Z if contiguous: 2025-05-07T20:32:17.9719193Z x0 = x0.contiguous() 2025-05-07T20:32:17.9719292Z x1 = x1.contiguous() 2025-05-07T20:32:17.9719364Z 2025-05-07T20:32:17.9719456Z if scale_ub is not None: 2025-05-07T20:32:17.9719570Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9719705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9719786Z ) 2025-05-07T20:32:17.9719872Z else: 2025-05-07T20:32:17.9719967Z scale_ub_tensor = None 2025-05-07T20:32:17.9720049Z 2025-05-07T20:32:17.9720180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9720271Z op = silu_mul_quant 2025-05-07T20:32:17.9720367Z if compiled: 2025-05-07T20:32:17.9720469Z op = torch.compile(op) 2025-05-07T20:32:17.9720575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9720654Z 2025-05-07T20:32:17.9720747Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9720751Z 2025-05-07T20:32:17.9720933Z moe/activation_test.py:117: 2025-05-07T20:32:17.9721068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9721170Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9721277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9721641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9721738Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9722239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9722337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9722702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9722958Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9723326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9723433Z kernel = self.compile( 2025-05-07T20:32:17.9723811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9723989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9724163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9724167Z 2025-05-07T20:32:17.9724373Z self = 2025-05-07T20:32:17.9725153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9725662Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9c405e0>} 2025-05-07T20:32:17.9726463Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9726669Z context = 2025-05-07T20:32:17.9726673Z 2025-05-07T20:32:17.9726839Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9727114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9727225Z module_map=module_map) 2025-05-07T20:32:17.9727386Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9727490Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9727569Z E ^ 2025-05-07T20:32:17.9727932Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9727944Z 2025-05-07T20:32:17.9728364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9728368Z 2025-05-07T20:32:17.9728471Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9728702Z self=, 2025-05-07T20:32:17.9728779Z T=2048, 2025-05-07T20:32:17.9728856Z D=5120, 2025-05-07T20:32:17.9728944Z scale_ub=None, 2025-05-07T20:32:17.9729032Z contiguous=False, 2025-05-07T20:32:17.9729115Z compiled=True, 2025-05-07T20:32:17.9729194Z ) 2025-05-07T20:32:17.9729410Z self = 2025-05-07T20:32:17.9729592Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9729597Z 2025-05-07T20:32:17.9729674Z @given( 2025-05-07T20:32:17.9729874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9729984Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9730102Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9730217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9730337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9730414Z ) 2025-05-07T20:32:17.9730665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9730761Z def test_silu_mul_quant( 2025-05-07T20:32:17.9730837Z self, 2025-05-07T20:32:17.9730921Z T: int, 2025-05-07T20:32:17.9730998Z D: int, 2025-05-07T20:32:17.9731096Z scale_ub: Optional[float], 2025-05-07T20:32:17.9731192Z contiguous: bool, 2025-05-07T20:32:17.9731279Z compiled: bool, 2025-05-07T20:32:17.9731358Z ) -> None: 2025-05-07T20:32:17.9731458Z torch.manual_seed(2025) 2025-05-07T20:32:17.9731531Z 2025-05-07T20:32:17.9731711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9731793Z 2025-05-07T20:32:17.9731885Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9732010Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9732106Z x = x_sign * x_clamp 2025-05-07T20:32:17.9732234Z x0 = x[:, :D] 2025-05-07T20:32:17.9732321Z x1 = x[:, D:] 2025-05-07T20:32:17.9732394Z 2025-05-07T20:32:17.9732478Z if contiguous: 2025-05-07T20:32:17.9732577Z x0 = x0.contiguous() 2025-05-07T20:32:17.9732666Z x1 = x1.contiguous() 2025-05-07T20:32:17.9732740Z 2025-05-07T20:32:17.9732838Z if scale_ub is not None: 2025-05-07T20:32:17.9732954Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9733110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9733213Z ) 2025-05-07T20:32:17.9733291Z else: 2025-05-07T20:32:17.9733390Z scale_ub_tensor = None 2025-05-07T20:32:17.9733512Z 2025-05-07T20:32:17.9738903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9739020Z op = silu_mul_quant 2025-05-07T20:32:17.9739114Z if compiled: 2025-05-07T20:32:17.9739237Z op = torch.compile(op) 2025-05-07T20:32:17.9739355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9739434Z 2025-05-07T20:32:17.9739537Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9739543Z 2025-05-07T20:32:17.9739645Z moe/activation_test.py:117: 2025-05-07T20:32:17.9739791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9739899Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9740007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9740753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9740854Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9741436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9741551Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9741919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9742155Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9742496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9742591Z kernel = self.compile( 2025-05-07T20:32:17.9742985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9743169Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9743302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9743523Z 2025-05-07T20:32:17.9743740Z self = 2025-05-07T20:32:17.9744518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9745043Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9a21c10>} 2025-05-07T20:32:17.9745798Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9746000Z context = 2025-05-07T20:32:17.9746005Z 2025-05-07T20:32:17.9746178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9746454Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9746573Z module_map=module_map) 2025-05-07T20:32:17.9746738Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9746907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9746987Z E ^ 2025-05-07T20:32:17.9747345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9747350Z 2025-05-07T20:32:17.9747772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9747776Z 2025-05-07T20:32:17.9747885Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9748108Z self=, 2025-05-07T20:32:17.9748256Z T=2048, 2025-05-07T20:32:17.9748341Z D=5120, 2025-05-07T20:32:17.9748437Z scale_ub=1200.0, 2025-05-07T20:32:17.9748526Z contiguous=False, 2025-05-07T20:32:17.9748612Z compiled=True, 2025-05-07T20:32:17.9748693Z ) 2025-05-07T20:32:17.9748915Z self = 2025-05-07T20:32:17.9749096Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9749101Z 2025-05-07T20:32:17.9749187Z @given( 2025-05-07T20:32:17.9749308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9749408Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9749533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9749651Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9749775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9749851Z ) 2025-05-07T20:32:17.9750106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9750212Z def test_silu_mul_quant( 2025-05-07T20:32:17.9750290Z self, 2025-05-07T20:32:17.9750368Z T: int, 2025-05-07T20:32:17.9750455Z D: int, 2025-05-07T20:32:17.9750556Z scale_ub: Optional[float], 2025-05-07T20:32:17.9750647Z contiguous: bool, 2025-05-07T20:32:17.9750745Z compiled: bool, 2025-05-07T20:32:17.9750831Z ) -> None: 2025-05-07T20:32:17.9750932Z torch.manual_seed(2025) 2025-05-07T20:32:17.9751013Z 2025-05-07T20:32:17.9751186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9751269Z 2025-05-07T20:32:17.9751367Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9751496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9751595Z x = x_sign * x_clamp 2025-05-07T20:32:17.9751681Z x0 = x[:, :D] 2025-05-07T20:32:17.9751763Z x1 = x[:, D:] 2025-05-07T20:32:17.9751845Z 2025-05-07T20:32:17.9752021Z if contiguous: 2025-05-07T20:32:17.9752117Z x0 = x0.contiguous() 2025-05-07T20:32:17.9752218Z x1 = x1.contiguous() 2025-05-07T20:32:17.9752292Z 2025-05-07T20:32:17.9752385Z if scale_ub is not None: 2025-05-07T20:32:17.9752501Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9752643Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9752729Z ) 2025-05-07T20:32:17.9752810Z else: 2025-05-07T20:32:17.9752907Z scale_ub_tensor = None 2025-05-07T20:32:17.9752991Z 2025-05-07T20:32:17.9753125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9753216Z op = silu_mul_quant 2025-05-07T20:32:17.9753311Z if compiled: 2025-05-07T20:32:17.9753413Z op = torch.compile(op) 2025-05-07T20:32:17.9753520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9753600Z 2025-05-07T20:32:17.9753691Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9753705Z 2025-05-07T20:32:17.9753806Z moe/activation_test.py:117: 2025-05-07T20:32:17.9753946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9754049Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9754164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9754580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9754676Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9755180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9755282Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9755640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9755876Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9756270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9756375Z kernel = self.compile( 2025-05-07T20:32:17.9756758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9756941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9757077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9757081Z 2025-05-07T20:32:17.9757293Z self = 2025-05-07T20:32:17.9758074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9758580Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e98e8820>} 2025-05-07T20:32:17.9759340Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9759544Z context = 2025-05-07T20:32:17.9759548Z 2025-05-07T20:32:17.9759719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9759992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9760104Z module_map=module_map) 2025-05-07T20:32:17.9760270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9760378Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9760457Z E ^ 2025-05-07T20:32:17.9760901Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9760906Z 2025-05-07T20:32:17.9761327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9761334Z 2025-05-07T20:32:17.9761439Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9761671Z self=, 2025-05-07T20:32:17.9761751Z T=4096, 2025-05-07T20:32:17.9761829Z D=5120, 2025-05-07T20:32:17.9761926Z scale_ub=1200.0, 2025-05-07T20:32:17.9762013Z contiguous=True, 2025-05-07T20:32:17.9762105Z compiled=True, 2025-05-07T20:32:17.9762180Z ) 2025-05-07T20:32:17.9762400Z self = 2025-05-07T20:32:17.9762584Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.9762589Z 2025-05-07T20:32:17.9762677Z @given( 2025-05-07T20:32:17.9762798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9762907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9763024Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9763143Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9763304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9763380Z ) 2025-05-07T20:32:17.9763637Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9763735Z def test_silu_mul_quant( 2025-05-07T20:32:17.9763814Z self, 2025-05-07T20:32:17.9763899Z T: int, 2025-05-07T20:32:17.9763978Z D: int, 2025-05-07T20:32:17.9764078Z scale_ub: Optional[float], 2025-05-07T20:32:17.9764177Z contiguous: bool, 2025-05-07T20:32:17.9764265Z compiled: bool, 2025-05-07T20:32:17.9764346Z ) -> None: 2025-05-07T20:32:17.9764452Z torch.manual_seed(2025) 2025-05-07T20:32:17.9764574Z 2025-05-07T20:32:17.9764747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9764832Z 2025-05-07T20:32:17.9764926Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9765061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9765157Z x = x_sign * x_clamp 2025-05-07T20:32:17.9765240Z x0 = x[:, :D] 2025-05-07T20:32:17.9765331Z x1 = x[:, D:] 2025-05-07T20:32:17.9765410Z 2025-05-07T20:32:17.9765496Z if contiguous: 2025-05-07T20:32:17.9765597Z x0 = x0.contiguous() 2025-05-07T20:32:17.9765689Z x1 = x1.contiguous() 2025-05-07T20:32:17.9765764Z 2025-05-07T20:32:17.9765867Z if scale_ub is not None: 2025-05-07T20:32:17.9765977Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9766117Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9766205Z ) 2025-05-07T20:32:17.9766288Z else: 2025-05-07T20:32:17.9766396Z scale_ub_tensor = None 2025-05-07T20:32:17.9766473Z 2025-05-07T20:32:17.9766606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9766706Z op = silu_mul_quant 2025-05-07T20:32:17.9766796Z if compiled: 2025-05-07T20:32:17.9766905Z op = torch.compile(op) 2025-05-07T20:32:17.9767019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9767093Z 2025-05-07T20:32:17.9767184Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9767189Z 2025-05-07T20:32:17.9767299Z moe/activation_test.py:117: 2025-05-07T20:32:17.9767429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9767531Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9767643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9768010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9768192Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9768699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9768797Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9769167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9769399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9769748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9769844Z kernel = self.compile( 2025-05-07T20:32:17.9770224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9770409Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9770545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9770551Z 2025-05-07T20:32:17.9770758Z self = 2025-05-07T20:32:17.9771552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9772105Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9ddd430>} 2025-05-07T20:32:17.9772869Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9773063Z context = 2025-05-07T20:32:17.9773129Z 2025-05-07T20:32:17.9773318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9773590Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9773703Z module_map=module_map) 2025-05-07T20:32:17.9773882Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9773984Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9774065Z E ^ 2025-05-07T20:32:17.9774424Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9774429Z 2025-05-07T20:32:17.9774838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9774843Z 2025-05-07T20:32:17.9774952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9775181Z self=, 2025-05-07T20:32:17.9775263Z T=128, 2025-05-07T20:32:17.9775349Z D=5120, 2025-05-07T20:32:17.9775436Z scale_ub=1200.0, 2025-05-07T20:32:17.9775524Z contiguous=False, 2025-05-07T20:32:17.9775616Z compiled=True, 2025-05-07T20:32:17.9775689Z ) 2025-05-07T20:32:17.9775912Z self = 2025-05-07T20:32:17.9776089Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9776094Z 2025-05-07T20:32:17.9776172Z @given( 2025-05-07T20:32:17.9776300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9776398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9776514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9776636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9776748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9776824Z ) 2025-05-07T20:32:17.9777154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9777250Z def test_silu_mul_quant( 2025-05-07T20:32:17.9777333Z self, 2025-05-07T20:32:17.9777411Z T: int, 2025-05-07T20:32:17.9777485Z D: int, 2025-05-07T20:32:17.9777590Z scale_ub: Optional[float], 2025-05-07T20:32:17.9777687Z contiguous: bool, 2025-05-07T20:32:17.9777774Z compiled: bool, 2025-05-07T20:32:17.9777851Z ) -> None: 2025-05-07T20:32:17.9777950Z torch.manual_seed(2025) 2025-05-07T20:32:17.9778025Z 2025-05-07T20:32:17.9778195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9778276Z 2025-05-07T20:32:17.9778366Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9778495Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9778589Z x = x_sign * x_clamp 2025-05-07T20:32:17.9778670Z x0 = x[:, :D] 2025-05-07T20:32:17.9778756Z x1 = x[:, D:] 2025-05-07T20:32:17.9778837Z 2025-05-07T20:32:17.9778922Z if contiguous: 2025-05-07T20:32:17.9779020Z x0 = x0.contiguous() 2025-05-07T20:32:17.9779109Z x1 = x1.contiguous() 2025-05-07T20:32:17.9779183Z 2025-05-07T20:32:17.9779280Z if scale_ub is not None: 2025-05-07T20:32:17.9779428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9779566Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9779651Z ) 2025-05-07T20:32:17.9779729Z else: 2025-05-07T20:32:17.9779822Z scale_ub_tensor = None 2025-05-07T20:32:17.9779901Z 2025-05-07T20:32:17.9780031Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9780125Z op = silu_mul_quant 2025-05-07T20:32:17.9780210Z if compiled: 2025-05-07T20:32:17.9780310Z op = torch.compile(op) 2025-05-07T20:32:17.9780423Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9780538Z 2025-05-07T20:32:17.9780637Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9780642Z 2025-05-07T20:32:17.9780744Z moe/activation_test.py:117: 2025-05-07T20:32:17.9780872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9780973Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9781164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9781535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9781633Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9782125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9782222Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9782583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9782813Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9783164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9783259Z kernel = self.compile( 2025-05-07T20:32:17.9783638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9783823Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9783951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9783955Z 2025-05-07T20:32:17.9784160Z self = 2025-05-07T20:32:17.9784942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9785526Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e982b040>} 2025-05-07T20:32:17.9786283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9786477Z context = 2025-05-07T20:32:17.9786481Z 2025-05-07T20:32:17.9786650Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9786911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9787020Z module_map=module_map) 2025-05-07T20:32:17.9787186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9787284Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9787370Z E ^ 2025-05-07T20:32:17.9787732Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9787737Z 2025-05-07T20:32:17.9788147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9788189Z 2025-05-07T20:32:17.9788300Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9788522Z self=, 2025-05-07T20:32:17.9788600Z T=16384, 2025-05-07T20:32:17.9788681Z D=7168, 2025-05-07T20:32:17.9788763Z scale_ub=1200.0, 2025-05-07T20:32:17.9788847Z contiguous=True, 2025-05-07T20:32:17.9788936Z compiled=True, 2025-05-07T20:32:17.9789009Z ) 2025-05-07T20:32:17.9789226Z self = 2025-05-07T20:32:17.9789406Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.9789456Z 2025-05-07T20:32:17.9789534Z @given( 2025-05-07T20:32:17.9789659Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9789758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9789873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9789996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9790109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9790183Z ) 2025-05-07T20:32:17.9790435Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9790532Z def test_silu_mul_quant( 2025-05-07T20:32:17.9790613Z self, 2025-05-07T20:32:17.9790690Z T: int, 2025-05-07T20:32:17.9790766Z D: int, 2025-05-07T20:32:17.9790869Z scale_ub: Optional[float], 2025-05-07T20:32:17.9790957Z contiguous: bool, 2025-05-07T20:32:17.9791044Z compiled: bool, 2025-05-07T20:32:17.9791125Z ) -> None: 2025-05-07T20:32:17.9791230Z torch.manual_seed(2025) 2025-05-07T20:32:17.9791304Z 2025-05-07T20:32:17.9791486Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9791561Z 2025-05-07T20:32:17.9791653Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9791782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9791875Z x = x_sign * x_clamp 2025-05-07T20:32:17.9791961Z x0 = x[:, :D] 2025-05-07T20:32:17.9792043Z x1 = x[:, D:] 2025-05-07T20:32:17.9792116Z 2025-05-07T20:32:17.9792206Z if contiguous: 2025-05-07T20:32:17.9792298Z x0 = x0.contiguous() 2025-05-07T20:32:17.9792386Z x1 = x1.contiguous() 2025-05-07T20:32:17.9792464Z 2025-05-07T20:32:17.9792556Z if scale_ub is not None: 2025-05-07T20:32:17.9792666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9792810Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9792968Z ) 2025-05-07T20:32:17.9793046Z else: 2025-05-07T20:32:17.9793147Z scale_ub_tensor = None 2025-05-07T20:32:17.9793219Z 2025-05-07T20:32:17.9793350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9793448Z op = silu_mul_quant 2025-05-07T20:32:17.9793538Z if compiled: 2025-05-07T20:32:17.9793645Z op = torch.compile(op) 2025-05-07T20:32:17.9793749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9793820Z 2025-05-07T20:32:17.9793915Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9793920Z 2025-05-07T20:32:17.9794016Z moe/activation_test.py:117: 2025-05-07T20:32:17.9794144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9794251Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9794350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9794728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9794831Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9795324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9795423Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9795826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9796051Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9796395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9796488Z kernel = self.compile( 2025-05-07T20:32:17.9796871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9797045Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9797220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9797225Z 2025-05-07T20:32:17.9797435Z self = 2025-05-07T20:32:17.9798205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9798722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e982bb80>} 2025-05-07T20:32:17.9799460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9799655Z context = 2025-05-07T20:32:17.9799662Z 2025-05-07T20:32:17.9799835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9800103Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9800223Z module_map=module_map) 2025-05-07T20:32:17.9800384Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9800486Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9800573Z E ^ 2025-05-07T20:32:17.9800926Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9800931Z 2025-05-07T20:32:17.9801347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9801351Z 2025-05-07T20:32:17.9801456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9801752Z self=, 2025-05-07T20:32:17.9801837Z T=16384, 2025-05-07T20:32:17.9801914Z D=5120, 2025-05-07T20:32:17.9801996Z scale_ub=1200.0, 2025-05-07T20:32:17.9802089Z contiguous=True, 2025-05-07T20:32:17.9802173Z compiled=False, 2025-05-07T20:32:17.9802249Z ) 2025-05-07T20:32:17.9802470Z self = 2025-05-07T20:32:17.9802655Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.9802659Z 2025-05-07T20:32:17.9802740Z @given( 2025-05-07T20:32:17.9802863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9802964Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9803079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9803204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9803321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9803409Z ) 2025-05-07T20:32:17.9803657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9803754Z def test_silu_mul_quant( 2025-05-07T20:32:17.9803836Z self, 2025-05-07T20:32:17.9803913Z T: int, 2025-05-07T20:32:17.9803989Z D: int, 2025-05-07T20:32:17.9804159Z scale_ub: Optional[float], 2025-05-07T20:32:17.9804251Z contiguous: bool, 2025-05-07T20:32:17.9804340Z compiled: bool, 2025-05-07T20:32:17.9804423Z ) -> None: 2025-05-07T20:32:17.9804518Z torch.manual_seed(2025) 2025-05-07T20:32:17.9804591Z 2025-05-07T20:32:17.9804765Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9804842Z 2025-05-07T20:32:17.9804937Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9805063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9805151Z x = x_sign * x_clamp 2025-05-07T20:32:17.9805238Z x0 = x[:, :D] 2025-05-07T20:32:17.9805367Z x1 = x[:, D:] 2025-05-07T20:32:17.9805439Z 2025-05-07T20:32:17.9805528Z if contiguous: 2025-05-07T20:32:17.9805621Z x0 = x0.contiguous() 2025-05-07T20:32:17.9805714Z x1 = x1.contiguous() 2025-05-07T20:32:17.9805795Z 2025-05-07T20:32:17.9805890Z if scale_ub is not None: 2025-05-07T20:32:17.9805995Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9806135Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9806211Z ) 2025-05-07T20:32:17.9806288Z else: 2025-05-07T20:32:17.9806387Z scale_ub_tensor = None 2025-05-07T20:32:17.9806460Z 2025-05-07T20:32:17.9806593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9806683Z op = silu_mul_quant 2025-05-07T20:32:17.9806767Z if compiled: 2025-05-07T20:32:17.9806870Z op = torch.compile(op) 2025-05-07T20:32:17.9806979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9807055Z 2025-05-07T20:32:17.9807152Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9807157Z 2025-05-07T20:32:17.9807256Z moe/activation_test.py:117: 2025-05-07T20:32:17.9807382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9807492Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9807591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9808102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9808198Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9808557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9808785Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9809200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9809307Z kernel = self.compile( 2025-05-07T20:32:17.9809687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9809861Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9809993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9809998Z 2025-05-07T20:32:17.9810202Z self = 2025-05-07T20:32:17.9810973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9811486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e97835e0>} 2025-05-07T20:32:17.9812240Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9812435Z context = 2025-05-07T20:32:17.9812479Z 2025-05-07T20:32:17.9812648Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9812913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9813020Z module_map=module_map) 2025-05-07T20:32:17.9813181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9813284Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9813362Z E ^ 2025-05-07T20:32:17.9813720Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9813765Z 2025-05-07T20:32:17.9814190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9814195Z 2025-05-07T20:32:17.9814298Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9814525Z self=, 2025-05-07T20:32:17.9814601Z T=1, 2025-05-07T20:32:17.9814676Z D=7168, 2025-05-07T20:32:17.9814762Z scale_ub=1200.0, 2025-05-07T20:32:17.9814848Z contiguous=False, 2025-05-07T20:32:17.9814931Z compiled=False, 2025-05-07T20:32:17.9815011Z ) 2025-05-07T20:32:17.9815227Z self = 2025-05-07T20:32:17.9815395Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.9815404Z 2025-05-07T20:32:17.9815479Z @given( 2025-05-07T20:32:17.9815601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9815708Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9815822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9815939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9816054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9816130Z ) 2025-05-07T20:32:17.9816374Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9816472Z def test_silu_mul_quant( 2025-05-07T20:32:17.9816551Z self, 2025-05-07T20:32:17.9816632Z T: int, 2025-05-07T20:32:17.9816708Z D: int, 2025-05-07T20:32:17.9816808Z scale_ub: Optional[float], 2025-05-07T20:32:17.9816903Z contiguous: bool, 2025-05-07T20:32:17.9816988Z compiled: bool, 2025-05-07T20:32:17.9817066Z ) -> None: 2025-05-07T20:32:17.9817165Z torch.manual_seed(2025) 2025-05-07T20:32:17.9817237Z 2025-05-07T20:32:17.9817493Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9817575Z 2025-05-07T20:32:17.9817667Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9817792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9817888Z x = x_sign * x_clamp 2025-05-07T20:32:17.9817969Z x0 = x[:, :D] 2025-05-07T20:32:17.9818052Z x1 = x[:, D:] 2025-05-07T20:32:17.9818127Z 2025-05-07T20:32:17.9818211Z if contiguous: 2025-05-07T20:32:17.9818307Z x0 = x0.contiguous() 2025-05-07T20:32:17.9818397Z x1 = x1.contiguous() 2025-05-07T20:32:17.9818471Z 2025-05-07T20:32:17.9818565Z if scale_ub is not None: 2025-05-07T20:32:17.9818669Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9818807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9818888Z ) 2025-05-07T20:32:17.9818965Z else: 2025-05-07T20:32:17.9819058Z scale_ub_tensor = None 2025-05-07T20:32:17.9819146Z 2025-05-07T20:32:17.9819276Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9819367Z op = silu_mul_quant 2025-05-07T20:32:17.9819455Z if compiled: 2025-05-07T20:32:17.9819554Z op = torch.compile(op) 2025-05-07T20:32:17.9819663Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9819780Z 2025-05-07T20:32:17.9819872Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9819877Z 2025-05-07T20:32:17.9819977Z moe/activation_test.py:117: 2025-05-07T20:32:17.9820105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9820206Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9820311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9820807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9820902Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9821373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9821601Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9821945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9822042Z kernel = self.compile( 2025-05-07T20:32:17.9822422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9822601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9822724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9822729Z 2025-05-07T20:32:17.9822941Z self = 2025-05-07T20:32:17.9823712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9824225Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e97839d0>} 2025-05-07T20:32:17.9824982Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9825174Z context = 2025-05-07T20:32:17.9825179Z 2025-05-07T20:32:17.9825350Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9825611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9825804Z module_map=module_map) 2025-05-07T20:32:17.9825976Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9826075Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9826158Z E ^ 2025-05-07T20:32:17.9826508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9826515Z 2025-05-07T20:32:17.9826924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9826928Z 2025-05-07T20:32:17.9827036Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9827256Z self=, 2025-05-07T20:32:17.9827337Z T=4096, 2025-05-07T20:32:17.9827413Z D=7168, 2025-05-07T20:32:17.9827496Z scale_ub=1200.0, 2025-05-07T20:32:17.9827586Z contiguous=False, 2025-05-07T20:32:17.9827669Z compiled=True, 2025-05-07T20:32:17.9827745Z ) 2025-05-07T20:32:17.9827971Z self = 2025-05-07T20:32:17.9828146Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9828151Z 2025-05-07T20:32:17.9828228Z @given( 2025-05-07T20:32:17.9828353Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9828493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9828609Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9828725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9828837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9828914Z ) 2025-05-07T20:32:17.9829159Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9829253Z def test_silu_mul_quant( 2025-05-07T20:32:17.9829333Z self, 2025-05-07T20:32:17.9829408Z T: int, 2025-05-07T20:32:17.9829485Z D: int, 2025-05-07T20:32:17.9829634Z scale_ub: Optional[float], 2025-05-07T20:32:17.9829723Z contiguous: bool, 2025-05-07T20:32:17.9829809Z compiled: bool, 2025-05-07T20:32:17.9829891Z ) -> None: 2025-05-07T20:32:17.9829985Z torch.manual_seed(2025) 2025-05-07T20:32:17.9830063Z 2025-05-07T20:32:17.9830233Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9830307Z 2025-05-07T20:32:17.9830404Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9830530Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9830619Z x = x_sign * x_clamp 2025-05-07T20:32:17.9830704Z x0 = x[:, :D] 2025-05-07T20:32:17.9830784Z x1 = x[:, D:] 2025-05-07T20:32:17.9830856Z 2025-05-07T20:32:17.9830943Z if contiguous: 2025-05-07T20:32:17.9831033Z x0 = x0.contiguous() 2025-05-07T20:32:17.9831121Z x1 = x1.contiguous() 2025-05-07T20:32:17.9831195Z 2025-05-07T20:32:17.9831290Z if scale_ub is not None: 2025-05-07T20:32:17.9831397Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9831540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9831616Z ) 2025-05-07T20:32:17.9831698Z else: 2025-05-07T20:32:17.9831795Z scale_ub_tensor = None 2025-05-07T20:32:17.9831870Z 2025-05-07T20:32:17.9832004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9832092Z op = silu_mul_quant 2025-05-07T20:32:17.9832176Z if compiled: 2025-05-07T20:32:17.9832280Z op = torch.compile(op) 2025-05-07T20:32:17.9832385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9832457Z 2025-05-07T20:32:17.9832550Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9832554Z 2025-05-07T20:32:17.9832650Z moe/activation_test.py:117: 2025-05-07T20:32:17.9832781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9832984Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9833087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9833462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9833555Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9834049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9834146Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9834501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9834728Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9835064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9835158Z kernel = self.compile( 2025-05-07T20:32:17.9835556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9835733Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9835857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9835905Z 2025-05-07T20:32:17.9836114Z self = 2025-05-07T20:32:17.9836882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9837386Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9696c10>} 2025-05-07T20:32:17.9838130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9838363Z context = 2025-05-07T20:32:17.9838368Z 2025-05-07T20:32:17.9838530Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9838793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9838904Z module_map=module_map) 2025-05-07T20:32:17.9839067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9839164Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9839244Z E ^ 2025-05-07T20:32:17.9839596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9839601Z 2025-05-07T20:32:17.9840026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9840032Z 2025-05-07T20:32:17.9840574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9840802Z self=, 2025-05-07T20:32:17.9840881Z T=128, 2025-05-07T20:32:17.9840962Z D=7168, 2025-05-07T20:32:17.9841049Z scale_ub=1200.0, 2025-05-07T20:32:17.9841133Z contiguous=False, 2025-05-07T20:32:17.9841215Z compiled=True, 2025-05-07T20:32:17.9841289Z ) 2025-05-07T20:32:17.9841504Z self = 2025-05-07T20:32:17.9841674Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.9841679Z 2025-05-07T20:32:17.9841760Z @given( 2025-05-07T20:32:17.9841876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9841973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9842294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9842420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9842539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9842614Z ) 2025-05-07T20:32:17.9842860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9842965Z def test_silu_mul_quant( 2025-05-07T20:32:17.9843038Z self, 2025-05-07T20:32:17.9843112Z T: int, 2025-05-07T20:32:17.9843193Z D: int, 2025-05-07T20:32:17.9843291Z scale_ub: Optional[float], 2025-05-07T20:32:17.9843380Z contiguous: bool, 2025-05-07T20:32:17.9843470Z compiled: bool, 2025-05-07T20:32:17.9843547Z ) -> None: 2025-05-07T20:32:17.9843640Z torch.manual_seed(2025) 2025-05-07T20:32:17.9843717Z 2025-05-07T20:32:17.9843883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9843962Z 2025-05-07T20:32:17.9844054Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9844189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9844279Z x = x_sign * x_clamp 2025-05-07T20:32:17.9844360Z x0 = x[:, :D] 2025-05-07T20:32:17.9844438Z x1 = x[:, D:] 2025-05-07T20:32:17.9844514Z 2025-05-07T20:32:17.9844596Z if contiguous: 2025-05-07T20:32:17.9844748Z x0 = x0.contiguous() 2025-05-07T20:32:17.9844846Z x1 = x1.contiguous() 2025-05-07T20:32:17.9844919Z 2025-05-07T20:32:17.9845008Z if scale_ub is not None: 2025-05-07T20:32:17.9845118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9845253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9845328Z ) 2025-05-07T20:32:17.9845409Z else: 2025-05-07T20:32:17.9845503Z scale_ub_tensor = None 2025-05-07T20:32:17.9845578Z 2025-05-07T20:32:17.9845708Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9845864Z op = silu_mul_quant 2025-05-07T20:32:17.9845953Z if compiled: 2025-05-07T20:32:17.9846052Z op = torch.compile(op) 2025-05-07T20:32:17.9846155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9846229Z 2025-05-07T20:32:17.9846319Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9846326Z 2025-05-07T20:32:17.9846424Z moe/activation_test.py:117: 2025-05-07T20:32:17.9846553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9846652Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9846755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9847125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9847219Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9847714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9847815Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9848168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9848394Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9848738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9848837Z kernel = self.compile( 2025-05-07T20:32:17.9849214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9849390Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9849519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9849523Z 2025-05-07T20:32:17.9849729Z self = 2025-05-07T20:32:17.9850576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9851091Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e98a9820>} 2025-05-07T20:32:17.9851842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9852035Z context = 2025-05-07T20:32:17.9852039Z 2025-05-07T20:32:17.9852202Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9852466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9852582Z module_map=module_map) 2025-05-07T20:32:17.9852741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9852844Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9852921Z E ^ 2025-05-07T20:32:17.9853272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9853323Z 2025-05-07T20:32:17.9853740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9853744Z 2025-05-07T20:32:17.9853847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9854074Z self=, 2025-05-07T20:32:17.9854149Z T=2048, 2025-05-07T20:32:17.9854222Z D=7168, 2025-05-07T20:32:17.9854308Z scale_ub=None, 2025-05-07T20:32:17.9854391Z contiguous=True, 2025-05-07T20:32:17.9854521Z compiled=True, 2025-05-07T20:32:17.9854598Z ) 2025-05-07T20:32:17.9854816Z self = 2025-05-07T20:32:17.9854987Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.9854992Z 2025-05-07T20:32:17.9855068Z @given( 2025-05-07T20:32:17.9855187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9855292Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9855404Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9855519Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9855637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9855709Z ) 2025-05-07T20:32:17.9855957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9856048Z def test_silu_mul_quant( 2025-05-07T20:32:17.9856124Z self, 2025-05-07T20:32:17.9856205Z T: int, 2025-05-07T20:32:17.9856287Z D: int, 2025-05-07T20:32:17.9856384Z scale_ub: Optional[float], 2025-05-07T20:32:17.9856475Z contiguous: bool, 2025-05-07T20:32:17.9856559Z compiled: bool, 2025-05-07T20:32:17.9856636Z ) -> None: 2025-05-07T20:32:17.9856734Z torch.manual_seed(2025) 2025-05-07T20:32:17.9856808Z 2025-05-07T20:32:17.9856977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9857053Z 2025-05-07T20:32:17.9857142Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9857269Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9857360Z x = x_sign * x_clamp 2025-05-07T20:32:17.9857443Z x0 = x[:, :D] 2025-05-07T20:32:17.9857525Z x1 = x[:, D:] 2025-05-07T20:32:17.9857595Z 2025-05-07T20:32:17.9857676Z if contiguous: 2025-05-07T20:32:17.9857768Z x0 = x0.contiguous() 2025-05-07T20:32:17.9857855Z x1 = x1.contiguous() 2025-05-07T20:32:17.9857931Z 2025-05-07T20:32:17.9858105Z if scale_ub is not None: 2025-05-07T20:32:17.9858211Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9858347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9858427Z ) 2025-05-07T20:32:17.9858502Z else: 2025-05-07T20:32:17.9858598Z scale_ub_tensor = None 2025-05-07T20:32:17.9858674Z 2025-05-07T20:32:17.9858802Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9858894Z op = silu_mul_quant 2025-05-07T20:32:17.9858976Z if compiled: 2025-05-07T20:32:17.9859075Z op = torch.compile(op) 2025-05-07T20:32:17.9859180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9859250Z 2025-05-07T20:32:17.9859339Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9859343Z 2025-05-07T20:32:17.9859442Z moe/activation_test.py:117: 2025-05-07T20:32:17.9859573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9859674Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9859776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9863750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.9863936Z return fn(*args, **kwargs) 2025-05-07T20:32:17.9864451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9864551Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9864915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9865140Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9865475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9865575Z kernel = self.compile( 2025-05-07T20:32:17.9866085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9866279Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9866418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9866431Z 2025-05-07T20:32:17.9866665Z self = 2025-05-07T20:32:17.9867645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9868273Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e97b54c0>} 2025-05-07T20:32:17.9869207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9869432Z context = 2025-05-07T20:32:17.9869436Z 2025-05-07T20:32:17.9869620Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9869930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9870046Z module_map=module_map) 2025-05-07T20:32:17.9870222Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9870328Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9870404Z E ^ 2025-05-07T20:32:17.9870831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9870835Z 2025-05-07T20:32:17.9871445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9871451Z 2025-05-07T20:32:17.9871561Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9871783Z self=, 2025-05-07T20:32:17.9871867Z T=16384, 2025-05-07T20:32:17.9871941Z D=5120, 2025-05-07T20:32:17.9872027Z scale_ub=None, 2025-05-07T20:32:17.9872114Z contiguous=False, 2025-05-07T20:32:17.9872196Z compiled=False, 2025-05-07T20:32:17.9872270Z ) 2025-05-07T20:32:17.9872487Z self = 2025-05-07T20:32:17.9872666Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.9872671Z 2025-05-07T20:32:17.9872749Z @given( 2025-05-07T20:32:17.9872866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9872963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9873093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9873207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9873327Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9873398Z ) 2025-05-07T20:32:17.9873643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9873779Z def test_silu_mul_quant( 2025-05-07T20:32:17.9873854Z self, 2025-05-07T20:32:17.9873930Z T: int, 2025-05-07T20:32:17.9874007Z D: int, 2025-05-07T20:32:17.9874106Z scale_ub: Optional[float], 2025-05-07T20:32:17.9874192Z contiguous: bool, 2025-05-07T20:32:17.9874281Z compiled: bool, 2025-05-07T20:32:17.9874361Z ) -> None: 2025-05-07T20:32:17.9874454Z torch.manual_seed(2025) 2025-05-07T20:32:17.9874532Z 2025-05-07T20:32:17.9874701Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9874818Z 2025-05-07T20:32:17.9874912Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9875038Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9876851Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9876860Z 2025-05-07T20:32:17.9876978Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.9876983Z 2025-05-07T20:32:17.9877089Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9877312Z self=, 2025-05-07T20:32:17.9877393Z T=4096, 2025-05-07T20:32:17.9877471Z D=7168, 2025-05-07T20:32:17.9877551Z scale_ub=1200.0, 2025-05-07T20:32:17.9877634Z contiguous=True, 2025-05-07T20:32:17.9877719Z compiled=True, 2025-05-07T20:32:17.9877790Z ) 2025-05-07T20:32:17.9878010Z self = 2025-05-07T20:32:17.9878180Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.9878185Z 2025-05-07T20:32:17.9878259Z @given( 2025-05-07T20:32:17.9878380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9878476Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9878589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9878705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9878814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9878885Z ) 2025-05-07T20:32:17.9879217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9879310Z def test_silu_mul_quant( 2025-05-07T20:32:17.9879387Z self, 2025-05-07T20:32:17.9879461Z T: int, 2025-05-07T20:32:17.9879537Z D: int, 2025-05-07T20:32:17.9879635Z scale_ub: Optional[float], 2025-05-07T20:32:17.9879724Z contiguous: bool, 2025-05-07T20:32:17.9879808Z compiled: bool, 2025-05-07T20:32:17.9879888Z ) -> None: 2025-05-07T20:32:17.9879980Z torch.manual_seed(2025) 2025-05-07T20:32:17.9880054Z 2025-05-07T20:32:17.9880225Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9880297Z 2025-05-07T20:32:17.9880385Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9880511Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9882271Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9882326Z 2025-05-07T20:32:17.9882445Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.9882449Z 2025-05-07T20:32:17.9882549Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9882774Z self=, 2025-05-07T20:32:17.9882850Z T=16384, 2025-05-07T20:32:17.9882926Z D=7168, 2025-05-07T20:32:17.9883012Z scale_ub=None, 2025-05-07T20:32:17.9883096Z contiguous=False, 2025-05-07T20:32:17.9883178Z compiled=False, 2025-05-07T20:32:17.9883294Z ) 2025-05-07T20:32:17.9883513Z self = 2025-05-07T20:32:17.9883691Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.9883701Z 2025-05-07T20:32:17.9883778Z @given( 2025-05-07T20:32:17.9883892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9883994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9884104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9884220Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9884331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9884402Z ) 2025-05-07T20:32:17.9884643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9884741Z def test_silu_mul_quant( 2025-05-07T20:32:17.9884816Z self, 2025-05-07T20:32:17.9884894Z T: int, 2025-05-07T20:32:17.9884968Z D: int, 2025-05-07T20:32:17.9885071Z scale_ub: Optional[float], 2025-05-07T20:32:17.9885161Z contiguous: bool, 2025-05-07T20:32:17.9885247Z compiled: bool, 2025-05-07T20:32:17.9885323Z ) -> None: 2025-05-07T20:32:17.9885420Z torch.manual_seed(2025) 2025-05-07T20:32:17.9885490Z 2025-05-07T20:32:17.9885661Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9887424Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9887509Z 2025-05-07T20:32:17.9887627Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.9887631Z 2025-05-07T20:32:17.9887734Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9887952Z self=, 2025-05-07T20:32:17.9888032Z T=2048, 2025-05-07T20:32:17.9888107Z D=7168, 2025-05-07T20:32:17.9888190Z scale_ub=1200.0, 2025-05-07T20:32:17.9888278Z contiguous=True, 2025-05-07T20:32:17.9888358Z compiled=True, 2025-05-07T20:32:17.9888431Z ) 2025-05-07T20:32:17.9888653Z self = 2025-05-07T20:32:17.9888823Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.9888827Z 2025-05-07T20:32:17.9888902Z @given( 2025-05-07T20:32:17.9889022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9889120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9889242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9889361Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9889470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9889545Z ) 2025-05-07T20:32:17.9892215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9892382Z def test_silu_mul_quant( 2025-05-07T20:32:17.9892459Z self, 2025-05-07T20:32:17.9892540Z T: int, 2025-05-07T20:32:17.9892617Z D: int, 2025-05-07T20:32:17.9892722Z scale_ub: Optional[float], 2025-05-07T20:32:17.9892814Z contiguous: bool, 2025-05-07T20:32:17.9892901Z compiled: bool, 2025-05-07T20:32:17.9893009Z ) -> None: 2025-05-07T20:32:17.9893112Z torch.manual_seed(2025) 2025-05-07T20:32:17.9893205Z 2025-05-07T20:32:17.9893379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9893451Z 2025-05-07T20:32:17.9893591Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9893718Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9895490Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9895498Z 2025-05-07T20:32:17.9895617Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.9895621Z 2025-05-07T20:32:17.9895727Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9895947Z self=, 2025-05-07T20:32:17.9896028Z T=2048, 2025-05-07T20:32:17.9896107Z D=7168, 2025-05-07T20:32:17.9896188Z scale_ub=None, 2025-05-07T20:32:17.9896269Z contiguous=True, 2025-05-07T20:32:17.9896358Z compiled=False, 2025-05-07T20:32:17.9896429Z ) 2025-05-07T20:32:17.9896645Z self = 2025-05-07T20:32:17.9896827Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.9896832Z 2025-05-07T20:32:17.9896906Z @given( 2025-05-07T20:32:17.9897022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9897120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9897231Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9897349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9897459Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9897531Z ) 2025-05-07T20:32:17.9897823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9897919Z def test_silu_mul_quant( 2025-05-07T20:32:17.9897994Z self, 2025-05-07T20:32:17.9898072Z T: int, 2025-05-07T20:32:17.9898145Z D: int, 2025-05-07T20:32:17.9898240Z scale_ub: Optional[float], 2025-05-07T20:32:17.9898336Z contiguous: bool, 2025-05-07T20:32:17.9898420Z compiled: bool, 2025-05-07T20:32:17.9898499Z ) -> None: 2025-05-07T20:32:17.9898593Z torch.manual_seed(2025) 2025-05-07T20:32:17.9898666Z 2025-05-07T20:32:17.9898834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9898905Z 2025-05-07T20:32:17.9898995Z > x_sign = torch.sign(x) 2025-05-07T20:32:17.9900746Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9900894Z 2025-05-07T20:32:17.9901017Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:17.9901022Z 2025-05-07T20:32:17.9901205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9901427Z self=, 2025-05-07T20:32:17.9901506Z T=1, 2025-05-07T20:32:17.9901587Z D=7168, 2025-05-07T20:32:17.9901670Z scale_ub=1200.0, 2025-05-07T20:32:17.9901755Z contiguous=True, 2025-05-07T20:32:17.9901842Z compiled=False, 2025-05-07T20:32:17.9901914Z ) 2025-05-07T20:32:17.9902131Z self = 2025-05-07T20:32:17.9902348Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.9902353Z 2025-05-07T20:32:17.9902427Z @given( 2025-05-07T20:32:17.9902546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9902642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9902758Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9902877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9902989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9903066Z ) 2025-05-07T20:32:17.9903312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9903404Z def test_silu_mul_quant( 2025-05-07T20:32:17.9903483Z self, 2025-05-07T20:32:17.9903557Z T: int, 2025-05-07T20:32:17.9903632Z D: int, 2025-05-07T20:32:17.9903730Z scale_ub: Optional[float], 2025-05-07T20:32:17.9903817Z contiguous: bool, 2025-05-07T20:32:17.9903906Z compiled: bool, 2025-05-07T20:32:17.9903985Z ) -> None: 2025-05-07T20:32:17.9904080Z torch.manual_seed(2025) 2025-05-07T20:32:17.9904151Z 2025-05-07T20:32:17.9904319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9904394Z 2025-05-07T20:32:17.9904491Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9904614Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9904700Z x = x_sign * x_clamp 2025-05-07T20:32:17.9904782Z x0 = x[:, :D] 2025-05-07T20:32:17.9904860Z x1 = x[:, D:] 2025-05-07T20:32:17.9904931Z 2025-05-07T20:32:17.9905017Z if contiguous: 2025-05-07T20:32:17.9905107Z x0 = x0.contiguous() 2025-05-07T20:32:17.9905198Z x1 = x1.contiguous() 2025-05-07T20:32:17.9905269Z 2025-05-07T20:32:17.9905357Z if scale_ub is not None: 2025-05-07T20:32:17.9905464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9905646Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9905721Z ) 2025-05-07T20:32:17.9905800Z else: 2025-05-07T20:32:17.9905892Z scale_ub_tensor = None 2025-05-07T20:32:17.9905963Z 2025-05-07T20:32:17.9906095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9906188Z op = silu_mul_quant 2025-05-07T20:32:17.9906273Z if compiled: 2025-05-07T20:32:17.9906377Z op = torch.compile(op) 2025-05-07T20:32:17.9906479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9906556Z 2025-05-07T20:32:17.9906646Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9906650Z 2025-05-07T20:32:17.9906745Z moe/activation_test.py:117: 2025-05-07T20:32:17.9906874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9906973Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9907074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9907582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9907675Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9908094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9908361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9908700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9908796Z kernel = self.compile( 2025-05-07T20:32:17.9909176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9909350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9909477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9909527Z 2025-05-07T20:32:17.9909734Z self = 2025-05-07T20:32:17.9910514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9911016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e93ec040>} 2025-05-07T20:32:17.9911758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9911947Z context = 2025-05-07T20:32:17.9911952Z 2025-05-07T20:32:17.9912119Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9912392Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9912499Z module_map=module_map) 2025-05-07T20:32:17.9912659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9912766Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9912841Z E ^ 2025-05-07T20:32:17.9913196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9913201Z 2025-05-07T20:32:17.9913611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9913616Z 2025-05-07T20:32:17.9913716Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9913939Z self=, 2025-05-07T20:32:17.9914020Z T=128, 2025-05-07T20:32:17.9914141Z D=5120, 2025-05-07T20:32:17.9914221Z scale_ub=None, 2025-05-07T20:32:17.9914304Z contiguous=True, 2025-05-07T20:32:17.9914392Z compiled=False, 2025-05-07T20:32:17.9914465Z ) 2025-05-07T20:32:17.9914681Z self = 2025-05-07T20:32:17.9914856Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.9914861Z 2025-05-07T20:32:17.9914936Z @given( 2025-05-07T20:32:17.9915053Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9915151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9915263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9915381Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9915492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9915563Z ) 2025-05-07T20:32:17.9915813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9915909Z def test_silu_mul_quant( 2025-05-07T20:32:17.9915983Z self, 2025-05-07T20:32:17.9916062Z T: int, 2025-05-07T20:32:17.9916136Z D: int, 2025-05-07T20:32:17.9916233Z scale_ub: Optional[float], 2025-05-07T20:32:17.9916323Z contiguous: bool, 2025-05-07T20:32:17.9916492Z compiled: bool, 2025-05-07T20:32:17.9916570Z ) -> None: 2025-05-07T20:32:17.9916669Z torch.manual_seed(2025) 2025-05-07T20:32:17.9916743Z 2025-05-07T20:32:17.9916910Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9916985Z 2025-05-07T20:32:17.9917074Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9917198Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9917284Z x = x_sign * x_clamp 2025-05-07T20:32:17.9917362Z x0 = x[:, :D] 2025-05-07T20:32:17.9917446Z x1 = x[:, D:] 2025-05-07T20:32:17.9917518Z 2025-05-07T20:32:17.9917650Z if contiguous: 2025-05-07T20:32:17.9917745Z x0 = x0.contiguous() 2025-05-07T20:32:17.9917833Z x1 = x1.contiguous() 2025-05-07T20:32:17.9917904Z 2025-05-07T20:32:17.9917999Z if scale_ub is not None: 2025-05-07T20:32:17.9918104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9918242Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9918322Z ) 2025-05-07T20:32:17.9918398Z else: 2025-05-07T20:32:17.9918496Z scale_ub_tensor = None 2025-05-07T20:32:17.9918567Z 2025-05-07T20:32:17.9918695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9918785Z op = silu_mul_quant 2025-05-07T20:32:17.9918869Z if compiled: 2025-05-07T20:32:17.9918969Z op = torch.compile(op) 2025-05-07T20:32:17.9919075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9919145Z 2025-05-07T20:32:17.9919234Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9919243Z 2025-05-07T20:32:17.9919346Z moe/activation_test.py:117: 2025-05-07T20:32:17.9919474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9919575Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9919673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9920176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9920277Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9920639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9920859Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9921199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9921290Z kernel = self.compile( 2025-05-07T20:32:17.9921723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9921898Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9922022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9922031Z 2025-05-07T20:32:17.9922239Z self = 2025-05-07T20:32:17.9923045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9923560Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e93ec9d0>} 2025-05-07T20:32:17.9924300Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9924493Z context = 2025-05-07T20:32:17.9924500Z 2025-05-07T20:32:17.9924710Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9925007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9925116Z module_map=module_map) 2025-05-07T20:32:17.9925277Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9925373Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9925453Z E ^ 2025-05-07T20:32:17.9925806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9925811Z 2025-05-07T20:32:17.9926228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9926272Z 2025-05-07T20:32:17.9926379Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9926601Z self=, 2025-05-07T20:32:17.9926683Z T=128, 2025-05-07T20:32:17.9926760Z D=7168, 2025-05-07T20:32:17.9926841Z scale_ub=None, 2025-05-07T20:32:17.9926928Z contiguous=True, 2025-05-07T20:32:17.9927012Z compiled=False, 2025-05-07T20:32:17.9927083Z ) 2025-05-07T20:32:17.9927303Z self = 2025-05-07T20:32:17.9927470Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.9927474Z 2025-05-07T20:32:17.9927554Z @given( 2025-05-07T20:32:17.9927670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9927766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9927888Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9928005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9928119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9928198Z ) 2025-05-07T20:32:17.9928441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9928538Z def test_silu_mul_quant( 2025-05-07T20:32:17.9928615Z self, 2025-05-07T20:32:17.9928696Z T: int, 2025-05-07T20:32:17.9928770Z D: int, 2025-05-07T20:32:17.9928868Z scale_ub: Optional[float], 2025-05-07T20:32:17.9928955Z contiguous: bool, 2025-05-07T20:32:17.9929041Z compiled: bool, 2025-05-07T20:32:17.9929121Z ) -> None: 2025-05-07T20:32:17.9929214Z torch.manual_seed(2025) 2025-05-07T20:32:17.9929284Z 2025-05-07T20:32:17.9929454Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9929525Z 2025-05-07T20:32:17.9929665Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9929791Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9929878Z x = x_sign * x_clamp 2025-05-07T20:32:17.9929961Z x0 = x[:, :D] 2025-05-07T20:32:17.9930038Z x1 = x[:, D:] 2025-05-07T20:32:17.9930108Z 2025-05-07T20:32:17.9930196Z if contiguous: 2025-05-07T20:32:17.9930291Z x0 = x0.contiguous() 2025-05-07T20:32:17.9930377Z x1 = x1.contiguous() 2025-05-07T20:32:17.9930452Z 2025-05-07T20:32:17.9930542Z if scale_ub is not None: 2025-05-07T20:32:17.9930645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9930783Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9930857Z ) 2025-05-07T20:32:17.9930938Z else: 2025-05-07T20:32:17.9931029Z scale_ub_tensor = None 2025-05-07T20:32:17.9931103Z 2025-05-07T20:32:17.9931232Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9931325Z op = silu_mul_quant 2025-05-07T20:32:17.9931410Z if compiled: 2025-05-07T20:32:17.9931511Z op = torch.compile(op) 2025-05-07T20:32:17.9931614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9931686Z 2025-05-07T20:32:17.9931780Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9931890Z 2025-05-07T20:32:17.9931989Z moe/activation_test.py:117: 2025-05-07T20:32:17.9932114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9932216Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9932314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9932812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9932905Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9933265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9933534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9933873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9933973Z kernel = self.compile( 2025-05-07T20:32:17.9934363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9934540Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9934665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9934670Z 2025-05-07T20:32:17.9934870Z self = 2025-05-07T20:32:17.9935642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9936153Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e94f1430>} 2025-05-07T20:32:17.9936891Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9937092Z context = 2025-05-07T20:32:17.9937097Z 2025-05-07T20:32:17.9937259Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9937522Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9937627Z module_map=module_map) 2025-05-07T20:32:17.9937787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9937929Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9938006Z E ^ 2025-05-07T20:32:17.9938356Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9938361Z 2025-05-07T20:32:17.9938778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9938785Z 2025-05-07T20:32:17.9938887Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9939109Z self=, 2025-05-07T20:32:17.9939185Z T=2048, 2025-05-07T20:32:17.9939259Z D=7168, 2025-05-07T20:32:17.9939343Z scale_ub=1200.0, 2025-05-07T20:32:17.9939427Z contiguous=True, 2025-05-07T20:32:17.9939508Z compiled=False, 2025-05-07T20:32:17.9939584Z ) 2025-05-07T20:32:17.9939799Z self = 2025-05-07T20:32:17.9939990Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.9939996Z 2025-05-07T20:32:17.9940381Z @given( 2025-05-07T20:32:17.9940544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9940649Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9940913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9941032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9941197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9941270Z ) 2025-05-07T20:32:17.9941514Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9941611Z def test_silu_mul_quant( 2025-05-07T20:32:17.9941687Z self, 2025-05-07T20:32:17.9941763Z T: int, 2025-05-07T20:32:17.9941837Z D: int, 2025-05-07T20:32:17.9941933Z scale_ub: Optional[float], 2025-05-07T20:32:17.9942022Z contiguous: bool, 2025-05-07T20:32:17.9942179Z compiled: bool, 2025-05-07T20:32:17.9942256Z ) -> None: 2025-05-07T20:32:17.9942350Z torch.manual_seed(2025) 2025-05-07T20:32:17.9942420Z 2025-05-07T20:32:17.9942587Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9944389Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9944398Z 2025-05-07T20:32:17.9944517Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.9944522Z 2025-05-07T20:32:17.9944628Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9944848Z self=, 2025-05-07T20:32:17.9944930Z T=1, 2025-05-07T20:32:17.9945006Z D=5120, 2025-05-07T20:32:17.9945089Z scale_ub=1200.0, 2025-05-07T20:32:17.9945177Z contiguous=True, 2025-05-07T20:32:17.9945268Z compiled=False, 2025-05-07T20:32:17.9945341Z ) 2025-05-07T20:32:17.9945556Z self = 2025-05-07T20:32:17.9945719Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.9945723Z 2025-05-07T20:32:17.9945797Z @given( 2025-05-07T20:32:17.9945916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9946014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9946124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9946244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9946422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9946501Z ) 2025-05-07T20:32:17.9946744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9946835Z def test_silu_mul_quant( 2025-05-07T20:32:17.9946915Z self, 2025-05-07T20:32:17.9946995Z T: int, 2025-05-07T20:32:17.9947070Z D: int, 2025-05-07T20:32:17.9947170Z scale_ub: Optional[float], 2025-05-07T20:32:17.9947255Z contiguous: bool, 2025-05-07T20:32:17.9947338Z compiled: bool, 2025-05-07T20:32:17.9947416Z ) -> None: 2025-05-07T20:32:17.9947509Z torch.manual_seed(2025) 2025-05-07T20:32:17.9947579Z 2025-05-07T20:32:17.9947749Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9947821Z 2025-05-07T20:32:17.9947915Z x_sign = torch.sign(x) 2025-05-07T20:32:17.9948036Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.9948131Z x = x_sign * x_clamp 2025-05-07T20:32:17.9948213Z x0 = x[:, :D] 2025-05-07T20:32:17.9948291Z x1 = x[:, D:] 2025-05-07T20:32:17.9948360Z 2025-05-07T20:32:17.9948445Z if contiguous: 2025-05-07T20:32:17.9948535Z x0 = x0.contiguous() 2025-05-07T20:32:17.9948624Z x1 = x1.contiguous() 2025-05-07T20:32:17.9948781Z 2025-05-07T20:32:17.9948870Z if scale_ub is not None: 2025-05-07T20:32:17.9948973Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.9949112Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.9949187Z ) 2025-05-07T20:32:17.9949267Z else: 2025-05-07T20:32:17.9949358Z scale_ub_tensor = None 2025-05-07T20:32:17.9949428Z 2025-05-07T20:32:17.9949562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.9949650Z op = silu_mul_quant 2025-05-07T20:32:17.9949733Z if compiled: 2025-05-07T20:32:17.9949880Z op = torch.compile(op) 2025-05-07T20:32:17.9949984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9950055Z 2025-05-07T20:32:17.9950153Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.9950157Z 2025-05-07T20:32:17.9950253Z moe/activation_test.py:117: 2025-05-07T20:32:17.9950384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9950487Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.9950588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.9951092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.9951188Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.9951543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.9951768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.9952116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.9952212Z kernel = self.compile( 2025-05-07T20:32:17.9952588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.9952767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.9952893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.9952898Z 2025-05-07T20:32:17.9953101Z self = 2025-05-07T20:32:17.9953873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.9954412Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9412160>} 2025-05-07T20:32:17.9955167Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.9955362Z context = 2025-05-07T20:32:17.9955366Z 2025-05-07T20:32:17.9955529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.9955791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.9955897Z module_map=module_map) 2025-05-07T20:32:17.9956054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.9956152Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.9956227Z E ^ 2025-05-07T20:32:17.9956584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.9956591Z 2025-05-07T20:32:17.9957007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.9957011Z 2025-05-07T20:32:17.9957197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9957421Z self=, 2025-05-07T20:32:17.9957496Z T=2048, 2025-05-07T20:32:17.9957570Z D=5120, 2025-05-07T20:32:17.9957656Z scale_ub=None, 2025-05-07T20:32:17.9957740Z contiguous=True, 2025-05-07T20:32:17.9957822Z compiled=False, 2025-05-07T20:32:17.9957898Z ) 2025-05-07T20:32:17.9958110Z self = 2025-05-07T20:32:17.9958285Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.9958289Z 2025-05-07T20:32:17.9958406Z @given( 2025-05-07T20:32:17.9958525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9958627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9958739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9958855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9958976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9959047Z ) 2025-05-07T20:32:17.9959289Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9959383Z def test_silu_mul_quant( 2025-05-07T20:32:17.9959457Z self, 2025-05-07T20:32:17.9959535Z T: int, 2025-05-07T20:32:17.9959609Z D: int, 2025-05-07T20:32:17.9959707Z scale_ub: Optional[float], 2025-05-07T20:32:17.9959799Z contiguous: bool, 2025-05-07T20:32:17.9959884Z compiled: bool, 2025-05-07T20:32:17.9959961Z ) -> None: 2025-05-07T20:32:17.9960057Z torch.manual_seed(2025) 2025-05-07T20:32:17.9960133Z 2025-05-07T20:32:17.9960298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9960373Z 2025-05-07T20:32:17.9960463Z > x_sign = torch.sign(x) 2025-05-07T20:32:17.9962261Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9962270Z 2025-05-07T20:32:17.9962388Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:17.9962392Z 2025-05-07T20:32:17.9962499Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9962800Z self=, 2025-05-07T20:32:17.9962892Z T=16384, 2025-05-07T20:32:17.9962980Z D=5120, 2025-05-07T20:32:17.9963074Z scale_ub=None, 2025-05-07T20:32:17.9963156Z contiguous=True, 2025-05-07T20:32:17.9963245Z compiled=False, 2025-05-07T20:32:17.9963317Z ) 2025-05-07T20:32:17.9963531Z self = 2025-05-07T20:32:17.9963707Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.9963711Z 2025-05-07T20:32:17.9963785Z @given( 2025-05-07T20:32:17.9963899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9963999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9964110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9964226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9964342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9964416Z ) 2025-05-07T20:32:17.9964662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9964755Z def test_silu_mul_quant( 2025-05-07T20:32:17.9964829Z self, 2025-05-07T20:32:17.9964907Z T: int, 2025-05-07T20:32:17.9965065Z D: int, 2025-05-07T20:32:17.9965164Z scale_ub: Optional[float], 2025-05-07T20:32:17.9965253Z contiguous: bool, 2025-05-07T20:32:17.9965337Z compiled: bool, 2025-05-07T20:32:17.9965415Z ) -> None: 2025-05-07T20:32:17.9965507Z torch.manual_seed(2025) 2025-05-07T20:32:17.9965577Z 2025-05-07T20:32:17.9965747Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9967544Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9967591Z 2025-05-07T20:32:17.9967714Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.9967718Z 2025-05-07T20:32:17.9967818Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9968038Z self=, 2025-05-07T20:32:17.9968116Z T=4096, 2025-05-07T20:32:17.9968190Z D=5120, 2025-05-07T20:32:17.9968271Z scale_ub=None, 2025-05-07T20:32:17.9968356Z contiguous=True, 2025-05-07T20:32:17.9968439Z compiled=False, 2025-05-07T20:32:17.9968513Z ) 2025-05-07T20:32:17.9968730Z self = 2025-05-07T20:32:17.9968907Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.9968912Z 2025-05-07T20:32:17.9968990Z @given( 2025-05-07T20:32:17.9969103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9969201Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9969321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9969439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9969551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9969625Z ) 2025-05-07T20:32:17.9969867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9969961Z def test_silu_mul_quant( 2025-05-07T20:32:17.9970037Z self, 2025-05-07T20:32:17.9970113Z T: int, 2025-05-07T20:32:17.9970189Z D: int, 2025-05-07T20:32:17.9970285Z scale_ub: Optional[float], 2025-05-07T20:32:17.9970371Z contiguous: bool, 2025-05-07T20:32:17.9970509Z compiled: bool, 2025-05-07T20:32:17.9970586Z ) -> None: 2025-05-07T20:32:17.9970678Z torch.manual_seed(2025) 2025-05-07T20:32:17.9970757Z 2025-05-07T20:32:17.9970921Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9972699Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9972705Z 2025-05-07T20:32:17.9972820Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.9972829Z 2025-05-07T20:32:17.9972931Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9973149Z self=, 2025-05-07T20:32:17.9973227Z T=2048, 2025-05-07T20:32:17.9973304Z D=5120, 2025-05-07T20:32:17.9973387Z scale_ub=None, 2025-05-07T20:32:17.9973555Z contiguous=False, 2025-05-07T20:32:17.9973642Z compiled=False, 2025-05-07T20:32:17.9973713Z ) 2025-05-07T20:32:17.9973927Z self = 2025-05-07T20:32:17.9974102Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.9974106Z 2025-05-07T20:32:17.9974181Z @given( 2025-05-07T20:32:17.9974294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9974394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9974505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9974624Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9974777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9974849Z ) 2025-05-07T20:32:17.9975093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9975184Z def test_silu_mul_quant( 2025-05-07T20:32:17.9975261Z self, 2025-05-07T20:32:17.9975340Z T: int, 2025-05-07T20:32:17.9975415Z D: int, 2025-05-07T20:32:17.9975511Z scale_ub: Optional[float], 2025-05-07T20:32:17.9975600Z contiguous: bool, 2025-05-07T20:32:17.9975684Z compiled: bool, 2025-05-07T20:32:17.9975771Z ) -> None: 2025-05-07T20:32:17.9975864Z torch.manual_seed(2025) 2025-05-07T20:32:17.9975934Z 2025-05-07T20:32:17.9976101Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9977881Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9977891Z 2025-05-07T20:32:17.9978009Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.9978014Z 2025-05-07T20:32:17.9978113Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9978332Z self=, 2025-05-07T20:32:17.9978410Z T=4096, 2025-05-07T20:32:17.9978485Z D=7168, 2025-05-07T20:32:17.9978565Z scale_ub=None, 2025-05-07T20:32:17.9978653Z contiguous=True, 2025-05-07T20:32:17.9978734Z compiled=True, 2025-05-07T20:32:17.9978804Z ) 2025-05-07T20:32:17.9979066Z self = 2025-05-07T20:32:17.9979235Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.9979239Z 2025-05-07T20:32:17.9979318Z @given( 2025-05-07T20:32:17.9979432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9979534Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9979651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9979766Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9979876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9979953Z ) 2025-05-07T20:32:17.9980396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9980492Z def test_silu_mul_quant( 2025-05-07T20:32:17.9980566Z self, 2025-05-07T20:32:17.9980641Z T: int, 2025-05-07T20:32:17.9980718Z D: int, 2025-05-07T20:32:17.9980821Z scale_ub: Optional[float], 2025-05-07T20:32:17.9980908Z contiguous: bool, 2025-05-07T20:32:17.9981002Z compiled: bool, 2025-05-07T20:32:17.9981119Z ) -> None: 2025-05-07T20:32:17.9981215Z torch.manual_seed(2025) 2025-05-07T20:32:17.9981289Z 2025-05-07T20:32:17.9981509Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9983301Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9983345Z 2025-05-07T20:32:17.9983466Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.9983471Z 2025-05-07T20:32:17.9983574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9983792Z self=, 2025-05-07T20:32:17.9983867Z T=2048, 2025-05-07T20:32:17.9983946Z D=5120, 2025-05-07T20:32:17.9984030Z scale_ub=1200.0, 2025-05-07T20:32:17.9984114Z contiguous=False, 2025-05-07T20:32:17.9984199Z compiled=False, 2025-05-07T20:32:17.9984269Z ) 2025-05-07T20:32:17.9984483Z self = 2025-05-07T20:32:17.9984657Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.9984662Z 2025-05-07T20:32:17.9984736Z @given( 2025-05-07T20:32:17.9984850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9984949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9985063Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9985183Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9985292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9985364Z ) 2025-05-07T20:32:17.9985610Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9985707Z def test_silu_mul_quant( 2025-05-07T20:32:17.9985781Z self, 2025-05-07T20:32:17.9985859Z T: int, 2025-05-07T20:32:17.9985934Z D: int, 2025-05-07T20:32:17.9986028Z scale_ub: Optional[float], 2025-05-07T20:32:17.9986116Z contiguous: bool, 2025-05-07T20:32:17.9986200Z compiled: bool, 2025-05-07T20:32:17.9986279Z ) -> None: 2025-05-07T20:32:17.9986371Z torch.manual_seed(2025) 2025-05-07T20:32:17.9986444Z 2025-05-07T20:32:17.9986611Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9988395Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9988406Z 2025-05-07T20:32:17.9988527Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.9988531Z 2025-05-07T20:32:17.9988633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9988852Z self=, 2025-05-07T20:32:17.9988929Z T=4096, 2025-05-07T20:32:17.9989003Z D=7168, 2025-05-07T20:32:17.9989085Z scale_ub=1200.0, 2025-05-07T20:32:17.9989170Z contiguous=True, 2025-05-07T20:32:17.9989260Z compiled=False, 2025-05-07T20:32:17.9989332Z ) 2025-05-07T20:32:17.9989547Z self = 2025-05-07T20:32:17.9989716Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.9989720Z 2025-05-07T20:32:17.9989839Z @given( 2025-05-07T20:32:17.9989993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9990090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9990210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9994040Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9994172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9994247Z ) 2025-05-07T20:32:17.9994497Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9994588Z def test_silu_mul_quant( 2025-05-07T20:32:17.9994663Z self, 2025-05-07T20:32:17.9994833Z T: int, 2025-05-07T20:32:17.9994912Z D: int, 2025-05-07T20:32:17.9995008Z scale_ub: Optional[float], 2025-05-07T20:32:17.9995096Z contiguous: bool, 2025-05-07T20:32:17.9995182Z compiled: bool, 2025-05-07T20:32:17.9995261Z ) -> None: 2025-05-07T20:32:17.9995359Z torch.manual_seed(2025) 2025-05-07T20:32:17.9995435Z 2025-05-07T20:32:17.9995607Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.9997373Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.9997381Z 2025-05-07T20:32:17.9997502Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.9997507Z 2025-05-07T20:32:17.9997608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.9997828Z self=, 2025-05-07T20:32:17.9997915Z T=16384, 2025-05-07T20:32:17.9997991Z D=7168, 2025-05-07T20:32:17.9998073Z scale_ub=None, 2025-05-07T20:32:17.9998162Z contiguous=False, 2025-05-07T20:32:17.9998245Z compiled=True, 2025-05-07T20:32:17.9998319Z ) 2025-05-07T20:32:17.9998538Z self = 2025-05-07T20:32:17.9998714Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.9998718Z 2025-05-07T20:32:17.9998796Z @given( 2025-05-07T20:32:17.9998911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.9999054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.9999177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.9999289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.9999398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.9999474Z ) 2025-05-07T20:32:17.9999719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.9999812Z def test_silu_mul_quant( 2025-05-07T20:32:17.9999887Z self, 2025-05-07T20:32:17.9999961Z T: int, 2025-05-07T20:32:18.0000038Z D: int, 2025-05-07T20:32:18.0000135Z scale_ub: Optional[float], 2025-05-07T20:32:18.0000220Z contiguous: bool, 2025-05-07T20:32:18.0000310Z compiled: bool, 2025-05-07T20:32:18.0000385Z ) -> None: 2025-05-07T20:32:18.0000477Z torch.manual_seed(2025) 2025-05-07T20:32:18.0000551Z 2025-05-07T20:32:18.0000716Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0002561Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.0002602Z 2025-05-07T20:32:18.0002725Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.0002729Z 2025-05-07T20:32:18.0002834Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0003053Z self=, 2025-05-07T20:32:18.0003128Z T=4096, 2025-05-07T20:32:18.0003208Z D=7168, 2025-05-07T20:32:18.0003329Z scale_ub=None, 2025-05-07T20:32:18.0003415Z contiguous=True, 2025-05-07T20:32:18.0003501Z compiled=False, 2025-05-07T20:32:18.0003572Z ) 2025-05-07T20:32:18.0003786Z self = 2025-05-07T20:32:18.0003961Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.0003971Z 2025-05-07T20:32:18.0004045Z @given( 2025-05-07T20:32:18.0004159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0004263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0004376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0004493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0004604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0004677Z ) 2025-05-07T20:32:18.0004922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0005014Z def test_silu_mul_quant( 2025-05-07T20:32:18.0005093Z self, 2025-05-07T20:32:18.0005172Z T: int, 2025-05-07T20:32:18.0005246Z D: int, 2025-05-07T20:32:18.0005340Z scale_ub: Optional[float], 2025-05-07T20:32:18.0005432Z contiguous: bool, 2025-05-07T20:32:18.0005515Z compiled: bool, 2025-05-07T20:32:18.0005596Z ) -> None: 2025-05-07T20:32:18.0005691Z torch.manual_seed(2025) 2025-05-07T20:32:18.0005762Z 2025-05-07T20:32:18.0005929Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0007719Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.0007730Z 2025-05-07T20:32:18.0007849Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.0007854Z 2025-05-07T20:32:18.0007954Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0008177Z self=, 2025-05-07T20:32:18.0008256Z T=16384, 2025-05-07T20:32:18.0008329Z D=7168, 2025-05-07T20:32:18.0008407Z scale_ub=None, 2025-05-07T20:32:18.0008495Z contiguous=True, 2025-05-07T20:32:18.0008580Z compiled=False, 2025-05-07T20:32:18.0008652Z ) 2025-05-07T20:32:18.0008867Z self = 2025-05-07T20:32:18.0009038Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.0009042Z 2025-05-07T20:32:18.0009121Z @given( 2025-05-07T20:32:18.0009242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0009342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0009457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0009575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0009684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0009844Z ) 2025-05-07T20:32:18.0010089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0010184Z def test_silu_mul_quant( 2025-05-07T20:32:18.0010258Z self, 2025-05-07T20:32:18.0010332Z T: int, 2025-05-07T20:32:18.0010411Z D: int, 2025-05-07T20:32:18.0010509Z scale_ub: Optional[float], 2025-05-07T20:32:18.0010596Z contiguous: bool, 2025-05-07T20:32:18.0010681Z compiled: bool, 2025-05-07T20:32:18.0010758Z ) -> None: 2025-05-07T20:32:18.0010854Z torch.manual_seed(2025) 2025-05-07T20:32:18.0010927Z 2025-05-07T20:32:18.0011136Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0012889Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.0012899Z 2025-05-07T20:32:18.0013015Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.0013019Z 2025-05-07T20:32:18.0013124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0013340Z self=, 2025-05-07T20:32:18.0013418Z T=16384, 2025-05-07T20:32:18.0013498Z D=7168, 2025-05-07T20:32:18.0013578Z scale_ub=1200.0, 2025-05-07T20:32:18.0013661Z contiguous=True, 2025-05-07T20:32:18.0013749Z compiled=False, 2025-05-07T20:32:18.0013822Z ) 2025-05-07T20:32:18.0014035Z self = 2025-05-07T20:32:18.0014220Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.0014225Z 2025-05-07T20:32:18.0014303Z @given( 2025-05-07T20:32:18.0014419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0014518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0014628Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0014746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0014856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0014928Z ) 2025-05-07T20:32:18.0015218Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0015318Z def test_silu_mul_quant( 2025-05-07T20:32:18.0015391Z self, 2025-05-07T20:32:18.0015473Z T: int, 2025-05-07T20:32:18.0015547Z D: int, 2025-05-07T20:32:18.0015642Z scale_ub: Optional[float], 2025-05-07T20:32:18.0015732Z contiguous: bool, 2025-05-07T20:32:18.0015822Z compiled: bool, 2025-05-07T20:32:18.0015901Z ) -> None: 2025-05-07T20:32:18.0015998Z torch.manual_seed(2025) 2025-05-07T20:32:18.0016069Z 2025-05-07T20:32:18.0016234Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0017980Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.0017987Z 2025-05-07T20:32:18.0018104Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.0018150Z 2025-05-07T20:32:18.0018290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0018509Z self=, 2025-05-07T20:32:18.0018587Z T=128, 2025-05-07T20:32:18.0018663Z D=5120, 2025-05-07T20:32:18.0018744Z scale_ub=1200.0, 2025-05-07T20:32:18.0018832Z contiguous=False, 2025-05-07T20:32:18.0018912Z compiled=False, 2025-05-07T20:32:18.0018981Z ) 2025-05-07T20:32:18.0019200Z self = 2025-05-07T20:32:18.0019372Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:18.0019416Z 2025-05-07T20:32:18.0019501Z @given( 2025-05-07T20:32:18.0019615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0019711Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0019824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0019937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0020056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0020131Z ) 2025-05-07T20:32:18.0020374Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0020469Z def test_silu_mul_quant( 2025-05-07T20:32:18.0020542Z self, 2025-05-07T20:32:18.0020615Z T: int, 2025-05-07T20:32:18.0020693Z D: int, 2025-05-07T20:32:18.0020787Z scale_ub: Optional[float], 2025-05-07T20:32:18.0020874Z contiguous: bool, 2025-05-07T20:32:18.0020960Z compiled: bool, 2025-05-07T20:32:18.0021036Z ) -> None: 2025-05-07T20:32:18.0021196Z torch.manual_seed(2025) 2025-05-07T20:32:18.0021276Z 2025-05-07T20:32:18.0021440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0021512Z 2025-05-07T20:32:18.0021604Z x_sign = torch.sign(x) 2025-05-07T20:32:18.0021727Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.0021823Z x = x_sign * x_clamp 2025-05-07T20:32:18.0021903Z x0 = x[:, :D] 2025-05-07T20:32:18.0021980Z x1 = x[:, D:] 2025-05-07T20:32:18.0022052Z 2025-05-07T20:32:18.0022133Z if contiguous: 2025-05-07T20:32:18.0022221Z x0 = x0.contiguous() 2025-05-07T20:32:18.0022310Z x1 = x1.contiguous() 2025-05-07T20:32:18.0022382Z 2025-05-07T20:32:18.0022470Z if scale_ub is not None: 2025-05-07T20:32:18.0022578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.0022714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.0022790Z ) 2025-05-07T20:32:18.0022941Z else: 2025-05-07T20:32:18.0023053Z scale_ub_tensor = None 2025-05-07T20:32:18.0023131Z 2025-05-07T20:32:18.0023261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.0023350Z op = silu_mul_quant 2025-05-07T20:32:18.0023439Z if compiled: 2025-05-07T20:32:18.0023544Z op = torch.compile(op) 2025-05-07T20:32:18.0023646Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.0023722Z 2025-05-07T20:32:18.0023811Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.0023816Z 2025-05-07T20:32:18.0023911Z moe/activation_test.py:117: 2025-05-07T20:32:18.0024040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.0024141Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.0024238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.0024750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.0024847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.0025211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.0025436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.0025883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.0025981Z kernel = self.compile( 2025-05-07T20:32:18.0026366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.0026546Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.0026671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.0026676Z 2025-05-07T20:32:18.0026878Z self = 2025-05-07T20:32:18.0027703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.0028204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e9301ca0>} 2025-05-07T20:32:18.0028957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.0029147Z context = 2025-05-07T20:32:18.0029151Z 2025-05-07T20:32:18.0029318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.0029583Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.0029691Z module_map=module_map) 2025-05-07T20:32:18.0029857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.0029954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.0030028Z E ^ 2025-05-07T20:32:18.0030394Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.0030399Z 2025-05-07T20:32:18.0030820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.0030825Z 2025-05-07T20:32:18.0030928Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0031150Z self=, 2025-05-07T20:32:18.0031224Z T=2048, 2025-05-07T20:32:18.0031298Z D=7168, 2025-05-07T20:32:18.0031379Z scale_ub=None, 2025-05-07T20:32:18.0031510Z contiguous=False, 2025-05-07T20:32:18.0031596Z compiled=False, 2025-05-07T20:32:18.0031670Z ) 2025-05-07T20:32:18.0031885Z self = 2025-05-07T20:32:18.0032056Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.0032068Z 2025-05-07T20:32:18.0032147Z @given( 2025-05-07T20:32:18.0032263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0032363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0032474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0032589Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0032703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0032775Z ) 2025-05-07T20:32:18.0033018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0033115Z def test_silu_mul_quant( 2025-05-07T20:32:18.0033189Z self, 2025-05-07T20:32:18.0033269Z T: int, 2025-05-07T20:32:18.0033346Z D: int, 2025-05-07T20:32:18.0033441Z scale_ub: Optional[float], 2025-05-07T20:32:18.0033532Z contiguous: bool, 2025-05-07T20:32:18.0033615Z compiled: bool, 2025-05-07T20:32:18.0033691Z ) -> None: 2025-05-07T20:32:18.0033882Z torch.manual_seed(2025) 2025-05-07T20:32:18.0033960Z 2025-05-07T20:32:18.0034126Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0035881Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.0035924Z 2025-05-07T20:32:18.0036043Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.0036047Z 2025-05-07T20:32:18.0036150Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0036370Z self=, 2025-05-07T20:32:18.0036448Z T=128, 2025-05-07T20:32:18.0036528Z D=7168, 2025-05-07T20:32:18.0036608Z scale_ub=1200.0, 2025-05-07T20:32:18.0036697Z contiguous=True, 2025-05-07T20:32:18.0036782Z compiled=True, 2025-05-07T20:32:18.0036859Z ) 2025-05-07T20:32:18.0037078Z self = 2025-05-07T20:32:18.0037244Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:18.0037248Z 2025-05-07T20:32:18.0037321Z @given( 2025-05-07T20:32:18.0037441Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0037543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0037654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0037773Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0037884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0037957Z ) 2025-05-07T20:32:18.0038204Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0038296Z def test_silu_mul_quant( 2025-05-07T20:32:18.0038373Z self, 2025-05-07T20:32:18.0038447Z T: int, 2025-05-07T20:32:18.0038520Z D: int, 2025-05-07T20:32:18.0038619Z scale_ub: Optional[float], 2025-05-07T20:32:18.0038705Z contiguous: bool, 2025-05-07T20:32:18.0038787Z compiled: bool, 2025-05-07T20:32:18.0038867Z ) -> None: 2025-05-07T20:32:18.0038959Z torch.manual_seed(2025) 2025-05-07T20:32:18.0039030Z 2025-05-07T20:32:18.0039236Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0039315Z 2025-05-07T20:32:18.0039407Z x_sign = torch.sign(x) 2025-05-07T20:32:18.0039529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.0039615Z x = x_sign * x_clamp 2025-05-07T20:32:18.0039697Z x0 = x[:, :D] 2025-05-07T20:32:18.0039777Z x1 = x[:, D:] 2025-05-07T20:32:18.0039850Z 2025-05-07T20:32:18.0039935Z if contiguous: 2025-05-07T20:32:18.0040023Z x0 = x0.contiguous() 2025-05-07T20:32:18.0040464Z x1 = x1.contiguous() 2025-05-07T20:32:18.0040574Z 2025-05-07T20:32:18.0040700Z if scale_ub is not None: 2025-05-07T20:32:18.0040847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.0041068Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.0041215Z ) 2025-05-07T20:32:18.0041309Z else: 2025-05-07T20:32:18.0041405Z scale_ub_tensor = None 2025-05-07T20:32:18.0041478Z 2025-05-07T20:32:18.0041620Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.0041707Z op = silu_mul_quant 2025-05-07T20:32:18.0041793Z if compiled: 2025-05-07T20:32:18.0041893Z op = torch.compile(op) 2025-05-07T20:32:18.0041999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.0042221Z 2025-05-07T20:32:18.0042319Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.0042324Z 2025-05-07T20:32:18.0042421Z moe/activation_test.py:117: 2025-05-07T20:32:18.0042550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.0042655Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.0042752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.0043184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.0043275Z return fn(*args, **kwargs) 2025-05-07T20:32:18.0043767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.0043927Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.0044283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.0044513Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.0044861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.0044954Z kernel = self.compile( 2025-05-07T20:32:18.0045334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.0045507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.0045631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.0045636Z 2025-05-07T20:32:18.0045850Z self = 2025-05-07T20:32:18.0046619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.0047134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd7e95ae280>} 2025-05-07T20:32:18.0047881Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.0048080Z context = 2025-05-07T20:32:18.0048084Z 2025-05-07T20:32:18.0048250Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.0048578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.0048691Z module_map=module_map) 2025-05-07T20:32:18.0048855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.0048952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.0049036Z E ^ 2025-05-07T20:32:18.0049391Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.0049396Z 2025-05-07T20:32:18.0049805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.0049809Z 2025-05-07T20:32:18.0049908Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0050127Z self=, 2025-05-07T20:32:18.0050205Z T=128, 2025-05-07T20:32:18.0050280Z D=7168, 2025-05-07T20:32:18.0050368Z scale_ub=1200.0, 2025-05-07T20:32:18.0050454Z contiguous=True, 2025-05-07T20:32:18.0050535Z compiled=False, 2025-05-07T20:32:18.0050607Z ) 2025-05-07T20:32:18.0050827Z self = 2025-05-07T20:32:18.0051033Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.0051075Z 2025-05-07T20:32:18.0051154Z @given( 2025-05-07T20:32:18.0051270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0051369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0051490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0051603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0051714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0051788Z ) 2025-05-07T20:32:18.0052032Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0052135Z def test_silu_mul_quant( 2025-05-07T20:32:18.0052251Z self, 2025-05-07T20:32:18.0052326Z T: int, 2025-05-07T20:32:18.0052405Z D: int, 2025-05-07T20:32:18.0052500Z scale_ub: Optional[float], 2025-05-07T20:32:18.0052590Z contiguous: bool, 2025-05-07T20:32:18.0052678Z compiled: bool, 2025-05-07T20:32:18.0052764Z ) -> None: 2025-05-07T20:32:18.0052867Z torch.manual_seed(2025) 2025-05-07T20:32:18.0052957Z 2025-05-07T20:32:18.0053149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0053220Z 2025-05-07T20:32:18.0053311Z x_sign = torch.sign(x) 2025-05-07T20:32:18.0053433Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.0055220Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.0055233Z 2025-05-07T20:32:18.0055353Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:18.0055358Z 2025-05-07T20:32:18.0055462Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0055682Z self=, 2025-05-07T20:32:18.0055757Z T=128, 2025-05-07T20:32:18.0055833Z D=5120, 2025-05-07T20:32:18.0055921Z scale_ub=1200.0, 2025-05-07T20:32:18.0055999Z contiguous=True, 2025-05-07T20:32:18.0056082Z compiled=True, 2025-05-07T20:32:18.0056155Z ) 2025-05-07T20:32:18.0056366Z self = 2025-05-07T20:32:18.0056580Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:18.0056585Z 2025-05-07T20:32:18.0056660Z @given( 2025-05-07T20:32:18.0056775Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0056875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0056992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0057109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0057223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0057294Z ) 2025-05-07T20:32:18.0057541Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0057633Z def test_silu_mul_quant( 2025-05-07T20:32:18.0057707Z self, 2025-05-07T20:32:18.0057783Z T: int, 2025-05-07T20:32:18.0057856Z D: int, 2025-05-07T20:32:18.0057952Z scale_ub: Optional[float], 2025-05-07T20:32:18.0058045Z contiguous: bool, 2025-05-07T20:32:18.0058134Z compiled: bool, 2025-05-07T20:32:18.0058211Z ) -> None: 2025-05-07T20:32:18.0058306Z torch.manual_seed(2025) 2025-05-07T20:32:18.0058378Z 2025-05-07T20:32:18.0058540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0058614Z 2025-05-07T20:32:18.0058771Z x_sign = torch.sign(x) 2025-05-07T20:32:18.0058932Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.0060710Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.0060764Z 2025-05-07T20:32:18.0060886Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:18.0060890Z 2025-05-07T20:32:18.0060991Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0061284Z self=, 2025-05-07T20:32:18.0061368Z T=128, 2025-05-07T20:32:18.0061443Z D=7168, 2025-05-07T20:32:18.0061522Z scale_ub=None, 2025-05-07T20:32:18.0061607Z contiguous=True, 2025-05-07T20:32:18.0061688Z compiled=True, 2025-05-07T20:32:18.0061758Z ) 2025-05-07T20:32:18.0061972Z self = 2025-05-07T20:32:18.0062134Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:18.0062138Z 2025-05-07T20:32:18.0062214Z @given( 2025-05-07T20:32:18.0062330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0062426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0062553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0062668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0062780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0062854Z ) 2025-05-07T20:32:18.0063100Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0063194Z def test_silu_mul_quant( 2025-05-07T20:32:18.0063276Z self, 2025-05-07T20:32:18.0063351Z T: int, 2025-05-07T20:32:18.0063429Z D: int, 2025-05-07T20:32:18.0063524Z scale_ub: Optional[float], 2025-05-07T20:32:18.0063609Z contiguous: bool, 2025-05-07T20:32:18.0063696Z compiled: bool, 2025-05-07T20:32:18.0063772Z ) -> None: 2025-05-07T20:32:18.0063865Z torch.manual_seed(2025) 2025-05-07T20:32:18.0063939Z 2025-05-07T20:32:18.0064104Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0065891Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.0065903Z 2025-05-07T20:32:18.0066020Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.0066152Z =============================== warnings summary =============================== 2025-05-07T20:32:18.0066461Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:18.0066762Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:18.0067058Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:18.0067960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:18.0068230Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:18.0068239Z 2025-05-07T20:32:18.0068448Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:18.0068615Z ================= 1 failed, 1 deselected, 3 warnings in 24.05s ================= 2025-05-07T20:32:19.6289721Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:19.6931904Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:19.6932164Z 2025-05-07T20:32:21.6948931Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:23.8532853Z ============================= test session starts ============================== 2025-05-07T20:32:23.8533485Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:23.8534024Z cachedir: .pytest_cache 2025-05-07T20:32:23.8534607Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:23.8535385Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:23.8535801Z plugins: hypothesis-6.131.14 2025-05-07T20:32:25.4755442Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:25.6872946Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:25.6873372Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:25.6873593Z 2025-05-07T20:32:28.4026568Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.4027301Z self=, 2025-05-07T20:32:28.4027732Z T=1, 2025-05-07T20:32:28.4027935Z D=5120, 2025-05-07T20:32:28.4028138Z scale_ub=None, 2025-05-07T20:32:28.4028372Z contiguous=True, 2025-05-07T20:32:28.4028611Z compiled=True, 2025-05-07T20:32:28.4028830Z ) 2025-05-07T20:32:28.4029168Z self = 2025-05-07T20:32:28.4029672Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.4029942Z 2025-05-07T20:32:28.4030041Z @given( 2025-05-07T20:32:28.4030591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.4030928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.4031246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.4031582Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.4031933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.4032232Z ) 2025-05-07T20:32:28.4032585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.4033041Z def test_silu_mul_quant( 2025-05-07T20:32:28.4033293Z self, 2025-05-07T20:32:28.4033502Z T: int, 2025-05-07T20:32:28.4033705Z D: int, 2025-05-07T20:32:28.4033933Z scale_ub: Optional[float], 2025-05-07T20:32:28.4034218Z contiguous: bool, 2025-05-07T20:32:28.4034464Z compiled: bool, 2025-05-07T20:32:28.4034702Z ) -> None: 2025-05-07T20:32:28.4034933Z torch.manual_seed(2025) 2025-05-07T20:32:28.4035188Z 2025-05-07T20:32:28.4035472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.4035829Z 2025-05-07T20:32:28.4036029Z x_sign = torch.sign(x) 2025-05-07T20:32:28.4036332Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.4036808Z x = x_sign * x_clamp 2025-05-07T20:32:28.4037059Z x0 = x[:, :D] 2025-05-07T20:32:28.4037282Z x1 = x[:, D:] 2025-05-07T20:32:28.4037497Z 2025-05-07T20:32:28.4037688Z if contiguous: 2025-05-07T20:32:28.4037934Z x0 = x0.contiguous() 2025-05-07T20:32:28.4038201Z x1 = x1.contiguous() 2025-05-07T20:32:28.4038446Z 2025-05-07T20:32:28.4038649Z if scale_ub is not None: 2025-05-07T20:32:28.4038933Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.4039284Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.4039597Z ) 2025-05-07T20:32:28.4039902Z else: 2025-05-07T20:32:28.4040363Z scale_ub_tensor = None 2025-05-07T20:32:28.4040624Z 2025-05-07T20:32:28.4040871Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.4041203Z op = silu_mul_quant 2025-05-07T20:32:28.4041462Z if compiled: 2025-05-07T20:32:28.4041729Z op = torch.compile(op) 2025-05-07T20:32:28.4042038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.4042318Z 2025-05-07T20:32:28.4042522Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.4042820Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.4043116Z 2025-05-07T20:32:28.4043366Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.4043716Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.4044021Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.4044341Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.4044713Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.4045033Z 2025-05-07T20:32:28.4045238Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.4045442Z 2025-05-07T20:32:28.4045546Z moe/activation_test.py:126: 2025-05-07T20:32:28.4045854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.4046245Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.4046592Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.4047394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.4048158Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.4048710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.4049401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.4050179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.4050923Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.4051692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.4052459Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.4053191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.4053828Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.4054445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.4054969Z fn() 2025-05-07T20:32:28.4055485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.4056067Z self.fn.run( 2025-05-07T20:32:28.4056542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.4057082Z kernel = self.compile( 2025-05-07T20:32:28.4057695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.4058414Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.4058826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.4059063Z 2025-05-07T20:32:28.4059286Z self = 2025-05-07T20:32:28.4060370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.4061931Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdcfd99d0>} 2025-05-07T20:32:28.4063282Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.4064308Z context = 2025-05-07T20:32:28.4064603Z 2025-05-07T20:32:28.4064780Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.4065307Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.4065779Z module_map=module_map) 2025-05-07T20:32:28.4073060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.4073466Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.4073752Z E ^ 2025-05-07T20:32:28.4074242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.4074694Z 2025-05-07T20:32:28.4075137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.4075664Z 2025-05-07T20:32:28.4075772Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.4076208Z self=, 2025-05-07T20:32:28.4076620Z T=2048, 2025-05-07T20:32:28.4076813Z D=5120, 2025-05-07T20:32:28.4077020Z scale_ub=1200.0, 2025-05-07T20:32:28.4077256Z contiguous=True, 2025-05-07T20:32:28.4077485Z compiled=False, 2025-05-07T20:32:28.4077711Z ) 2025-05-07T20:32:29.9271538Z self = 2025-05-07T20:32:29.9272567Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.9272869Z 2025-05-07T20:32:29.9272966Z @given( 2025-05-07T20:32:29.9273209Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.9273544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.9273881Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.9274243Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.9274593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.9274896Z ) 2025-05-07T20:32:29.9275267Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.9275724Z def test_silu_mul_quant( 2025-05-07T20:32:29.9275986Z self, 2025-05-07T20:32:29.9276199Z T: int, 2025-05-07T20:32:29.9276406Z D: int, 2025-05-07T20:32:29.9276640Z scale_ub: Optional[float], 2025-05-07T20:32:29.9276929Z contiguous: bool, 2025-05-07T20:32:29.9277177Z compiled: bool, 2025-05-07T20:32:29.9277430Z ) -> None: 2025-05-07T20:32:29.9277667Z torch.manual_seed(2025) 2025-05-07T20:32:29.9277918Z 2025-05-07T20:32:29.9278209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.9278573Z 2025-05-07T20:32:29.9278773Z x_sign = torch.sign(x) 2025-05-07T20:32:29.9279233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.9279566Z x = x_sign * x_clamp 2025-05-07T20:32:29.9279833Z x0 = x[:, :D] 2025-05-07T20:32:29.9280059Z x1 = x[:, D:] 2025-05-07T20:32:29.9280290Z 2025-05-07T20:32:29.9280490Z if contiguous: 2025-05-07T20:32:29.9280734Z x0 = x0.contiguous() 2025-05-07T20:32:29.9281012Z x1 = x1.contiguous() 2025-05-07T20:32:29.9281269Z 2025-05-07T20:32:29.9281469Z if scale_ub is not None: 2025-05-07T20:32:29.9281758Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.9282113Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.9282516Z ) 2025-05-07T20:32:29.9282784Z else: 2025-05-07T20:32:29.9283099Z scale_ub_tensor = None 2025-05-07T20:32:29.9283457Z 2025-05-07T20:32:29.9283715Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.9284050Z op = silu_mul_quant 2025-05-07T20:32:29.9284311Z if compiled: 2025-05-07T20:32:29.9284581Z op = torch.compile(op) 2025-05-07T20:32:29.9284894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.9285175Z 2025-05-07T20:32:29.9285381Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.9285560Z 2025-05-07T20:32:29.9285667Z moe/activation_test.py:117: 2025-05-07T20:32:29.9285980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.9286316Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.9286611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.9287323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.9288022Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.9288578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.9289290Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.9289974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.9290518Z kernel = self.compile( 2025-05-07T20:32:29.9291078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.9291748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.9292153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.9292404Z 2025-05-07T20:32:29.9292687Z self = 2025-05-07T20:32:29.9293801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.9295211Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdbfe5e50>} 2025-05-07T20:32:29.9296580Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.9297619Z context = 2025-05-07T20:32:29.9297924Z 2025-05-07T20:32:29.9298100Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.9298653Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.9299136Z module_map=module_map) 2025-05-07T20:32:29.9299517Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.9299978Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.9300256Z E ^ 2025-05-07T20:32:29.9300723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.9301350Z 2025-05-07T20:32:29.9301782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.9302328Z 2025-05-07T20:32:29.9302437Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.9302874Z self=, 2025-05-07T20:32:29.9303283Z T=2048, 2025-05-07T20:32:29.9303537Z D=5120, 2025-05-07T20:32:29.9303750Z scale_ub=1200.0, 2025-05-07T20:32:29.9304058Z contiguous=True, 2025-05-07T20:32:29.9304334Z compiled=True, 2025-05-07T20:32:29.9304560Z ) 2025-05-07T20:32:29.9304884Z self = 2025-05-07T20:32:29.9305400Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.9305693Z 2025-05-07T20:32:29.9305774Z @given( 2025-05-07T20:32:29.9306023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.9306346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.9306668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.9307029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.9307406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.9307712Z ) 2025-05-07T20:32:29.9308080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.9308540Z def test_silu_mul_quant( 2025-05-07T20:32:29.9308794Z self, 2025-05-07T20:32:29.9309006Z T: int, 2025-05-07T20:32:29.9309209Z D: int, 2025-05-07T20:32:29.9309447Z scale_ub: Optional[float], 2025-05-07T20:32:29.9309730Z contiguous: bool, 2025-05-07T20:32:29.9309974Z compiled: bool, 2025-05-07T20:32:29.9310216Z ) -> None: 2025-05-07T20:32:29.9310444Z torch.manual_seed(2025) 2025-05-07T20:32:29.9310697Z 2025-05-07T20:32:29.9310971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.9311324Z 2025-05-07T20:32:29.9311525Z x_sign = torch.sign(x) 2025-05-07T20:32:29.9311819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.9312141Z x = x_sign * x_clamp 2025-05-07T20:32:29.9312397Z x0 = x[:, :D] 2025-05-07T20:32:29.9312616Z x1 = x[:, D:] 2025-05-07T20:32:29.9312845Z 2025-05-07T20:32:29.9313040Z if contiguous: 2025-05-07T20:32:29.9313334Z x0 = x0.contiguous() 2025-05-07T20:32:29.9313605Z x1 = x1.contiguous() 2025-05-07T20:32:29.9313854Z 2025-05-07T20:32:29.9314047Z if scale_ub is not None: 2025-05-07T20:32:29.9314333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.9314739Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.9315160Z ) 2025-05-07T20:32:29.9315366Z else: 2025-05-07T20:32:29.9315588Z scale_ub_tensor = None 2025-05-07T20:32:29.9315847Z 2025-05-07T20:32:29.9316082Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.9316405Z op = silu_mul_quant 2025-05-07T20:32:29.9316665Z if compiled: 2025-05-07T20:32:29.9316920Z op = torch.compile(op) 2025-05-07T20:32:29.9317270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.9317554Z 2025-05-07T20:32:29.9317747Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.9318052Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.9318355Z 2025-05-07T20:32:29.9318592Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.9318936Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.9319237Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.9319650Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.9320021Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.9320341Z 2025-05-07T20:32:29.9320554Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.9320753Z 2025-05-07T20:32:29.9320856Z moe/activation_test.py:126: 2025-05-07T20:32:29.9321166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.9321511Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.9321842Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.9322651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.9323469Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.9324028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.9324718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.9325413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.9326158Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.9326908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:29.9327659Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.9328397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.9329047Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.9329651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.9330188Z fn() 2025-05-07T20:32:29.9330710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.9331304Z self.fn.run( 2025-05-07T20:32:29.9331771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.9332310Z kernel = self.compile( 2025-05-07T20:32:29.9332861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.9333512Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.9333966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.9334216Z 2025-05-07T20:32:29.9334431Z self = 2025-05-07T20:32:29.9335523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.9336927Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdbb7ca60>} 2025-05-07T20:32:29.9338305Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.9339342Z context = 2025-05-07T20:32:29.9339640Z 2025-05-07T20:32:29.9339817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.9340626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.9341149Z module_map=module_map) 2025-05-07T20:32:29.9341698Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.9342064Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.9342332Z E ^ 2025-05-07T20:32:29.9342801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.9343250Z 2025-05-07T20:32:29.9343703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.9344228Z 2025-05-07T20:32:29.9344335Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.9344761Z self=, 2025-05-07T20:32:29.9345237Z T=16384, 2025-05-07T20:32:29.9345433Z D=7168, 2025-05-07T20:32:29.9345638Z scale_ub=1200.0, 2025-05-07T20:32:29.9345871Z contiguous=False, 2025-05-07T20:32:29.9346109Z compiled=False, 2025-05-07T20:32:29.9346318Z ) 2025-05-07T20:32:31.2678591Z self = 2025-05-07T20:32:31.2679161Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:31.2679457Z 2025-05-07T20:32:31.2679544Z @given( 2025-05-07T20:32:31.2679797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.2680132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.2680454Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.2680793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.2681141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.2681441Z ) 2025-05-07T20:32:31.2681812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.2682445Z def test_silu_mul_quant( 2025-05-07T20:32:31.2682704Z self, 2025-05-07T20:32:31.2682908Z T: int, 2025-05-07T20:32:31.2683118Z D: int, 2025-05-07T20:32:31.2683353Z scale_ub: Optional[float], 2025-05-07T20:32:31.2683636Z contiguous: bool, 2025-05-07T20:32:31.2683888Z compiled: bool, 2025-05-07T20:32:31.2684129Z ) -> None: 2025-05-07T20:32:31.2684353Z torch.manual_seed(2025) 2025-05-07T20:32:31.2684610Z 2025-05-07T20:32:31.2684901Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.2685245Z 2025-05-07T20:32:31.2685453Z x_sign = torch.sign(x) 2025-05-07T20:32:31.2685757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.2686080Z x = x_sign * x_clamp 2025-05-07T20:32:31.2686361Z x0 = x[:, :D] 2025-05-07T20:32:31.2686587Z x1 = x[:, D:] 2025-05-07T20:32:31.2687101Z 2025-05-07T20:32:31.2687302Z if contiguous: 2025-05-07T20:32:31.2687541Z x0 = x0.contiguous() 2025-05-07T20:32:31.2687812Z x1 = x1.contiguous() 2025-05-07T20:32:31.2688063Z 2025-05-07T20:32:31.2688266Z if scale_ub is not None: 2025-05-07T20:32:31.2688551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.2688900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.2689222Z ) 2025-05-07T20:32:31.2689420Z else: 2025-05-07T20:32:31.2689642Z scale_ub_tensor = None 2025-05-07T20:32:31.2689904Z 2025-05-07T20:32:31.2690141Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2690469Z op = silu_mul_quant 2025-05-07T20:32:31.2690732Z if compiled: 2025-05-07T20:32:31.2690987Z op = torch.compile(op) 2025-05-07T20:32:31.2691297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2691592Z 2025-05-07T20:32:31.2691789Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.2691969Z 2025-05-07T20:32:31.2692077Z moe/activation_test.py:117: 2025-05-07T20:32:31.2692384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2692725Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.2693187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2693894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.2694600Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.2695150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.2695850Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.2696531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.2697166Z kernel = self.compile( 2025-05-07T20:32:31.2697715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.2698381Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.2698796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2699032Z 2025-05-07T20:32:31.2699258Z self = 2025-05-07T20:32:31.2700345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.2701843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb9e5670>} 2025-05-07T20:32:31.2703202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.2704229Z context = 2025-05-07T20:32:31.2704529Z 2025-05-07T20:32:31.2704700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.2705240Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.2705719Z module_map=module_map) 2025-05-07T20:32:31.2706097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.2706452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.2706717Z E ^ 2025-05-07T20:32:31.2707185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.2707689Z 2025-05-07T20:32:31.2708118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.2708646Z 2025-05-07T20:32:31.2708752Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.2709178Z self=, 2025-05-07T20:32:31.2709591Z T=1, 2025-05-07T20:32:31.2709779Z D=7168, 2025-05-07T20:32:31.2709982Z scale_ub=None, 2025-05-07T20:32:31.2710204Z contiguous=True, 2025-05-07T20:32:31.2710431Z compiled=True, 2025-05-07T20:32:31.2710649Z ) 2025-05-07T20:32:31.2710976Z self = 2025-05-07T20:32:31.2711461Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.2711729Z 2025-05-07T20:32:31.2711809Z @given( 2025-05-07T20:32:31.2712047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.2712379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.2712690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.2713028Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.2713363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.2713651Z ) 2025-05-07T20:32:31.2714097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.2714554Z def test_silu_mul_quant( 2025-05-07T20:32:31.2714799Z self, 2025-05-07T20:32:31.2715003Z T: int, 2025-05-07T20:32:31.2715210Z D: int, 2025-05-07T20:32:31.2715434Z scale_ub: Optional[float], 2025-05-07T20:32:31.2715716Z contiguous: bool, 2025-05-07T20:32:31.2715966Z compiled: bool, 2025-05-07T20:32:31.2716191Z ) -> None: 2025-05-07T20:32:31.2716418Z torch.manual_seed(2025) 2025-05-07T20:32:31.2716671Z 2025-05-07T20:32:31.2716961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.2717402Z 2025-05-07T20:32:31.2717611Z x_sign = torch.sign(x) 2025-05-07T20:32:31.2717910Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.2718226Z x = x_sign * x_clamp 2025-05-07T20:32:31.2718476Z x0 = x[:, :D] 2025-05-07T20:32:31.2718703Z x1 = x[:, D:] 2025-05-07T20:32:31.2718920Z 2025-05-07T20:32:31.2719113Z if contiguous: 2025-05-07T20:32:31.2719353Z x0 = x0.contiguous() 2025-05-07T20:32:31.2719617Z x1 = x1.contiguous() 2025-05-07T20:32:31.2719869Z 2025-05-07T20:32:31.2720068Z if scale_ub is not None: 2025-05-07T20:32:31.2720345Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.2720688Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.2721008Z ) 2025-05-07T20:32:31.2721208Z else: 2025-05-07T20:32:31.2721424Z scale_ub_tensor = None 2025-05-07T20:32:31.2721684Z 2025-05-07T20:32:31.2721923Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2722248Z op = silu_mul_quant 2025-05-07T20:32:31.2722506Z if compiled: 2025-05-07T20:32:31.2722767Z op = torch.compile(op) 2025-05-07T20:32:31.2723064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2723349Z 2025-05-07T20:32:31.2723556Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.2723844Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.2724143Z 2025-05-07T20:32:31.2724388Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2724726Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.2725030Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.2725353Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.2725717Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.2726038Z 2025-05-07T20:32:31.2726298Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.2726503Z 2025-05-07T20:32:31.2726611Z moe/activation_test.py:126: 2025-05-07T20:32:31.2726915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2727308Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.2727647Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.2728441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.2729214Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.2729778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.2730466Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.2731156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.2731895Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.2732655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:31.2733457Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.2734233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.2734884Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.2735497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.2736018Z fn() 2025-05-07T20:32:31.2736533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.2737123Z self.fn.run( 2025-05-07T20:32:31.2737653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.2738194Z kernel = self.compile( 2025-05-07T20:32:31.2738750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.2739427Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.2739832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2740423Z 2025-05-07T20:32:31.2740677Z self = 2025-05-07T20:32:31.2741828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.2743216Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdba28dc0>} 2025-05-07T20:32:31.2744574Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.2745607Z context = 2025-05-07T20:32:31.2745910Z 2025-05-07T20:32:31.2746084Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.2746618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.2747120Z module_map=module_map) 2025-05-07T20:32:31.2747510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.2747878Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.2748155Z E ^ 2025-05-07T20:32:31.2748709Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.2749179Z 2025-05-07T20:32:31.2749606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.2750128Z 2025-05-07T20:32:31.2750243Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.2750663Z self=, 2025-05-07T20:32:31.2757913Z T=4096, 2025-05-07T20:32:31.2758246Z D=5120, 2025-05-07T20:32:31.2758481Z scale_ub=None, 2025-05-07T20:32:31.2758772Z contiguous=False, 2025-05-07T20:32:31.2759103Z compiled=False, 2025-05-07T20:32:31.2759335Z ) 2025-05-07T20:32:32.9912205Z self = 2025-05-07T20:32:32.9912811Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.9913111Z 2025-05-07T20:32:32.9913195Z @given( 2025-05-07T20:32:32.9913467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.9913795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.9914126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.9914470Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.9915040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.9915340Z ) 2025-05-07T20:32:32.9915699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.9916154Z def test_silu_mul_quant( 2025-05-07T20:32:32.9916406Z self, 2025-05-07T20:32:32.9916612Z T: int, 2025-05-07T20:32:32.9916814Z D: int, 2025-05-07T20:32:32.9917037Z scale_ub: Optional[float], 2025-05-07T20:32:32.9917324Z contiguous: bool, 2025-05-07T20:32:32.9917574Z compiled: bool, 2025-05-07T20:32:32.9917806Z ) -> None: 2025-05-07T20:32:32.9918033Z torch.manual_seed(2025) 2025-05-07T20:32:32.9918369Z 2025-05-07T20:32:32.9918647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.9919001Z 2025-05-07T20:32:32.9919197Z x_sign = torch.sign(x) 2025-05-07T20:32:32.9919490Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.9919810Z x = x_sign * x_clamp 2025-05-07T20:32:32.9920059Z x0 = x[:, :D] 2025-05-07T20:32:32.9920273Z x1 = x[:, D:] 2025-05-07T20:32:32.9920489Z 2025-05-07T20:32:32.9920681Z if contiguous: 2025-05-07T20:32:32.9920917Z x0 = x0.contiguous() 2025-05-07T20:32:32.9921186Z x1 = x1.contiguous() 2025-05-07T20:32:32.9921433Z 2025-05-07T20:32:32.9921632Z if scale_ub is not None: 2025-05-07T20:32:32.9921907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.9922258Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.9922579Z ) 2025-05-07T20:32:32.9922777Z else: 2025-05-07T20:32:32.9923000Z scale_ub_tensor = None 2025-05-07T20:32:32.9923261Z 2025-05-07T20:32:32.9923500Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.9923819Z op = silu_mul_quant 2025-05-07T20:32:32.9924080Z if compiled: 2025-05-07T20:32:32.9924333Z op = torch.compile(op) 2025-05-07T20:32:32.9924642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9924933Z 2025-05-07T20:32:32.9925128Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.9925306Z 2025-05-07T20:32:32.9925408Z moe/activation_test.py:117: 2025-05-07T20:32:32.9925713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9926055Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.9926342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9927042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.9927822Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.9928426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.9929132Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.9929816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.9930362Z kernel = self.compile( 2025-05-07T20:32:32.9930904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.9931564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.9931974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9932209Z 2025-05-07T20:32:32.9932429Z self = 2025-05-07T20:32:32.9933511Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.9934986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb6393a0>} 2025-05-07T20:32:32.9936386Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.9937420Z context = 2025-05-07T20:32:32.9937713Z 2025-05-07T20:32:32.9937887Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.9938459Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.9938979Z module_map=module_map) 2025-05-07T20:32:32.9939356Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.9939716Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.9939977Z E ^ 2025-05-07T20:32:32.9940636Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.9941146Z 2025-05-07T20:32:32.9941569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.9942077Z 2025-05-07T20:32:32.9942186Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.9942602Z self=, 2025-05-07T20:32:32.9943013Z T=4096, 2025-05-07T20:32:32.9943201Z D=7168, 2025-05-07T20:32:32.9943392Z scale_ub=None, 2025-05-07T20:32:32.9943613Z contiguous=False, 2025-05-07T20:32:32.9943852Z compiled=False, 2025-05-07T20:32:32.9944054Z ) 2025-05-07T20:32:32.9944377Z self = 2025-05-07T20:32:32.9944878Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.9945152Z 2025-05-07T20:32:32.9945232Z @given( 2025-05-07T20:32:32.9945469Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.9945793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.9946109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.9946440Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.9946773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.9947064Z ) 2025-05-07T20:32:32.9947413Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.9947869Z def test_silu_mul_quant( 2025-05-07T20:32:32.9948146Z self, 2025-05-07T20:32:32.9948358Z T: int, 2025-05-07T20:32:32.9948641Z D: int, 2025-05-07T20:32:32.9948865Z scale_ub: Optional[float], 2025-05-07T20:32:32.9949142Z contiguous: bool, 2025-05-07T20:32:32.9949387Z compiled: bool, 2025-05-07T20:32:32.9949616Z ) -> None: 2025-05-07T20:32:32.9949845Z torch.manual_seed(2025) 2025-05-07T20:32:32.9950098Z 2025-05-07T20:32:32.9950376Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.9950725Z 2025-05-07T20:32:32.9950921Z x_sign = torch.sign(x) 2025-05-07T20:32:32.9951209Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.9951524Z x = x_sign * x_clamp 2025-05-07T20:32:32.9951767Z x0 = x[:, :D] 2025-05-07T20:32:32.9951980Z x1 = x[:, D:] 2025-05-07T20:32:32.9952188Z 2025-05-07T20:32:32.9952373Z if contiguous: 2025-05-07T20:32:32.9952612Z x0 = x0.contiguous() 2025-05-07T20:32:32.9952871Z x1 = x1.contiguous() 2025-05-07T20:32:32.9953117Z 2025-05-07T20:32:32.9953315Z if scale_ub is not None: 2025-05-07T20:32:32.9953587Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.9953927Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.9954240Z ) 2025-05-07T20:32:32.9954430Z else: 2025-05-07T20:32:32.9954776Z scale_ub_tensor = None 2025-05-07T20:32:32.9955035Z 2025-05-07T20:32:32.9955264Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.9955580Z op = silu_mul_quant 2025-05-07T20:32:32.9955838Z if compiled: 2025-05-07T20:32:32.9956086Z op = torch.compile(op) 2025-05-07T20:32:32.9956386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9956663Z 2025-05-07T20:32:32.9956853Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.9957026Z 2025-05-07T20:32:32.9957126Z moe/activation_test.py:117: 2025-05-07T20:32:32.9957429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9957824Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.9958105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9958797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.9959489Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.9960029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.9960714Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.9961380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.9961913Z kernel = self.compile( 2025-05-07T20:32:32.9962449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.9963109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.9963507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9963740Z 2025-05-07T20:32:32.9963956Z self = 2025-05-07T20:32:32.9965050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.9966421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb5a78b0>} 2025-05-07T20:32:32.9967762Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.9968844Z context = 2025-05-07T20:32:32.9969136Z 2025-05-07T20:32:32.9969304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.9969834Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.9970311Z module_map=module_map) 2025-05-07T20:32:32.9970685Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.9971036Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.9971299Z E ^ 2025-05-07T20:32:32.9971762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.9972207Z 2025-05-07T20:32:32.9972622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.9973136Z 2025-05-07T20:32:32.9973246Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.9973666Z self=, 2025-05-07T20:32:32.9974068Z T=128, 2025-05-07T20:32:32.9974253Z D=7168, 2025-05-07T20:32:32.9974449Z scale_ub=None, 2025-05-07T20:32:32.9974668Z contiguous=False, 2025-05-07T20:32:32.9974973Z compiled=True, 2025-05-07T20:32:32.9975189Z ) 2025-05-07T20:32:33.0739656Z self = 2025-05-07T20:32:33.0740363Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.0740641Z 2025-05-07T20:32:33.0740725Z @given( 2025-05-07T20:32:33.0740955Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.0741327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.0741640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.0741974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.0742311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.0742693Z ) 2025-05-07T20:32:33.0743039Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.0743485Z def test_silu_mul_quant( 2025-05-07T20:32:33.0743727Z self, 2025-05-07T20:32:33.0743919Z T: int, 2025-05-07T20:32:33.0744136Z D: int, 2025-05-07T20:32:33.0744359Z scale_ub: Optional[float], 2025-05-07T20:32:33.0744633Z contiguous: bool, 2025-05-07T20:32:33.0744876Z compiled: bool, 2025-05-07T20:32:33.0745101Z ) -> None: 2025-05-07T20:32:33.0745316Z torch.manual_seed(2025) 2025-05-07T20:32:33.0745564Z 2025-05-07T20:32:33.0745838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.0746186Z 2025-05-07T20:32:33.0746381Z x_sign = torch.sign(x) 2025-05-07T20:32:33.0746678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.0746991Z x = x_sign * x_clamp 2025-05-07T20:32:33.0747246Z x0 = x[:, :D] 2025-05-07T20:32:33.0747464Z x1 = x[:, D:] 2025-05-07T20:32:33.0747675Z 2025-05-07T20:32:33.0747873Z if contiguous: 2025-05-07T20:32:33.0748116Z x0 = x0.contiguous() 2025-05-07T20:32:33.0748386Z x1 = x1.contiguous() 2025-05-07T20:32:33.0748634Z 2025-05-07T20:32:33.0748837Z if scale_ub is not None: 2025-05-07T20:32:33.0749122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.0749462Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.0749782Z ) 2025-05-07T20:32:33.0749983Z else: 2025-05-07T20:32:33.0750197Z scale_ub_tensor = None 2025-05-07T20:32:33.0750458Z 2025-05-07T20:32:33.0750697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.0751016Z op = silu_mul_quant 2025-05-07T20:32:33.0751278Z if compiled: 2025-05-07T20:32:33.0751535Z op = torch.compile(op) 2025-05-07T20:32:33.0751985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.0752275Z 2025-05-07T20:32:33.0752475Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.0752763Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.0753068Z 2025-05-07T20:32:33.0753317Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.0753668Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.0753964Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.0754285Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.0754654Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.0754968Z 2025-05-07T20:32:33.0755176Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.0755377Z 2025-05-07T20:32:33.0755491Z moe/activation_test.py:126: 2025-05-07T20:32:33.0755822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.0756215Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.0756591Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.0757556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.0758471Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.0759038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.0759727Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.0760418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.0761148Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.0761909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:33.0762704Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.0763432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.0764092Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.0764707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.0765231Z fn() 2025-05-07T20:32:33.0765738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.0766329Z self.fn.run( 2025-05-07T20:32:33.0766802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.0767331Z kernel = self.compile( 2025-05-07T20:32:33.0767929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.0768606Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.0769013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.0769248Z 2025-05-07T20:32:33.0769467Z self = 2025-05-07T20:32:33.0770580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.0771973Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb5a7e50>} 2025-05-07T20:32:33.0773359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.0774403Z context = 2025-05-07T20:32:33.0774694Z 2025-05-07T20:32:33.0774864Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.0775398Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.0775879Z module_map=module_map) 2025-05-07T20:32:33.0776259Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.0776625Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.0776904Z E ^ 2025-05-07T20:32:33.0777373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.0777820Z 2025-05-07T20:32:33.0778243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.0778765Z 2025-05-07T20:32:33.0778872Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.0779294Z self=, 2025-05-07T20:32:33.0779706Z T=128, 2025-05-07T20:32:33.0779897Z D=7168, 2025-05-07T20:32:33.0780176Z scale_ub=None, 2025-05-07T20:32:33.0780400Z contiguous=False, 2025-05-07T20:32:33.0780631Z compiled=False, 2025-05-07T20:32:33.0780845Z ) 2025-05-07T20:32:33.4860663Z self = 2025-05-07T20:32:33.4861269Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:33.4861556Z 2025-05-07T20:32:33.4861641Z @given( 2025-05-07T20:32:33.4861893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.4862231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.4862550Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.4863065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.4863443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.4863770Z ) 2025-05-07T20:32:33.4864170Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.4864644Z def test_silu_mul_quant( 2025-05-07T20:32:33.4864905Z self, 2025-05-07T20:32:33.4865108Z T: int, 2025-05-07T20:32:33.4865315Z D: int, 2025-05-07T20:32:33.4865544Z scale_ub: Optional[float], 2025-05-07T20:32:33.4865821Z contiguous: bool, 2025-05-07T20:32:33.4866074Z compiled: bool, 2025-05-07T20:32:33.4866311Z ) -> None: 2025-05-07T20:32:33.4866534Z torch.manual_seed(2025) 2025-05-07T20:32:33.4866789Z 2025-05-07T20:32:33.4867076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.4867433Z 2025-05-07T20:32:33.4867641Z x_sign = torch.sign(x) 2025-05-07T20:32:33.4867990Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.4868317Z x = x_sign * x_clamp 2025-05-07T20:32:33.4868562Z x0 = x[:, :D] 2025-05-07T20:32:33.4868792Z x1 = x[:, D:] 2025-05-07T20:32:33.4869008Z 2025-05-07T20:32:33.4869197Z if contiguous: 2025-05-07T20:32:33.4869446Z x0 = x0.contiguous() 2025-05-07T20:32:33.4869719Z x1 = x1.contiguous() 2025-05-07T20:32:33.4869962Z 2025-05-07T20:32:33.4870164Z if scale_ub is not None: 2025-05-07T20:32:33.4870449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.4870791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.4871109Z ) 2025-05-07T20:32:33.4871314Z else: 2025-05-07T20:32:33.4871527Z scale_ub_tensor = None 2025-05-07T20:32:33.4871792Z 2025-05-07T20:32:33.4872034Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.4872361Z op = silu_mul_quant 2025-05-07T20:32:33.4872693Z if compiled: 2025-05-07T20:32:33.4872959Z op = torch.compile(op) 2025-05-07T20:32:33.4873268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4873547Z 2025-05-07T20:32:33.4873776Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.4873951Z 2025-05-07T20:32:33.4874061Z moe/activation_test.py:117: 2025-05-07T20:32:33.4874368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4874703Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.4874993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4875691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.4876389Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.4876929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.4877624Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.4878298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.4878833Z kernel = self.compile( 2025-05-07T20:32:33.4879449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.4880169Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.4880578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4880813Z 2025-05-07T20:32:33.4881030Z self = 2025-05-07T20:32:33.4882124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.4883566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb08faf0>} 2025-05-07T20:32:33.4884919Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.4885938Z context = 2025-05-07T20:32:33.4886241Z 2025-05-07T20:32:33.4886411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.4886946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.4887421Z module_map=module_map) 2025-05-07T20:32:33.4887798Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.4888168Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.4888443Z E ^ 2025-05-07T20:32:33.4888905Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.4889375Z 2025-05-07T20:32:33.4889802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.4890323Z 2025-05-07T20:32:33.4890430Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.4890855Z self=, 2025-05-07T20:32:33.4891259Z T=4096, 2025-05-07T20:32:33.4891458Z D=5120, 2025-05-07T20:32:33.4891658Z scale_ub=1200.0, 2025-05-07T20:32:33.4891885Z contiguous=True, 2025-05-07T20:32:33.4892117Z compiled=False, 2025-05-07T20:32:33.4892330Z ) 2025-05-07T20:32:33.4892651Z self = 2025-05-07T20:32:33.4893199Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:33.4893493Z 2025-05-07T20:32:33.4893573Z @given( 2025-05-07T20:32:33.4893812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.4894131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.4894452Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.4894799Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.4895130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.4895425Z ) 2025-05-07T20:32:33.4895784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.4896238Z def test_silu_mul_quant( 2025-05-07T20:32:33.4896483Z self, 2025-05-07T20:32:33.4896688Z T: int, 2025-05-07T20:32:33.4896892Z D: int, 2025-05-07T20:32:33.4897111Z scale_ub: Optional[float], 2025-05-07T20:32:33.4904872Z contiguous: bool, 2025-05-07T20:32:33.4905253Z compiled: bool, 2025-05-07T20:32:33.4905597Z ) -> None: 2025-05-07T20:32:33.4905897Z torch.manual_seed(2025) 2025-05-07T20:32:33.4906148Z 2025-05-07T20:32:33.4906436Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.4906791Z 2025-05-07T20:32:33.4906993Z x_sign = torch.sign(x) 2025-05-07T20:32:33.4907453Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.4907778Z x = x_sign * x_clamp 2025-05-07T20:32:33.4908030Z x0 = x[:, :D] 2025-05-07T20:32:33.4908245Z x1 = x[:, D:] 2025-05-07T20:32:33.4908456Z 2025-05-07T20:32:33.4908639Z if contiguous: 2025-05-07T20:32:33.4908865Z x0 = x0.contiguous() 2025-05-07T20:32:33.4909133Z x1 = x1.contiguous() 2025-05-07T20:32:33.4909376Z 2025-05-07T20:32:33.4909566Z if scale_ub is not None: 2025-05-07T20:32:33.4909848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.4910192Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.4910556Z ) 2025-05-07T20:32:33.4910753Z else: 2025-05-07T20:32:33.4910965Z scale_ub_tensor = None 2025-05-07T20:32:33.4911212Z 2025-05-07T20:32:33.4911449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.4911769Z op = silu_mul_quant 2025-05-07T20:32:33.4912035Z if compiled: 2025-05-07T20:32:33.4912282Z op = torch.compile(op) 2025-05-07T20:32:33.4912585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4912864Z 2025-05-07T20:32:33.4913051Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.4913223Z 2025-05-07T20:32:33.4913325Z moe/activation_test.py:117: 2025-05-07T20:32:33.4913623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4913954Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.4914238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4914947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.4915637Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.4916170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.4916863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.4917526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.4918054Z kernel = self.compile( 2025-05-07T20:32:33.4918599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.4919255Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.4919654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4919886Z 2025-05-07T20:32:33.4920148Z self = 2025-05-07T20:32:33.4921226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.4922627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb165ca0>} 2025-05-07T20:32:33.4923974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.4924993Z context = 2025-05-07T20:32:33.4925285Z 2025-05-07T20:32:33.4925458Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.4926007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.4926483Z module_map=module_map) 2025-05-07T20:32:33.4926851Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.4927209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.4927558Z E ^ 2025-05-07T20:32:33.4928082Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.4928537Z 2025-05-07T20:32:33.4928961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.4929487Z 2025-05-07T20:32:33.4929591Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.4930006Z self=, 2025-05-07T20:32:33.4930406Z T=1, 2025-05-07T20:32:33.4930592Z D=5120, 2025-05-07T20:32:33.4930838Z scale_ub=None, 2025-05-07T20:32:33.4931049Z contiguous=True, 2025-05-07T20:32:33.4931275Z compiled=True, 2025-05-07T20:32:33.4931488Z ) 2025-05-07T20:32:34.1403344Z self = 2025-05-07T20:32:34.1404768Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.1405340Z 2025-05-07T20:32:34.1405478Z @given( 2025-05-07T20:32:34.1405883Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1406422Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1406929Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1407493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1408064Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1408562Z ) 2025-05-07T20:32:34.1409101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1409694Z def test_silu_mul_quant( 2025-05-07T20:32:34.1410059Z self, 2025-05-07T20:32:34.1410335Z T: int, 2025-05-07T20:32:34.1410600Z D: int, 2025-05-07T20:32:34.1410906Z scale_ub: Optional[float], 2025-05-07T20:32:34.1411279Z contiguous: bool, 2025-05-07T20:32:34.1411612Z compiled: bool, 2025-05-07T20:32:34.1411921Z ) -> None: 2025-05-07T20:32:34.1412226Z torch.manual_seed(2025) 2025-05-07T20:32:34.1412563Z 2025-05-07T20:32:34.1412924Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1413396Z 2025-05-07T20:32:34.1413664Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1414052Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1414482Z x = x_sign * x_clamp 2025-05-07T20:32:34.1414816Z x0 = x[:, :D] 2025-05-07T20:32:34.1415103Z x1 = x[:, D:] 2025-05-07T20:32:34.1415392Z 2025-05-07T20:32:34.1415652Z if contiguous: 2025-05-07T20:32:34.1415964Z x0 = x0.contiguous() 2025-05-07T20:32:34.1416676Z x1 = x1.contiguous() 2025-05-07T20:32:34.1417026Z 2025-05-07T20:32:34.1417284Z if scale_ub is not None: 2025-05-07T20:32:34.1417662Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1418123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1418560Z ) 2025-05-07T20:32:34.1418821Z else: 2025-05-07T20:32:34.1419115Z scale_ub_tensor = None 2025-05-07T20:32:34.1419465Z 2025-05-07T20:32:34.1419773Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1420214Z op = silu_mul_quant 2025-05-07T20:32:34.1420579Z if compiled: 2025-05-07T20:32:34.1420925Z op = torch.compile(op) 2025-05-07T20:32:34.1421450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1421821Z 2025-05-07T20:32:34.1422126Z y_fp8, y_scale = fn() 2025-05-07T20:32:34.1422599Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:34.1423042Z 2025-05-07T20:32:34.1423402Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1423910Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:34.1424362Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:34.1425019Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:34.1425711Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.1426184Z 2025-05-07T20:32:34.1426477Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:34.1426773Z 2025-05-07T20:32:34.1426917Z moe/activation_test.py:126: 2025-05-07T20:32:34.1427344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1427846Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:34.1428329Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.1429522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:34.1430854Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:34.1431697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1432817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1433864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:34.1435007Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.1436249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:34.1437570Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.1438859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:34.1439919Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:34.1441321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:34.1442207Z fn() 2025-05-07T20:32:34.1443087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:34.1444087Z self.fn.run( 2025-05-07T20:32:34.1444901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1445839Z kernel = self.compile( 2025-05-07T20:32:34.1446766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1447903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1448775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1449189Z 2025-05-07T20:32:34.1449558Z self = 2025-05-07T20:32:34.1451494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1454006Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda9b6550>} 2025-05-07T20:32:34.1456397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1458217Z context = 2025-05-07T20:32:34.1458731Z 2025-05-07T20:32:34.1459022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1459932Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1460741Z module_map=module_map) 2025-05-07T20:32:34.1461630Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1462309Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:34.1462763Z E ^ 2025-05-07T20:32:34.1463581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1464390Z 2025-05-07T20:32:34.1465094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.1465984Z 2025-05-07T20:32:34.1466151Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1466837Z self=, 2025-05-07T20:32:34.1467579Z T=2048, 2025-05-07T20:32:34.1467859Z D=5120, 2025-05-07T20:32:34.1468149Z scale_ub=None, 2025-05-07T20:32:34.1468472Z contiguous=True, 2025-05-07T20:32:34.1468800Z compiled=True, 2025-05-07T20:32:34.1469114Z ) 2025-05-07T20:32:34.7542175Z self = 2025-05-07T20:32:34.7543126Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.7543603Z 2025-05-07T20:32:34.7543732Z @given( 2025-05-07T20:32:34.7544129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7544642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7545156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7545712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7546255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7546736Z ) 2025-05-07T20:32:34.7547279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7547893Z def test_silu_mul_quant( 2025-05-07T20:32:34.7548226Z self, 2025-05-07T20:32:34.7548495Z T: int, 2025-05-07T20:32:34.7548757Z D: int, 2025-05-07T20:32:34.7549055Z scale_ub: Optional[float], 2025-05-07T20:32:34.7549459Z contiguous: bool, 2025-05-07T20:32:34.7549776Z compiled: bool, 2025-05-07T20:32:34.7550088Z ) -> None: 2025-05-07T20:32:34.7550386Z torch.manual_seed(2025) 2025-05-07T20:32:34.7550712Z 2025-05-07T20:32:34.7551078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7551543Z 2025-05-07T20:32:34.7551801Z x_sign = torch.sign(x) 2025-05-07T20:32:34.7552192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.7552612Z x = x_sign * x_clamp 2025-05-07T20:32:34.7552935Z x0 = x[:, :D] 2025-05-07T20:32:34.7553234Z x1 = x[:, D:] 2025-05-07T20:32:34.7553525Z 2025-05-07T20:32:34.7554137Z if contiguous: 2025-05-07T20:32:34.7554467Z x0 = x0.contiguous() 2025-05-07T20:32:34.7554822Z x1 = x1.contiguous() 2025-05-07T20:32:34.7555143Z 2025-05-07T20:32:34.7555409Z if scale_ub is not None: 2025-05-07T20:32:34.7555857Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.7556379Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.7556835Z ) 2025-05-07T20:32:34.7557122Z else: 2025-05-07T20:32:34.7557444Z scale_ub_tensor = None 2025-05-07T20:32:34.7557822Z 2025-05-07T20:32:34.7558197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.7558708Z op = silu_mul_quant 2025-05-07T20:32:34.7559097Z if compiled: 2025-05-07T20:32:34.7559479Z op = torch.compile(op) 2025-05-07T20:32:34.7559922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7560320Z 2025-05-07T20:32:34.7560604Z y_fp8, y_scale = fn() 2025-05-07T20:32:34.7561010Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:34.7561422Z 2025-05-07T20:32:34.7561782Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.7562270Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:34.7562908Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:34.7563364Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:34.7563881Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.7564380Z 2025-05-07T20:32:34.7564687Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:34.7564996Z 2025-05-07T20:32:34.7565153Z moe/activation_test.py:126: 2025-05-07T20:32:34.7565616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7566162Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:34.7566711Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.7568056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:34.7569300Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:34.7570200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.7571323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.7572503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:34.7573701Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.7575007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:34.7576301Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.7577610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:34.7578708Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:34.7579752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:34.7580661Z fn() 2025-05-07T20:32:34.7581675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:34.7582709Z self.fn.run( 2025-05-07T20:32:34.7583518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.7584468Z kernel = self.compile( 2025-05-07T20:32:34.7585408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.7586651Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.7587350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7587750Z 2025-05-07T20:32:34.7588145Z self = 2025-05-07T20:32:34.7590068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.7592597Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda8f0f70>} 2025-05-07T20:32:34.7595028Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.7596805Z context = 2025-05-07T20:32:34.7597284Z 2025-05-07T20:32:34.7597556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.7598407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.7599319Z module_map=module_map) 2025-05-07T20:32:34.7599905Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.7600476Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:34.7600914Z E ^ 2025-05-07T20:32:34.7601685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.7602451Z 2025-05-07T20:32:34.7603170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.7604056Z 2025-05-07T20:32:34.7604227Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7605018Z self=, 2025-05-07T20:32:34.7605705Z T=128, 2025-05-07T20:32:34.7606009Z D=5120, 2025-05-07T20:32:34.7606326Z scale_ub=None, 2025-05-07T20:32:34.7606682Z contiguous=True, 2025-05-07T20:32:34.7607045Z compiled=True, 2025-05-07T20:32:34.7607392Z ) 2025-05-07T20:32:35.7536309Z self = 2025-05-07T20:32:35.7536887Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.7537172Z 2025-05-07T20:32:35.7537260Z @given( 2025-05-07T20:32:35.7537518Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7537845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7538175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7538530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7538875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7539198Z ) 2025-05-07T20:32:35.7539572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7540027Z def test_silu_mul_quant( 2025-05-07T20:32:35.7540564Z self, 2025-05-07T20:32:35.7540781Z T: int, 2025-05-07T20:32:35.7540993Z D: int, 2025-05-07T20:32:35.7541315Z scale_ub: Optional[float], 2025-05-07T20:32:35.7541607Z contiguous: bool, 2025-05-07T20:32:35.7541866Z compiled: bool, 2025-05-07T20:32:35.7542106Z ) -> None: 2025-05-07T20:32:35.7542342Z torch.manual_seed(2025) 2025-05-07T20:32:35.7542603Z 2025-05-07T20:32:35.7542883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7543253Z 2025-05-07T20:32:35.7543464Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7543764Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7544090Z x = x_sign * x_clamp 2025-05-07T20:32:35.7544646Z x0 = x[:, :D] 2025-05-07T20:32:35.7544875Z x1 = x[:, D:] 2025-05-07T20:32:35.7545102Z 2025-05-07T20:32:35.7545305Z if contiguous: 2025-05-07T20:32:35.7545546Z x0 = x0.contiguous() 2025-05-07T20:32:35.7545819Z x1 = x1.contiguous() 2025-05-07T20:32:35.7546072Z 2025-05-07T20:32:35.7546274Z if scale_ub is not None: 2025-05-07T20:32:35.7546562Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7546916Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7547233Z ) 2025-05-07T20:32:35.7547442Z else: 2025-05-07T20:32:35.7547668Z scale_ub_tensor = None 2025-05-07T20:32:35.7547934Z 2025-05-07T20:32:35.7548179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7548511Z op = silu_mul_quant 2025-05-07T20:32:35.7548779Z if compiled: 2025-05-07T20:32:35.7549034Z op = torch.compile(op) 2025-05-07T20:32:35.7549357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7549650Z 2025-05-07T20:32:35.7549850Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.7550152Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.7550459Z 2025-05-07T20:32:35.7550701Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7551211Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.7551524Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.7551848Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.7552225Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7552552Z 2025-05-07T20:32:35.7552776Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.7552978Z 2025-05-07T20:32:35.7553087Z moe/activation_test.py:126: 2025-05-07T20:32:35.7553404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7553754Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.7554178Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.7554985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.7555757Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.7556339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7557037Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7557744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.7558487Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7559252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.7560011Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.7560755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.7561412Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.7562029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.7562564Z fn() 2025-05-07T20:32:35.7563087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.7563692Z self.fn.run( 2025-05-07T20:32:35.7564170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7564715Z kernel = self.compile( 2025-05-07T20:32:35.7565322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7565991Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7566407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7566653Z 2025-05-07T20:32:35.7566869Z self = 2025-05-07T20:32:35.7567968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7569361Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdac8fb80>} 2025-05-07T20:32:35.7570723Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7571761Z context = 2025-05-07T20:32:35.7572057Z 2025-05-07T20:32:35.7572237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7572816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7573370Z module_map=module_map) 2025-05-07T20:32:35.7573752Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7574132Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.7574404Z E ^ 2025-05-07T20:32:35.7574878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7575332Z 2025-05-07T20:32:35.7575763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7576329Z 2025-05-07T20:32:35.7576448Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7576867Z self=, 2025-05-07T20:32:35.7577284Z T=4096, 2025-05-07T20:32:35.7577491Z D=5120, 2025-05-07T20:32:35.7577690Z scale_ub=None, 2025-05-07T20:32:35.7577925Z contiguous=True, 2025-05-07T20:32:35.7578164Z compiled=True, 2025-05-07T20:32:35.7578381Z ) 2025-05-07T20:32:36.5873170Z self = 2025-05-07T20:32:36.5873767Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.5874052Z 2025-05-07T20:32:36.5874151Z @given( 2025-05-07T20:32:36.5874401Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5874743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5875075Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5875451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5875819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5876129Z ) 2025-05-07T20:32:36.5876496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5876965Z def test_silu_mul_quant( 2025-05-07T20:32:36.5877234Z self, 2025-05-07T20:32:36.5877458Z T: int, 2025-05-07T20:32:36.5877670Z D: int, 2025-05-07T20:32:36.5877912Z scale_ub: Optional[float], 2025-05-07T20:32:36.5878205Z contiguous: bool, 2025-05-07T20:32:36.5878454Z compiled: bool, 2025-05-07T20:32:36.5878707Z ) -> None: 2025-05-07T20:32:36.5878944Z torch.manual_seed(2025) 2025-05-07T20:32:36.5879199Z 2025-05-07T20:32:36.5879492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5879852Z 2025-05-07T20:32:36.5880050Z x_sign = torch.sign(x) 2025-05-07T20:32:36.5880364Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.5881022Z x = x_sign * x_clamp 2025-05-07T20:32:36.5881282Z x0 = x[:, :D] 2025-05-07T20:32:36.5881522Z x1 = x[:, D:] 2025-05-07T20:32:36.5881747Z 2025-05-07T20:32:36.5881940Z if contiguous: 2025-05-07T20:32:36.5882188Z x0 = x0.contiguous() 2025-05-07T20:32:36.5882471Z x1 = x1.contiguous() 2025-05-07T20:32:36.5882721Z 2025-05-07T20:32:36.5882931Z if scale_ub is not None: 2025-05-07T20:32:36.5883223Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.5883577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.5883892Z ) 2025-05-07T20:32:36.5884101Z else: 2025-05-07T20:32:36.5884331Z scale_ub_tensor = None 2025-05-07T20:32:36.5884591Z 2025-05-07T20:32:36.5884838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.5885170Z op = silu_mul_quant 2025-05-07T20:32:36.5885429Z if compiled: 2025-05-07T20:32:36.5885703Z op = torch.compile(op) 2025-05-07T20:32:36.5886020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.5886305Z 2025-05-07T20:32:36.5886523Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.5886832Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.5887220Z 2025-05-07T20:32:36.5887543Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.5887901Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.5888214Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.5888543Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.5888975Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.5889305Z 2025-05-07T20:32:36.5889516Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.5889729Z 2025-05-07T20:32:36.5889839Z moe/activation_test.py:126: 2025-05-07T20:32:36.5890159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.5890593Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.5890941Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.5891752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.5892525Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.5893091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.5893792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.5894505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.5895244Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.5896007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.5896791Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.5897540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.5898187Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.5898867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.5899396Z fn() 2025-05-07T20:32:36.5899927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.5900509Z self.fn.run( 2025-05-07T20:32:36.5900995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.5901635Z kernel = self.compile( 2025-05-07T20:32:36.5902231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.5902911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.5903325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.5903569Z 2025-05-07T20:32:36.5903792Z self = 2025-05-07T20:32:36.5904874Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.5906297Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda81cca0>} 2025-05-07T20:32:36.5907665Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.5908697Z context = 2025-05-07T20:32:36.5908994Z 2025-05-07T20:32:36.5909292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.5909827Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.5910305Z module_map=module_map) 2025-05-07T20:32:36.5910682Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.5911046Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.5911324Z E ^ 2025-05-07T20:32:36.5911794Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.5912241Z 2025-05-07T20:32:36.5912671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.5913238Z 2025-05-07T20:32:36.5913346Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5913772Z self=, 2025-05-07T20:32:36.5914193Z T=16384, 2025-05-07T20:32:36.5914398Z D=5120, 2025-05-07T20:32:36.5914605Z scale_ub=None, 2025-05-07T20:32:36.5914835Z contiguous=True, 2025-05-07T20:32:36.5915067Z compiled=True, 2025-05-07T20:32:36.5915292Z ) 2025-05-07T20:32:36.6347964Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:36.6349252Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:36.6350623Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:36.6351620Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:36.6352760Z W0507 20:32:36.633162 87966 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:36.7567219Z self = 2025-05-07T20:32:36.7567798Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.7568081Z 2025-05-07T20:32:36.7568174Z @given( 2025-05-07T20:32:36.7568426Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.7568760Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.7569246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.7569625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.7569970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.7570267Z ) 2025-05-07T20:32:36.7570627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.7571083Z def test_silu_mul_quant( 2025-05-07T20:32:36.7571337Z self, 2025-05-07T20:32:36.7571539Z T: int, 2025-05-07T20:32:36.7571744Z D: int, 2025-05-07T20:32:36.7571973Z scale_ub: Optional[float], 2025-05-07T20:32:36.7572253Z contiguous: bool, 2025-05-07T20:32:36.7572510Z compiled: bool, 2025-05-07T20:32:36.7572744Z ) -> None: 2025-05-07T20:32:36.7572969Z torch.manual_seed(2025) 2025-05-07T20:32:36.7573222Z 2025-05-07T20:32:36.7573504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.7573853Z 2025-05-07T20:32:36.7574063Z x_sign = torch.sign(x) 2025-05-07T20:32:36.7574365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.7574679Z x = x_sign * x_clamp 2025-05-07T20:32:36.7574927Z x0 = x[:, :D] 2025-05-07T20:32:36.7575152Z x1 = x[:, D:] 2025-05-07T20:32:36.7575368Z 2025-05-07T20:32:36.7575700Z if contiguous: 2025-05-07T20:32:36.7575942Z x0 = x0.contiguous() 2025-05-07T20:32:36.7576212Z x1 = x1.contiguous() 2025-05-07T20:32:36.7576460Z 2025-05-07T20:32:36.7576659Z if scale_ub is not None: 2025-05-07T20:32:36.7576941Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.7577280Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.7577601Z ) 2025-05-07T20:32:36.7577802Z else: 2025-05-07T20:32:36.7578015Z scale_ub_tensor = None 2025-05-07T20:32:36.7578277Z 2025-05-07T20:32:36.7578518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.7578941Z op = silu_mul_quant 2025-05-07T20:32:36.7579222Z if compiled: 2025-05-07T20:32:36.7579481Z op = torch.compile(op) 2025-05-07T20:32:36.7579780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7580065Z 2025-05-07T20:32:36.7580263Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.7580564Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.7580857Z 2025-05-07T20:32:36.7581198Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.7581540Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.7581834Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.7582156Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.7582522Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.7582834Z 2025-05-07T20:32:36.7583044Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.7583245Z 2025-05-07T20:32:36.7583362Z moe/activation_test.py:126: 2025-05-07T20:32:36.7583665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7584014Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.7584351Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.7585151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.7585903Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.7586456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.7587142Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.7587844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.7588620Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.7589393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.7590144Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.7590877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.7591512Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.7592119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.7592642Z fn() 2025-05-07T20:32:36.7593147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.7593728Z self.fn.run( 2025-05-07T20:32:36.7594199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.7594731Z kernel = self.compile( 2025-05-07T20:32:36.7595270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.7595922Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.7596424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7596660Z 2025-05-07T20:32:36.7596874Z self = 2025-05-07T20:32:36.7597958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.7599398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdaddcb80>} 2025-05-07T20:32:36.7600781Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.7601797Z context = 2025-05-07T20:32:36.7602103Z 2025-05-07T20:32:36.7602278Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.7602807Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.7603284Z module_map=module_map) 2025-05-07T20:32:36.7603653Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.7604023Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.7604298Z E ^ 2025-05-07T20:32:36.7604762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.7605224Z 2025-05-07T20:32:36.7605640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.7606157Z 2025-05-07T20:32:36.7606264Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.7606696Z self=, 2025-05-07T20:32:36.7607099Z T=1, 2025-05-07T20:32:36.7607294Z D=5120, 2025-05-07T20:32:36.7607498Z scale_ub=1200.0, 2025-05-07T20:32:36.7607724Z contiguous=True, 2025-05-07T20:32:36.7607958Z compiled=True, 2025-05-07T20:32:36.7608169Z ) 2025-05-07T20:32:36.9311026Z self = 2025-05-07T20:32:36.9311586Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.9311853Z 2025-05-07T20:32:36.9311937Z @given( 2025-05-07T20:32:36.9312181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.9312658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.9312979Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.9313332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.9313681Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.9313973Z ) 2025-05-07T20:32:36.9314340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.9314800Z def test_silu_mul_quant( 2025-05-07T20:32:36.9315064Z self, 2025-05-07T20:32:36.9315268Z T: int, 2025-05-07T20:32:36.9315483Z D: int, 2025-05-07T20:32:36.9315718Z scale_ub: Optional[float], 2025-05-07T20:32:36.9315997Z contiguous: bool, 2025-05-07T20:32:36.9316256Z compiled: bool, 2025-05-07T20:32:36.9316496Z ) -> None: 2025-05-07T20:32:36.9316717Z torch.manual_seed(2025) 2025-05-07T20:32:36.9316978Z 2025-05-07T20:32:36.9317275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.9317622Z 2025-05-07T20:32:36.9317823Z x_sign = torch.sign(x) 2025-05-07T20:32:36.9318126Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.9318439Z x = x_sign * x_clamp 2025-05-07T20:32:36.9318693Z x0 = x[:, :D] 2025-05-07T20:32:36.9319078Z x1 = x[:, D:] 2025-05-07T20:32:36.9319293Z 2025-05-07T20:32:36.9319488Z if contiguous: 2025-05-07T20:32:36.9319730Z x0 = x0.contiguous() 2025-05-07T20:32:36.9319991Z x1 = x1.contiguous() 2025-05-07T20:32:36.9320239Z 2025-05-07T20:32:36.9320436Z if scale_ub is not None: 2025-05-07T20:32:36.9320710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.9321051Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.9321371Z ) 2025-05-07T20:32:36.9321570Z else: 2025-05-07T20:32:36.9321783Z scale_ub_tensor = None 2025-05-07T20:32:36.9322107Z 2025-05-07T20:32:36.9322355Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.9322672Z op = silu_mul_quant 2025-05-07T20:32:36.9322932Z if compiled: 2025-05-07T20:32:36.9323187Z op = torch.compile(op) 2025-05-07T20:32:36.9323485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.9323775Z 2025-05-07T20:32:36.9323977Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.9324145Z 2025-05-07T20:32:36.9324249Z moe/activation_test.py:117: 2025-05-07T20:32:36.9324551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9324897Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.9325187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.9325747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.9326312Z return fn(*args, **kwargs) 2025-05-07T20:32:36.9326986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.9327674Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.9328217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.9328961Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.9329642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.9330171Z kernel = self.compile( 2025-05-07T20:32:36.9330719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.9331389Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.9331790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9332037Z 2025-05-07T20:32:36.9332296Z self = 2025-05-07T20:32:36.9333393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.9334785Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda7eda60>} 2025-05-07T20:32:36.9336130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.9337142Z context = 2025-05-07T20:32:36.9337442Z 2025-05-07T20:32:36.9337614Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.9338150Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.9338622Z module_map=module_map) 2025-05-07T20:32:36.9338993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.9339435Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.9339706Z E ^ 2025-05-07T20:32:36.9340327Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.9340786Z 2025-05-07T20:32:36.9341264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.9341780Z 2025-05-07T20:32:36.9341887Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.9342915Z self=, 2025-05-07T20:32:36.9343327Z T=1, 2025-05-07T20:32:36.9343519Z D=5120, 2025-05-07T20:32:36.9343815Z scale_ub=None, 2025-05-07T20:32:36.9344039Z contiguous=False, 2025-05-07T20:32:36.9344272Z compiled=True, 2025-05-07T20:32:36.9344482Z ) 2025-05-07T20:32:37.0166218Z self = 2025-05-07T20:32:37.0166771Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.0167048Z 2025-05-07T20:32:37.0167132Z @given( 2025-05-07T20:32:37.0167373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0167704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0168014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0168355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0168702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0168991Z ) 2025-05-07T20:32:37.0169356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0169819Z def test_silu_mul_quant( 2025-05-07T20:32:37.0170061Z self, 2025-05-07T20:32:37.0170264Z T: int, 2025-05-07T20:32:37.0170471Z D: int, 2025-05-07T20:32:37.0170692Z scale_ub: Optional[float], 2025-05-07T20:32:37.0170973Z contiguous: bool, 2025-05-07T20:32:37.0171301Z compiled: bool, 2025-05-07T20:32:37.0171589Z ) -> None: 2025-05-07T20:32:37.0171809Z torch.manual_seed(2025) 2025-05-07T20:32:37.0172059Z 2025-05-07T20:32:37.0172350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0172776Z 2025-05-07T20:32:37.0172983Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0173287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0173604Z x = x_sign * x_clamp 2025-05-07T20:32:37.0173861Z x0 = x[:, :D] 2025-05-07T20:32:37.0174092Z x1 = x[:, D:] 2025-05-07T20:32:37.0174303Z 2025-05-07T20:32:37.0174502Z if contiguous: 2025-05-07T20:32:37.0174874Z x0 = x0.contiguous() 2025-05-07T20:32:37.0175142Z x1 = x1.contiguous() 2025-05-07T20:32:37.0175395Z 2025-05-07T20:32:37.0175600Z if scale_ub is not None: 2025-05-07T20:32:37.0175879Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0176232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0176560Z ) 2025-05-07T20:32:37.0176766Z else: 2025-05-07T20:32:37.0176982Z scale_ub_tensor = None 2025-05-07T20:32:37.0177244Z 2025-05-07T20:32:37.0177485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0177802Z op = silu_mul_quant 2025-05-07T20:32:37.0178065Z if compiled: 2025-05-07T20:32:37.0178321Z op = torch.compile(op) 2025-05-07T20:32:37.0178620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0178904Z 2025-05-07T20:32:37.0179108Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0179401Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0179705Z 2025-05-07T20:32:37.0179950Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0180284Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0180586Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0181027Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0181512Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0181828Z 2025-05-07T20:32:37.0182039Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0182238Z 2025-05-07T20:32:37.0182350Z moe/activation_test.py:126: 2025-05-07T20:32:37.0182650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0182997Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0183335Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0184128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0184969Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0185522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0186216Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0186906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0187632Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0188390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.0189142Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0189865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0190513Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0191123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0191655Z fn() 2025-05-07T20:32:37.0192170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0192765Z self.fn.run( 2025-05-07T20:32:37.0193241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0193771Z kernel = self.compile( 2025-05-07T20:32:37.0194321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0194980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0195436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0195676Z 2025-05-07T20:32:37.0195888Z self = 2025-05-07T20:32:37.0196980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0198355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda1b9430>} 2025-05-07T20:32:37.0206260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0207368Z context = 2025-05-07T20:32:37.0207678Z 2025-05-07T20:32:37.0207857Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0208403Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0208890Z module_map=module_map) 2025-05-07T20:32:37.0209436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0209811Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0210099Z E ^ 2025-05-07T20:32:37.0210584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0211043Z 2025-05-07T20:32:37.0211473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0211995Z 2025-05-07T20:32:37.0212105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0212538Z self=, 2025-05-07T20:32:37.0213022Z T=1, 2025-05-07T20:32:37.0213225Z D=5120, 2025-05-07T20:32:37.0213433Z scale_ub=None, 2025-05-07T20:32:37.0213667Z contiguous=True, 2025-05-07T20:32:37.0213900Z compiled=False, 2025-05-07T20:32:37.0214128Z ) 2025-05-07T20:32:37.3885244Z self = 2025-05-07T20:32:37.3886345Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.3886894Z 2025-05-07T20:32:37.3887062Z @given( 2025-05-07T20:32:37.3887545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.3888189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.3888825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.3889286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.3889647Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.3889953Z ) 2025-05-07T20:32:37.3890329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.3890785Z def test_silu_mul_quant( 2025-05-07T20:32:37.3891048Z self, 2025-05-07T20:32:37.3891262Z T: int, 2025-05-07T20:32:37.3891466Z D: int, 2025-05-07T20:32:37.3891704Z scale_ub: Optional[float], 2025-05-07T20:32:37.3892002Z contiguous: bool, 2025-05-07T20:32:37.3892259Z compiled: bool, 2025-05-07T20:32:37.3892496Z ) -> None: 2025-05-07T20:32:37.3892734Z torch.manual_seed(2025) 2025-05-07T20:32:37.3892992Z 2025-05-07T20:32:37.3893275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.3893637Z 2025-05-07T20:32:37.3893847Z x_sign = torch.sign(x) 2025-05-07T20:32:37.3894147Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.3894477Z x = x_sign * x_clamp 2025-05-07T20:32:37.3894736Z x0 = x[:, :D] 2025-05-07T20:32:37.3894959Z x1 = x[:, D:] 2025-05-07T20:32:37.3895304Z 2025-05-07T20:32:37.3895515Z if contiguous: 2025-05-07T20:32:37.3895756Z x0 = x0.contiguous() 2025-05-07T20:32:37.3896029Z x1 = x1.contiguous() 2025-05-07T20:32:37.3896281Z 2025-05-07T20:32:37.3896476Z if scale_ub is not None: 2025-05-07T20:32:37.3896767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.3897118Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.3897441Z ) 2025-05-07T20:32:37.3897650Z else: 2025-05-07T20:32:37.3897871Z scale_ub_tensor = None 2025-05-07T20:32:37.3898135Z 2025-05-07T20:32:37.3898371Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.3898698Z op = silu_mul_quant 2025-05-07T20:32:37.3898963Z if compiled: 2025-05-07T20:32:37.3899213Z op = torch.compile(op) 2025-05-07T20:32:37.3899527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.3899815Z 2025-05-07T20:32:37.3900013Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.3900187Z 2025-05-07T20:32:37.3900294Z moe/activation_test.py:117: 2025-05-07T20:32:37.3900603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.3900941Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.3901461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.3902177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.3902874Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.3903419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.3904109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.3904786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.3905395Z kernel = self.compile( 2025-05-07T20:32:37.3905956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.3906624Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.3907037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.3907276Z 2025-05-07T20:32:37.3907493Z self = 2025-05-07T20:32:37.3908584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.3909987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda1b9e50>} 2025-05-07T20:32:37.3911367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.3912396Z context = 2025-05-07T20:32:37.3912692Z 2025-05-07T20:32:37.3912866Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.3913403Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.3913883Z module_map=module_map) 2025-05-07T20:32:37.3914257Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.3914627Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.3914899Z E ^ 2025-05-07T20:32:37.3915376Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.3915829Z 2025-05-07T20:32:37.3916291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.3916823Z 2025-05-07T20:32:37.3916932Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.3917358Z self=, 2025-05-07T20:32:37.3917776Z T=128, 2025-05-07T20:32:37.3917969Z D=5120, 2025-05-07T20:32:37.3918171Z scale_ub=None, 2025-05-07T20:32:37.3918401Z contiguous=False, 2025-05-07T20:32:37.3918630Z compiled=True, 2025-05-07T20:32:37.3918845Z ) 2025-05-07T20:32:37.3919197Z self = 2025-05-07T20:32:37.3919718Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.3919998Z 2025-05-07T20:32:37.3920080Z @given( 2025-05-07T20:32:37.3920316Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.3920644Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.3920966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.3921300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.3921631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.3921931Z ) 2025-05-07T20:32:37.3922333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.3922821Z def test_silu_mul_quant( 2025-05-07T20:32:37.3923072Z self, 2025-05-07T20:32:37.3923271Z T: int, 2025-05-07T20:32:37.3923472Z D: int, 2025-05-07T20:32:37.3923702Z scale_ub: Optional[float], 2025-05-07T20:32:37.3923979Z contiguous: bool, 2025-05-07T20:32:37.3924225Z compiled: bool, 2025-05-07T20:32:37.3924448Z ) -> None: 2025-05-07T20:32:37.3924673Z torch.manual_seed(2025) 2025-05-07T20:32:37.3924921Z 2025-05-07T20:32:37.3925194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.3925596Z 2025-05-07T20:32:37.3925795Z x_sign = torch.sign(x) 2025-05-07T20:32:37.3926091Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.3926409Z x = x_sign * x_clamp 2025-05-07T20:32:37.3926661Z x0 = x[:, :D] 2025-05-07T20:32:37.3926875Z x1 = x[:, D:] 2025-05-07T20:32:37.3927098Z 2025-05-07T20:32:37.3927290Z if contiguous: 2025-05-07T20:32:37.3927519Z x0 = x0.contiguous() 2025-05-07T20:32:37.3927785Z x1 = x1.contiguous() 2025-05-07T20:32:37.3928033Z 2025-05-07T20:32:37.3928225Z if scale_ub is not None: 2025-05-07T20:32:37.3928503Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.3928843Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.3929183Z ) 2025-05-07T20:32:37.3929401Z else: 2025-05-07T20:32:37.3929620Z scale_ub_tensor = None 2025-05-07T20:32:37.3929880Z 2025-05-07T20:32:37.3930116Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.3930440Z op = silu_mul_quant 2025-05-07T20:32:37.3930700Z if compiled: 2025-05-07T20:32:37.3930946Z op = torch.compile(op) 2025-05-07T20:32:37.3931247Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.3931528Z 2025-05-07T20:32:37.3931721Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.3931896Z 2025-05-07T20:32:37.3931998Z moe/activation_test.py:117: 2025-05-07T20:32:37.3932302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.3932634Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.3932924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.3933484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.3934052Z return fn(*args, **kwargs) 2025-05-07T20:32:37.3934756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.3935461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.3936003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.3936685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.3937348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.3937884Z kernel = self.compile( 2025-05-07T20:32:37.3938428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.3939078Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.3939482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.3939722Z 2025-05-07T20:32:37.3939938Z self = 2025-05-07T20:32:37.3941224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.3942709Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd98b8040>} 2025-05-07T20:32:37.3944061Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.3945079Z context = 2025-05-07T20:32:37.3945374Z 2025-05-07T20:32:37.3945551Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.3946088Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.3946626Z module_map=module_map) 2025-05-07T20:32:37.3947000Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.3947360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.3947622Z E ^ 2025-05-07T20:32:37.3948089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.3948539Z 2025-05-07T20:32:37.3948960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.3949469Z 2025-05-07T20:32:37.3949580Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.3949990Z self=, 2025-05-07T20:32:37.3950396Z T=128, 2025-05-07T20:32:37.3950590Z D=7168, 2025-05-07T20:32:37.3950782Z scale_ub=1200.0, 2025-05-07T20:32:37.3951019Z contiguous=False, 2025-05-07T20:32:37.3951247Z compiled=False, 2025-05-07T20:32:37.3951453Z ) 2025-05-07T20:32:37.5495214Z self = 2025-05-07T20:32:37.5495788Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.5496075Z 2025-05-07T20:32:37.5496168Z @given( 2025-05-07T20:32:37.5496407Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5496737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5497060Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5497404Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5497747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5498050Z ) 2025-05-07T20:32:37.5498410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5498862Z def test_silu_mul_quant( 2025-05-07T20:32:37.5499239Z self, 2025-05-07T20:32:37.5499450Z T: int, 2025-05-07T20:32:37.5499651Z D: int, 2025-05-07T20:32:37.5499881Z scale_ub: Optional[float], 2025-05-07T20:32:37.5500167Z contiguous: bool, 2025-05-07T20:32:37.5500416Z compiled: bool, 2025-05-07T20:32:37.5500654Z ) -> None: 2025-05-07T20:32:37.5500892Z torch.manual_seed(2025) 2025-05-07T20:32:37.5501224Z 2025-05-07T20:32:37.5501500Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5501856Z 2025-05-07T20:32:37.5502052Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5502357Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5502681Z x = x_sign * x_clamp 2025-05-07T20:32:37.5502940Z x0 = x[:, :D] 2025-05-07T20:32:37.5503161Z x1 = x[:, D:] 2025-05-07T20:32:37.5503375Z 2025-05-07T20:32:37.5503572Z if contiguous: 2025-05-07T20:32:37.5503807Z x0 = x0.contiguous() 2025-05-07T20:32:37.5504089Z x1 = x1.contiguous() 2025-05-07T20:32:37.5504342Z 2025-05-07T20:32:37.5504535Z if scale_ub is not None: 2025-05-07T20:32:37.5504818Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5505169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5505552Z ) 2025-05-07T20:32:37.5505834Z else: 2025-05-07T20:32:37.5506057Z scale_ub_tensor = None 2025-05-07T20:32:37.5506311Z 2025-05-07T20:32:37.5506555Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5506881Z op = silu_mul_quant 2025-05-07T20:32:37.5507134Z if compiled: 2025-05-07T20:32:37.5507387Z op = torch.compile(op) 2025-05-07T20:32:37.5507695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5507981Z 2025-05-07T20:32:37.5508171Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5508348Z 2025-05-07T20:32:37.5508455Z moe/activation_test.py:117: 2025-05-07T20:32:37.5508835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5509224Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5509523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5510230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5510924Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5511479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5512169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5512850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5513382Z kernel = self.compile( 2025-05-07T20:32:37.5513933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5514591Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5514999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5515232Z 2025-05-07T20:32:37.5515443Z self = 2025-05-07T20:32:37.5516528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5517902Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd98b8c10>} 2025-05-07T20:32:37.5519292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5520330Z context = 2025-05-07T20:32:37.5520626Z 2025-05-07T20:32:37.5520794Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5521330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5521802Z module_map=module_map) 2025-05-07T20:32:37.5522169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5522528Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5522801Z E ^ 2025-05-07T20:32:37.5523261Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5523723Z 2025-05-07T20:32:37.5524140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5524672Z 2025-05-07T20:32:37.5524779Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5525198Z self=, 2025-05-07T20:32:37.5525598Z T=128, 2025-05-07T20:32:37.5525790Z D=5120, 2025-05-07T20:32:37.5525988Z scale_ub=None, 2025-05-07T20:32:37.5526343Z contiguous=False, 2025-05-07T20:32:37.5526579Z compiled=False, 2025-05-07T20:32:37.5526794Z ) 2025-05-07T20:32:37.5527110Z self = 2025-05-07T20:32:37.5527607Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.5527883Z 2025-05-07T20:32:37.5527966Z @given( 2025-05-07T20:32:37.5528207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5528524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5528836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5529176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5529554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5529844Z ) 2025-05-07T20:32:37.5530202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5530652Z def test_silu_mul_quant( 2025-05-07T20:32:37.5530907Z self, 2025-05-07T20:32:37.5531110Z T: int, 2025-05-07T20:32:37.5531306Z D: int, 2025-05-07T20:32:37.5531534Z scale_ub: Optional[float], 2025-05-07T20:32:37.5531816Z contiguous: bool, 2025-05-07T20:32:37.5532065Z compiled: bool, 2025-05-07T20:32:37.5532289Z ) -> None: 2025-05-07T20:32:37.5532513Z torch.manual_seed(2025) 2025-05-07T20:32:37.5532763Z 2025-05-07T20:32:37.5533040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5533388Z 2025-05-07T20:32:37.5533589Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5533887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5534207Z x = x_sign * x_clamp 2025-05-07T20:32:37.5534454Z x0 = x[:, :D] 2025-05-07T20:32:37.5534668Z x1 = x[:, D:] 2025-05-07T20:32:37.5534881Z 2025-05-07T20:32:37.5535069Z if contiguous: 2025-05-07T20:32:37.5535306Z x0 = x0.contiguous() 2025-05-07T20:32:37.5535576Z x1 = x1.contiguous() 2025-05-07T20:32:37.5535821Z 2025-05-07T20:32:37.5536012Z if scale_ub is not None: 2025-05-07T20:32:37.5536311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5536658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5536973Z ) 2025-05-07T20:32:37.5537166Z else: 2025-05-07T20:32:37.5537385Z scale_ub_tensor = None 2025-05-07T20:32:37.5537641Z 2025-05-07T20:32:37.5537872Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5538193Z op = silu_mul_quant 2025-05-07T20:32:37.5538452Z if compiled: 2025-05-07T20:32:37.5538771Z op = torch.compile(op) 2025-05-07T20:32:37.5539090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5539406Z 2025-05-07T20:32:37.5539608Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5539774Z 2025-05-07T20:32:37.5539874Z moe/activation_test.py:117: 2025-05-07T20:32:37.5540343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5540676Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5541173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5541879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5542572Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5543107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5543796Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5544467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5545002Z kernel = self.compile( 2025-05-07T20:32:37.5545612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5546329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5546734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5546970Z 2025-05-07T20:32:37.5547185Z self = 2025-05-07T20:32:37.5548257Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5549704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9bf0310>} 2025-05-07T20:32:37.5551052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5552074Z context = 2025-05-07T20:32:37.5552364Z 2025-05-07T20:32:37.5552538Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5553064Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5553537Z module_map=module_map) 2025-05-07T20:32:37.5553908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5554269Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5554535Z E ^ 2025-05-07T20:32:37.5555001Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5555452Z 2025-05-07T20:32:37.5555876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5556402Z 2025-05-07T20:32:37.5556512Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5556931Z self=, 2025-05-07T20:32:37.5557334Z T=128, 2025-05-07T20:32:37.5557519Z D=5120, 2025-05-07T20:32:37.5557714Z scale_ub=1200.0, 2025-05-07T20:32:37.5557942Z contiguous=True, 2025-05-07T20:32:37.5558166Z compiled=False, 2025-05-07T20:32:37.5558375Z ) 2025-05-07T20:32:37.7857021Z self = 2025-05-07T20:32:37.7857622Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.7858015Z 2025-05-07T20:32:37.7858108Z @given( 2025-05-07T20:32:37.7858346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.7858675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.7859009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.7859391Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.7859736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.7860037Z ) 2025-05-07T20:32:37.7860396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.7860849Z def test_silu_mul_quant( 2025-05-07T20:32:37.7861163Z self, 2025-05-07T20:32:37.7861367Z T: int, 2025-05-07T20:32:37.7861574Z D: int, 2025-05-07T20:32:37.7861810Z scale_ub: Optional[float], 2025-05-07T20:32:37.7862089Z contiguous: bool, 2025-05-07T20:32:37.7862327Z compiled: bool, 2025-05-07T20:32:37.7862557Z ) -> None: 2025-05-07T20:32:37.7862786Z torch.manual_seed(2025) 2025-05-07T20:32:37.7863029Z 2025-05-07T20:32:37.7869351Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.7869812Z 2025-05-07T20:32:37.7870009Z x_sign = torch.sign(x) 2025-05-07T20:32:37.7870425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.7870808Z x = x_sign * x_clamp 2025-05-07T20:32:37.7871050Z x0 = x[:, :D] 2025-05-07T20:32:37.7871274Z x1 = x[:, D:] 2025-05-07T20:32:37.7871486Z 2025-05-07T20:32:37.7871675Z if contiguous: 2025-05-07T20:32:37.7871915Z x0 = x0.contiguous() 2025-05-07T20:32:37.7872181Z x1 = x1.contiguous() 2025-05-07T20:32:37.7872422Z 2025-05-07T20:32:37.7872620Z if scale_ub is not None: 2025-05-07T20:32:37.7872899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.7873235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.7873623Z ) 2025-05-07T20:32:37.7873826Z else: 2025-05-07T20:32:37.7874042Z scale_ub_tensor = None 2025-05-07T20:32:37.7874296Z 2025-05-07T20:32:37.7874534Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.7874854Z op = silu_mul_quant 2025-05-07T20:32:37.7875110Z if compiled: 2025-05-07T20:32:37.7875362Z op = torch.compile(op) 2025-05-07T20:32:37.7875667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.7875949Z 2025-05-07T20:32:37.7876155Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.7876321Z 2025-05-07T20:32:37.7876430Z moe/activation_test.py:117: 2025-05-07T20:32:37.7876722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.7877057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.7877342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.7878044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.7878746Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.7879283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.7879967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.7880640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.7881182Z kernel = self.compile( 2025-05-07T20:32:37.7881728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.7882382Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.7882774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.7883009Z 2025-05-07T20:32:37.7883263Z self = 2025-05-07T20:32:37.7884344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.7885743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9bf0ee0>} 2025-05-07T20:32:37.7887100Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.7888120Z context = 2025-05-07T20:32:37.7888415Z 2025-05-07T20:32:37.7888582Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.7889111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.7889581Z module_map=module_map) 2025-05-07T20:32:37.7889963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.7890327Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.7890663Z E ^ 2025-05-07T20:32:37.7891159Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.7891621Z 2025-05-07T20:32:37.7892038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.7892549Z 2025-05-07T20:32:37.7892661Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.7893072Z self=, 2025-05-07T20:32:37.7893479Z T=1, 2025-05-07T20:32:37.7893668Z D=7168, 2025-05-07T20:32:37.7893868Z scale_ub=1200.0, 2025-05-07T20:32:37.7894139Z contiguous=True, 2025-05-07T20:32:37.7894367Z compiled=True, 2025-05-07T20:32:37.7894578Z ) 2025-05-07T20:32:37.7894897Z self = 2025-05-07T20:32:37.7895394Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.7895659Z 2025-05-07T20:32:37.7895744Z @given( 2025-05-07T20:32:37.7895975Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.7896295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.7896611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.7896939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.7897277Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.7897566Z ) 2025-05-07T20:32:37.7897927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.7898369Z def test_silu_mul_quant( 2025-05-07T20:32:37.7898615Z self, 2025-05-07T20:32:37.7898814Z T: int, 2025-05-07T20:32:37.7899007Z D: int, 2025-05-07T20:32:37.7899258Z scale_ub: Optional[float], 2025-05-07T20:32:37.7899556Z contiguous: bool, 2025-05-07T20:32:37.7899798Z compiled: bool, 2025-05-07T20:32:37.7900026Z ) -> None: 2025-05-07T20:32:37.7900254Z torch.manual_seed(2025) 2025-05-07T20:32:37.7900495Z 2025-05-07T20:32:37.7900770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.7901222Z 2025-05-07T20:32:37.7901409Z x_sign = torch.sign(x) 2025-05-07T20:32:37.7901703Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.7902018Z x = x_sign * x_clamp 2025-05-07T20:32:37.7902258Z x0 = x[:, :D] 2025-05-07T20:32:37.7902481Z x1 = x[:, D:] 2025-05-07T20:32:37.7902690Z 2025-05-07T20:32:37.7902883Z if contiguous: 2025-05-07T20:32:37.7903114Z x0 = x0.contiguous() 2025-05-07T20:32:37.7903424Z x1 = x1.contiguous() 2025-05-07T20:32:37.7903667Z 2025-05-07T20:32:37.7903855Z if scale_ub is not None: 2025-05-07T20:32:37.7904130Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.7904468Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.7904777Z ) 2025-05-07T20:32:37.7904974Z else: 2025-05-07T20:32:37.7905188Z scale_ub_tensor = None 2025-05-07T20:32:37.7905437Z 2025-05-07T20:32:37.7905673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.7905989Z op = silu_mul_quant 2025-05-07T20:32:37.7906241Z if compiled: 2025-05-07T20:32:37.7906491Z op = torch.compile(op) 2025-05-07T20:32:37.7906788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.7907063Z 2025-05-07T20:32:37.7907261Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.7907433Z 2025-05-07T20:32:37.7907534Z moe/activation_test.py:117: 2025-05-07T20:32:37.7907843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.7908177Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.7908466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.7909095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.7909719Z return fn(*args, **kwargs) 2025-05-07T20:32:37.7910396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.7911082Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.7911617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.7912299Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.7912964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.7913547Z kernel = self.compile( 2025-05-07T20:32:37.7914084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.7914739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.7915144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.7915375Z 2025-05-07T20:32:37.7915587Z self = 2025-05-07T20:32:37.7916657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.7918027Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdae2c940>} 2025-05-07T20:32:37.7919370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.7920385Z context = 2025-05-07T20:32:37.7920676Z 2025-05-07T20:32:37.7920844Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.7921362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.7921825Z module_map=module_map) 2025-05-07T20:32:37.7922191Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.7922543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.7922804Z E ^ 2025-05-07T20:32:37.7923267Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.7923760Z 2025-05-07T20:32:37.7924183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.7924705Z 2025-05-07T20:32:37.7924808Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.7925226Z self=, 2025-05-07T20:32:37.7925629Z T=1, 2025-05-07T20:32:37.7925808Z D=7168, 2025-05-07T20:32:37.7926001Z scale_ub=1200.0, 2025-05-07T20:32:37.7926225Z contiguous=False, 2025-05-07T20:32:37.7926449Z compiled=True, 2025-05-07T20:32:37.7926654Z ) 2025-05-07T20:32:38.1456442Z self = 2025-05-07T20:32:38.1457011Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.1457291Z 2025-05-07T20:32:38.1457375Z @given( 2025-05-07T20:32:38.1457624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.1457964Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.1458292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.1458642Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.1458995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.1459422Z ) 2025-05-07T20:32:38.1459849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.1460317Z def test_silu_mul_quant( 2025-05-07T20:32:38.1460568Z self, 2025-05-07T20:32:38.1460782Z T: int, 2025-05-07T20:32:38.1460993Z D: int, 2025-05-07T20:32:38.1461317Z scale_ub: Optional[float], 2025-05-07T20:32:38.1461605Z contiguous: bool, 2025-05-07T20:32:38.1461861Z compiled: bool, 2025-05-07T20:32:38.1462091Z ) -> None: 2025-05-07T20:32:38.1462325Z torch.manual_seed(2025) 2025-05-07T20:32:38.1462579Z 2025-05-07T20:32:38.1462859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.1463286Z 2025-05-07T20:32:38.1463486Z x_sign = torch.sign(x) 2025-05-07T20:32:38.1463784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.1464096Z x = x_sign * x_clamp 2025-05-07T20:32:38.1464346Z x0 = x[:, :D] 2025-05-07T20:32:38.1464578Z x1 = x[:, D:] 2025-05-07T20:32:38.1464791Z 2025-05-07T20:32:38.1464983Z if contiguous: 2025-05-07T20:32:38.1465225Z x0 = x0.contiguous() 2025-05-07T20:32:38.1465487Z x1 = x1.contiguous() 2025-05-07T20:32:38.1465737Z 2025-05-07T20:32:38.1465941Z if scale_ub is not None: 2025-05-07T20:32:38.1466217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.1466563Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.1466877Z ) 2025-05-07T20:32:38.1467070Z else: 2025-05-07T20:32:38.1467285Z scale_ub_tensor = None 2025-05-07T20:32:38.1467549Z 2025-05-07T20:32:38.1467788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.1468111Z op = silu_mul_quant 2025-05-07T20:32:38.1468369Z if compiled: 2025-05-07T20:32:38.1468627Z op = torch.compile(op) 2025-05-07T20:32:38.1468930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.1469238Z 2025-05-07T20:32:38.1469431Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.1469605Z 2025-05-07T20:32:38.1469712Z moe/activation_test.py:117: 2025-05-07T20:32:38.1470017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.1470358Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.1470646Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.1471217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.1471784Z return fn(*args, **kwargs) 2025-05-07T20:32:38.1472518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.1473210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.1473754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.1474444Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.1475115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.1475652Z kernel = self.compile( 2025-05-07T20:32:38.1476200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.1476852Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.1477256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.1477497Z 2025-05-07T20:32:38.1477712Z self = 2025-05-07T20:32:38.1478804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.1480261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9ba55e0>} 2025-05-07T20:32:38.1481610Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.1482630Z context = 2025-05-07T20:32:38.1482925Z 2025-05-07T20:32:38.1483104Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.1483685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.1484152Z module_map=module_map) 2025-05-07T20:32:38.1484531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.1484888Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.1485158Z E ^ 2025-05-07T20:32:38.1485625Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.1486075Z 2025-05-07T20:32:38.1486508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.1487019Z 2025-05-07T20:32:38.1487133Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.1487546Z self=, 2025-05-07T20:32:38.1487951Z T=1, 2025-05-07T20:32:38.1488142Z D=7168, 2025-05-07T20:32:38.1488345Z scale_ub=None, 2025-05-07T20:32:38.1488569Z contiguous=False, 2025-05-07T20:32:38.1488804Z compiled=True, 2025-05-07T20:32:38.1489011Z ) 2025-05-07T20:32:38.2628230Z self = 2025-05-07T20:32:38.2628815Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:38.2629097Z 2025-05-07T20:32:38.2629182Z @given( 2025-05-07T20:32:38.2629438Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.2629776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.2630106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.2630452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.2630808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.2631115Z ) 2025-05-07T20:32:38.2631480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.2632220Z def test_silu_mul_quant( 2025-05-07T20:32:38.2632499Z self, 2025-05-07T20:32:38.2632702Z T: int, 2025-05-07T20:32:38.2632914Z D: int, 2025-05-07T20:32:38.2633147Z scale_ub: Optional[float], 2025-05-07T20:32:38.2633428Z contiguous: bool, 2025-05-07T20:32:38.2633687Z compiled: bool, 2025-05-07T20:32:38.2633939Z ) -> None: 2025-05-07T20:32:38.2634163Z torch.manual_seed(2025) 2025-05-07T20:32:38.2634425Z 2025-05-07T20:32:38.2634712Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.2635072Z 2025-05-07T20:32:38.2635272Z x_sign = torch.sign(x) 2025-05-07T20:32:38.2635576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.2635906Z x = x_sign * x_clamp 2025-05-07T20:32:38.2636155Z x0 = x[:, :D] 2025-05-07T20:32:38.2636385Z x1 = x[:, D:] 2025-05-07T20:32:38.2636607Z 2025-05-07T20:32:38.2636798Z if contiguous: 2025-05-07T20:32:38.2637054Z x0 = x0.contiguous() 2025-05-07T20:32:38.2637328Z x1 = x1.contiguous() 2025-05-07T20:32:38.2637574Z 2025-05-07T20:32:38.2637779Z if scale_ub is not None: 2025-05-07T20:32:38.2638072Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.2638414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.2638994Z ) 2025-05-07T20:32:38.2639209Z else: 2025-05-07T20:32:38.2639425Z scale_ub_tensor = None 2025-05-07T20:32:38.2639691Z 2025-05-07T20:32:38.2639937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2640532Z op = silu_mul_quant 2025-05-07T20:32:38.2640793Z if compiled: 2025-05-07T20:32:38.2641059Z op = torch.compile(op) 2025-05-07T20:32:38.2641371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2641651Z 2025-05-07T20:32:38.2641860Z y_fp8, y_scale = fn() 2025-05-07T20:32:38.2642165Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:38.2642559Z 2025-05-07T20:32:38.2642836Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2643184Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:38.2643490Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:38.2643815Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:38.2644197Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.2644520Z 2025-05-07T20:32:38.2644725Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:38.2644936Z 2025-05-07T20:32:38.2645048Z moe/activation_test.py:126: 2025-05-07T20:32:38.2645367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2645712Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:38.2646057Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.2646861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:38.2647631Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:38.2648192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.2648896Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.2649653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:38.2650389Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.2651144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:38.2651901Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.2652707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:38.2653369Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:38.2653976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:38.2654512Z fn() 2025-05-07T20:32:38.2655041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:38.2655629Z self.fn.run( 2025-05-07T20:32:38.2656116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.2656651Z kernel = self.compile( 2025-05-07T20:32:38.2657192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.2657865Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2658285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2658521Z 2025-05-07T20:32:38.2658744Z self = 2025-05-07T20:32:38.2659935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.2661512Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd998c160>} 2025-05-07T20:32:38.2662860Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.2663897Z context = 2025-05-07T20:32:38.2664236Z 2025-05-07T20:32:38.2664424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.2664957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2665438Z module_map=module_map) 2025-05-07T20:32:38.2665825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2666188Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:38.2666471Z E ^ 2025-05-07T20:32:38.2666945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2667398Z 2025-05-07T20:32:38.2667822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.2668338Z 2025-05-07T20:32:38.2668447Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2668873Z self=, 2025-05-07T20:32:38.2669336Z T=1, 2025-05-07T20:32:38.2669534Z D=5120, 2025-05-07T20:32:38.2669744Z scale_ub=1200.0, 2025-05-07T20:32:38.2669985Z contiguous=False, 2025-05-07T20:32:38.2670229Z compiled=True, 2025-05-07T20:32:38.2670443Z ) 2025-05-07T20:32:38.4668738Z self = 2025-05-07T20:32:38.4669512Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.4669787Z 2025-05-07T20:32:38.4669872Z @given( 2025-05-07T20:32:38.4670125Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.4670458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.4670785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.4671130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.4671479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.4671784Z ) 2025-05-07T20:32:38.4672393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.4672865Z def test_silu_mul_quant( 2025-05-07T20:32:38.4673131Z self, 2025-05-07T20:32:38.4673340Z T: int, 2025-05-07T20:32:38.4673559Z D: int, 2025-05-07T20:32:38.4673795Z scale_ub: Optional[float], 2025-05-07T20:32:38.4674084Z contiguous: bool, 2025-05-07T20:32:38.4674340Z compiled: bool, 2025-05-07T20:32:38.4674582Z ) -> None: 2025-05-07T20:32:38.4674810Z torch.manual_seed(2025) 2025-05-07T20:32:38.4675077Z 2025-05-07T20:32:38.4675374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.4675739Z 2025-05-07T20:32:38.4675942Z x_sign = torch.sign(x) 2025-05-07T20:32:38.4676253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.4676585Z x = x_sign * x_clamp 2025-05-07T20:32:38.4676838Z x0 = x[:, :D] 2025-05-07T20:32:38.4677075Z x1 = x[:, D:] 2025-05-07T20:32:38.4677305Z 2025-05-07T20:32:38.4677500Z if contiguous: 2025-05-07T20:32:38.4677747Z x0 = x0.contiguous() 2025-05-07T20:32:38.4678030Z x1 = x1.contiguous() 2025-05-07T20:32:38.4678281Z 2025-05-07T20:32:38.4678491Z if scale_ub is not None: 2025-05-07T20:32:38.4678864Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.4679276Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.4679607Z ) 2025-05-07T20:32:38.4679820Z else: 2025-05-07T20:32:38.4680041Z scale_ub_tensor = None 2025-05-07T20:32:38.4680311Z 2025-05-07T20:32:38.4680558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.4687972Z op = silu_mul_quant 2025-05-07T20:32:38.4688289Z if compiled: 2025-05-07T20:32:38.4688571Z op = torch.compile(op) 2025-05-07T20:32:38.4688896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.4689187Z 2025-05-07T20:32:38.4689537Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.4689714Z 2025-05-07T20:32:38.4689834Z moe/activation_test.py:117: 2025-05-07T20:32:38.4690150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.4690496Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.4690804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.4691391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.4691965Z return fn(*args, **kwargs) 2025-05-07T20:32:38.4692641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.4693343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.4693894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.4694592Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.4695287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.4695837Z kernel = self.compile( 2025-05-07T20:32:38.4696393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.4697069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.4697484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.4697720Z 2025-05-07T20:32:38.4697945Z self = 2025-05-07T20:32:38.4699039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.4700496Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd998cb80>} 2025-05-07T20:32:38.4701960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.4703007Z context = 2025-05-07T20:32:38.4703303Z 2025-05-07T20:32:38.4703488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.4704019Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.4704498Z module_map=module_map) 2025-05-07T20:32:38.4704894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.4705259Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.4705545Z E ^ 2025-05-07T20:32:38.4706029Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.4706482Z 2025-05-07T20:32:38.4706910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.4707548Z 2025-05-07T20:32:38.4707659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.4708098Z self=, 2025-05-07T20:32:38.4708514Z T=1, 2025-05-07T20:32:38.4708707Z D=5120, 2025-05-07T20:32:38.4708921Z scale_ub=1200.0, 2025-05-07T20:32:38.4709165Z contiguous=False, 2025-05-07T20:32:38.4709404Z compiled=False, 2025-05-07T20:32:38.4709631Z ) 2025-05-07T20:32:38.4709970Z self = 2025-05-07T20:32:38.4710470Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:38.4710808Z 2025-05-07T20:32:38.4710894Z @given( 2025-05-07T20:32:38.4711142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.4711472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.4711791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.4712149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.4712506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.4712802Z ) 2025-05-07T20:32:38.4713171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.4713637Z def test_silu_mul_quant( 2025-05-07T20:32:38.4713891Z self, 2025-05-07T20:32:38.4714106Z T: int, 2025-05-07T20:32:38.4714326Z D: int, 2025-05-07T20:32:38.4714568Z scale_ub: Optional[float], 2025-05-07T20:32:38.4714850Z contiguous: bool, 2025-05-07T20:32:38.4715110Z compiled: bool, 2025-05-07T20:32:38.4715355Z ) -> None: 2025-05-07T20:32:38.4715592Z torch.manual_seed(2025) 2025-05-07T20:32:38.4715857Z 2025-05-07T20:32:38.4716149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.4716503Z 2025-05-07T20:32:38.4716715Z x_sign = torch.sign(x) 2025-05-07T20:32:38.4717030Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.4717360Z x = x_sign * x_clamp 2025-05-07T20:32:38.4717622Z x0 = x[:, :D] 2025-05-07T20:32:38.4717856Z x1 = x[:, D:] 2025-05-07T20:32:38.4718071Z 2025-05-07T20:32:38.4718296Z if contiguous: 2025-05-07T20:32:38.4718545Z x0 = x0.contiguous() 2025-05-07T20:32:38.4718819Z x1 = x1.contiguous() 2025-05-07T20:32:38.4719070Z 2025-05-07T20:32:38.4719280Z if scale_ub is not None: 2025-05-07T20:32:38.4719574Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.4719922Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.4720262Z ) 2025-05-07T20:32:38.4720523Z else: 2025-05-07T20:32:38.4720746Z scale_ub_tensor = None 2025-05-07T20:32:38.4721023Z 2025-05-07T20:32:38.4721277Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.4721605Z op = silu_mul_quant 2025-05-07T20:32:38.4721883Z if compiled: 2025-05-07T20:32:38.4722158Z op = torch.compile(op) 2025-05-07T20:32:38.4722466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.4722770Z 2025-05-07T20:32:38.4722984Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.4723158Z 2025-05-07T20:32:38.4723274Z moe/activation_test.py:117: 2025-05-07T20:32:38.4723580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.4723936Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.4724237Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.4724950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.4725670Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.4726231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.4726975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.4727689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.4728246Z kernel = self.compile( 2025-05-07T20:32:38.4728812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.4729486Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.4729909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.4730159Z 2025-05-07T20:32:38.4730379Z self = 2025-05-07T20:32:38.4731544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.4732946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9a30550>} 2025-05-07T20:32:38.4734302Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.4735337Z context = 2025-05-07T20:32:38.4735643Z 2025-05-07T20:32:38.4735822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.4736365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.4736847Z module_map=module_map) 2025-05-07T20:32:38.4737236Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.4737617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.4737889Z E ^ 2025-05-07T20:32:38.4738373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.4738835Z 2025-05-07T20:32:38.4739273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.4739792Z 2025-05-07T20:32:38.4739901Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.4740634Z self=, 2025-05-07T20:32:38.4741110Z T=16384, 2025-05-07T20:32:38.4741312Z D=5120, 2025-05-07T20:32:38.4741521Z scale_ub=1200.0, 2025-05-07T20:32:38.4741849Z contiguous=False, 2025-05-07T20:32:38.4742084Z compiled=True, 2025-05-07T20:32:38.4742302Z ) 2025-05-07T20:32:38.5914223Z self = 2025-05-07T20:32:38.5914765Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.5915078Z 2025-05-07T20:32:38.5915162Z @given( 2025-05-07T20:32:38.5915414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.5915745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.5916073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.5916418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.5916766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.5917070Z ) 2025-05-07T20:32:38.5917429Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.5917887Z def test_silu_mul_quant( 2025-05-07T20:32:38.5918159Z self, 2025-05-07T20:32:38.5918363Z T: int, 2025-05-07T20:32:38.5918574Z D: int, 2025-05-07T20:32:38.5918809Z scale_ub: Optional[float], 2025-05-07T20:32:38.5919091Z contiguous: bool, 2025-05-07T20:32:38.5919347Z compiled: bool, 2025-05-07T20:32:38.5919591Z ) -> None: 2025-05-07T20:32:38.5920138Z torch.manual_seed(2025) 2025-05-07T20:32:38.5920404Z 2025-05-07T20:32:38.5920696Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.5921067Z 2025-05-07T20:32:38.5921276Z x_sign = torch.sign(x) 2025-05-07T20:32:38.5921587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.5921922Z x = x_sign * x_clamp 2025-05-07T20:32:38.5922173Z x0 = x[:, :D] 2025-05-07T20:32:38.5922405Z x1 = x[:, D:] 2025-05-07T20:32:38.5922631Z 2025-05-07T20:32:38.5922826Z if contiguous: 2025-05-07T20:32:38.5923075Z x0 = x0.contiguous() 2025-05-07T20:32:38.5923442Z x1 = x1.contiguous() 2025-05-07T20:32:38.5923694Z 2025-05-07T20:32:38.5923905Z if scale_ub is not None: 2025-05-07T20:32:38.5924202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.5924550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.5924881Z ) 2025-05-07T20:32:38.5925098Z else: 2025-05-07T20:32:38.5925319Z scale_ub_tensor = None 2025-05-07T20:32:38.5925589Z 2025-05-07T20:32:38.5925839Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.5926166Z op = silu_mul_quant 2025-05-07T20:32:38.5926437Z if compiled: 2025-05-07T20:32:38.5926705Z op = torch.compile(op) 2025-05-07T20:32:38.5927025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.5927310Z 2025-05-07T20:32:38.5927518Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.5927689Z 2025-05-07T20:32:38.5927805Z moe/activation_test.py:117: 2025-05-07T20:32:38.5928115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5928470Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.5928797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.5929400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.5929988Z return fn(*args, **kwargs) 2025-05-07T20:32:38.5930663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.5931361Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.5931913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.5932596Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.5933367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.5933927Z kernel = self.compile( 2025-05-07T20:32:38.5934478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.5935154Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.5935576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5935815Z 2025-05-07T20:32:38.5936039Z self = 2025-05-07T20:32:38.5937134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.5938526Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9ae41f0>} 2025-05-07T20:32:38.5939883Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.5941453Z context = 2025-05-07T20:32:38.5941799Z 2025-05-07T20:32:38.5941977Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.5942521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.5943007Z module_map=module_map) 2025-05-07T20:32:38.5943394Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.5943756Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.5944033Z E ^ 2025-05-07T20:32:38.5944516Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.5945029Z 2025-05-07T20:32:38.5945448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.5945974Z 2025-05-07T20:32:38.5946084Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.5946519Z self=, 2025-05-07T20:32:38.5946936Z T=2048, 2025-05-07T20:32:38.5947134Z D=7168, 2025-05-07T20:32:38.5947343Z scale_ub=1200.0, 2025-05-07T20:32:38.5947586Z contiguous=False, 2025-05-07T20:32:38.5947820Z compiled=True, 2025-05-07T20:32:38.5948041Z ) 2025-05-07T20:32:38.5948374Z self = 2025-05-07T20:32:38.5948875Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.5949167Z 2025-05-07T20:32:38.5949250Z @given( 2025-05-07T20:32:38.5949497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.5949832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.5950146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.5950496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.5950843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.5951139Z ) 2025-05-07T20:32:38.5951503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.5951961Z def test_silu_mul_quant( 2025-05-07T20:32:38.5952211Z self, 2025-05-07T20:32:38.5952426Z T: int, 2025-05-07T20:32:38.5952639Z D: int, 2025-05-07T20:32:38.5952869Z scale_ub: Optional[float], 2025-05-07T20:32:38.5953158Z contiguous: bool, 2025-05-07T20:32:38.5953415Z compiled: bool, 2025-05-07T20:32:38.5953646Z ) -> None: 2025-05-07T20:32:38.5953884Z torch.manual_seed(2025) 2025-05-07T20:32:38.5954148Z 2025-05-07T20:32:38.5954501Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.5954859Z 2025-05-07T20:32:38.5955073Z x_sign = torch.sign(x) 2025-05-07T20:32:38.5955383Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.5955703Z x = x_sign * x_clamp 2025-05-07T20:32:38.5955956Z x0 = x[:, :D] 2025-05-07T20:32:38.5956193Z x1 = x[:, D:] 2025-05-07T20:32:38.5956406Z 2025-05-07T20:32:38.5956603Z if contiguous: 2025-05-07T20:32:38.5956847Z x0 = x0.contiguous() 2025-05-07T20:32:38.5957114Z x1 = x1.contiguous() 2025-05-07T20:32:38.5957369Z 2025-05-07T20:32:38.5957575Z if scale_ub is not None: 2025-05-07T20:32:38.5957858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.5958208Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.5958532Z ) 2025-05-07T20:32:38.5958731Z else: 2025-05-07T20:32:38.5958955Z scale_ub_tensor = None 2025-05-07T20:32:38.5959228Z 2025-05-07T20:32:38.5959473Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.5959803Z op = silu_mul_quant 2025-05-07T20:32:38.5960069Z if compiled: 2025-05-07T20:32:38.5960331Z op = torch.compile(op) 2025-05-07T20:32:38.5960631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.5961029Z 2025-05-07T20:32:38.5961237Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.5961410Z 2025-05-07T20:32:38.5961514Z moe/activation_test.py:117: 2025-05-07T20:32:38.5961826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5962171Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.5962461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.5963026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.5963591Z return fn(*args, **kwargs) 2025-05-07T20:32:38.5964263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.5965005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.5965553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.5966246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.5966910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.5967452Z kernel = self.compile( 2025-05-07T20:32:38.5968004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.5968666Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.5969071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5969313Z 2025-05-07T20:32:38.5969567Z self = 2025-05-07T20:32:38.5970670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.5972050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9ae4ee0>} 2025-05-07T20:32:38.5973397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.5974416Z context = 2025-05-07T20:32:38.5974718Z 2025-05-07T20:32:38.5974890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.5975483Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.5975955Z module_map=module_map) 2025-05-07T20:32:38.5976334Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.5976704Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.5976975Z E ^ 2025-05-07T20:32:38.5977440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.5977898Z 2025-05-07T20:32:38.5978319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.5978833Z 2025-05-07T20:32:38.8662913Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.8663434Z self=, 2025-05-07T20:32:38.8663854Z T=1, 2025-05-07T20:32:38.8664049Z D=5120, 2025-05-07T20:32:38.8664283Z scale_ub=None, 2025-05-07T20:32:38.8664514Z contiguous=False, 2025-05-07T20:32:38.8664746Z compiled=False, 2025-05-07T20:32:38.8664970Z ) 2025-05-07T20:32:38.8665301Z self = 2025-05-07T20:32:38.8666084Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:38.8666440Z 2025-05-07T20:32:38.8666524Z @given( 2025-05-07T20:32:38.8666764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.8667087Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.8667411Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.8667755Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.8668106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.8668401Z ) 2025-05-07T20:32:38.8668766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.8669320Z def test_silu_mul_quant( 2025-05-07T20:32:38.8669609Z self, 2025-05-07T20:32:38.8669827Z T: int, 2025-05-07T20:32:38.8670040Z D: int, 2025-05-07T20:32:38.8670264Z scale_ub: Optional[float], 2025-05-07T20:32:38.8670547Z contiguous: bool, 2025-05-07T20:32:38.8670798Z compiled: bool, 2025-05-07T20:32:38.8671038Z ) -> None: 2025-05-07T20:32:38.8671269Z torch.manual_seed(2025) 2025-05-07T20:32:38.8671523Z 2025-05-07T20:32:38.8671801Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.8672157Z 2025-05-07T20:32:38.8672363Z x_sign = torch.sign(x) 2025-05-07T20:32:38.8672662Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.8672991Z x = x_sign * x_clamp 2025-05-07T20:32:38.8673244Z x0 = x[:, :D] 2025-05-07T20:32:38.8673473Z x1 = x[:, D:] 2025-05-07T20:32:38.8673686Z 2025-05-07T20:32:38.8673882Z if contiguous: 2025-05-07T20:32:38.8674139Z x0 = x0.contiguous() 2025-05-07T20:32:38.8674409Z x1 = x1.contiguous() 2025-05-07T20:32:38.8674665Z 2025-05-07T20:32:38.8674875Z if scale_ub is not None: 2025-05-07T20:32:38.8675158Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.8675511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.8675839Z ) 2025-05-07T20:32:38.8676040Z else: 2025-05-07T20:32:38.8676262Z scale_ub_tensor = None 2025-05-07T20:32:38.8676528Z 2025-05-07T20:32:38.8676765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.8677105Z op = silu_mul_quant 2025-05-07T20:32:38.8677374Z if compiled: 2025-05-07T20:32:38.8677630Z op = torch.compile(op) 2025-05-07T20:32:38.8677941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.8678231Z 2025-05-07T20:32:38.8678430Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.8678610Z 2025-05-07T20:32:38.8678804Z moe/activation_test.py:117: 2025-05-07T20:32:38.8679119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.8679465Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.8679751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.8680460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.8681174Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.8681720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.8682415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.8683092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.8683635Z kernel = self.compile( 2025-05-07T20:32:38.8684188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.8684857Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.8685267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.8685504Z 2025-05-07T20:32:38.8685807Z self = 2025-05-07T20:32:38.8686897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.8688300Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd98475e0>} 2025-05-07T20:32:38.8689670Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.8690739Z context = 2025-05-07T20:32:38.8691034Z 2025-05-07T20:32:38.8691213Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.8691750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.8692225Z module_map=module_map) 2025-05-07T20:32:38.8692606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.8692964Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.8693240Z E ^ 2025-05-07T20:32:38.8693714Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.8694165Z 2025-05-07T20:32:38.8694605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.8695124Z 2025-05-07T20:32:38.8695234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.8695660Z self=, 2025-05-07T20:32:38.8696076Z T=4096, 2025-05-07T20:32:38.8696275Z D=7168, 2025-05-07T20:32:38.8696485Z scale_ub=1200.0, 2025-05-07T20:32:38.8696728Z contiguous=False, 2025-05-07T20:32:38.8696960Z compiled=False, 2025-05-07T20:32:38.8697179Z ) 2025-05-07T20:32:38.8697505Z self = 2025-05-07T20:32:38.8705334Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:38.8705631Z 2025-05-07T20:32:38.8705717Z @given( 2025-05-07T20:32:38.8705970Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.8706308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.8706623Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.8707064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.8707422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.8707717Z ) 2025-05-07T20:32:38.8708089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.8708564Z def test_silu_mul_quant( 2025-05-07T20:32:38.8708817Z self, 2025-05-07T20:32:38.8709031Z T: int, 2025-05-07T20:32:38.8709243Z D: int, 2025-05-07T20:32:38.8709477Z scale_ub: Optional[float], 2025-05-07T20:32:38.8709754Z contiguous: bool, 2025-05-07T20:32:38.8710013Z compiled: bool, 2025-05-07T20:32:38.8710254Z ) -> None: 2025-05-07T20:32:38.8710486Z torch.manual_seed(2025) 2025-05-07T20:32:38.8710743Z 2025-05-07T20:32:38.8711034Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.8711383Z 2025-05-07T20:32:38.8711591Z x_sign = torch.sign(x) 2025-05-07T20:32:38.8711907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.8712234Z x = x_sign * x_clamp 2025-05-07T20:32:38.8712495Z x0 = x[:, :D] 2025-05-07T20:32:38.8712729Z x1 = x[:, D:] 2025-05-07T20:32:38.8712947Z 2025-05-07T20:32:38.8713152Z if contiguous: 2025-05-07T20:32:38.8713485Z x0 = x0.contiguous() 2025-05-07T20:32:38.8713752Z x1 = x1.contiguous() 2025-05-07T20:32:38.8714015Z 2025-05-07T20:32:38.8714224Z if scale_ub is not None: 2025-05-07T20:32:38.8714510Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.8714863Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.8715193Z ) 2025-05-07T20:32:38.8715405Z else: 2025-05-07T20:32:38.8715625Z scale_ub_tensor = None 2025-05-07T20:32:38.8715895Z 2025-05-07T20:32:38.8716145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.8716470Z op = silu_mul_quant 2025-05-07T20:32:38.8716793Z if compiled: 2025-05-07T20:32:38.8717060Z op = torch.compile(op) 2025-05-07T20:32:38.8717367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.8717659Z 2025-05-07T20:32:38.8717868Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.8718041Z 2025-05-07T20:32:38.8718149Z moe/activation_test.py:117: 2025-05-07T20:32:38.8718466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.8718818Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.8719117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.8719814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.8720517Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.8721069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.8721757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.8722464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.8723009Z kernel = self.compile( 2025-05-07T20:32:38.8723568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.8724243Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.8724654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.8724900Z 2025-05-07T20:32:38.8725116Z self = 2025-05-07T20:32:38.8726286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.8727664Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd97161f0>} 2025-05-07T20:32:38.8729026Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.8730119Z context = 2025-05-07T20:32:38.8730412Z 2025-05-07T20:32:38.8730590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.8731125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.8731592Z module_map=module_map) 2025-05-07T20:32:38.8731971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.8732344Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.8732609Z E ^ 2025-05-07T20:32:38.8733078Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.8733526Z 2025-05-07T20:32:38.8733995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.8734554Z 2025-05-07T20:32:38.8734669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.8735082Z self=, 2025-05-07T20:32:38.8735496Z T=16384, 2025-05-07T20:32:38.8735700Z D=7168, 2025-05-07T20:32:38.8735898Z scale_ub=None, 2025-05-07T20:32:38.8736124Z contiguous=True, 2025-05-07T20:32:38.8736361Z compiled=True, 2025-05-07T20:32:38.8736565Z ) 2025-05-07T20:32:39.1685241Z self = 2025-05-07T20:32:39.1685846Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.1686425Z 2025-05-07T20:32:39.1686514Z @given( 2025-05-07T20:32:39.1686764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.1687100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.1687422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.1687787Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.1688136Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.1688434Z ) 2025-05-07T20:32:39.1688795Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.1689247Z def test_silu_mul_quant( 2025-05-07T20:32:39.1689521Z self, 2025-05-07T20:32:39.1689756Z T: int, 2025-05-07T20:32:39.1689969Z D: int, 2025-05-07T20:32:39.1690193Z scale_ub: Optional[float], 2025-05-07T20:32:39.1690478Z contiguous: bool, 2025-05-07T20:32:39.1690733Z compiled: bool, 2025-05-07T20:32:39.1690972Z ) -> None: 2025-05-07T20:32:39.1691206Z torch.manual_seed(2025) 2025-05-07T20:32:39.1691463Z 2025-05-07T20:32:39.1691744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.1692104Z 2025-05-07T20:32:39.1692314Z x_sign = torch.sign(x) 2025-05-07T20:32:39.1692627Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.1692947Z x = x_sign * x_clamp 2025-05-07T20:32:39.1693202Z x0 = x[:, :D] 2025-05-07T20:32:39.1693432Z x1 = x[:, D:] 2025-05-07T20:32:39.1693645Z 2025-05-07T20:32:39.1693843Z if contiguous: 2025-05-07T20:32:39.1694094Z x0 = x0.contiguous() 2025-05-07T20:32:39.1694361Z x1 = x1.contiguous() 2025-05-07T20:32:39.1694622Z 2025-05-07T20:32:39.1694819Z if scale_ub is not None: 2025-05-07T20:32:39.1695111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.1695546Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.1695870Z ) 2025-05-07T20:32:39.1696076Z else: 2025-05-07T20:32:39.1696301Z scale_ub_tensor = None 2025-05-07T20:32:39.1696560Z 2025-05-07T20:32:39.1696803Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.1697130Z op = silu_mul_quant 2025-05-07T20:32:39.1697403Z if compiled: 2025-05-07T20:32:39.1697660Z op = torch.compile(op) 2025-05-07T20:32:39.1697967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1698259Z 2025-05-07T20:32:39.1698455Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.1698632Z 2025-05-07T20:32:39.1698736Z moe/activation_test.py:117: 2025-05-07T20:32:39.1699041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1699382Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.1699680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1700250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.1700817Z return fn(*args, **kwargs) 2025-05-07T20:32:39.1701568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.1702416Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.1702968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.1703652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.1704325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.1704866Z kernel = self.compile( 2025-05-07T20:32:39.1705417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.1706076Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1706530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1706765Z 2025-05-07T20:32:39.1706980Z self = 2025-05-07T20:32:39.1708075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.1709457Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9716ee0>} 2025-05-07T20:32:39.1710821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.1711851Z context = 2025-05-07T20:32:39.1712143Z 2025-05-07T20:32:39.1712321Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.1712858Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1713351Z module_map=module_map) 2025-05-07T20:32:39.1713734Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1714099Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.1714362Z E ^ 2025-05-07T20:32:39.1714830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1715276Z 2025-05-07T20:32:39.1715703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.1716216Z 2025-05-07T20:32:39.1716332Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.1716797Z self=, 2025-05-07T20:32:39.1717215Z T=4096, 2025-05-07T20:32:39.1717411Z D=5120, 2025-05-07T20:32:39.1717613Z scale_ub=None, 2025-05-07T20:32:39.1717842Z contiguous=False, 2025-05-07T20:32:39.1718080Z compiled=True, 2025-05-07T20:32:39.1718292Z ) 2025-05-07T20:32:39.1718631Z self = 2025-05-07T20:32:39.1719148Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.1719451Z 2025-05-07T20:32:39.1719540Z @given( 2025-05-07T20:32:39.1719804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.1720128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.1720447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.1720784Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.1721133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.1721433Z ) 2025-05-07T20:32:39.1721790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.1722252Z def test_silu_mul_quant( 2025-05-07T20:32:39.1722510Z self, 2025-05-07T20:32:39.1722712Z T: int, 2025-05-07T20:32:39.1723001Z D: int, 2025-05-07T20:32:39.1723237Z scale_ub: Optional[float], 2025-05-07T20:32:39.1723513Z contiguous: bool, 2025-05-07T20:32:39.1723767Z compiled: bool, 2025-05-07T20:32:39.1724005Z ) -> None: 2025-05-07T20:32:39.1724230Z torch.manual_seed(2025) 2025-05-07T20:32:39.1724485Z 2025-05-07T20:32:39.1724768Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.1725111Z 2025-05-07T20:32:39.1725316Z x_sign = torch.sign(x) 2025-05-07T20:32:39.1725621Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.1725943Z x = x_sign * x_clamp 2025-05-07T20:32:39.1726238Z x0 = x[:, :D] 2025-05-07T20:32:39.1726486Z x1 = x[:, D:] 2025-05-07T20:32:39.1726698Z 2025-05-07T20:32:39.1726894Z if contiguous: 2025-05-07T20:32:39.1727138Z x0 = x0.contiguous() 2025-05-07T20:32:39.1727403Z x1 = x1.contiguous() 2025-05-07T20:32:39.1727666Z 2025-05-07T20:32:39.1727875Z if scale_ub is not None: 2025-05-07T20:32:39.1728158Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.1728510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.1728832Z ) 2025-05-07T20:32:39.1729036Z else: 2025-05-07T20:32:39.1729253Z scale_ub_tensor = None 2025-05-07T20:32:39.1729519Z 2025-05-07T20:32:39.1729766Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.1730089Z op = silu_mul_quant 2025-05-07T20:32:39.1730353Z if compiled: 2025-05-07T20:32:39.1730614Z op = torch.compile(op) 2025-05-07T20:32:39.1730928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1731214Z 2025-05-07T20:32:39.1731416Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.1731586Z 2025-05-07T20:32:39.1731689Z moe/activation_test.py:117: 2025-05-07T20:32:39.1731990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1732338Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.1732633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1733196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.1733762Z return fn(*args, **kwargs) 2025-05-07T20:32:39.1734433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.1735127Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.1735723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.1736424Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.1737096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.1737632Z kernel = self.compile( 2025-05-07T20:32:39.1738212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.1738888Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1739296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1739532Z 2025-05-07T20:32:39.1739748Z self = 2025-05-07T20:32:39.1741177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.1742555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9b87940>} 2025-05-07T20:32:39.1743969Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.1745057Z context = 2025-05-07T20:32:39.1745360Z 2025-05-07T20:32:39.1745531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.1746069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1746549Z module_map=module_map) 2025-05-07T20:32:39.1746924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1747398Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.1747675Z E ^ 2025-05-07T20:32:39.1748142Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1748604Z 2025-05-07T20:32:39.1749027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.1749590Z 2025-05-07T20:32:39.3711025Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.3711525Z self=, 2025-05-07T20:32:39.3711951Z T=4096, 2025-05-07T20:32:39.3712151Z D=5120, 2025-05-07T20:32:39.3712350Z scale_ub=1200.0, 2025-05-07T20:32:39.3712588Z contiguous=False, 2025-05-07T20:32:39.3712825Z compiled=False, 2025-05-07T20:32:39.3713037Z ) 2025-05-07T20:32:39.3713366Z self = 2025-05-07T20:32:39.3713905Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.3714187Z 2025-05-07T20:32:39.3714270Z @given( 2025-05-07T20:32:39.3714511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.3714834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.3715166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.3715502Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.3715843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.3716140Z ) 2025-05-07T20:32:39.3716500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.3716954Z def test_silu_mul_quant( 2025-05-07T20:32:39.3717213Z self, 2025-05-07T20:32:39.3717413Z T: int, 2025-05-07T20:32:39.3717627Z D: int, 2025-05-07T20:32:39.3717859Z scale_ub: Optional[float], 2025-05-07T20:32:39.3718135Z contiguous: bool, 2025-05-07T20:32:39.3718685Z compiled: bool, 2025-05-07T20:32:39.3718932Z ) -> None: 2025-05-07T20:32:39.3719155Z torch.manual_seed(2025) 2025-05-07T20:32:39.3719419Z 2025-05-07T20:32:39.3719706Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.3720060Z 2025-05-07T20:32:39.3720260Z x_sign = torch.sign(x) 2025-05-07T20:32:39.3720563Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.3720889Z x = x_sign * x_clamp 2025-05-07T20:32:39.3721133Z x0 = x[:, :D] 2025-05-07T20:32:39.3721362Z x1 = x[:, D:] 2025-05-07T20:32:39.3721582Z 2025-05-07T20:32:39.3721775Z if contiguous: 2025-05-07T20:32:39.3722024Z x0 = x0.contiguous() 2025-05-07T20:32:39.3722296Z x1 = x1.contiguous() 2025-05-07T20:32:39.3722543Z 2025-05-07T20:32:39.3722745Z if scale_ub is not None: 2025-05-07T20:32:39.3723031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.3723383Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.3723712Z ) 2025-05-07T20:32:39.3723922Z else: 2025-05-07T20:32:39.3724137Z scale_ub_tensor = None 2025-05-07T20:32:39.3724402Z 2025-05-07T20:32:39.3724643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.3725133Z op = silu_mul_quant 2025-05-07T20:32:39.3725390Z if compiled: 2025-05-07T20:32:39.3725651Z op = torch.compile(op) 2025-05-07T20:32:39.3725956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3726237Z 2025-05-07T20:32:39.3726439Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.3726605Z 2025-05-07T20:32:39.3726718Z moe/activation_test.py:117: 2025-05-07T20:32:39.3727015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3727358Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.3727655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3728437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.3729136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.3729693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.3730390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.3731051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.3731590Z kernel = self.compile( 2025-05-07T20:32:39.3732150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.3732812Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.3733213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3733463Z 2025-05-07T20:32:39.3733674Z self = 2025-05-07T20:32:39.3734768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.3736159Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd95ef3a0>} 2025-05-07T20:32:39.3737497Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.3738521Z context = 2025-05-07T20:32:39.3738819Z 2025-05-07T20:32:39.3739039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.3739582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.3740398Z module_map=module_map) 2025-05-07T20:32:39.3740803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.3741242Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.3741507Z E ^ 2025-05-07T20:32:39.3741982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.3742439Z 2025-05-07T20:32:39.3742865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.3743378Z 2025-05-07T20:32:39.3743492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.3743932Z self=, 2025-05-07T20:32:39.3744344Z T=4096, 2025-05-07T20:32:39.3744549Z D=5120, 2025-05-07T20:32:39.3744756Z scale_ub=1200.0, 2025-05-07T20:32:39.3744986Z contiguous=False, 2025-05-07T20:32:39.3745223Z compiled=True, 2025-05-07T20:32:39.3745437Z ) 2025-05-07T20:32:39.3745759Z self = 2025-05-07T20:32:39.3746469Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:39.3746758Z 2025-05-07T20:32:39.3746838Z @given( 2025-05-07T20:32:39.3747077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.3747393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.3747712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.3748049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.3748386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.3748679Z ) 2025-05-07T20:32:39.3749042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.3749549Z def test_silu_mul_quant( 2025-05-07T20:32:39.3749801Z self, 2025-05-07T20:32:39.3750004Z T: int, 2025-05-07T20:32:39.3750211Z D: int, 2025-05-07T20:32:39.3750432Z scale_ub: Optional[float], 2025-05-07T20:32:39.3750714Z contiguous: bool, 2025-05-07T20:32:39.3750968Z compiled: bool, 2025-05-07T20:32:39.3751195Z ) -> None: 2025-05-07T20:32:39.3751422Z torch.manual_seed(2025) 2025-05-07T20:32:39.3751684Z 2025-05-07T20:32:39.3751959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.3752311Z 2025-05-07T20:32:39.3752512Z x_sign = torch.sign(x) 2025-05-07T20:32:39.3752809Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.3753130Z x = x_sign * x_clamp 2025-05-07T20:32:39.3753381Z x0 = x[:, :D] 2025-05-07T20:32:39.3753598Z x1 = x[:, D:] 2025-05-07T20:32:39.3753816Z 2025-05-07T20:32:39.3754018Z if contiguous: 2025-05-07T20:32:39.3754250Z x0 = x0.contiguous() 2025-05-07T20:32:39.3754519Z x1 = x1.contiguous() 2025-05-07T20:32:39.3754773Z 2025-05-07T20:32:39.3754967Z if scale_ub is not None: 2025-05-07T20:32:39.3755254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.3755606Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.3755929Z ) 2025-05-07T20:32:39.3756124Z else: 2025-05-07T20:32:39.3756348Z scale_ub_tensor = None 2025-05-07T20:32:39.3756611Z 2025-05-07T20:32:39.3756848Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.3757184Z op = silu_mul_quant 2025-05-07T20:32:39.3757445Z if compiled: 2025-05-07T20:32:39.3757694Z op = torch.compile(op) 2025-05-07T20:32:39.3765705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3766052Z 2025-05-07T20:32:39.3766369Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.3766563Z 2025-05-07T20:32:39.3766674Z moe/activation_test.py:117: 2025-05-07T20:32:39.3766987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3767335Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.3767630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3768221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.3768795Z return fn(*args, **kwargs) 2025-05-07T20:32:39.3769468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.3770164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.3770713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.3771408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.3772087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.3772632Z kernel = self.compile( 2025-05-07T20:32:39.3773188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.3773948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.3774356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3774602Z 2025-05-07T20:32:39.3774818Z self = 2025-05-07T20:32:39.3775907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.3777301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd95ef280>} 2025-05-07T20:32:39.3778709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.3779805Z context = 2025-05-07T20:32:39.3780106Z 2025-05-07T20:32:39.3780278Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.3780822Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.3781361Z module_map=module_map) 2025-05-07T20:32:39.3781747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.3782117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.3782387Z E ^ 2025-05-07T20:32:39.3782878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.3783343Z 2025-05-07T20:32:39.3783762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.3784290Z 2025-05-07T20:32:39.6536857Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.6537350Z self=, 2025-05-07T20:32:39.6537875Z T=2048, 2025-05-07T20:32:39.6538148Z D=7168, 2025-05-07T20:32:39.6538481Z scale_ub=1200.0, 2025-05-07T20:32:39.6538829Z contiguous=False, 2025-05-07T20:32:39.6539134Z compiled=False, 2025-05-07T20:32:39.6539422Z ) 2025-05-07T20:32:39.6539909Z self = 2025-05-07T20:32:39.6540835Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.6541200Z 2025-05-07T20:32:39.6541580Z @given( 2025-05-07T20:32:39.6541831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.6542152Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.6542474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.6542820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.6543177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.6543473Z ) 2025-05-07T20:32:39.6543835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.6544291Z def test_silu_mul_quant( 2025-05-07T20:32:39.6544539Z self, 2025-05-07T20:32:39.6544749Z T: int, 2025-05-07T20:32:39.6544960Z D: int, 2025-05-07T20:32:39.6545182Z scale_ub: Optional[float], 2025-05-07T20:32:39.6545466Z contiguous: bool, 2025-05-07T20:32:39.6545718Z compiled: bool, 2025-05-07T20:32:39.6545945Z ) -> None: 2025-05-07T20:32:39.6546179Z torch.manual_seed(2025) 2025-05-07T20:32:39.6546434Z 2025-05-07T20:32:39.6546710Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.6547065Z 2025-05-07T20:32:39.6547267Z x_sign = torch.sign(x) 2025-05-07T20:32:39.6547562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.6548044Z x = x_sign * x_clamp 2025-05-07T20:32:39.6548300Z x0 = x[:, :D] 2025-05-07T20:32:39.6548522Z x1 = x[:, D:] 2025-05-07T20:32:39.6548741Z 2025-05-07T20:32:39.6548938Z if contiguous: 2025-05-07T20:32:39.6549183Z x0 = x0.contiguous() 2025-05-07T20:32:39.6549449Z x1 = x1.contiguous() 2025-05-07T20:32:39.6549733Z 2025-05-07T20:32:39.6549959Z if scale_ub is not None: 2025-05-07T20:32:39.6550240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.6550596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.6550918Z ) 2025-05-07T20:32:39.6551203Z else: 2025-05-07T20:32:39.6551432Z scale_ub_tensor = None 2025-05-07T20:32:39.6551700Z 2025-05-07T20:32:39.6551936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.6552266Z op = silu_mul_quant 2025-05-07T20:32:39.6552529Z if compiled: 2025-05-07T20:32:39.6552787Z op = torch.compile(op) 2025-05-07T20:32:39.6553096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.6553388Z 2025-05-07T20:32:39.6553587Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.6553766Z 2025-05-07T20:32:39.6553871Z moe/activation_test.py:117: 2025-05-07T20:32:39.6554182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.6554527Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.6554819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.6555540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.6556246Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.6556800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.6557510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.6558194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.6558739Z kernel = self.compile( 2025-05-07T20:32:39.6559295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.6559962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.6560378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.6560616Z 2025-05-07T20:32:39.6560840Z self = 2025-05-07T20:32:39.6561994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.6563401Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd97ed670>} 2025-05-07T20:32:39.6564772Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.6565802Z context = 2025-05-07T20:32:39.6566099Z 2025-05-07T20:32:39.6566271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.6566823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.6567308Z module_map=module_map) 2025-05-07T20:32:39.6567687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.6568051Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.6568322Z E ^ 2025-05-07T20:32:39.6568875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.6569333Z 2025-05-07T20:32:39.6569775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.6570323Z 2025-05-07T20:32:39.6570431Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.6570880Z self=, 2025-05-07T20:32:39.6571290Z T=1, 2025-05-07T20:32:39.6571483Z D=7168, 2025-05-07T20:32:39.6571677Z scale_ub=None, 2025-05-07T20:32:39.6571900Z contiguous=True, 2025-05-07T20:32:39.6572203Z compiled=False, 2025-05-07T20:32:39.6572413Z ) 2025-05-07T20:32:39.6572739Z self = 2025-05-07T20:32:39.6573245Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:39.6573508Z 2025-05-07T20:32:39.6573594Z @given( 2025-05-07T20:32:39.6573838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.6574160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.6574481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.6574815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.6575153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.6575447Z ) 2025-05-07T20:32:39.6575797Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.6576257Z def test_silu_mul_quant( 2025-05-07T20:32:39.6576506Z self, 2025-05-07T20:32:39.6576706Z T: int, 2025-05-07T20:32:39.6576912Z D: int, 2025-05-07T20:32:39.6577139Z scale_ub: Optional[float], 2025-05-07T20:32:39.6577416Z contiguous: bool, 2025-05-07T20:32:39.6577666Z compiled: bool, 2025-05-07T20:32:39.6577898Z ) -> None: 2025-05-07T20:32:39.6578116Z torch.manual_seed(2025) 2025-05-07T20:32:39.6578374Z 2025-05-07T20:32:39.6578655Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.6578998Z 2025-05-07T20:32:39.6579201Z x_sign = torch.sign(x) 2025-05-07T20:32:39.6579507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.6579831Z x = x_sign * x_clamp 2025-05-07T20:32:39.6580074Z x0 = x[:, :D] 2025-05-07T20:32:39.6580302Z x1 = x[:, D:] 2025-05-07T20:32:39.6580520Z 2025-05-07T20:32:39.6580709Z if contiguous: 2025-05-07T20:32:39.6580949Z x0 = x0.contiguous() 2025-05-07T20:32:39.6581301Z x1 = x1.contiguous() 2025-05-07T20:32:39.6581600Z 2025-05-07T20:32:39.6581800Z if scale_ub is not None: 2025-05-07T20:32:39.6582087Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.6582424Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.6582745Z ) 2025-05-07T20:32:39.6582954Z else: 2025-05-07T20:32:39.6583169Z scale_ub_tensor = None 2025-05-07T20:32:39.6583437Z 2025-05-07T20:32:39.6583680Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.6584000Z op = silu_mul_quant 2025-05-07T20:32:39.6584263Z if compiled: 2025-05-07T20:32:39.6584522Z op = torch.compile(op) 2025-05-07T20:32:39.6584825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.6585109Z 2025-05-07T20:32:39.6585314Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.6585483Z 2025-05-07T20:32:39.6585590Z moe/activation_test.py:117: 2025-05-07T20:32:39.6585891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.6586235Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.6586528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.6587265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.6588028Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.6588574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.6589263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.6589928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.6590467Z kernel = self.compile( 2025-05-07T20:32:39.6591013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.6591722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.6592129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.6592372Z 2025-05-07T20:32:39.6592583Z self = 2025-05-07T20:32:39.6593669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.6595045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd96a1280>} 2025-05-07T20:32:39.6596385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.6597412Z context = 2025-05-07T20:32:39.6597716Z 2025-05-07T20:32:39.6597904Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.6598440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.6598915Z module_map=module_map) 2025-05-07T20:32:39.6599293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.6599657Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.6599921Z E ^ 2025-05-07T20:32:39.6600392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.6600842Z 2025-05-07T20:32:39.6601265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.6601780Z 2025-05-07T20:32:39.6601929Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.6602358Z self=, 2025-05-07T20:32:39.6602772Z T=16384, 2025-05-07T20:32:39.6602975Z D=7168, 2025-05-07T20:32:39.6603173Z scale_ub=1200.0, 2025-05-07T20:32:39.6603410Z contiguous=False, 2025-05-07T20:32:39.6603647Z compiled=True, 2025-05-07T20:32:39.6603857Z ) 2025-05-07T20:32:39.8525384Z self = 2025-05-07T20:32:39.8525931Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:39.8526217Z 2025-05-07T20:32:39.8526304Z @given( 2025-05-07T20:32:39.8526568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.8527009Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.8527382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.8527726Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.8528123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.8528425Z ) 2025-05-07T20:32:39.8528779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.8529237Z def test_silu_mul_quant( 2025-05-07T20:32:39.8529490Z self, 2025-05-07T20:32:39.8529987Z T: int, 2025-05-07T20:32:39.8530199Z D: int, 2025-05-07T20:32:39.8530429Z scale_ub: Optional[float], 2025-05-07T20:32:39.8530708Z contiguous: bool, 2025-05-07T20:32:39.8530959Z compiled: bool, 2025-05-07T20:32:39.8531203Z ) -> None: 2025-05-07T20:32:39.8531427Z torch.manual_seed(2025) 2025-05-07T20:32:39.8531682Z 2025-05-07T20:32:39.8531970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.8532317Z 2025-05-07T20:32:39.8532522Z x_sign = torch.sign(x) 2025-05-07T20:32:39.8532828Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.8533236Z x = x_sign * x_clamp 2025-05-07T20:32:39.8533491Z x0 = x[:, :D] 2025-05-07T20:32:39.8533724Z x1 = x[:, D:] 2025-05-07T20:32:39.8533937Z 2025-05-07T20:32:39.8534138Z if contiguous: 2025-05-07T20:32:39.8534382Z x0 = x0.contiguous() 2025-05-07T20:32:39.8534657Z x1 = x1.contiguous() 2025-05-07T20:32:39.8534912Z 2025-05-07T20:32:39.8535125Z if scale_ub is not None: 2025-05-07T20:32:39.8535414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.8535758Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.8536082Z ) 2025-05-07T20:32:39.8536284Z else: 2025-05-07T20:32:39.8536501Z scale_ub_tensor = None 2025-05-07T20:32:39.8536763Z 2025-05-07T20:32:39.8537003Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8537326Z op = silu_mul_quant 2025-05-07T20:32:39.8537587Z if compiled: 2025-05-07T20:32:39.8537850Z op = torch.compile(op) 2025-05-07T20:32:39.8538150Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8538442Z 2025-05-07T20:32:39.8538644Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.8538814Z 2025-05-07T20:32:39.8538928Z moe/activation_test.py:117: 2025-05-07T20:32:39.8539237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8539590Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.8539922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8540797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.8541455Z return fn(*args, **kwargs) 2025-05-07T20:32:39.8542133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.8542827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.8543464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.8544168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.8544841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.8545386Z kernel = self.compile( 2025-05-07T20:32:39.8545940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.8546609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.8547022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8547257Z 2025-05-07T20:32:39.8547473Z self = 2025-05-07T20:32:39.8548565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.8549967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd96a1ee0>} 2025-05-07T20:32:39.8551430Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.8552473Z context = 2025-05-07T20:32:39.8552775Z 2025-05-07T20:32:39.8552949Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.8553485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.8553962Z module_map=module_map) 2025-05-07T20:32:39.8554402Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.8554778Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.8555054Z E ^ 2025-05-07T20:32:39.8555522Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.8555992Z 2025-05-07T20:32:39.8556415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.8556939Z 2025-05-07T20:32:39.8557047Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.8557472Z self=, 2025-05-07T20:32:39.8557883Z T=1, 2025-05-07T20:32:39.8558079Z D=7168, 2025-05-07T20:32:39.8558286Z scale_ub=None, 2025-05-07T20:32:39.8558509Z contiguous=False, 2025-05-07T20:32:39.8558745Z compiled=False, 2025-05-07T20:32:39.8558963Z ) 2025-05-07T20:32:39.8559288Z self = 2025-05-07T20:32:39.8559833Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:39.8560122Z 2025-05-07T20:32:39.8560204Z @given( 2025-05-07T20:32:39.8560445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.8560767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.8561085Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.8561424Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.8561758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.8562056Z ) 2025-05-07T20:32:39.8562413Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.8562860Z def test_silu_mul_quant( 2025-05-07T20:32:39.8563113Z self, 2025-05-07T20:32:39.8563317Z T: int, 2025-05-07T20:32:39.8563523Z D: int, 2025-05-07T20:32:39.8563748Z scale_ub: Optional[float], 2025-05-07T20:32:39.8564081Z contiguous: bool, 2025-05-07T20:32:39.8564331Z compiled: bool, 2025-05-07T20:32:39.8564557Z ) -> None: 2025-05-07T20:32:39.8564782Z torch.manual_seed(2025) 2025-05-07T20:32:39.8565033Z 2025-05-07T20:32:39.8565309Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.8565667Z 2025-05-07T20:32:39.8565872Z x_sign = torch.sign(x) 2025-05-07T20:32:39.8566168Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.8566487Z x = x_sign * x_clamp 2025-05-07T20:32:39.8566737Z x0 = x[:, :D] 2025-05-07T20:32:39.8566958Z x1 = x[:, D:] 2025-05-07T20:32:39.8567179Z 2025-05-07T20:32:39.8567376Z if contiguous: 2025-05-07T20:32:39.8567613Z x0 = x0.contiguous() 2025-05-07T20:32:39.8567885Z x1 = x1.contiguous() 2025-05-07T20:32:39.8568141Z 2025-05-07T20:32:39.8568337Z if scale_ub is not None: 2025-05-07T20:32:39.8568633Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.8568983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.8569307Z ) 2025-05-07T20:32:39.8569506Z else: 2025-05-07T20:32:39.8569732Z scale_ub_tensor = None 2025-05-07T20:32:39.8569995Z 2025-05-07T20:32:39.8570319Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8570651Z op = silu_mul_quant 2025-05-07T20:32:39.8570924Z if compiled: 2025-05-07T20:32:39.8571187Z op = torch.compile(op) 2025-05-07T20:32:39.8571497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8571784Z 2025-05-07T20:32:39.8571982Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.8572159Z 2025-05-07T20:32:39.8572263Z moe/activation_test.py:117: 2025-05-07T20:32:39.8572568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8572911Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.8573248Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8573951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.8574652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.8575207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.8575904Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.8576577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.8577121Z kernel = self.compile( 2025-05-07T20:32:39.8577673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.8578339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.8578749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8578988Z 2025-05-07T20:32:39.8579200Z self = 2025-05-07T20:32:39.8580288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.8581749Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd977d670>} 2025-05-07T20:32:39.8583097Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.8584123Z context = 2025-05-07T20:32:39.8584421Z 2025-05-07T20:32:39.8584640Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.8585177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.8585649Z module_map=module_map) 2025-05-07T20:32:39.8586034Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.8586398Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.8586669Z E ^ 2025-05-07T20:32:39.8587140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.8587591Z 2025-05-07T20:32:39.8588011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.8588532Z 2025-05-07T20:32:39.8588638Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.8589067Z self=, 2025-05-07T20:32:39.8589481Z T=2048, 2025-05-07T20:32:39.8589677Z D=7168, 2025-05-07T20:32:39.8589883Z scale_ub=None, 2025-05-07T20:32:39.8590112Z contiguous=False, 2025-05-07T20:32:39.8590350Z compiled=True, 2025-05-07T20:32:39.8590565Z ) 2025-05-07T20:32:40.1605949Z self = 2025-05-07T20:32:40.1606730Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.1607011Z 2025-05-07T20:32:40.1607104Z @given( 2025-05-07T20:32:40.1607349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.1615607Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.1615949Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.1616293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.1616642Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.1616941Z ) 2025-05-07T20:32:40.1617471Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.1617936Z def test_silu_mul_quant( 2025-05-07T20:32:40.1618186Z self, 2025-05-07T20:32:40.1618395Z T: int, 2025-05-07T20:32:40.1618605Z D: int, 2025-05-07T20:32:40.1618829Z scale_ub: Optional[float], 2025-05-07T20:32:40.1619132Z contiguous: bool, 2025-05-07T20:32:40.1619386Z compiled: bool, 2025-05-07T20:32:40.1619625Z ) -> None: 2025-05-07T20:32:40.1619890Z torch.manual_seed(2025) 2025-05-07T20:32:40.1620162Z 2025-05-07T20:32:40.1620447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.1620804Z 2025-05-07T20:32:40.1621010Z x_sign = torch.sign(x) 2025-05-07T20:32:40.1621433Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.1621764Z x = x_sign * x_clamp 2025-05-07T20:32:40.1622020Z x0 = x[:, :D] 2025-05-07T20:32:40.1622250Z x1 = x[:, D:] 2025-05-07T20:32:40.1622471Z 2025-05-07T20:32:40.1622669Z if contiguous: 2025-05-07T20:32:40.1622917Z x0 = x0.contiguous() 2025-05-07T20:32:40.1623219Z x1 = x1.contiguous() 2025-05-07T20:32:40.1623466Z 2025-05-07T20:32:40.1623671Z if scale_ub is not None: 2025-05-07T20:32:40.1623963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.1624323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.1624642Z ) 2025-05-07T20:32:40.1624851Z else: 2025-05-07T20:32:40.1625075Z scale_ub_tensor = None 2025-05-07T20:32:40.1625329Z 2025-05-07T20:32:40.1625568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.1625899Z op = silu_mul_quant 2025-05-07T20:32:40.1626159Z if compiled: 2025-05-07T20:32:40.1626416Z op = torch.compile(op) 2025-05-07T20:32:40.1626726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1627016Z 2025-05-07T20:32:40.1627305Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.1627478Z 2025-05-07T20:32:40.1627595Z moe/activation_test.py:117: 2025-05-07T20:32:40.1627894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1628243Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.1628548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1629128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.1629687Z return fn(*args, **kwargs) 2025-05-07T20:32:40.1630364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.1631059Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.1631604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.1632306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.1632992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.1633535Z kernel = self.compile( 2025-05-07T20:32:40.1634135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.1634853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1635266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1635502Z 2025-05-07T20:32:40.1635724Z self = 2025-05-07T20:32:40.1636822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.1638379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd945b550>} 2025-05-07T20:32:40.1639780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.1641157Z context = 2025-05-07T20:32:40.1641454Z 2025-05-07T20:32:40.1641628Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.1642173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1642658Z module_map=module_map) 2025-05-07T20:32:40.1643042Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1643407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1643690Z E ^ 2025-05-07T20:32:40.1644165Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.1644624Z 2025-05-07T20:32:40.1645043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.1645572Z 2025-05-07T20:32:40.1645683Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.1646115Z self=, 2025-05-07T20:32:40.1646531Z T=4096, 2025-05-07T20:32:40.1646727Z D=7168, 2025-05-07T20:32:40.1646936Z scale_ub=None, 2025-05-07T20:32:40.1647164Z contiguous=False, 2025-05-07T20:32:40.1647399Z compiled=True, 2025-05-07T20:32:40.1647622Z ) 2025-05-07T20:32:40.1647956Z self = 2025-05-07T20:32:40.1648462Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.1648849Z 2025-05-07T20:32:40.1648936Z @given( 2025-05-07T20:32:40.1649187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.1649520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.1649835Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.1650184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.1650524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.1650817Z ) 2025-05-07T20:32:40.1651181Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.1651636Z def test_silu_mul_quant( 2025-05-07T20:32:40.1651882Z self, 2025-05-07T20:32:40.1652091Z T: int, 2025-05-07T20:32:40.1652300Z D: int, 2025-05-07T20:32:40.1652522Z scale_ub: Optional[float], 2025-05-07T20:32:40.1652811Z contiguous: bool, 2025-05-07T20:32:40.1653067Z compiled: bool, 2025-05-07T20:32:40.1653294Z ) -> None: 2025-05-07T20:32:40.1653535Z torch.manual_seed(2025) 2025-05-07T20:32:40.1653793Z 2025-05-07T20:32:40.1654078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.1654433Z 2025-05-07T20:32:40.1654647Z x_sign = torch.sign(x) 2025-05-07T20:32:40.1655018Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.1655411Z x = x_sign * x_clamp 2025-05-07T20:32:40.1655668Z x0 = x[:, :D] 2025-05-07T20:32:40.1655901Z x1 = x[:, D:] 2025-05-07T20:32:40.1656114Z 2025-05-07T20:32:40.1656316Z if contiguous: 2025-05-07T20:32:40.1656570Z x0 = x0.contiguous() 2025-05-07T20:32:40.1656835Z x1 = x1.contiguous() 2025-05-07T20:32:40.1657092Z 2025-05-07T20:32:40.1657301Z if scale_ub is not None: 2025-05-07T20:32:40.1657577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.1657935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.1658316Z ) 2025-05-07T20:32:40.1658521Z else: 2025-05-07T20:32:40.1658746Z scale_ub_tensor = None 2025-05-07T20:32:40.1659012Z 2025-05-07T20:32:40.1659247Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.1659574Z op = silu_mul_quant 2025-05-07T20:32:40.1659844Z if compiled: 2025-05-07T20:32:40.1660100Z op = torch.compile(op) 2025-05-07T20:32:40.1660410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1660701Z 2025-05-07T20:32:40.1660901Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.1661166Z 2025-05-07T20:32:40.1661268Z moe/activation_test.py:117: 2025-05-07T20:32:40.1661579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1661917Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.1662215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1662783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.1663357Z return fn(*args, **kwargs) 2025-05-07T20:32:40.1664022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.1664717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.1665265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.1665950Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.1666636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.1667176Z kernel = self.compile( 2025-05-07T20:32:40.1667723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.1668382Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1668841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1669077Z 2025-05-07T20:32:40.1669304Z self = 2025-05-07T20:32:40.1670395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.1671771Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd93d0160>} 2025-05-07T20:32:40.1673116Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.1674149Z context = 2025-05-07T20:32:40.1674446Z 2025-05-07T20:32:40.1674623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.1675147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1675663Z module_map=module_map) 2025-05-07T20:32:40.1676075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1676442Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1676703Z E ^ 2025-05-07T20:32:40.1677178Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.1677627Z 2025-05-07T20:32:40.1678053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.1678565Z 2025-05-07T20:32:40.3748667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.3749587Z self=, 2025-05-07T20:32:40.3750188Z T=16384, 2025-05-07T20:32:40.3750466Z D=5120, 2025-05-07T20:32:40.3750736Z scale_ub=1200.0, 2025-05-07T20:32:40.3751056Z contiguous=False, 2025-05-07T20:32:40.3751315Z compiled=False, 2025-05-07T20:32:40.3751533Z ) 2025-05-07T20:32:40.3751867Z self = 2025-05-07T20:32:40.3752379Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.3752661Z 2025-05-07T20:32:40.3752750Z @given( 2025-05-07T20:32:40.3752981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.3753305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.3753622Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.3753954Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.3754293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.3754600Z ) 2025-05-07T20:32:40.3754951Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.3755402Z def test_silu_mul_quant( 2025-05-07T20:32:40.3755652Z self, 2025-05-07T20:32:40.3755858Z T: int, 2025-05-07T20:32:40.3756056Z D: int, 2025-05-07T20:32:40.3756289Z scale_ub: Optional[float], 2025-05-07T20:32:40.3756573Z contiguous: bool, 2025-05-07T20:32:40.3756816Z compiled: bool, 2025-05-07T20:32:40.3757055Z ) -> None: 2025-05-07T20:32:40.3757282Z torch.manual_seed(2025) 2025-05-07T20:32:40.3757541Z 2025-05-07T20:32:40.3757822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.3758178Z 2025-05-07T20:32:40.3758372Z x_sign = torch.sign(x) 2025-05-07T20:32:40.3758677Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.3759002Z x = x_sign * x_clamp 2025-05-07T20:32:40.3759250Z x0 = x[:, :D] 2025-05-07T20:32:40.3759591Z x1 = x[:, D:] 2025-05-07T20:32:40.3759811Z 2025-05-07T20:32:40.3760007Z if contiguous: 2025-05-07T20:32:40.3760244Z x0 = x0.contiguous() 2025-05-07T20:32:40.3760514Z x1 = x1.contiguous() 2025-05-07T20:32:40.3760768Z 2025-05-07T20:32:40.3760965Z if scale_ub is not None: 2025-05-07T20:32:40.3761250Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.3761598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.3761911Z ) 2025-05-07T20:32:40.3762111Z else: 2025-05-07T20:32:40.3762359Z scale_ub_tensor = None 2025-05-07T20:32:40.3762619Z 2025-05-07T20:32:40.3762852Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3763174Z op = silu_mul_quant 2025-05-07T20:32:40.3763433Z if compiled: 2025-05-07T20:32:40.3763683Z op = torch.compile(op) 2025-05-07T20:32:40.3763992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3764288Z 2025-05-07T20:32:40.3764487Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.3764661Z 2025-05-07T20:32:40.3764765Z moe/activation_test.py:117: 2025-05-07T20:32:40.3765065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3765595Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.3765884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3766581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.3767275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.3767813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.3768500Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.3769177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.3769795Z kernel = self.compile( 2025-05-07T20:32:40.3770359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.3771021Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3771431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3771666Z 2025-05-07T20:32:40.3771884Z self = 2025-05-07T20:32:40.3772967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.3774354Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd93d0940>} 2025-05-07T20:32:40.3775702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.3776725Z context = 2025-05-07T20:32:40.3777020Z 2025-05-07T20:32:40.3777197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.3777734Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3778208Z module_map=module_map) 2025-05-07T20:32:40.3778583Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3778937Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.3779207Z E ^ 2025-05-07T20:32:40.3779730Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3780190Z 2025-05-07T20:32:40.3780607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.3781240Z 2025-05-07T20:32:40.3781351Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.3781775Z self=, 2025-05-07T20:32:40.3782188Z T=16384, 2025-05-07T20:32:40.3782384Z D=5120, 2025-05-07T20:32:40.3782586Z scale_ub=1200.0, 2025-05-07T20:32:40.3782818Z contiguous=True, 2025-05-07T20:32:40.3783043Z compiled=True, 2025-05-07T20:32:40.3783257Z ) 2025-05-07T20:32:40.3783582Z self = 2025-05-07T20:32:40.3784082Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.3784365Z 2025-05-07T20:32:40.3784446Z @given( 2025-05-07T20:32:40.3784690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.3785014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.3785324Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.3785663Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.3786003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.3786382Z ) 2025-05-07T20:32:40.3786740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.3787194Z def test_silu_mul_quant( 2025-05-07T20:32:40.3787436Z self, 2025-05-07T20:32:40.3787636Z T: int, 2025-05-07T20:32:40.3787842Z D: int, 2025-05-07T20:32:40.3788061Z scale_ub: Optional[float], 2025-05-07T20:32:40.3788340Z contiguous: bool, 2025-05-07T20:32:40.3788586Z compiled: bool, 2025-05-07T20:32:40.3788810Z ) -> None: 2025-05-07T20:32:40.3789033Z torch.manual_seed(2025) 2025-05-07T20:32:40.3789285Z 2025-05-07T20:32:40.3789570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.3789967Z 2025-05-07T20:32:40.3790169Z x_sign = torch.sign(x) 2025-05-07T20:32:40.3790471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.3790786Z x = x_sign * x_clamp 2025-05-07T20:32:40.3791032Z x0 = x[:, :D] 2025-05-07T20:32:40.3791258Z x1 = x[:, D:] 2025-05-07T20:32:40.3791467Z 2025-05-07T20:32:40.3791660Z if contiguous: 2025-05-07T20:32:40.3791904Z x0 = x0.contiguous() 2025-05-07T20:32:40.3792168Z x1 = x1.contiguous() 2025-05-07T20:32:40.3792417Z 2025-05-07T20:32:40.3792616Z if scale_ub is not None: 2025-05-07T20:32:40.3792892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.3793242Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.3793562Z ) 2025-05-07T20:32:40.3793759Z else: 2025-05-07T20:32:40.3793980Z scale_ub_tensor = None 2025-05-07T20:32:40.3794249Z 2025-05-07T20:32:40.3794486Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3794814Z op = silu_mul_quant 2025-05-07T20:32:40.3795082Z if compiled: 2025-05-07T20:32:40.3795344Z op = torch.compile(op) 2025-05-07T20:32:40.3795647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3795939Z 2025-05-07T20:32:40.3796142Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.3796311Z 2025-05-07T20:32:40.3796414Z moe/activation_test.py:117: 2025-05-07T20:32:40.3796719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3797062Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.3797350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3797916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.3798481Z return fn(*args, **kwargs) 2025-05-07T20:32:40.3799199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.3799932Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.3800495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.3801188Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.3801853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.3802399Z kernel = self.compile( 2025-05-07T20:32:40.3802951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.3803615Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3804012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3804256Z 2025-05-07T20:32:40.3804475Z self = 2025-05-07T20:32:40.3805605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.3807035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd932a550>} 2025-05-07T20:32:40.3808380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.3809414Z context = 2025-05-07T20:32:40.3809718Z 2025-05-07T20:32:40.3809893Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.3810473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3810946Z module_map=module_map) 2025-05-07T20:32:40.3811324Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3811695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.3811966Z E ^ 2025-05-07T20:32:40.3812427Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3812884Z 2025-05-07T20:32:40.3813306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.3813817Z 2025-05-07T20:32:40.6041655Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6042326Z self=, 2025-05-07T20:32:40.6042909Z T=16384, 2025-05-07T20:32:40.6043210Z D=5120, 2025-05-07T20:32:40.6043431Z scale_ub=None, 2025-05-07T20:32:40.6043663Z contiguous=False, 2025-05-07T20:32:40.6043901Z compiled=True, 2025-05-07T20:32:40.6044125Z ) 2025-05-07T20:32:40.6044458Z self = 2025-05-07T20:32:40.6044970Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.6045255Z 2025-05-07T20:32:40.6045339Z @given( 2025-05-07T20:32:40.6045578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6045899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6046208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6046548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6046887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6047176Z ) 2025-05-07T20:32:40.6047536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6048274Z def test_silu_mul_quant( 2025-05-07T20:32:40.6048524Z self, 2025-05-07T20:32:40.6048730Z T: int, 2025-05-07T20:32:40.6048934Z D: int, 2025-05-07T20:32:40.6049155Z scale_ub: Optional[float], 2025-05-07T20:32:40.6049438Z contiguous: bool, 2025-05-07T20:32:40.6049690Z compiled: bool, 2025-05-07T20:32:40.6049925Z ) -> None: 2025-05-07T20:32:40.6050144Z torch.manual_seed(2025) 2025-05-07T20:32:40.6050395Z 2025-05-07T20:32:40.6050672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6051019Z 2025-05-07T20:32:40.6051215Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6051513Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6051827Z x = x_sign * x_clamp 2025-05-07T20:32:40.6052075Z x0 = x[:, :D] 2025-05-07T20:32:40.6052302Z x1 = x[:, D:] 2025-05-07T20:32:40.6052511Z 2025-05-07T20:32:40.6052701Z if contiguous: 2025-05-07T20:32:40.6052949Z x0 = x0.contiguous() 2025-05-07T20:32:40.6053209Z x1 = x1.contiguous() 2025-05-07T20:32:40.6053459Z 2025-05-07T20:32:40.6053656Z if scale_ub is not None: 2025-05-07T20:32:40.6053931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6054348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6054736Z ) 2025-05-07T20:32:40.6054930Z else: 2025-05-07T20:32:40.6055154Z scale_ub_tensor = None 2025-05-07T20:32:40.6055415Z 2025-05-07T20:32:40.6055654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6055972Z op = silu_mul_quant 2025-05-07T20:32:40.6056234Z if compiled: 2025-05-07T20:32:40.6056494Z op = torch.compile(op) 2025-05-07T20:32:40.6056791Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6057081Z 2025-05-07T20:32:40.6064985Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6065313Z 2025-05-07T20:32:40.6065432Z moe/activation_test.py:117: 2025-05-07T20:32:40.6065746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6066090Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6066395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6066983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.6067545Z return fn(*args, **kwargs) 2025-05-07T20:32:40.6068213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6068906Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6069460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6070145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6070827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6071380Z kernel = self.compile( 2025-05-07T20:32:40.6071930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6072609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6073018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6073259Z 2025-05-07T20:32:40.6073480Z self = 2025-05-07T20:32:40.6074576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6076005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd92d21f0>} 2025-05-07T20:32:40.6077379Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6078405Z context = 2025-05-07T20:32:40.6078697Z 2025-05-07T20:32:40.6078876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6079400Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6079884Z module_map=module_map) 2025-05-07T20:32:40.6080261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6080620Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6080896Z E ^ 2025-05-07T20:32:40.6081369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6081881Z 2025-05-07T20:32:40.6082415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6082941Z 2025-05-07T20:32:40.6083172Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6083603Z self=, 2025-05-07T20:32:40.6084022Z T=2048, 2025-05-07T20:32:40.6084213Z D=5120, 2025-05-07T20:32:40.6084416Z scale_ub=None, 2025-05-07T20:32:40.6084647Z contiguous=False, 2025-05-07T20:32:40.6084887Z compiled=True, 2025-05-07T20:32:40.6085094Z ) 2025-05-07T20:32:40.7285013Z self = 2025-05-07T20:32:40.7286523Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.7287309Z 2025-05-07T20:32:40.7287533Z @given( 2025-05-07T20:32:40.7288580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.7289215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.7289827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.7290177Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.7290520Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.7290819Z ) 2025-05-07T20:32:40.7291182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.7291645Z def test_silu_mul_quant( 2025-05-07T20:32:40.7291897Z self, 2025-05-07T20:32:40.7292111Z T: int, 2025-05-07T20:32:40.7292324Z D: int, 2025-05-07T20:32:40.7292551Z scale_ub: Optional[float], 2025-05-07T20:32:40.7292840Z contiguous: bool, 2025-05-07T20:32:40.7293095Z compiled: bool, 2025-05-07T20:32:40.7293331Z ) -> None: 2025-05-07T20:32:40.7293564Z torch.manual_seed(2025) 2025-05-07T20:32:40.7293826Z 2025-05-07T20:32:40.7294108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.7294466Z 2025-05-07T20:32:40.7294675Z x_sign = torch.sign(x) 2025-05-07T20:32:40.7294974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.7295304Z x = x_sign * x_clamp 2025-05-07T20:32:40.7295567Z x0 = x[:, :D] 2025-05-07T20:32:40.7295789Z x1 = x[:, D:] 2025-05-07T20:32:40.7296011Z 2025-05-07T20:32:40.7296212Z if contiguous: 2025-05-07T20:32:40.7296455Z x0 = x0.contiguous() 2025-05-07T20:32:40.7296731Z x1 = x1.contiguous() 2025-05-07T20:32:40.7296986Z 2025-05-07T20:32:40.7297185Z if scale_ub is not None: 2025-05-07T20:32:40.7297475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.7297829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.7298152Z ) 2025-05-07T20:32:40.7298355Z else: 2025-05-07T20:32:40.7298680Z scale_ub_tensor = None 2025-05-07T20:32:40.7298949Z 2025-05-07T20:32:40.7299187Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.7299516Z op = silu_mul_quant 2025-05-07T20:32:40.7299784Z if compiled: 2025-05-07T20:32:40.7300074Z op = torch.compile(op) 2025-05-07T20:32:40.7300410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.7300700Z 2025-05-07T20:32:40.7300900Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.7301191Z 2025-05-07T20:32:40.7301297Z moe/activation_test.py:117: 2025-05-07T20:32:40.7301607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7301959Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.7302248Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.7302824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.7303406Z return fn(*args, **kwargs) 2025-05-07T20:32:40.7304075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.7304767Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.7305397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.7306162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.7306829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.7307378Z kernel = self.compile( 2025-05-07T20:32:40.7307942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.7308603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.7309023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7309313Z 2025-05-07T20:32:40.7309530Z self = 2025-05-07T20:32:40.7310623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.7312016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd92d2f70>} 2025-05-07T20:32:40.7313368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.7314396Z context = 2025-05-07T20:32:40.7314690Z 2025-05-07T20:32:40.7314880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.7315425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.7315894Z module_map=module_map) 2025-05-07T20:32:40.7316280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.7316660Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.7316927Z E ^ 2025-05-07T20:32:40.7317413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.7317862Z 2025-05-07T20:32:40.7318293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.7318809Z 2025-05-07T20:32:40.7318918Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.7319341Z self=, 2025-05-07T20:32:40.7319847Z T=2048, 2025-05-07T20:32:40.7320045Z D=5120, 2025-05-07T20:32:40.7320249Z scale_ub=1200.0, 2025-05-07T20:32:40.7320487Z contiguous=False, 2025-05-07T20:32:40.7320722Z compiled=True, 2025-05-07T20:32:40.7320943Z ) 2025-05-07T20:32:40.7321276Z self = 2025-05-07T20:32:40.7321793Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.7322072Z 2025-05-07T20:32:40.7322157Z @given( 2025-05-07T20:32:40.7322400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.7322727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.7323040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.7323383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.7323732Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.7324022Z ) 2025-05-07T20:32:40.7324388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.7324843Z def test_silu_mul_quant( 2025-05-07T20:32:40.7325088Z self, 2025-05-07T20:32:40.7325294Z T: int, 2025-05-07T20:32:40.7325502Z D: int, 2025-05-07T20:32:40.7325732Z scale_ub: Optional[float], 2025-05-07T20:32:40.7326056Z contiguous: bool, 2025-05-07T20:32:40.7326345Z compiled: bool, 2025-05-07T20:32:40.7326580Z ) -> None: 2025-05-07T20:32:40.7326805Z torch.manual_seed(2025) 2025-05-07T20:32:40.7327061Z 2025-05-07T20:32:40.7327343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.7327690Z 2025-05-07T20:32:40.7327896Z x_sign = torch.sign(x) 2025-05-07T20:32:40.7328200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.7328516Z x = x_sign * x_clamp 2025-05-07T20:32:40.7328770Z x0 = x[:, :D] 2025-05-07T20:32:40.7329000Z x1 = x[:, D:] 2025-05-07T20:32:40.7329256Z 2025-05-07T20:32:40.7329458Z if contiguous: 2025-05-07T20:32:40.7329705Z x0 = x0.contiguous() 2025-05-07T20:32:40.7329973Z x1 = x1.contiguous() 2025-05-07T20:32:40.7330225Z 2025-05-07T20:32:40.7330429Z if scale_ub is not None: 2025-05-07T20:32:40.7330708Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.7331074Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.7331397Z ) 2025-05-07T20:32:40.7331602Z else: 2025-05-07T20:32:40.7331818Z scale_ub_tensor = None 2025-05-07T20:32:40.7332088Z 2025-05-07T20:32:40.7332334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.7332659Z op = silu_mul_quant 2025-05-07T20:32:40.7332928Z if compiled: 2025-05-07T20:32:40.7333190Z op = torch.compile(op) 2025-05-07T20:32:40.7333496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.7333785Z 2025-05-07T20:32:40.7334000Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.7334170Z 2025-05-07T20:32:40.7334273Z moe/activation_test.py:117: 2025-05-07T20:32:40.7334579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7334931Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.7335234Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.7335797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.7336359Z return fn(*args, **kwargs) 2025-05-07T20:32:40.7337026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.7337713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.7338270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.7339010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.7339690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.7340498Z kernel = self.compile( 2025-05-07T20:32:40.7341114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.7341789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.7342192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7342432Z 2025-05-07T20:32:40.7342647Z self = 2025-05-07T20:32:40.7343732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.7345105Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd913c940>} 2025-05-07T20:32:40.7346564Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.7347641Z context = 2025-05-07T20:32:40.7347940Z 2025-05-07T20:32:40.7348110Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.7348655Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.7349122Z module_map=module_map) 2025-05-07T20:32:40.7349501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.7349870Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.7350137Z E ^ 2025-05-07T20:32:40.7350677Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.7351131Z 2025-05-07T20:32:40.7351556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.7352073Z 2025-05-07T20:32:41.1511448Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.1512144Z self=, 2025-05-07T20:32:41.1512685Z T=4096, 2025-05-07T20:32:41.1512879Z D=5120, 2025-05-07T20:32:41.1513085Z scale_ub=1200.0, 2025-05-07T20:32:41.1513322Z contiguous=True, 2025-05-07T20:32:41.1513552Z compiled=True, 2025-05-07T20:32:41.1513777Z ) 2025-05-07T20:32:41.1514109Z self = 2025-05-07T20:32:41.1514610Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.1514904Z 2025-05-07T20:32:41.1514998Z @given( 2025-05-07T20:32:41.1515236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.1515564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.1515883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.1516230Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.1516578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.1516871Z ) 2025-05-07T20:32:41.1517231Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.1517683Z def test_silu_mul_quant( 2025-05-07T20:32:41.1517926Z self, 2025-05-07T20:32:41.1518130Z T: int, 2025-05-07T20:32:41.1518336Z D: int, 2025-05-07T20:32:41.1518559Z scale_ub: Optional[float], 2025-05-07T20:32:41.1518842Z contiguous: bool, 2025-05-07T20:32:41.1519092Z compiled: bool, 2025-05-07T20:32:41.1519322Z ) -> None: 2025-05-07T20:32:41.1519849Z torch.manual_seed(2025) 2025-05-07T20:32:41.1520107Z 2025-05-07T20:32:41.1520391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.1520738Z 2025-05-07T20:32:41.1520944Z x_sign = torch.sign(x) 2025-05-07T20:32:41.1521242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.1521561Z x = x_sign * x_clamp 2025-05-07T20:32:41.1521812Z x0 = x[:, :D] 2025-05-07T20:32:41.1522039Z x1 = x[:, D:] 2025-05-07T20:32:41.1522249Z 2025-05-07T20:32:41.1522446Z if contiguous: 2025-05-07T20:32:41.1522689Z x0 = x0.contiguous() 2025-05-07T20:32:41.1522956Z x1 = x1.contiguous() 2025-05-07T20:32:41.1523209Z 2025-05-07T20:32:41.1523412Z if scale_ub is not None: 2025-05-07T20:32:41.1523690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.1524040Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.1524362Z ) 2025-05-07T20:32:41.1524564Z else: 2025-05-07T20:32:41.1524785Z scale_ub_tensor = None 2025-05-07T20:32:41.1525046Z 2025-05-07T20:32:41.1525289Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.1525609Z op = silu_mul_quant 2025-05-07T20:32:41.1525871Z if compiled: 2025-05-07T20:32:41.1526293Z op = torch.compile(op) 2025-05-07T20:32:41.1526596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.1526884Z 2025-05-07T20:32:41.1527088Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.1527254Z 2025-05-07T20:32:41.1527358Z moe/activation_test.py:117: 2025-05-07T20:32:41.1527667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.1528011Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.1528295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.1528867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.1529509Z return fn(*args, **kwargs) 2025-05-07T20:32:41.1530176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.1530860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.1531414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.1532103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.1532776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.1533307Z kernel = self.compile( 2025-05-07T20:32:41.1533942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.1534644Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.1535051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.1535294Z 2025-05-07T20:32:41.1535505Z self = 2025-05-07T20:32:41.1536599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.1537998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9107790>} 2025-05-07T20:32:41.1539352Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.1540734Z context = 2025-05-07T20:32:41.1541201Z 2025-05-07T20:32:41.1541377Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.1541909Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.1542380Z module_map=module_map) 2025-05-07T20:32:41.1542757Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.1543116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.1543385Z E ^ 2025-05-07T20:32:41.1543847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.1544300Z 2025-05-07T20:32:41.1544718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.1545236Z 2025-05-07T20:32:41.1545342Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.1545766Z self=, 2025-05-07T20:32:41.1546170Z T=128, 2025-05-07T20:32:41.1546369Z D=5120, 2025-05-07T20:32:41.1546569Z scale_ub=1200.0, 2025-05-07T20:32:41.1546796Z contiguous=False, 2025-05-07T20:32:41.1547030Z compiled=True, 2025-05-07T20:32:41.1547244Z ) 2025-05-07T20:32:41.2875893Z self = 2025-05-07T20:32:41.2876675Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.2877071Z 2025-05-07T20:32:41.2877190Z @given( 2025-05-07T20:32:41.2877442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2877764Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2878088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2878432Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2878769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2879070Z ) 2025-05-07T20:32:41.2879550Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2880005Z def test_silu_mul_quant( 2025-05-07T20:32:41.2880253Z self, 2025-05-07T20:32:41.2880462Z T: int, 2025-05-07T20:32:41.2880672Z D: int, 2025-05-07T20:32:41.2880909Z scale_ub: Optional[float], 2025-05-07T20:32:41.2881203Z contiguous: bool, 2025-05-07T20:32:41.2881456Z compiled: bool, 2025-05-07T20:32:41.2881690Z ) -> None: 2025-05-07T20:32:41.2881922Z torch.manual_seed(2025) 2025-05-07T20:32:41.2882179Z 2025-05-07T20:32:41.2882457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2882815Z 2025-05-07T20:32:41.2883022Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2883321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2883650Z x = x_sign * x_clamp 2025-05-07T20:32:41.2883912Z x0 = x[:, :D] 2025-05-07T20:32:41.2884144Z x1 = x[:, D:] 2025-05-07T20:32:41.2884364Z 2025-05-07T20:32:41.2884564Z if contiguous: 2025-05-07T20:32:41.2884806Z x0 = x0.contiguous() 2025-05-07T20:32:41.2885086Z x1 = x1.contiguous() 2025-05-07T20:32:41.2885344Z 2025-05-07T20:32:41.2885553Z if scale_ub is not None: 2025-05-07T20:32:41.2885844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2886197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2886520Z ) 2025-05-07T20:32:41.2886726Z else: 2025-05-07T20:32:41.2886953Z scale_ub_tensor = None 2025-05-07T20:32:41.2887221Z 2025-05-07T20:32:41.2887464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2887799Z op = silu_mul_quant 2025-05-07T20:32:41.2888069Z if compiled: 2025-05-07T20:32:41.2888332Z op = torch.compile(op) 2025-05-07T20:32:41.2888650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2889031Z 2025-05-07T20:32:41.2889237Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2889414Z 2025-05-07T20:32:41.2889520Z moe/activation_test.py:117: 2025-05-07T20:32:41.2889827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2890177Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2890475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2891049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.2891618Z return fn(*args, **kwargs) 2025-05-07T20:32:41.2892287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2892985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2893536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2894233Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2894907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2895452Z kernel = self.compile( 2025-05-07T20:32:41.2896050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2896782Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2897193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2897438Z 2025-05-07T20:32:41.2897654Z self = 2025-05-07T20:32:41.2898766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2900202Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8f0e0d0>} 2025-05-07T20:32:41.2901657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2902687Z context = 2025-05-07T20:32:41.2902980Z 2025-05-07T20:32:41.2903162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2903699Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2904167Z module_map=module_map) 2025-05-07T20:32:41.2904547Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2904912Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2905186Z E ^ 2025-05-07T20:32:41.2905666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2906123Z 2025-05-07T20:32:41.2906545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2907062Z 2025-05-07T20:32:41.2907177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2907594Z self=, 2025-05-07T20:32:41.2908010Z T=16384, 2025-05-07T20:32:41.2915455Z D=7168, 2025-05-07T20:32:41.2915702Z scale_ub=1200.0, 2025-05-07T20:32:41.2915950Z contiguous=True, 2025-05-07T20:32:41.2916188Z compiled=True, 2025-05-07T20:32:41.2916407Z ) 2025-05-07T20:32:41.2916747Z self = 2025-05-07T20:32:41.2917345Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.2917638Z 2025-05-07T20:32:41.2917735Z @given( 2025-05-07T20:32:41.2917976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2918311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2918633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2918974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2919321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2919626Z ) 2025-05-07T20:32:41.2919989Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2920448Z def test_silu_mul_quant( 2025-05-07T20:32:41.2920706Z self, 2025-05-07T20:32:41.2920908Z T: int, 2025-05-07T20:32:41.2921124Z D: int, 2025-05-07T20:32:41.2921358Z scale_ub: Optional[float], 2025-05-07T20:32:41.2921637Z contiguous: bool, 2025-05-07T20:32:41.2921894Z compiled: bool, 2025-05-07T20:32:41.2922139Z ) -> None: 2025-05-07T20:32:41.2922374Z torch.manual_seed(2025) 2025-05-07T20:32:41.2922627Z 2025-05-07T20:32:41.2922921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2923284Z 2025-05-07T20:32:41.2923484Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2923879Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2924215Z x = x_sign * x_clamp 2025-05-07T20:32:41.2924463Z x0 = x[:, :D] 2025-05-07T20:32:41.2924695Z x1 = x[:, D:] 2025-05-07T20:32:41.2924917Z 2025-05-07T20:32:41.2925109Z if contiguous: 2025-05-07T20:32:41.2925360Z x0 = x0.contiguous() 2025-05-07T20:32:41.2925644Z x1 = x1.contiguous() 2025-05-07T20:32:41.2925892Z 2025-05-07T20:32:41.2926095Z if scale_ub is not None: 2025-05-07T20:32:41.2926386Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2926735Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2927111Z ) 2025-05-07T20:32:41.2927315Z else: 2025-05-07T20:32:41.2927537Z scale_ub_tensor = None 2025-05-07T20:32:41.2927793Z 2025-05-07T20:32:41.2928038Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2928366Z op = silu_mul_quant 2025-05-07T20:32:41.2928629Z if compiled: 2025-05-07T20:32:41.2928891Z op = torch.compile(op) 2025-05-07T20:32:41.2929203Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2929488Z 2025-05-07T20:32:41.2929695Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2929868Z 2025-05-07T20:32:41.2929984Z moe/activation_test.py:117: 2025-05-07T20:32:41.2930292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2930641Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2930940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2931514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.2932090Z return fn(*args, **kwargs) 2025-05-07T20:32:41.2932766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2933479Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2934021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2934730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2935410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2935956Z kernel = self.compile( 2025-05-07T20:32:41.2936510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2937238Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2937672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2937909Z 2025-05-07T20:32:41.2938126Z self = 2025-05-07T20:32:41.2939225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2941204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8f0ed30>} 2025-05-07T20:32:41.2942575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2943615Z context = 2025-05-07T20:32:41.2943912Z 2025-05-07T20:32:41.2944090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2944641Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2945277Z module_map=module_map) 2025-05-07T20:32:41.2945667Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2946032Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2946315Z E ^ 2025-05-07T20:32:41.2946795Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2947249Z 2025-05-07T20:32:41.2947669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2948191Z 2025-05-07T20:32:41.5712521Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5713325Z self=, 2025-05-07T20:32:41.5713757Z T=16384, 2025-05-07T20:32:41.5713960Z D=5120, 2025-05-07T20:32:41.5714168Z scale_ub=1200.0, 2025-05-07T20:32:41.5714406Z contiguous=True, 2025-05-07T20:32:41.5714633Z compiled=False, 2025-05-07T20:32:41.5714866Z ) 2025-05-07T20:32:41.5715199Z self = 2025-05-07T20:32:41.5715705Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.5715998Z 2025-05-07T20:32:41.5716083Z @given( 2025-05-07T20:32:41.5716331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5716650Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5716972Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5717319Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5717669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5717968Z ) 2025-05-07T20:32:41.5718332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5718787Z def test_silu_mul_quant( 2025-05-07T20:32:41.5719035Z self, 2025-05-07T20:32:41.5719246Z T: int, 2025-05-07T20:32:41.5719466Z D: int, 2025-05-07T20:32:41.5719691Z scale_ub: Optional[float], 2025-05-07T20:32:41.5719989Z contiguous: bool, 2025-05-07T20:32:41.5720272Z compiled: bool, 2025-05-07T20:32:41.5720529Z ) -> None: 2025-05-07T20:32:41.5720765Z torch.manual_seed(2025) 2025-05-07T20:32:41.5721027Z 2025-05-07T20:32:41.5721303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5721661Z 2025-05-07T20:32:41.5721868Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5722173Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5722499Z x = x_sign * x_clamp 2025-05-07T20:32:41.5722855Z x0 = x[:, :D] 2025-05-07T20:32:41.5723079Z x1 = x[:, D:] 2025-05-07T20:32:41.5723297Z 2025-05-07T20:32:41.5723493Z if contiguous: 2025-05-07T20:32:41.5723729Z x0 = x0.contiguous() 2025-05-07T20:32:41.5724004Z x1 = x1.contiguous() 2025-05-07T20:32:41.5724261Z 2025-05-07T20:32:41.5724462Z if scale_ub is not None: 2025-05-07T20:32:41.5724751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5725105Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5725421Z ) 2025-05-07T20:32:41.5725629Z else: 2025-05-07T20:32:41.5725852Z scale_ub_tensor = None 2025-05-07T20:32:41.5726115Z 2025-05-07T20:32:41.5726362Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5726686Z op = silu_mul_quant 2025-05-07T20:32:41.5726945Z if compiled: 2025-05-07T20:32:41.5727205Z op = torch.compile(op) 2025-05-07T20:32:41.5727519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5727805Z 2025-05-07T20:32:41.5728000Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5728174Z 2025-05-07T20:32:41.5728279Z moe/activation_test.py:117: 2025-05-07T20:32:41.5728583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5729067Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5729360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5730063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5730758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5731305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5731995Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5732668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5733252Z kernel = self.compile( 2025-05-07T20:32:41.5733800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5734467Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5734887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5735122Z 2025-05-07T20:32:41.5735336Z self = 2025-05-07T20:32:41.5736430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5737845Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9052700>} 2025-05-07T20:32:41.5739190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5740476Z context = 2025-05-07T20:32:41.5740781Z 2025-05-07T20:32:41.5740952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5741548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5742025Z module_map=module_map) 2025-05-07T20:32:41.5742397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5742767Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5743037Z E ^ 2025-05-07T20:32:41.5743581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5744054Z 2025-05-07T20:32:41.5744473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5744994Z 2025-05-07T20:32:41.5745101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5745557Z self=, 2025-05-07T20:32:41.5745975Z T=1, 2025-05-07T20:32:41.5746165Z D=7168, 2025-05-07T20:32:41.5746370Z scale_ub=1200.0, 2025-05-07T20:32:41.5746604Z contiguous=False, 2025-05-07T20:32:41.5746834Z compiled=False, 2025-05-07T20:32:41.5747050Z ) 2025-05-07T20:32:41.5747378Z self = 2025-05-07T20:32:41.5747871Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.5748154Z 2025-05-07T20:32:41.5748236Z @given( 2025-05-07T20:32:41.5748487Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5748813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5749126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5749470Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5749808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5750227Z ) 2025-05-07T20:32:41.5750594Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5751042Z def test_silu_mul_quant( 2025-05-07T20:32:41.5751286Z self, 2025-05-07T20:32:41.5751490Z T: int, 2025-05-07T20:32:41.5751694Z D: int, 2025-05-07T20:32:41.5751916Z scale_ub: Optional[float], 2025-05-07T20:32:41.5752200Z contiguous: bool, 2025-05-07T20:32:41.5752447Z compiled: bool, 2025-05-07T20:32:41.5752684Z ) -> None: 2025-05-07T20:32:41.5752904Z torch.manual_seed(2025) 2025-05-07T20:32:41.5753157Z 2025-05-07T20:32:41.5753528Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5753883Z 2025-05-07T20:32:41.5754090Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5754385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5754710Z x = x_sign * x_clamp 2025-05-07T20:32:41.5754974Z x0 = x[:, :D] 2025-05-07T20:32:41.5755195Z x1 = x[:, D:] 2025-05-07T20:32:41.5755421Z 2025-05-07T20:32:41.5755623Z if contiguous: 2025-05-07T20:32:41.5755857Z x0 = x0.contiguous() 2025-05-07T20:32:41.5756126Z x1 = x1.contiguous() 2025-05-07T20:32:41.5756377Z 2025-05-07T20:32:41.5756578Z if scale_ub is not None: 2025-05-07T20:32:41.5756855Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5757201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5757527Z ) 2025-05-07T20:32:41.5757726Z else: 2025-05-07T20:32:41.5757953Z scale_ub_tensor = None 2025-05-07T20:32:41.5758220Z 2025-05-07T20:32:41.5758453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5758782Z op = silu_mul_quant 2025-05-07T20:32:41.5759044Z if compiled: 2025-05-07T20:32:41.5759296Z op = torch.compile(op) 2025-05-07T20:32:41.5759609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5759898Z 2025-05-07T20:32:41.5760091Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5760268Z 2025-05-07T20:32:41.5760371Z moe/activation_test.py:117: 2025-05-07T20:32:41.5760671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5761015Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5761303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5762006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5762758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5763306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5763998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5764672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5765214Z kernel = self.compile( 2025-05-07T20:32:41.5765755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5766421Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5766827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5767061Z 2025-05-07T20:32:41.5767282Z self = 2025-05-07T20:32:41.5768371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5769799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd900c0d0>} 2025-05-07T20:32:41.5771197Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5772239Z context = 2025-05-07T20:32:41.5772532Z 2025-05-07T20:32:41.5772705Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5773250Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5773775Z module_map=module_map) 2025-05-07T20:32:41.5774161Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5774523Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5774795Z E ^ 2025-05-07T20:32:41.5775268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5775722Z 2025-05-07T20:32:41.5776141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5776662Z 2025-05-07T20:32:41.5776770Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5777198Z self=, 2025-05-07T20:32:41.5777615Z T=4096, 2025-05-07T20:32:41.5777809Z D=7168, 2025-05-07T20:32:41.5778013Z scale_ub=1200.0, 2025-05-07T20:32:41.5778250Z contiguous=False, 2025-05-07T20:32:41.5778482Z compiled=True, 2025-05-07T20:32:41.5778703Z ) 2025-05-07T20:32:41.6962047Z self = 2025-05-07T20:32:41.6963153Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.6963704Z 2025-05-07T20:32:41.6963875Z @given( 2025-05-07T20:32:41.6964363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6965010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6965629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6966297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6966953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6967532Z ) 2025-05-07T20:32:41.6968230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6969108Z def test_silu_mul_quant( 2025-05-07T20:32:41.6969594Z self, 2025-05-07T20:32:41.6969991Z T: int, 2025-05-07T20:32:41.6970258Z D: int, 2025-05-07T20:32:41.6970788Z scale_ub: Optional[float], 2025-05-07T20:32:41.6971077Z contiguous: bool, 2025-05-07T20:32:41.6971318Z compiled: bool, 2025-05-07T20:32:41.6971553Z ) -> None: 2025-05-07T20:32:41.6971779Z torch.manual_seed(2025) 2025-05-07T20:32:41.6972025Z 2025-05-07T20:32:41.6972311Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6972661Z 2025-05-07T20:32:41.6972860Z x_sign = torch.sign(x) 2025-05-07T20:32:41.6973151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.6973470Z x = x_sign * x_clamp 2025-05-07T20:32:41.6973718Z x0 = x[:, :D] 2025-05-07T20:32:41.6973935Z x1 = x[:, D:] 2025-05-07T20:32:41.6974148Z 2025-05-07T20:32:41.6974342Z if contiguous: 2025-05-07T20:32:41.6974575Z x0 = x0.contiguous() 2025-05-07T20:32:41.6974849Z x1 = x1.contiguous() 2025-05-07T20:32:41.6975098Z 2025-05-07T20:32:41.6975299Z if scale_ub is not None: 2025-05-07T20:32:41.6975583Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.6975930Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.6976242Z ) 2025-05-07T20:32:41.6976449Z else: 2025-05-07T20:32:41.6976814Z scale_ub_tensor = None 2025-05-07T20:32:41.6977070Z 2025-05-07T20:32:41.6977311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.6977637Z op = silu_mul_quant 2025-05-07T20:32:41.6977905Z if compiled: 2025-05-07T20:32:41.6978154Z op = torch.compile(op) 2025-05-07T20:32:41.6978459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6978744Z 2025-05-07T20:32:41.6978939Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.6979112Z 2025-05-07T20:32:41.6979215Z moe/activation_test.py:117: 2025-05-07T20:32:41.6979520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6979945Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.6980251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6980827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.6981491Z return fn(*args, **kwargs) 2025-05-07T20:32:41.6982153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.6982847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.6983389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.6984066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.6984733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.6985270Z kernel = self.compile( 2025-05-07T20:32:41.6985821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.6986475Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.6986879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6987117Z 2025-05-07T20:32:41.6987337Z self = 2025-05-07T20:32:41.6988420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.6989804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd900cdc0>} 2025-05-07T20:32:41.6991210Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.6992250Z context = 2025-05-07T20:32:41.6992543Z 2025-05-07T20:32:41.6992726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.6993251Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.6993731Z module_map=module_map) 2025-05-07T20:32:41.6994108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.6994470Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.6994732Z E ^ 2025-05-07T20:32:41.6995200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.6995647Z 2025-05-07T20:32:41.6996072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.6996586Z 2025-05-07T20:32:41.6996693Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6997110Z self=, 2025-05-07T20:32:41.6997566Z T=128, 2025-05-07T20:32:41.6997802Z D=7168, 2025-05-07T20:32:41.6998001Z scale_ub=1200.0, 2025-05-07T20:32:41.6998235Z contiguous=False, 2025-05-07T20:32:41.6998469Z compiled=True, 2025-05-07T20:32:41.6998681Z ) 2025-05-07T20:32:41.6999015Z self = 2025-05-07T20:32:41.6999519Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.6999791Z 2025-05-07T20:32:41.6999874Z @given( 2025-05-07T20:32:41.7000115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7000440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7000803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7001140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7001481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7001778Z ) 2025-05-07T20:32:41.7002130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7002586Z def test_silu_mul_quant( 2025-05-07T20:32:41.7002839Z self, 2025-05-07T20:32:41.7003034Z T: int, 2025-05-07T20:32:41.7003241Z D: int, 2025-05-07T20:32:41.7003466Z scale_ub: Optional[float], 2025-05-07T20:32:41.7003740Z contiguous: bool, 2025-05-07T20:32:41.7003992Z compiled: bool, 2025-05-07T20:32:41.7004223Z ) -> None: 2025-05-07T20:32:41.7004441Z torch.manual_seed(2025) 2025-05-07T20:32:41.7004694Z 2025-05-07T20:32:41.7004973Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7005317Z 2025-05-07T20:32:41.7005521Z x_sign = torch.sign(x) 2025-05-07T20:32:41.7005821Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.7006143Z x = x_sign * x_clamp 2025-05-07T20:32:41.7006386Z x0 = x[:, :D] 2025-05-07T20:32:41.7006610Z x1 = x[:, D:] 2025-05-07T20:32:41.7006826Z 2025-05-07T20:32:41.7007017Z if contiguous: 2025-05-07T20:32:41.7007255Z x0 = x0.contiguous() 2025-05-07T20:32:41.7007521Z x1 = x1.contiguous() 2025-05-07T20:32:41.7007765Z 2025-05-07T20:32:41.7007964Z if scale_ub is not None: 2025-05-07T20:32:41.7008246Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.7008582Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.7008896Z ) 2025-05-07T20:32:41.7009096Z else: 2025-05-07T20:32:41.7009308Z scale_ub_tensor = None 2025-05-07T20:32:41.7009567Z 2025-05-07T20:32:41.7009808Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.7010185Z op = silu_mul_quant 2025-05-07T20:32:41.7010452Z if compiled: 2025-05-07T20:32:41.7010708Z op = torch.compile(op) 2025-05-07T20:32:41.7011015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7011301Z 2025-05-07T20:32:41.7011511Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.7019262Z 2025-05-07T20:32:41.7019385Z moe/activation_test.py:117: 2025-05-07T20:32:41.7019706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7020047Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.7020346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7020926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.7021586Z return fn(*args, **kwargs) 2025-05-07T20:32:41.7022259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.7022970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.7023519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.7024203Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.7024997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.7025553Z kernel = self.compile( 2025-05-07T20:32:41.7026109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.7026776Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.7027188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7027432Z 2025-05-07T20:32:41.7027645Z self = 2025-05-07T20:32:41.7028782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.7030185Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8e9a940>} 2025-05-07T20:32:41.7031565Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.7032590Z context = 2025-05-07T20:32:41.7032883Z 2025-05-07T20:32:41.7033058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.7033590Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.7034065Z module_map=module_map) 2025-05-07T20:32:41.7034447Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.7034816Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.7035078Z E ^ 2025-05-07T20:32:41.7035554Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.7036012Z 2025-05-07T20:32:41.7036437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.7036952Z 2025-05-07T20:32:41.8789231Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8789748Z self=, 2025-05-07T20:32:41.8790173Z T=2048, 2025-05-07T20:32:41.8790368Z D=7168, 2025-05-07T20:32:41.8790574Z scale_ub=None, 2025-05-07T20:32:41.8790803Z contiguous=True, 2025-05-07T20:32:41.8791330Z compiled=True, 2025-05-07T20:32:41.8791557Z ) 2025-05-07T20:32:41.8791891Z self = 2025-05-07T20:32:41.8792400Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.8792679Z 2025-05-07T20:32:41.8792769Z @given( 2025-05-07T20:32:41.8793018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8793348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8793667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8794015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8794363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8794657Z ) 2025-05-07T20:32:41.8795027Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8795489Z def test_silu_mul_quant( 2025-05-07T20:32:41.8795751Z self, 2025-05-07T20:32:41.8795963Z T: int, 2025-05-07T20:32:41.8796183Z D: int, 2025-05-07T20:32:41.8796418Z scale_ub: Optional[float], 2025-05-07T20:32:41.8796703Z contiguous: bool, 2025-05-07T20:32:41.8796961Z compiled: bool, 2025-05-07T20:32:41.8797203Z ) -> None: 2025-05-07T20:32:41.8797429Z torch.manual_seed(2025) 2025-05-07T20:32:41.8797853Z 2025-05-07T20:32:41.8798144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8798493Z 2025-05-07T20:32:41.8798703Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8799017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8799338Z x = x_sign * x_clamp 2025-05-07T20:32:41.8799600Z x0 = x[:, :D] 2025-05-07T20:32:41.8799837Z x1 = x[:, D:] 2025-05-07T20:32:41.8800052Z 2025-05-07T20:32:41.8800274Z if contiguous: 2025-05-07T20:32:41.8800551Z x0 = x0.contiguous() 2025-05-07T20:32:41.8800831Z x1 = x1.contiguous() 2025-05-07T20:32:41.8801181Z 2025-05-07T20:32:41.8801389Z if scale_ub is not None: 2025-05-07T20:32:41.8801681Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8802025Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8802353Z ) 2025-05-07T20:32:41.8802571Z else: 2025-05-07T20:32:41.8802792Z scale_ub_tensor = None 2025-05-07T20:32:41.8803066Z 2025-05-07T20:32:41.8803315Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8803641Z op = silu_mul_quant 2025-05-07T20:32:41.8803910Z if compiled: 2025-05-07T20:32:41.8804175Z op = torch.compile(op) 2025-05-07T20:32:41.8804478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8804768Z 2025-05-07T20:32:41.8804984Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8805156Z 2025-05-07T20:32:41.8805264Z moe/activation_test.py:117: 2025-05-07T20:32:41.8805576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8805928Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8806225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8806803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.8807380Z return fn(*args, **kwargs) 2025-05-07T20:32:41.8808055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8808748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8809303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8810005Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8810687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8811284Z kernel = self.compile( 2025-05-07T20:32:41.8811856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8812533Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8812957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8813198Z 2025-05-07T20:32:41.8813415Z self = 2025-05-07T20:32:41.8814512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8815913Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8c4f550>} 2025-05-07T20:32:41.8817288Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8818318Z context = 2025-05-07T20:32:41.8818656Z 2025-05-07T20:32:41.8818899Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8819448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8819924Z module_map=module_map) 2025-05-07T20:32:41.8820306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8820675Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8820942Z E ^ 2025-05-07T20:32:41.8821511Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8822019Z 2025-05-07T20:32:41.8822452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8822969Z 2025-05-07T20:32:41.8823088Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8823507Z self=, 2025-05-07T20:32:41.8823932Z T=16384, 2025-05-07T20:32:41.8824142Z D=5120, 2025-05-07T20:32:41.8824343Z scale_ub=None, 2025-05-07T20:32:41.8824573Z contiguous=False, 2025-05-07T20:32:41.8824815Z compiled=False, 2025-05-07T20:32:41.8825025Z ) 2025-05-07T20:32:41.8825355Z self = 2025-05-07T20:32:41.8825948Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.8826323Z 2025-05-07T20:32:41.8826417Z @given( 2025-05-07T20:32:41.8826653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8826986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8827310Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8827646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8827989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8828290Z ) 2025-05-07T20:32:41.8828655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8829111Z def test_silu_mul_quant( 2025-05-07T20:32:41.8829367Z self, 2025-05-07T20:32:41.8829571Z T: int, 2025-05-07T20:32:41.8829786Z D: int, 2025-05-07T20:32:41.8830018Z scale_ub: Optional[float], 2025-05-07T20:32:41.8830300Z contiguous: bool, 2025-05-07T20:32:41.8830559Z compiled: bool, 2025-05-07T20:32:41.8830796Z ) -> None: 2025-05-07T20:32:41.8831027Z torch.manual_seed(2025) 2025-05-07T20:32:41.8831281Z 2025-05-07T20:32:41.8831565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8831993Z 2025-05-07T20:32:41.8832200Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8832507Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8834537Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.8836440Z 2025-05-07T20:32:41.8836566Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.8836789Z 2025-05-07T20:32:41.8836905Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8837329Z self=, 2025-05-07T20:32:41.8837761Z T=4096, 2025-05-07T20:32:41.8837965Z D=7168, 2025-05-07T20:32:41.8838176Z scale_ub=1200.0, 2025-05-07T20:32:41.8838407Z contiguous=True, 2025-05-07T20:32:41.8838645Z compiled=True, 2025-05-07T20:32:41.8838908Z ) 2025-05-07T20:32:41.8839276Z self = 2025-05-07T20:32:41.8839790Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.8840437Z 2025-05-07T20:32:41.8840546Z @given( 2025-05-07T20:32:41.8840785Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8841114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8841440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8841782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8842131Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8842575Z ) 2025-05-07T20:32:41.8842946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8843402Z def test_silu_mul_quant( 2025-05-07T20:32:41.8843658Z self, 2025-05-07T20:32:41.8843866Z T: int, 2025-05-07T20:32:41.8844070Z D: int, 2025-05-07T20:32:41.8844315Z scale_ub: Optional[float], 2025-05-07T20:32:41.8844604Z contiguous: bool, 2025-05-07T20:32:41.8844850Z compiled: bool, 2025-05-07T20:32:41.8845087Z ) -> None: 2025-05-07T20:32:41.8845320Z torch.manual_seed(2025) 2025-05-07T20:32:41.8845572Z 2025-05-07T20:32:41.8845858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8846212Z 2025-05-07T20:32:41.8846411Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8846740Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8848751Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.8850639Z 2025-05-07T20:32:41.8850788Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.8851008Z 2025-05-07T20:32:41.8851126Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8851546Z self=, 2025-05-07T20:32:41.8851967Z T=16384, 2025-05-07T20:32:41.8852176Z D=7168, 2025-05-07T20:32:41.8852382Z scale_ub=None, 2025-05-07T20:32:41.8852610Z contiguous=False, 2025-05-07T20:32:41.8852850Z compiled=False, 2025-05-07T20:32:41.8853135Z ) 2025-05-07T20:32:41.9898016Z self = 2025-05-07T20:32:41.9898589Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.9898869Z 2025-05-07T20:32:41.9898960Z @given( 2025-05-07T20:32:41.9899222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9899552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9899870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9900217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9900550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9900856Z ) 2025-05-07T20:32:41.9901440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9901887Z def test_silu_mul_quant( 2025-05-07T20:32:41.9902139Z self, 2025-05-07T20:32:41.9902341Z T: int, 2025-05-07T20:32:41.9902554Z D: int, 2025-05-07T20:32:41.9902781Z scale_ub: Optional[float], 2025-05-07T20:32:41.9903059Z contiguous: bool, 2025-05-07T20:32:41.9903302Z compiled: bool, 2025-05-07T20:32:41.9903537Z ) -> None: 2025-05-07T20:32:41.9903760Z torch.manual_seed(2025) 2025-05-07T20:32:41.9904257Z 2025-05-07T20:32:41.9904611Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9906666Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9908639Z 2025-05-07T20:32:41.9908766Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9908986Z 2025-05-07T20:32:41.9909102Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9909517Z self=, 2025-05-07T20:32:41.9909928Z T=2048, 2025-05-07T20:32:41.9910132Z D=7168, 2025-05-07T20:32:41.9910330Z scale_ub=1200.0, 2025-05-07T20:32:41.9910567Z contiguous=True, 2025-05-07T20:32:41.9910805Z compiled=True, 2025-05-07T20:32:41.9911014Z ) 2025-05-07T20:32:41.9911343Z self = 2025-05-07T20:32:41.9911849Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.9912123Z 2025-05-07T20:32:41.9912213Z @given( 2025-05-07T20:32:41.9912444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9912773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9913096Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9913429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9913773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9914071Z ) 2025-05-07T20:32:41.9914431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9914888Z def test_silu_mul_quant( 2025-05-07T20:32:41.9915141Z self, 2025-05-07T20:32:41.9915349Z T: int, 2025-05-07T20:32:41.9915551Z D: int, 2025-05-07T20:32:41.9915783Z scale_ub: Optional[float], 2025-05-07T20:32:41.9916068Z contiguous: bool, 2025-05-07T20:32:41.9916315Z compiled: bool, 2025-05-07T20:32:41.9916556Z ) -> None: 2025-05-07T20:32:41.9916787Z torch.manual_seed(2025) 2025-05-07T20:32:41.9917042Z 2025-05-07T20:32:41.9917324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9917678Z 2025-05-07T20:32:41.9917956Z x_sign = torch.sign(x) 2025-05-07T20:32:41.9918260Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.9920259Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9922086Z 2025-05-07T20:32:41.9922208Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.9922425Z 2025-05-07T20:32:41.9922539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9922953Z self=, 2025-05-07T20:32:41.9923365Z T=2048, 2025-05-07T20:32:41.9923562Z D=7168, 2025-05-07T20:32:41.9923754Z scale_ub=None, 2025-05-07T20:32:41.9923976Z contiguous=True, 2025-05-07T20:32:41.9924214Z compiled=False, 2025-05-07T20:32:41.9924421Z ) 2025-05-07T20:32:41.9924793Z self = 2025-05-07T20:32:41.9925335Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.9925609Z 2025-05-07T20:32:41.9925699Z @given( 2025-05-07T20:32:41.9925931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9926262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9926579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9926911Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9927253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9927549Z ) 2025-05-07T20:32:41.9927952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9928403Z def test_silu_mul_quant( 2025-05-07T20:32:41.9928658Z self, 2025-05-07T20:32:41.9928859Z T: int, 2025-05-07T20:32:41.9929069Z D: int, 2025-05-07T20:32:41.9929299Z scale_ub: Optional[float], 2025-05-07T20:32:41.9929581Z contiguous: bool, 2025-05-07T20:32:41.9929832Z compiled: bool, 2025-05-07T20:32:41.9930068Z ) -> None: 2025-05-07T20:32:41.9930295Z torch.manual_seed(2025) 2025-05-07T20:32:41.9930540Z 2025-05-07T20:32:41.9930821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9931173Z 2025-05-07T20:32:41.9931371Z > x_sign = torch.sign(x) 2025-05-07T20:32:41.9933297Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9935135Z 2025-05-07T20:32:41.9935259Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:41.9935476Z 2025-05-07T20:32:41.9935588Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9936012Z self=, 2025-05-07T20:32:41.9936425Z T=1, 2025-05-07T20:32:41.9936619Z D=7168, 2025-05-07T20:32:41.9936822Z scale_ub=1200.0, 2025-05-07T20:32:41.9937049Z contiguous=True, 2025-05-07T20:32:41.9937281Z compiled=False, 2025-05-07T20:32:41.9937499Z ) 2025-05-07T20:32:42.3310297Z self = 2025-05-07T20:32:42.3310879Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.3311153Z 2025-05-07T20:32:42.3311243Z @given( 2025-05-07T20:32:42.3311477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3311809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3312147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3312488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3312827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3313126Z ) 2025-05-07T20:32:42.3313483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3313939Z def test_silu_mul_quant( 2025-05-07T20:32:42.3314194Z self, 2025-05-07T20:32:42.3314394Z T: int, 2025-05-07T20:32:42.3314604Z D: int, 2025-05-07T20:32:42.3314836Z scale_ub: Optional[float], 2025-05-07T20:32:42.3315120Z contiguous: bool, 2025-05-07T20:32:42.3315375Z compiled: bool, 2025-05-07T20:32:42.3315617Z ) -> None: 2025-05-07T20:32:42.3315852Z torch.manual_seed(2025) 2025-05-07T20:32:42.3316101Z 2025-05-07T20:32:42.3316383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3316825Z 2025-05-07T20:32:42.3317101Z x_sign = torch.sign(x) 2025-05-07T20:32:42.3317411Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.3317740Z x = x_sign * x_clamp 2025-05-07T20:32:42.3317988Z x0 = x[:, :D] 2025-05-07T20:32:42.3318217Z x1 = x[:, D:] 2025-05-07T20:32:42.3318434Z 2025-05-07T20:32:42.3318623Z if contiguous: 2025-05-07T20:32:42.3318866Z x0 = x0.contiguous() 2025-05-07T20:32:42.3319140Z x1 = x1.contiguous() 2025-05-07T20:32:42.3319386Z 2025-05-07T20:32:42.3319586Z if scale_ub is not None: 2025-05-07T20:32:42.3319874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.3320304Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.3320630Z ) 2025-05-07T20:32:42.3320835Z else: 2025-05-07T20:32:42.3321055Z scale_ub_tensor = None 2025-05-07T20:32:42.3321312Z 2025-05-07T20:32:42.3321561Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.3321889Z op = silu_mul_quant 2025-05-07T20:32:42.3322148Z if compiled: 2025-05-07T20:32:42.3322407Z op = torch.compile(op) 2025-05-07T20:32:42.3322715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.3323000Z 2025-05-07T20:32:42.3323204Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.3323373Z 2025-05-07T20:32:42.3323484Z moe/activation_test.py:117: 2025-05-07T20:32:42.3323783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.3324134Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.3324431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.3325146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.3325853Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.3326412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.3327119Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.3327788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.3328345Z kernel = self.compile( 2025-05-07T20:32:42.3328900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.3329568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.3330024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.3330272Z 2025-05-07T20:32:42.3330486Z self = 2025-05-07T20:32:42.3331578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.3332969Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8d50040>} 2025-05-07T20:32:42.3334321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.3335350Z context = 2025-05-07T20:32:42.3335654Z 2025-05-07T20:32:42.3335837Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.3336379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.3336849Z module_map=module_map) 2025-05-07T20:32:42.3337280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.3337727Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.3338000Z E ^ 2025-05-07T20:32:42.3338465Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.3338920Z 2025-05-07T20:32:42.3347324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.3347908Z 2025-05-07T20:32:42.3348024Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3348457Z self=, 2025-05-07T20:32:42.3348997Z T=128, 2025-05-07T20:32:42.3349206Z D=5120, 2025-05-07T20:32:42.3349410Z scale_ub=None, 2025-05-07T20:32:42.3349627Z contiguous=True, 2025-05-07T20:32:42.3349866Z compiled=False, 2025-05-07T20:32:42.3350093Z ) 2025-05-07T20:32:42.3350415Z self = 2025-05-07T20:32:42.3350936Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.3351223Z 2025-05-07T20:32:42.3351306Z @given( 2025-05-07T20:32:42.3351554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3351879Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3352198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3352543Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3352878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3353184Z ) 2025-05-07T20:32:42.3353550Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3353999Z def test_silu_mul_quant( 2025-05-07T20:32:42.3354251Z self, 2025-05-07T20:32:42.3354455Z T: int, 2025-05-07T20:32:42.3354654Z D: int, 2025-05-07T20:32:42.3354887Z scale_ub: Optional[float], 2025-05-07T20:32:42.3355176Z contiguous: bool, 2025-05-07T20:32:42.3355426Z compiled: bool, 2025-05-07T20:32:42.3355663Z ) -> None: 2025-05-07T20:32:42.3355888Z torch.manual_seed(2025) 2025-05-07T20:32:42.3356140Z 2025-05-07T20:32:42.3356416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3356770Z 2025-05-07T20:32:42.3356977Z x_sign = torch.sign(x) 2025-05-07T20:32:42.3357271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.3357599Z x = x_sign * x_clamp 2025-05-07T20:32:42.3357855Z x0 = x[:, :D] 2025-05-07T20:32:42.3358077Z x1 = x[:, D:] 2025-05-07T20:32:42.3358298Z 2025-05-07T20:32:42.3358567Z if contiguous: 2025-05-07T20:32:42.3358807Z x0 = x0.contiguous() 2025-05-07T20:32:42.3359075Z x1 = x1.contiguous() 2025-05-07T20:32:42.3359324Z 2025-05-07T20:32:42.3359517Z if scale_ub is not None: 2025-05-07T20:32:42.3359798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.3360152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.3360463Z ) 2025-05-07T20:32:42.3360668Z else: 2025-05-07T20:32:42.3360887Z scale_ub_tensor = None 2025-05-07T20:32:42.3361152Z 2025-05-07T20:32:42.3361387Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.3361720Z op = silu_mul_quant 2025-05-07T20:32:42.3361989Z if compiled: 2025-05-07T20:32:42.3362245Z op = torch.compile(op) 2025-05-07T20:32:42.3362553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.3362841Z 2025-05-07T20:32:42.3363047Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.3363223Z 2025-05-07T20:32:42.3363328Z moe/activation_test.py:117: 2025-05-07T20:32:42.3363639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.3363980Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.3364279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.3365123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.3365838Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.3366387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.3367085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.3367757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.3368296Z kernel = self.compile( 2025-05-07T20:32:42.3368905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.3369568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.3369980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.3370217Z 2025-05-07T20:32:42.3370428Z self = 2025-05-07T20:32:42.3371513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.3372898Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8d50a60>} 2025-05-07T20:32:42.3374249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.3375289Z context = 2025-05-07T20:32:42.3375583Z 2025-05-07T20:32:42.3375763Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.3376295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.3376771Z module_map=module_map) 2025-05-07T20:32:42.3377142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.3377510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.3377785Z E ^ 2025-05-07T20:32:42.3378263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.3378723Z 2025-05-07T20:32:42.3379197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.3379720Z 2025-05-07T20:32:42.3379828Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3380253Z self=, 2025-05-07T20:32:42.3380667Z T=128, 2025-05-07T20:32:42.3380863Z D=7168, 2025-05-07T20:32:42.3381148Z scale_ub=None, 2025-05-07T20:32:42.3381371Z contiguous=True, 2025-05-07T20:32:42.3381598Z compiled=False, 2025-05-07T20:32:42.3381814Z ) 2025-05-07T20:32:42.4281172Z self = 2025-05-07T20:32:42.4281701Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.4281981Z 2025-05-07T20:32:42.4282085Z @given( 2025-05-07T20:32:42.4282330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.4282659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.4282992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.4283339Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.4283688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.4283990Z ) 2025-05-07T20:32:42.4284570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.4285097Z def test_silu_mul_quant( 2025-05-07T20:32:42.4285355Z self, 2025-05-07T20:32:42.4285554Z T: int, 2025-05-07T20:32:42.4285766Z D: int, 2025-05-07T20:32:42.4285999Z scale_ub: Optional[float], 2025-05-07T20:32:42.4286278Z contiguous: bool, 2025-05-07T20:32:42.4286535Z compiled: bool, 2025-05-07T20:32:42.4286773Z ) -> None: 2025-05-07T20:32:42.4286995Z torch.manual_seed(2025) 2025-05-07T20:32:42.4287253Z 2025-05-07T20:32:42.4287545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.4287893Z 2025-05-07T20:32:42.4288193Z x_sign = torch.sign(x) 2025-05-07T20:32:42.4288496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.4288810Z x = x_sign * x_clamp 2025-05-07T20:32:42.4289062Z x0 = x[:, :D] 2025-05-07T20:32:42.4289289Z x1 = x[:, D:] 2025-05-07T20:32:42.4289501Z 2025-05-07T20:32:42.4289705Z if contiguous: 2025-05-07T20:32:42.4289950Z x0 = x0.contiguous() 2025-05-07T20:32:42.4290220Z x1 = x1.contiguous() 2025-05-07T20:32:42.4290498Z 2025-05-07T20:32:42.4290727Z if scale_ub is not None: 2025-05-07T20:32:42.4291003Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.4291348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.4291670Z ) 2025-05-07T20:32:42.4291866Z else: 2025-05-07T20:32:42.4292088Z scale_ub_tensor = None 2025-05-07T20:32:42.4292350Z 2025-05-07T20:32:42.4292584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.4292916Z op = silu_mul_quant 2025-05-07T20:32:42.4293176Z if compiled: 2025-05-07T20:32:42.4293426Z op = torch.compile(op) 2025-05-07T20:32:42.4293729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4294009Z 2025-05-07T20:32:42.4294213Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.4294381Z 2025-05-07T20:32:42.4294486Z moe/activation_test.py:117: 2025-05-07T20:32:42.4294794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4295135Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.4295418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4296116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.4296824Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.4297446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.4298143Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.4298811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.4299348Z kernel = self.compile( 2025-05-07T20:32:42.4299897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.4300557Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.4300962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4301299Z 2025-05-07T20:32:42.4301517Z self = 2025-05-07T20:32:42.4302599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.4303981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8cd1790>} 2025-05-07T20:32:42.4305370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.4306429Z context = 2025-05-07T20:32:42.4306718Z 2025-05-07T20:32:42.4306897Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.4307418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.4307890Z module_map=module_map) 2025-05-07T20:32:42.4308270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.4308670Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.4308939Z E ^ 2025-05-07T20:32:42.4309413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.4309861Z 2025-05-07T20:32:42.4310289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.4310802Z 2025-05-07T20:32:42.4310909Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.4311332Z self=, 2025-05-07T20:32:42.4311739Z T=2048, 2025-05-07T20:32:42.4311927Z D=7168, 2025-05-07T20:32:42.4312127Z scale_ub=1200.0, 2025-05-07T20:32:42.4312359Z contiguous=True, 2025-05-07T20:32:42.4312585Z compiled=False, 2025-05-07T20:32:42.4312802Z ) 2025-05-07T20:32:42.4313127Z self = 2025-05-07T20:32:42.4313637Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.4313920Z 2025-05-07T20:32:42.4314002Z @given( 2025-05-07T20:32:42.4314242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.4314566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.4314882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.4315224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.4315560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.4315848Z ) 2025-05-07T20:32:42.4316202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.4316662Z def test_silu_mul_quant( 2025-05-07T20:32:42.4316906Z self, 2025-05-07T20:32:42.4317109Z T: int, 2025-05-07T20:32:42.4317314Z D: int, 2025-05-07T20:32:42.4317539Z scale_ub: Optional[float], 2025-05-07T20:32:42.4317811Z contiguous: bool, 2025-05-07T20:32:42.4318111Z compiled: bool, 2025-05-07T20:32:42.4318342Z ) -> None: 2025-05-07T20:32:42.4318559Z torch.manual_seed(2025) 2025-05-07T20:32:42.4318826Z 2025-05-07T20:32:42.4319100Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.4321189Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.4323066Z 2025-05-07T20:32:42.4323186Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.4323415Z 2025-05-07T20:32:42.4323521Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.4323943Z self=, 2025-05-07T20:32:42.4324345Z T=1, 2025-05-07T20:32:42.4324540Z D=5120, 2025-05-07T20:32:42.4324742Z scale_ub=1200.0, 2025-05-07T20:32:42.4325079Z contiguous=True, 2025-05-07T20:32:42.4325314Z compiled=False, 2025-05-07T20:32:42.4325532Z ) 2025-05-07T20:32:42.4808359Z self = 2025-05-07T20:32:42.4808921Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.4809192Z 2025-05-07T20:32:42.4809274Z @given( 2025-05-07T20:32:42.4809512Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.4809838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.4810153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.4810503Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.4811045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.4811338Z ) 2025-05-07T20:32:42.4811692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.4812144Z def test_silu_mul_quant( 2025-05-07T20:32:42.4812397Z self, 2025-05-07T20:32:42.4812598Z T: int, 2025-05-07T20:32:42.4812803Z D: int, 2025-05-07T20:32:42.4813031Z scale_ub: Optional[float], 2025-05-07T20:32:42.4813306Z contiguous: bool, 2025-05-07T20:32:42.4813558Z compiled: bool, 2025-05-07T20:32:42.4813791Z ) -> None: 2025-05-07T20:32:42.4814009Z torch.manual_seed(2025) 2025-05-07T20:32:42.4814261Z 2025-05-07T20:32:42.4814542Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.4814898Z 2025-05-07T20:32:42.4815090Z x_sign = torch.sign(x) 2025-05-07T20:32:42.4815389Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.4815715Z x = x_sign * x_clamp 2025-05-07T20:32:42.4815959Z x0 = x[:, :D] 2025-05-07T20:32:42.4816185Z x1 = x[:, D:] 2025-05-07T20:32:42.4816408Z 2025-05-07T20:32:42.4816595Z if contiguous: 2025-05-07T20:32:42.4816840Z x0 = x0.contiguous() 2025-05-07T20:32:42.4817114Z x1 = x1.contiguous() 2025-05-07T20:32:42.4817358Z 2025-05-07T20:32:42.4817557Z if scale_ub is not None: 2025-05-07T20:32:42.4817842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.4818181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.4818502Z ) 2025-05-07T20:32:42.4818705Z else: 2025-05-07T20:32:42.4818919Z scale_ub_tensor = None 2025-05-07T20:32:42.4819185Z 2025-05-07T20:32:42.4819426Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.4819755Z op = silu_mul_quant 2025-05-07T20:32:42.4820010Z if compiled: 2025-05-07T20:32:42.4820382Z op = torch.compile(op) 2025-05-07T20:32:42.4820696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4820972Z 2025-05-07T20:32:42.4821237Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.4821404Z 2025-05-07T20:32:42.4821516Z moe/activation_test.py:117: 2025-05-07T20:32:42.4821819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4822162Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.4822458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4823146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.4823847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.4824399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.4825089Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.4825757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.4826299Z kernel = self.compile( 2025-05-07T20:32:42.4826847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.4827646Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.4828052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4828291Z 2025-05-07T20:32:42.4828503Z self = 2025-05-07T20:32:42.4829583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.4831017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8c96040>} 2025-05-07T20:32:42.4832399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.4833439Z context = 2025-05-07T20:32:42.4833738Z 2025-05-07T20:32:42.4833909Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.4834440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.4834909Z module_map=module_map) 2025-05-07T20:32:42.4835285Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.4835647Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.4835910Z E ^ 2025-05-07T20:32:42.4836382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.4836837Z 2025-05-07T20:32:42.4837253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.4837769Z 2025-05-07T20:32:42.4837886Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.4838298Z self=, 2025-05-07T20:32:42.4838710Z T=2048, 2025-05-07T20:32:42.4838909Z D=5120, 2025-05-07T20:32:42.4839102Z scale_ub=None, 2025-05-07T20:32:42.4839325Z contiguous=True, 2025-05-07T20:32:42.4839559Z compiled=False, 2025-05-07T20:32:42.4839774Z ) 2025-05-07T20:32:42.4840349Z self = 2025-05-07T20:32:42.4840854Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.4841134Z 2025-05-07T20:32:42.4841303Z @given( 2025-05-07T20:32:42.4841540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.4841865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.4842182Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.4842513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.4842857Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.4843154Z ) 2025-05-07T20:32:42.4843515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.4843967Z def test_silu_mul_quant( 2025-05-07T20:32:42.4844219Z self, 2025-05-07T20:32:42.4844422Z T: int, 2025-05-07T20:32:42.4844629Z D: int, 2025-05-07T20:32:42.4844858Z scale_ub: Optional[float], 2025-05-07T20:32:42.4845142Z contiguous: bool, 2025-05-07T20:32:42.4845387Z compiled: bool, 2025-05-07T20:32:42.4845616Z ) -> None: 2025-05-07T20:32:42.4845850Z torch.manual_seed(2025) 2025-05-07T20:32:42.4846096Z 2025-05-07T20:32:42.4846372Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.4846723Z 2025-05-07T20:32:42.4846922Z > x_sign = torch.sign(x) 2025-05-07T20:32:42.4848947Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.4850840Z 2025-05-07T20:32:42.4850960Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:42.4851181Z 2025-05-07T20:32:42.4851347Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.4851766Z self=, 2025-05-07T20:32:42.4852168Z T=16384, 2025-05-07T20:32:42.4852367Z D=5120, 2025-05-07T20:32:42.4852567Z scale_ub=None, 2025-05-07T20:32:42.4852784Z contiguous=True, 2025-05-07T20:32:42.4853023Z compiled=False, 2025-05-07T20:32:42.4853240Z ) 2025-05-07T20:32:42.4853557Z self = 2025-05-07T20:32:42.4854063Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.4854345Z 2025-05-07T20:32:42.4854426Z @given( 2025-05-07T20:32:42.4854664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.4854978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.4855293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.4855628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.4855976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.4856271Z ) 2025-05-07T20:32:42.4856630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.4857077Z def test_silu_mul_quant( 2025-05-07T20:32:42.4857329Z self, 2025-05-07T20:32:42.4857542Z T: int, 2025-05-07T20:32:42.4857747Z D: int, 2025-05-07T20:32:42.4857974Z scale_ub: Optional[float], 2025-05-07T20:32:42.4858259Z contiguous: bool, 2025-05-07T20:32:42.4858503Z compiled: bool, 2025-05-07T20:32:42.4858743Z ) -> None: 2025-05-07T20:32:42.4858968Z torch.manual_seed(2025) 2025-05-07T20:32:42.4859224Z 2025-05-07T20:32:42.4859498Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.4861673Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.4863568Z 2025-05-07T20:32:42.4863691Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.4863911Z 2025-05-07T20:32:42.4864030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.4864444Z self=, 2025-05-07T20:32:42.4864853Z T=4096, 2025-05-07T20:32:42.4865051Z D=5120, 2025-05-07T20:32:42.4865252Z scale_ub=None, 2025-05-07T20:32:42.4865637Z contiguous=True, 2025-05-07T20:32:42.4865870Z compiled=False, 2025-05-07T20:32:42.4866086Z ) 2025-05-07T20:32:42.5900065Z self = 2025-05-07T20:32:42.5900789Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.5901137Z 2025-05-07T20:32:42.5901222Z @given( 2025-05-07T20:32:42.5901465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5902073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5902385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5902728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5903065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5903353Z ) 2025-05-07T20:32:42.5903710Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5904163Z def test_silu_mul_quant( 2025-05-07T20:32:42.5904416Z self, 2025-05-07T20:32:42.5904612Z T: int, 2025-05-07T20:32:42.5904820Z D: int, 2025-05-07T20:32:42.5905049Z scale_ub: Optional[float], 2025-05-07T20:32:42.5905406Z contiguous: bool, 2025-05-07T20:32:42.5905655Z compiled: bool, 2025-05-07T20:32:42.5905889Z ) -> None: 2025-05-07T20:32:42.5906108Z torch.manual_seed(2025) 2025-05-07T20:32:42.5906359Z 2025-05-07T20:32:42.5906636Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5908662Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5910537Z 2025-05-07T20:32:42.5910662Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.5910893Z 2025-05-07T20:32:42.5910998Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5911423Z self=, 2025-05-07T20:32:42.5911828Z T=2048, 2025-05-07T20:32:42.5912017Z D=5120, 2025-05-07T20:32:42.5912218Z scale_ub=None, 2025-05-07T20:32:42.5912444Z contiguous=False, 2025-05-07T20:32:42.5920739Z compiled=False, 2025-05-07T20:32:42.5920980Z ) 2025-05-07T20:32:42.5921315Z self = 2025-05-07T20:32:42.5921835Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.5922113Z 2025-05-07T20:32:42.5922210Z @given( 2025-05-07T20:32:42.5922446Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5922772Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5923094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5923557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5923905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5924202Z ) 2025-05-07T20:32:42.5924562Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5925019Z def test_silu_mul_quant( 2025-05-07T20:32:42.5925282Z self, 2025-05-07T20:32:42.5925490Z T: int, 2025-05-07T20:32:42.5925696Z D: int, 2025-05-07T20:32:42.5925931Z scale_ub: Optional[float], 2025-05-07T20:32:42.5926217Z contiguous: bool, 2025-05-07T20:32:42.5926464Z compiled: bool, 2025-05-07T20:32:42.5926702Z ) -> None: 2025-05-07T20:32:42.5926938Z torch.manual_seed(2025) 2025-05-07T20:32:42.5927189Z 2025-05-07T20:32:42.5927472Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5929589Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5931553Z 2025-05-07T20:32:42.5931683Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.5931900Z 2025-05-07T20:32:42.5932021Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5932440Z self=, 2025-05-07T20:32:42.5932849Z T=4096, 2025-05-07T20:32:42.5933050Z D=7168, 2025-05-07T20:32:42.5933244Z scale_ub=None, 2025-05-07T20:32:42.5933469Z contiguous=True, 2025-05-07T20:32:42.5933702Z compiled=True, 2025-05-07T20:32:42.5933959Z ) 2025-05-07T20:32:42.5934289Z self = 2025-05-07T20:32:42.5934787Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5935059Z 2025-05-07T20:32:42.5935141Z @given( 2025-05-07T20:32:42.5935380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5935713Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5936032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5936367Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5936709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5937009Z ) 2025-05-07T20:32:42.5937363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5937817Z def test_silu_mul_quant( 2025-05-07T20:32:42.5938071Z self, 2025-05-07T20:32:42.5938269Z T: int, 2025-05-07T20:32:42.5938479Z D: int, 2025-05-07T20:32:42.5938719Z scale_ub: Optional[float], 2025-05-07T20:32:42.5938996Z contiguous: bool, 2025-05-07T20:32:42.5939252Z compiled: bool, 2025-05-07T20:32:42.5939492Z ) -> None: 2025-05-07T20:32:42.5939715Z torch.manual_seed(2025) 2025-05-07T20:32:42.5939976Z 2025-05-07T20:32:42.5940583Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5942689Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5944600Z 2025-05-07T20:32:42.5944740Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.5944959Z 2025-05-07T20:32:42.5945070Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5945496Z self=, 2025-05-07T20:32:42.5945908Z T=2048, 2025-05-07T20:32:42.5946105Z D=5120, 2025-05-07T20:32:42.5946314Z scale_ub=1200.0, 2025-05-07T20:32:42.5946552Z contiguous=False, 2025-05-07T20:32:42.5946784Z compiled=False, 2025-05-07T20:32:42.5947009Z ) 2025-05-07T20:32:42.5947339Z self = 2025-05-07T20:32:42.5947850Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.5948128Z 2025-05-07T20:32:42.5948213Z @given( 2025-05-07T20:32:42.5948453Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5948784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5949105Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5949448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5949794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5950083Z ) 2025-05-07T20:32:42.5950443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5951015Z def test_silu_mul_quant( 2025-05-07T20:32:42.5951273Z self, 2025-05-07T20:32:42.5951478Z T: int, 2025-05-07T20:32:42.5951690Z D: int, 2025-05-07T20:32:42.5951920Z scale_ub: Optional[float], 2025-05-07T20:32:42.5952204Z contiguous: bool, 2025-05-07T20:32:42.5952458Z compiled: bool, 2025-05-07T20:32:42.5952694Z ) -> None: 2025-05-07T20:32:42.5952916Z torch.manual_seed(2025) 2025-05-07T20:32:42.5953174Z 2025-05-07T20:32:42.5953459Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5955481Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5957418Z 2025-05-07T20:32:42.5957540Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.5957763Z 2025-05-07T20:32:42.5957870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5958288Z self=, 2025-05-07T20:32:42.5958699Z T=4096, 2025-05-07T20:32:42.5958889Z D=7168, 2025-05-07T20:32:42.5959091Z scale_ub=1200.0, 2025-05-07T20:32:42.5959327Z contiguous=True, 2025-05-07T20:32:42.5959560Z compiled=False, 2025-05-07T20:32:42.5959780Z ) 2025-05-07T20:32:42.5960106Z self = 2025-05-07T20:32:42.5960629Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.5960939Z 2025-05-07T20:32:42.5961027Z @given( 2025-05-07T20:32:42.5961267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5961594Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5961913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5962254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5962597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5962891Z ) 2025-05-07T20:32:42.5963255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5963711Z def test_silu_mul_quant( 2025-05-07T20:32:42.5963961Z self, 2025-05-07T20:32:42.5964222Z T: int, 2025-05-07T20:32:42.5964435Z D: int, 2025-05-07T20:32:42.5964659Z scale_ub: Optional[float], 2025-05-07T20:32:42.5964945Z contiguous: bool, 2025-05-07T20:32:42.5965193Z compiled: bool, 2025-05-07T20:32:42.5965424Z ) -> None: 2025-05-07T20:32:42.5965655Z torch.manual_seed(2025) 2025-05-07T20:32:42.5965903Z 2025-05-07T20:32:42.5966186Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5968209Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5970048Z 2025-05-07T20:32:42.5970169Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.5970386Z 2025-05-07T20:32:42.5970502Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5970959Z self=, 2025-05-07T20:32:42.5971409Z T=16384, 2025-05-07T20:32:42.5971612Z D=7168, 2025-05-07T20:32:42.5971807Z scale_ub=None, 2025-05-07T20:32:42.5972032Z contiguous=False, 2025-05-07T20:32:42.5972267Z compiled=True, 2025-05-07T20:32:42.5972478Z ) 2025-05-07T20:32:42.7266765Z self = 2025-05-07T20:32:42.7267310Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.7267604Z 2025-05-07T20:32:42.7267689Z @given( 2025-05-07T20:32:42.7267936Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7268576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7268906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7269256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7269593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7269898Z ) 2025-05-07T20:32:42.7270277Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7270724Z def test_silu_mul_quant( 2025-05-07T20:32:42.7270980Z self, 2025-05-07T20:32:42.7271187Z T: int, 2025-05-07T20:32:42.7271389Z D: int, 2025-05-07T20:32:42.7271624Z scale_ub: Optional[float], 2025-05-07T20:32:42.7271910Z contiguous: bool, 2025-05-07T20:32:42.7272164Z compiled: bool, 2025-05-07T20:32:42.7272395Z ) -> None: 2025-05-07T20:32:42.7272624Z torch.manual_seed(2025) 2025-05-07T20:32:42.7272879Z 2025-05-07T20:32:42.7273158Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7275224Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7277081Z 2025-05-07T20:32:42.7277211Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7277429Z 2025-05-07T20:32:42.7277546Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7277973Z self=, 2025-05-07T20:32:42.7278384Z T=4096, 2025-05-07T20:32:42.7278583Z D=7168, 2025-05-07T20:32:42.7278871Z scale_ub=None, 2025-05-07T20:32:42.7279095Z contiguous=True, 2025-05-07T20:32:42.7279335Z compiled=False, 2025-05-07T20:32:42.7279555Z ) 2025-05-07T20:32:42.7279877Z self = 2025-05-07T20:32:42.7280388Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7280669Z 2025-05-07T20:32:42.7280760Z @given( 2025-05-07T20:32:42.7280996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7281321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7281640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7281986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7282320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7282639Z ) 2025-05-07T20:32:42.7283006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7283466Z def test_silu_mul_quant( 2025-05-07T20:32:42.7283715Z self, 2025-05-07T20:32:42.7283926Z T: int, 2025-05-07T20:32:42.7284140Z D: int, 2025-05-07T20:32:42.7284365Z scale_ub: Optional[float], 2025-05-07T20:32:42.7284652Z contiguous: bool, 2025-05-07T20:32:42.7284906Z compiled: bool, 2025-05-07T20:32:42.7285359Z ) -> None: 2025-05-07T20:32:42.7285593Z torch.manual_seed(2025) 2025-05-07T20:32:42.7285851Z 2025-05-07T20:32:42.7286126Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7288162Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7290079Z 2025-05-07T20:32:42.7290205Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7290432Z 2025-05-07T20:32:42.7290539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7290966Z self=, 2025-05-07T20:32:42.7291371Z T=16384, 2025-05-07T20:32:42.7291578Z D=7168, 2025-05-07T20:32:42.7291782Z scale_ub=None, 2025-05-07T20:32:42.7292001Z contiguous=True, 2025-05-07T20:32:42.7292238Z compiled=False, 2025-05-07T20:32:42.7292468Z ) 2025-05-07T20:32:42.7292788Z self = 2025-05-07T20:32:42.7293298Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7293574Z 2025-05-07T20:32:42.7293663Z @given( 2025-05-07T20:32:42.7293904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7294230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7294549Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7294883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7295231Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7295534Z ) 2025-05-07T20:32:42.7295897Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7296343Z def test_silu_mul_quant( 2025-05-07T20:32:42.7296605Z self, 2025-05-07T20:32:42.7296821Z T: int, 2025-05-07T20:32:42.7297026Z D: int, 2025-05-07T20:32:42.7297256Z scale_ub: Optional[float], 2025-05-07T20:32:42.7297541Z contiguous: bool, 2025-05-07T20:32:42.7297789Z compiled: bool, 2025-05-07T20:32:42.7298024Z ) -> None: 2025-05-07T20:32:42.7298250Z torch.manual_seed(2025) 2025-05-07T20:32:42.7298504Z 2025-05-07T20:32:42.7298839Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7300877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7302857Z 2025-05-07T20:32:42.7302980Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7303196Z 2025-05-07T20:32:42.7303308Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7303721Z self=, 2025-05-07T20:32:42.7304144Z T=16384, 2025-05-07T20:32:42.7304349Z D=7168, 2025-05-07T20:32:42.7304546Z scale_ub=1200.0, 2025-05-07T20:32:42.7304779Z contiguous=True, 2025-05-07T20:32:42.7305014Z compiled=False, 2025-05-07T20:32:42.7305224Z ) 2025-05-07T20:32:42.7305549Z self = 2025-05-07T20:32:42.7306145Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.7306430Z 2025-05-07T20:32:42.7306520Z @given( 2025-05-07T20:32:42.7306755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7307086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7307406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7307741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7308080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7308377Z ) 2025-05-07T20:32:42.7308733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7309231Z def test_silu_mul_quant( 2025-05-07T20:32:42.7309487Z self, 2025-05-07T20:32:42.7309689Z T: int, 2025-05-07T20:32:42.7309900Z D: int, 2025-05-07T20:32:42.7310130Z scale_ub: Optional[float], 2025-05-07T20:32:42.7310413Z contiguous: bool, 2025-05-07T20:32:42.7310667Z compiled: bool, 2025-05-07T20:32:42.7310901Z ) -> None: 2025-05-07T20:32:42.7311130Z torch.manual_seed(2025) 2025-05-07T20:32:42.7311381Z 2025-05-07T20:32:42.7311667Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7313700Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7315540Z 2025-05-07T20:32:42.7315668Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7315890Z 2025-05-07T20:32:42.7316001Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7316424Z self=, 2025-05-07T20:32:42.7316837Z T=128, 2025-05-07T20:32:42.7317038Z D=5120, 2025-05-07T20:32:42.7317237Z scale_ub=1200.0, 2025-05-07T20:32:42.7317477Z contiguous=False, 2025-05-07T20:32:42.7317715Z compiled=False, 2025-05-07T20:32:42.7317929Z ) 2025-05-07T20:32:42.8947851Z self = 2025-05-07T20:32:42.8948410Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.8948977Z 2025-05-07T20:32:42.8949064Z @given( 2025-05-07T20:32:42.8949307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.8949632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.8949947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.8950306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.8950676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.8950993Z ) 2025-05-07T20:32:42.8951359Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.8951814Z def test_silu_mul_quant( 2025-05-07T20:32:42.8952064Z self, 2025-05-07T20:32:42.8952273Z T: int, 2025-05-07T20:32:42.8952484Z D: int, 2025-05-07T20:32:42.8952713Z scale_ub: Optional[float], 2025-05-07T20:32:42.8952991Z contiguous: bool, 2025-05-07T20:32:42.8953244Z compiled: bool, 2025-05-07T20:32:42.8953480Z ) -> None: 2025-05-07T20:32:42.8953717Z torch.manual_seed(2025) 2025-05-07T20:32:42.8953981Z 2025-05-07T20:32:42.8954262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.8954609Z 2025-05-07T20:32:42.8954815Z x_sign = torch.sign(x) 2025-05-07T20:32:42.8955208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.8955594Z x = x_sign * x_clamp 2025-05-07T20:32:42.8955849Z x0 = x[:, :D] 2025-05-07T20:32:42.8956079Z x1 = x[:, D:] 2025-05-07T20:32:42.8956290Z 2025-05-07T20:32:42.8956489Z if contiguous: 2025-05-07T20:32:42.8956738Z x0 = x0.contiguous() 2025-05-07T20:32:42.8957003Z x1 = x1.contiguous() 2025-05-07T20:32:42.8957254Z 2025-05-07T20:32:42.8957457Z if scale_ub is not None: 2025-05-07T20:32:42.8957738Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.8958086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.8958494Z ) 2025-05-07T20:32:42.8958700Z else: 2025-05-07T20:32:42.8958914Z scale_ub_tensor = None 2025-05-07T20:32:42.8959179Z 2025-05-07T20:32:42.8959419Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.8959740Z op = silu_mul_quant 2025-05-07T20:32:42.8960006Z if compiled: 2025-05-07T20:32:42.8960267Z op = torch.compile(op) 2025-05-07T20:32:42.8960572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.8960864Z 2025-05-07T20:32:42.8961066Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.8961235Z 2025-05-07T20:32:42.8961340Z moe/activation_test.py:117: 2025-05-07T20:32:42.8961643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.8961989Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.8962281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.8962985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.8963689Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.8964238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.8964927Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.8965602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.8966149Z kernel = self.compile( 2025-05-07T20:32:42.8966704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.8967366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.8967779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.8968017Z 2025-05-07T20:32:42.8968286Z self = 2025-05-07T20:32:42.8969390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.8970824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8a46ca0>} 2025-05-07T20:32:42.8972181Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.8973211Z context = 2025-05-07T20:32:42.8973506Z 2025-05-07T20:32:42.8973687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.8974219Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.8974700Z module_map=module_map) 2025-05-07T20:32:42.8975081Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.8975453Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.8975771Z E ^ 2025-05-07T20:32:42.8976288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.8976741Z 2025-05-07T20:32:42.8977165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.8977678Z 2025-05-07T20:32:42.8977786Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.8978211Z self=, 2025-05-07T20:32:42.8978625Z T=2048, 2025-05-07T20:32:42.8978825Z D=7168, 2025-05-07T20:32:42.8979023Z scale_ub=None, 2025-05-07T20:32:42.8979297Z contiguous=False, 2025-05-07T20:32:42.8979539Z compiled=False, 2025-05-07T20:32:42.8979751Z ) 2025-05-07T20:32:42.8980079Z self = 2025-05-07T20:32:42.8980588Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.8980895Z 2025-05-07T20:32:42.8980990Z @given( 2025-05-07T20:32:42.8981364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.8981695Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.8982010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.8982349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.8982688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.8982984Z ) 2025-05-07T20:32:42.8983338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.8983799Z def test_silu_mul_quant( 2025-05-07T20:32:42.8984059Z self, 2025-05-07T20:32:42.8984258Z T: int, 2025-05-07T20:32:42.8984470Z D: int, 2025-05-07T20:32:42.8984699Z scale_ub: Optional[float], 2025-05-07T20:32:42.8984977Z contiguous: bool, 2025-05-07T20:32:42.8985227Z compiled: bool, 2025-05-07T20:32:42.8985460Z ) -> None: 2025-05-07T20:32:42.8985685Z torch.manual_seed(2025) 2025-05-07T20:32:42.8985940Z 2025-05-07T20:32:42.8986225Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.8988320Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.8990200Z 2025-05-07T20:32:42.8990331Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.8990552Z 2025-05-07T20:32:42.8990658Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.8991087Z self=, 2025-05-07T20:32:42.8991501Z T=128, 2025-05-07T20:32:42.8991695Z D=7168, 2025-05-07T20:32:42.8991899Z scale_ub=1200.0, 2025-05-07T20:32:42.8992135Z contiguous=True, 2025-05-07T20:32:42.8992360Z compiled=True, 2025-05-07T20:32:42.8992575Z ) 2025-05-07T20:32:42.9437860Z self = 2025-05-07T20:32:42.9447550Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.9447898Z 2025-05-07T20:32:42.9447997Z @given( 2025-05-07T20:32:42.9448238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.9448593Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.9448921Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.9449255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.9449594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.9450149Z ) 2025-05-07T20:32:42.9450580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.9451042Z def test_silu_mul_quant( 2025-05-07T20:32:42.9451301Z self, 2025-05-07T20:32:42.9451501Z T: int, 2025-05-07T20:32:42.9451715Z D: int, 2025-05-07T20:32:42.9451952Z scale_ub: Optional[float], 2025-05-07T20:32:42.9452229Z contiguous: bool, 2025-05-07T20:32:42.9452484Z compiled: bool, 2025-05-07T20:32:42.9452727Z ) -> None: 2025-05-07T20:32:42.9452951Z torch.manual_seed(2025) 2025-05-07T20:32:42.9453208Z 2025-05-07T20:32:42.9453503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.9453948Z 2025-05-07T20:32:42.9454152Z x_sign = torch.sign(x) 2025-05-07T20:32:42.9454459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.9454785Z x = x_sign * x_clamp 2025-05-07T20:32:42.9455031Z x0 = x[:, :D] 2025-05-07T20:32:42.9455268Z x1 = x[:, D:] 2025-05-07T20:32:42.9455495Z 2025-05-07T20:32:42.9455687Z if contiguous: 2025-05-07T20:32:42.9455938Z x0 = x0.contiguous() 2025-05-07T20:32:42.9456214Z x1 = x1.contiguous() 2025-05-07T20:32:42.9456461Z 2025-05-07T20:32:42.9456668Z if scale_ub is not None: 2025-05-07T20:32:42.9456956Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.9457300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.9457624Z ) 2025-05-07T20:32:42.9457831Z else: 2025-05-07T20:32:42.9458048Z scale_ub_tensor = None 2025-05-07T20:32:42.9458323Z 2025-05-07T20:32:42.9458568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.9458897Z op = silu_mul_quant 2025-05-07T20:32:42.9459159Z if compiled: 2025-05-07T20:32:42.9459422Z op = torch.compile(op) 2025-05-07T20:32:42.9459742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.9460025Z 2025-05-07T20:32:42.9460232Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.9460403Z 2025-05-07T20:32:42.9460520Z moe/activation_test.py:117: 2025-05-07T20:32:42.9460857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9461300Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.9461634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.9462211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.9462784Z return fn(*args, **kwargs) 2025-05-07T20:32:42.9463528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.9464241Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.9464794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.9465486Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.9466162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.9466702Z kernel = self.compile( 2025-05-07T20:32:42.9467253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.9467907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.9468314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9468552Z 2025-05-07T20:32:42.9468777Z self = 2025-05-07T20:32:42.9469909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.9471398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd89350d0>} 2025-05-07T20:32:42.9472743Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.9473768Z context = 2025-05-07T20:32:42.9474061Z 2025-05-07T20:32:42.9474246Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.9474821Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.9475293Z module_map=module_map) 2025-05-07T20:32:42.9475679Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.9476059Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.9476330Z E ^ 2025-05-07T20:32:42.9476804Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.9477261Z 2025-05-07T20:32:42.9477695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.9478218Z 2025-05-07T20:32:42.9478326Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.9478748Z self=, 2025-05-07T20:32:42.9479163Z T=128, 2025-05-07T20:32:42.9479369Z D=7168, 2025-05-07T20:32:42.9479568Z scale_ub=1200.0, 2025-05-07T20:32:42.9479805Z contiguous=True, 2025-05-07T20:32:42.9480043Z compiled=False, 2025-05-07T20:32:42.9480256Z ) 2025-05-07T20:32:42.9480587Z self = 2025-05-07T20:32:42.9481098Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.9481380Z 2025-05-07T20:32:42.9481463Z @given( 2025-05-07T20:32:42.9481707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.9482034Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.9482348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.9482688Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.9483030Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.9483332Z ) 2025-05-07T20:32:42.9483692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.9484206Z def test_silu_mul_quant( 2025-05-07T20:32:42.9484463Z self, 2025-05-07T20:32:42.9484663Z T: int, 2025-05-07T20:32:42.9484877Z D: int, 2025-05-07T20:32:42.9485107Z scale_ub: Optional[float], 2025-05-07T20:32:42.9485384Z contiguous: bool, 2025-05-07T20:32:42.9485642Z compiled: bool, 2025-05-07T20:32:42.9485885Z ) -> None: 2025-05-07T20:32:42.9486106Z torch.manual_seed(2025) 2025-05-07T20:32:42.9486359Z 2025-05-07T20:32:42.9486645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.9486995Z 2025-05-07T20:32:42.9487200Z x_sign = torch.sign(x) 2025-05-07T20:32:42.9487501Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.9489496Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.9491495Z 2025-05-07T20:32:42.9491629Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.9491854Z 2025-05-07T20:32:42.9491960Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.9492398Z self=, 2025-05-07T20:32:42.9492808Z T=128, 2025-05-07T20:32:42.9492997Z D=5120, 2025-05-07T20:32:42.9493198Z scale_ub=1200.0, 2025-05-07T20:32:42.9493431Z contiguous=True, 2025-05-07T20:32:42.9493654Z compiled=True, 2025-05-07T20:32:42.9493867Z ) 2025-05-07T20:32:42.9494197Z self = 2025-05-07T20:32:42.9494752Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.9495026Z 2025-05-07T20:32:42.9495108Z @given( 2025-05-07T20:32:42.9495347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.9495670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.9495986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.9496329Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.9496672Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.9496961Z ) 2025-05-07T20:32:42.9497326Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.9497775Z def test_silu_mul_quant( 2025-05-07T20:32:42.9498027Z self, 2025-05-07T20:32:42.9498222Z T: int, 2025-05-07T20:32:42.9498425Z D: int, 2025-05-07T20:32:42.9498650Z scale_ub: Optional[float], 2025-05-07T20:32:42.9498924Z contiguous: bool, 2025-05-07T20:32:42.9499179Z compiled: bool, 2025-05-07T20:32:42.9499411Z ) -> None: 2025-05-07T20:32:42.9499628Z torch.manual_seed(2025) 2025-05-07T20:32:42.9499880Z 2025-05-07T20:32:42.9500162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.9500520Z 2025-05-07T20:32:42.9500761Z x_sign = torch.sign(x) 2025-05-07T20:32:42.9501215Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.9503270Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.9505130Z 2025-05-07T20:32:42.9505257Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.9505475Z 2025-05-07T20:32:42.9505581Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.9506000Z self=, 2025-05-07T20:32:42.9506420Z T=128, 2025-05-07T20:32:42.9506609Z D=7168, 2025-05-07T20:32:42.9506810Z scale_ub=None, 2025-05-07T20:32:42.9507030Z contiguous=True, 2025-05-07T20:32:42.9507255Z compiled=True, 2025-05-07T20:32:42.9507469Z ) 2025-05-07T20:32:43.1686540Z self = 2025-05-07T20:32:43.1687093Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1687366Z 2025-05-07T20:32:43.1687449Z @given( 2025-05-07T20:32:43.1687689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1688034Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1688359Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1688705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1689047Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1689344Z ) 2025-05-07T20:32:43.1689982Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1690749Z def test_silu_mul_quant( 2025-05-07T20:32:43.1691121Z self, 2025-05-07T20:32:43.1691390Z T: int, 2025-05-07T20:32:43.1691676Z D: int, 2025-05-07T20:32:43.1691987Z scale_ub: Optional[float], 2025-05-07T20:32:43.1692357Z contiguous: bool, 2025-05-07T20:32:43.1692680Z compiled: bool, 2025-05-07T20:32:43.1692981Z ) -> None: 2025-05-07T20:32:43.1693259Z torch.manual_seed(2025) 2025-05-07T20:32:43.1693582Z 2025-05-07T20:32:43.1693867Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1696077Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1697958Z 2025-05-07T20:32:43.1698087Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.1698303Z 2025-05-07T20:32:43.1707891Z FAILED 2025-05-07T20:32:43.1708162Z 2025-05-07T20:32:43.1708554Z =================================== FAILURES =================================== 2025-05-07T20:32:43.1709284Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:43.1709981Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:43.1710872Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:43.1711644Z | yield 2025-05-07T20:32:43.1712325Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:32:43.1713086Z | self._callTestMethod(testMethod) 2025-05-07T20:32:43.1713717Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:32:43.1714280Z | method() 2025-05-07T20:32:43.1714969Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:43.1715846Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1716973Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:43.1717883Z | raise the_error_hypothesis_found 2025-05-07T20:32:43.1718633Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:43.1719332Z +-+---------------- 1 ---------------- 2025-05-07T20:32:43.1719789Z | Traceback (most recent call last): 2025-05-07T20:32:43.1720857Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:43.1721995Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1724498Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1727304Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:43.1728094Z | self=, 2025-05-07T20:32:43.1728674Z | T=2048, 2025-05-07T20:32:43.1729005Z | D=5120, # or any other generated value 2025-05-07T20:32:43.1729482Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:43.1729994Z | contiguous=True, # or any other generated value 2025-05-07T20:32:43.1730520Z | compiled=False, # or any other generated value 2025-05-07T20:32:43.1730945Z | ) 2025-05-07T20:32:43.1731195Z | 2025-05-07T20:32:43.1731951Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:43.1732912Z +---------------- 2 ---------------- 2025-05-07T20:32:43.1733356Z | Traceback (most recent call last): 2025-05-07T20:32:43.1734358Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:43.1735448Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1738313Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1741602Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:43.1742273Z | self=, 2025-05-07T20:32:43.1742859Z | T=128, 2025-05-07T20:32:43.1743154Z | D=7168, 2025-05-07T20:32:43.1743452Z | scale_ub=None, 2025-05-07T20:32:43.1743800Z | contiguous=True, 2025-05-07T20:32:43.1744187Z | compiled=True, 2025-05-07T20:32:43.1744500Z | ) 2025-05-07T20:32:43.1744786Z | 2025-05-07T20:32:43.1745527Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:43.1746243Z +---------------- 3 ---------------- 2025-05-07T20:32:43.1746543Z | Traceback (most recent call last): 2025-05-07T20:32:43.1747371Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:43.1748165Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1750190Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.1752240Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:43.1752688Z | self=, 2025-05-07T20:32:43.1753102Z | T=128, 2025-05-07T20:32:43.1753324Z | D=5120, 2025-05-07T20:32:43.1753540Z | scale_ub=1200.0, 2025-05-07T20:32:43.1753794Z | contiguous=True, 2025-05-07T20:32:43.1754101Z | compiled=True, 2025-05-07T20:32:43.1754403Z | ) 2025-05-07T20:32:43.1754662Z | 2025-05-07T20:32:43.1755534Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:43.1756497Z +---------------- 4 ---------------- 2025-05-07T20:32:43.1756924Z | Traceback (most recent call last): 2025-05-07T20:32:43.1757961Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:43.1759021Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1759954Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:43.1761110Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1762299Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:43.1763434Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1764334Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:43.1765394Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1766449Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:43.1767524Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1768672Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:43.1769795Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1770893Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:43.1771877Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1772797Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:43.1773596Z | fn() 2025-05-07T20:32:43.1774354Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:43.1775219Z | self.fn.run( 2025-05-07T20:32:43.1776054Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:43.1776869Z | kernel = self.compile( 2025-05-07T20:32:43.1777726Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:43.1778745Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1779746Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:43.1780854Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1781716Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1782209Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1782585Z | ^ 2025-05-07T20:32:43.1783223Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1784036Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:43.1784611Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:43.1785345Z | self=, 2025-05-07T20:32:43.1785967Z | T=1, # or any other generated value 2025-05-07T20:32:43.1786541Z | D=5120, # or any other generated value 2025-05-07T20:32:43.1787024Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:43.1787534Z | contiguous=True, # or any other generated value 2025-05-07T20:32:43.1788043Z | compiled=True, # or any other generated value 2025-05-07T20:32:43.1788473Z | ) 2025-05-07T20:32:43.1788731Z | 2025-05-07T20:32:43.1789484Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:43.1790355Z +------------------------------------ 2025-05-07T20:32:43.1790916Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:43.1791427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1792013Z self=, 2025-05-07T20:32:43.1792580Z T=1, 2025-05-07T20:32:43.1792842Z D=5120, 2025-05-07T20:32:43.1793125Z scale_ub=None, 2025-05-07T20:32:43.1793436Z contiguous=True, 2025-05-07T20:32:43.1793753Z compiled=True, 2025-05-07T20:32:43.1794054Z ) 2025-05-07T20:32:43.1794507Z self = 2025-05-07T20:32:43.1795188Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1795561Z 2025-05-07T20:32:43.1795673Z @given( 2025-05-07T20:32:43.1796000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1796440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1796863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1797340Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1797797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1798174Z ) 2025-05-07T20:32:43.1798645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1799278Z def test_silu_mul_quant( 2025-05-07T20:32:43.1799628Z self, 2025-05-07T20:32:43.1799904Z T: int, 2025-05-07T20:32:43.1800191Z D: int, 2025-05-07T20:32:43.1800511Z scale_ub: Optional[float], 2025-05-07T20:32:43.1800896Z contiguous: bool, 2025-05-07T20:32:43.1801241Z compiled: bool, 2025-05-07T20:32:43.1801553Z ) -> None: 2025-05-07T20:32:43.1801855Z torch.manual_seed(2025) 2025-05-07T20:32:43.1802214Z 2025-05-07T20:32:43.1802614Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1803081Z 2025-05-07T20:32:43.1803350Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1803818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1804261Z x = x_sign * x_clamp 2025-05-07T20:32:43.1804619Z x0 = x[:, :D] 2025-05-07T20:32:43.1804934Z x1 = x[:, D:] 2025-05-07T20:32:43.1805226Z 2025-05-07T20:32:43.1805494Z if contiguous: 2025-05-07T20:32:43.1805836Z x0 = x0.contiguous() 2025-05-07T20:32:43.1806197Z x1 = x1.contiguous() 2025-05-07T20:32:43.1806546Z 2025-05-07T20:32:43.1806821Z if scale_ub is not None: 2025-05-07T20:32:43.1807231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1807695Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1808133Z ) 2025-05-07T20:32:43.1808401Z else: 2025-05-07T20:32:43.1830406Z scale_ub_tensor = None 2025-05-07T20:32:43.1830825Z 2025-05-07T20:32:43.1831155Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1831570Z op = silu_mul_quant 2025-05-07T20:32:43.1831909Z if compiled: 2025-05-07T20:32:43.1832238Z op = torch.compile(op) 2025-05-07T20:32:43.1832641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1833002Z 2025-05-07T20:32:43.1833262Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1833744Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1834186Z 2025-05-07T20:32:43.1834513Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1834966Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1835375Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.1835836Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.1836354Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1836807Z 2025-05-07T20:32:43.1837094Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1837380Z 2025-05-07T20:32:43.1837526Z moe/activation_test.py:126: 2025-05-07T20:32:43.1838018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1838491Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.1838954Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1840456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.1841603Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1842380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1843331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1844295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.1845333Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1846349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.1847363Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1848356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.1849241Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1850101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.1850847Z fn() 2025-05-07T20:32:43.1851568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.1852402Z self.fn.run( 2025-05-07T20:32:43.1853067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1853996Z kernel = self.compile( 2025-05-07T20:32:43.1855574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1856500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1857046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1857363Z 2025-05-07T20:32:43.1857654Z self = 2025-05-07T20:32:43.1859090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1860958Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdcfd99d0>} 2025-05-07T20:32:43.1862884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1864315Z context = 2025-05-07T20:32:43.1864820Z 2025-05-07T20:32:43.1865145Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1865895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1866558Z module_map=module_map) 2025-05-07T20:32:43.1867056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1867533Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1867895Z E ^ 2025-05-07T20:32:43.1868529Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1869235Z 2025-05-07T20:32:43.1869832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1870579Z 2025-05-07T20:32:43.1870734Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1871390Z self=, 2025-05-07T20:32:43.1871973Z T=2048, 2025-05-07T20:32:43.1872237Z D=5120, 2025-05-07T20:32:43.1872525Z scale_ub=1200.0, 2025-05-07T20:32:43.1872859Z contiguous=True, 2025-05-07T20:32:43.1873177Z compiled=False, 2025-05-07T20:32:43.1873483Z ) 2025-05-07T20:32:43.1873946Z self = 2025-05-07T20:32:43.1874639Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.1875032Z 2025-05-07T20:32:43.1875143Z @given( 2025-05-07T20:32:43.1875473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1875932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1876373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1876857Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1877340Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1877761Z ) 2025-05-07T20:32:43.1878274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1878901Z def test_silu_mul_quant( 2025-05-07T20:32:43.1879242Z self, 2025-05-07T20:32:43.1879520Z T: int, 2025-05-07T20:32:43.1879810Z D: int, 2025-05-07T20:32:43.1880119Z scale_ub: Optional[float], 2025-05-07T20:32:43.1880515Z contiguous: bool, 2025-05-07T20:32:43.1880853Z compiled: bool, 2025-05-07T20:32:43.1881169Z ) -> None: 2025-05-07T20:32:43.1881476Z torch.manual_seed(2025) 2025-05-07T20:32:43.1881811Z 2025-05-07T20:32:43.1882201Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1882734Z 2025-05-07T20:32:43.1883018Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1883434Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1883867Z x = x_sign * x_clamp 2025-05-07T20:32:43.1884219Z x0 = x[:, :D] 2025-05-07T20:32:43.1884520Z x1 = x[:, D:] 2025-05-07T20:32:43.1884793Z 2025-05-07T20:32:43.1885061Z if contiguous: 2025-05-07T20:32:43.1885392Z x0 = x0.contiguous() 2025-05-07T20:32:43.1885759Z x1 = x1.contiguous() 2025-05-07T20:32:43.1886114Z 2025-05-07T20:32:43.1886405Z if scale_ub is not None: 2025-05-07T20:32:43.1886803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1887290Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1887729Z ) 2025-05-07T20:32:43.1888014Z else: 2025-05-07T20:32:43.1888316Z scale_ub_tensor = None 2025-05-07T20:32:43.1888687Z 2025-05-07T20:32:43.1889021Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1889473Z op = silu_mul_quant 2025-05-07T20:32:43.1889829Z if compiled: 2025-05-07T20:32:43.1890184Z op = torch.compile(op) 2025-05-07T20:32:43.1890588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1891054Z 2025-05-07T20:32:43.1891344Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1891566Z 2025-05-07T20:32:43.1891690Z moe/activation_test.py:117: 2025-05-07T20:32:43.1892084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1892503Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1892877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1893797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1894782Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1895517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1896507Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1897408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1898146Z kernel = self.compile( 2025-05-07T20:32:43.1898867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1899758Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1900313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1900624Z 2025-05-07T20:32:43.1900920Z self = 2025-05-07T20:32:43.1902594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1904180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdbfe5e50>} 2025-05-07T20:32:43.1905535Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1906556Z context = 2025-05-07T20:32:43.1906847Z 2025-05-07T20:32:43.1907015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1907545Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1908013Z module_map=module_map) 2025-05-07T20:32:43.1908443Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1908808Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1909073Z E ^ 2025-05-07T20:32:43.1909541Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1910001Z 2025-05-07T20:32:43.1910420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1910938Z 2025-05-07T20:32:43.1911042Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1911462Z self=, 2025-05-07T20:32:43.1911866Z T=2048, 2025-05-07T20:32:43.1912053Z D=5120, 2025-05-07T20:32:43.1912250Z scale_ub=1200.0, 2025-05-07T20:32:43.1912481Z contiguous=True, 2025-05-07T20:32:43.1912701Z compiled=True, 2025-05-07T20:32:43.1912909Z ) 2025-05-07T20:32:43.1913235Z self = 2025-05-07T20:32:43.1913731Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.1914009Z 2025-05-07T20:32:43.1914087Z @given( 2025-05-07T20:32:43.1914322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1914679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1915041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1915379Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1915712Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1915996Z ) 2025-05-07T20:32:43.1916358Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1916806Z def test_silu_mul_quant( 2025-05-07T20:32:43.1917048Z self, 2025-05-07T20:32:43.1917250Z T: int, 2025-05-07T20:32:43.1917457Z D: int, 2025-05-07T20:32:43.1917678Z scale_ub: Optional[float], 2025-05-07T20:32:43.1918033Z contiguous: bool, 2025-05-07T20:32:43.1918285Z compiled: bool, 2025-05-07T20:32:43.1918504Z ) -> None: 2025-05-07T20:32:43.1918730Z torch.manual_seed(2025) 2025-05-07T20:32:43.1918985Z 2025-05-07T20:32:43.1919257Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1919618Z 2025-05-07T20:32:43.1919816Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1920105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1920423Z x = x_sign * x_clamp 2025-05-07T20:32:43.1920676Z x0 = x[:, :D] 2025-05-07T20:32:43.1920901Z x1 = x[:, D:] 2025-05-07T20:32:43.1921105Z 2025-05-07T20:32:43.1921301Z if contiguous: 2025-05-07T20:32:43.1921539Z x0 = x0.contiguous() 2025-05-07T20:32:43.1921798Z x1 = x1.contiguous() 2025-05-07T20:32:43.1922047Z 2025-05-07T20:32:43.1922246Z if scale_ub is not None: 2025-05-07T20:32:43.1922520Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1922864Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1923179Z ) 2025-05-07T20:32:43.1923371Z else: 2025-05-07T20:32:43.1923590Z scale_ub_tensor = None 2025-05-07T20:32:43.1923853Z 2025-05-07T20:32:43.1924088Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1924412Z op = silu_mul_quant 2025-05-07T20:32:43.1924684Z if compiled: 2025-05-07T20:32:43.1924930Z op = torch.compile(op) 2025-05-07T20:32:43.1925238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1925517Z 2025-05-07T20:32:43.1925712Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1926009Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1926307Z 2025-05-07T20:32:43.1926544Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1926887Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1927242Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.1927562Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.1927926Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1928248Z 2025-05-07T20:32:43.1928456Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1928656Z 2025-05-07T20:32:43.1928760Z moe/activation_test.py:126: 2025-05-07T20:32:43.1929064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1929406Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.1929732Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1930525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.1931311Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1931884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1932573Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1933264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.1934081Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1934837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.1935594Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1936330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.1936973Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1937582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.1938160Z fn() 2025-05-07T20:32:43.1938670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.1939249Z self.fn.run( 2025-05-07T20:32:43.1939711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1940538Z kernel = self.compile( 2025-05-07T20:32:43.1941168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1941831Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1942248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1942488Z 2025-05-07T20:32:43.1942705Z self = 2025-05-07T20:32:43.1943810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1945213Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdbb7ca60>} 2025-05-07T20:32:43.1946580Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1947599Z context = 2025-05-07T20:32:43.1947889Z 2025-05-07T20:32:43.1948069Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1948602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1949180Z module_map=module_map) 2025-05-07T20:32:43.1949558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1949922Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1950187Z E ^ 2025-05-07T20:32:43.1950652Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1951102Z 2025-05-07T20:32:43.1951520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1952030Z 2025-05-07T20:32:43.1952141Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1952554Z self=, 2025-05-07T20:32:43.1952961Z T=16384, 2025-05-07T20:32:43.1953161Z D=7168, 2025-05-07T20:32:43.1953351Z scale_ub=1200.0, 2025-05-07T20:32:43.1953585Z contiguous=False, 2025-05-07T20:32:43.1953819Z compiled=False, 2025-05-07T20:32:43.1954032Z ) 2025-05-07T20:32:43.1954352Z self = 2025-05-07T20:32:43.1954860Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.1955138Z 2025-05-07T20:32:43.1955225Z @given( 2025-05-07T20:32:43.1955591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1955924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1956238Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1956569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1956910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1957209Z ) 2025-05-07T20:32:43.1957560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1958009Z def test_silu_mul_quant( 2025-05-07T20:32:43.1958262Z self, 2025-05-07T20:32:43.1958456Z T: int, 2025-05-07T20:32:43.1958734Z D: int, 2025-05-07T20:32:43.1958958Z scale_ub: Optional[float], 2025-05-07T20:32:43.1959228Z contiguous: bool, 2025-05-07T20:32:43.1959474Z compiled: bool, 2025-05-07T20:32:43.1959708Z ) -> None: 2025-05-07T20:32:43.1959931Z torch.manual_seed(2025) 2025-05-07T20:32:43.1960181Z 2025-05-07T20:32:43.1960464Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1960814Z 2025-05-07T20:32:43.1961004Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1961355Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1961674Z x = x_sign * x_clamp 2025-05-07T20:32:43.1961915Z x0 = x[:, :D] 2025-05-07T20:32:43.1962141Z x1 = x[:, D:] 2025-05-07T20:32:43.1962352Z 2025-05-07T20:32:43.1962537Z if contiguous: 2025-05-07T20:32:43.1962778Z x0 = x0.contiguous() 2025-05-07T20:32:43.1963043Z x1 = x1.contiguous() 2025-05-07T20:32:43.1963290Z 2025-05-07T20:32:43.1963489Z if scale_ub is not None: 2025-05-07T20:32:43.1963772Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1964105Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1964424Z ) 2025-05-07T20:32:43.1964628Z else: 2025-05-07T20:32:43.1964847Z scale_ub_tensor = None 2025-05-07T20:32:43.1965102Z 2025-05-07T20:32:43.1965345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1965662Z op = silu_mul_quant 2025-05-07T20:32:43.1965914Z if compiled: 2025-05-07T20:32:43.1966174Z op = torch.compile(op) 2025-05-07T20:32:43.1966479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1966757Z 2025-05-07T20:32:43.1966956Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1967122Z 2025-05-07T20:32:43.1967233Z moe/activation_test.py:117: 2025-05-07T20:32:43.1967584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1967928Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1968217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1968912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1969614Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1970164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1970852Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1971511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1972055Z kernel = self.compile( 2025-05-07T20:32:43.1972605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1973269Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1973667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1973906Z 2025-05-07T20:32:43.1974119Z self = 2025-05-07T20:32:43.1975257Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1976670Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb9e5670>} 2025-05-07T20:32:43.1978016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1979094Z context = 2025-05-07T20:32:43.1979392Z 2025-05-07T20:32:43.1979563Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1980097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1980575Z module_map=module_map) 2025-05-07T20:32:43.1980997Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1981437Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1981703Z E ^ 2025-05-07T20:32:43.1982160Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1982616Z 2025-05-07T20:32:43.1983031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1983541Z 2025-05-07T20:32:43.1983652Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1984067Z self=, 2025-05-07T20:32:43.1984474Z T=1, 2025-05-07T20:32:43.1984666Z D=7168, 2025-05-07T20:32:43.1984865Z scale_ub=None, 2025-05-07T20:32:43.1985079Z contiguous=True, 2025-05-07T20:32:43.1985311Z compiled=True, 2025-05-07T20:32:43.1985525Z ) 2025-05-07T20:32:43.1985845Z self = 2025-05-07T20:32:43.1986330Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1986592Z 2025-05-07T20:32:43.1986677Z @given( 2025-05-07T20:32:43.1986907Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1987230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1987546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1987877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1988270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1988573Z ) 2025-05-07T20:32:43.1988929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1989371Z def test_silu_mul_quant( 2025-05-07T20:32:43.1989618Z self, 2025-05-07T20:32:43.1989824Z T: int, 2025-05-07T20:32:43.1990026Z D: int, 2025-05-07T20:32:43.1990250Z scale_ub: Optional[float], 2025-05-07T20:32:43.1990533Z contiguous: bool, 2025-05-07T20:32:43.1990774Z compiled: bool, 2025-05-07T20:32:43.1991004Z ) -> None: 2025-05-07T20:32:43.1991226Z torch.manual_seed(2025) 2025-05-07T20:32:43.1991468Z 2025-05-07T20:32:43.1991747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1992100Z 2025-05-07T20:32:43.1992295Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1992596Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1992912Z x = x_sign * x_clamp 2025-05-07T20:32:43.1993158Z x0 = x[:, :D] 2025-05-07T20:32:43.1993377Z x1 = x[:, D:] 2025-05-07T20:32:43.1993588Z 2025-05-07T20:32:43.1993775Z if contiguous: 2025-05-07T20:32:43.1994004Z x0 = x0.contiguous() 2025-05-07T20:32:43.1994267Z x1 = x1.contiguous() 2025-05-07T20:32:43.1994568Z 2025-05-07T20:32:43.1994820Z if scale_ub is not None: 2025-05-07T20:32:43.1995102Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1995444Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1995751Z ) 2025-05-07T20:32:43.1995947Z else: 2025-05-07T20:32:43.1996162Z scale_ub_tensor = None 2025-05-07T20:32:43.1996412Z 2025-05-07T20:32:43.1996653Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1996969Z op = silu_mul_quant 2025-05-07T20:32:43.1997224Z if compiled: 2025-05-07T20:32:43.1997477Z op = torch.compile(op) 2025-05-07T20:32:43.1997837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1998116Z 2025-05-07T20:32:43.1998314Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1998602Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1998897Z 2025-05-07T20:32:43.1999136Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1999481Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1999782Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2008407Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2008786Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2009113Z 2025-05-07T20:32:43.2009328Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2009529Z 2025-05-07T20:32:43.2009641Z moe/activation_test.py:126: 2025-05-07T20:32:43.2009937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2010297Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2010642Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2011429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2012209Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2012768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2013460Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2014147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2014871Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2015633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.2016472Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2017203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2017851Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2018461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2018978Z fn() 2025-05-07T20:32:43.2019498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2020091Z self.fn.run( 2025-05-07T20:32:43.2020563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2021176Z kernel = self.compile( 2025-05-07T20:32:43.2021734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2022392Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2022790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2023029Z 2025-05-07T20:32:43.2023288Z self = 2025-05-07T20:32:43.2024109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2024629Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdba28dc0>} 2025-05-07T20:32:43.2025379Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2025642Z context = 2025-05-07T20:32:43.2025647Z 2025-05-07T20:32:43.2025817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2026096Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2026207Z module_map=module_map) 2025-05-07T20:32:43.2026374Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2026488Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2026568Z E ^ 2025-05-07T20:32:43.2026928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2026932Z 2025-05-07T20:32:43.2027355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2027362Z 2025-05-07T20:32:43.2027467Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2027706Z self=, 2025-05-07T20:32:43.2027785Z T=4096, 2025-05-07T20:32:43.2027872Z D=5120, 2025-05-07T20:32:43.2027958Z scale_ub=None, 2025-05-07T20:32:43.2028045Z contiguous=False, 2025-05-07T20:32:43.2028131Z compiled=False, 2025-05-07T20:32:43.2028203Z ) 2025-05-07T20:32:43.2028416Z self = 2025-05-07T20:32:43.2028591Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2028595Z 2025-05-07T20:32:43.2028671Z @given( 2025-05-07T20:32:43.2028791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2028896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2029011Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2029183Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2029304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2029381Z ) 2025-05-07T20:32:43.2029640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2029740Z def test_silu_mul_quant( 2025-05-07T20:32:43.2029827Z self, 2025-05-07T20:32:43.2029914Z T: int, 2025-05-07T20:32:43.2029994Z D: int, 2025-05-07T20:32:43.2030094Z scale_ub: Optional[float], 2025-05-07T20:32:43.2030192Z contiguous: bool, 2025-05-07T20:32:43.2030281Z compiled: bool, 2025-05-07T20:32:43.2030369Z ) -> None: 2025-05-07T20:32:43.2030475Z torch.manual_seed(2025) 2025-05-07T20:32:43.2030550Z 2025-05-07T20:32:43.2030728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2030805Z 2025-05-07T20:32:43.2030899Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2031040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2031134Z x = x_sign * x_clamp 2025-05-07T20:32:43.2031220Z x0 = x[:, :D] 2025-05-07T20:32:43.2031310Z x1 = x[:, D:] 2025-05-07T20:32:43.2031387Z 2025-05-07T20:32:43.2031472Z if contiguous: 2025-05-07T20:32:43.2031576Z x0 = x0.contiguous() 2025-05-07T20:32:43.2031748Z x1 = x1.contiguous() 2025-05-07T20:32:43.2031824Z 2025-05-07T20:32:43.2031924Z if scale_ub is not None: 2025-05-07T20:32:43.2032031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2032169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2032256Z ) 2025-05-07T20:32:43.2032336Z else: 2025-05-07T20:32:43.2032445Z scale_ub_tensor = None 2025-05-07T20:32:43.2032520Z 2025-05-07T20:32:43.2032650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2032752Z op = silu_mul_quant 2025-05-07T20:32:43.2032883Z if compiled: 2025-05-07T20:32:43.2032991Z op = torch.compile(op) 2025-05-07T20:32:43.2033112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2033188Z 2025-05-07T20:32:43.2033283Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2033288Z 2025-05-07T20:32:43.2033395Z moe/activation_test.py:117: 2025-05-07T20:32:43.2033533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2033644Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2033747Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2034253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2034360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2034725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2034956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2035309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2035406Z kernel = self.compile( 2025-05-07T20:32:43.2035795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2035978Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2036106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2036110Z 2025-05-07T20:32:43.2036327Z self = 2025-05-07T20:32:43.2037095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2037659Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb6393a0>} 2025-05-07T20:32:43.2038415Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2038614Z context = 2025-05-07T20:32:43.2038625Z 2025-05-07T20:32:43.2038795Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2039058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2039173Z module_map=module_map) 2025-05-07T20:32:43.2039341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2039443Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2039535Z E ^ 2025-05-07T20:32:43.2039893Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2039898Z 2025-05-07T20:32:43.2040591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2040740Z 2025-05-07T20:32:43.2040854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2041081Z self=, 2025-05-07T20:32:43.2041168Z T=4096, 2025-05-07T20:32:43.2041244Z D=7168, 2025-05-07T20:32:43.2041327Z scale_ub=None, 2025-05-07T20:32:43.2041420Z contiguous=False, 2025-05-07T20:32:43.2041506Z compiled=False, 2025-05-07T20:32:43.2041581Z ) 2025-05-07T20:32:43.2041805Z self = 2025-05-07T20:32:43.2041981Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2042050Z 2025-05-07T20:32:43.2042137Z @given( 2025-05-07T20:32:43.2042264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2042365Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2042489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2042614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2042733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2042817Z ) 2025-05-07T20:32:43.2043066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2043171Z def test_silu_mul_quant( 2025-05-07T20:32:43.2043249Z self, 2025-05-07T20:32:43.2043328Z T: int, 2025-05-07T20:32:43.2043412Z D: int, 2025-05-07T20:32:43.2043511Z scale_ub: Optional[float], 2025-05-07T20:32:43.2043602Z contiguous: bool, 2025-05-07T20:32:43.2043696Z compiled: bool, 2025-05-07T20:32:43.2043778Z ) -> None: 2025-05-07T20:32:43.2043882Z torch.manual_seed(2025) 2025-05-07T20:32:43.2043968Z 2025-05-07T20:32:43.2044141Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2044219Z 2025-05-07T20:32:43.2044320Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2044451Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2044545Z x = x_sign * x_clamp 2025-05-07T20:32:43.2044634Z x0 = x[:, :D] 2025-05-07T20:32:43.2044719Z x1 = x[:, D:] 2025-05-07T20:32:43.2044804Z 2025-05-07T20:32:43.2044890Z if contiguous: 2025-05-07T20:32:43.2044984Z x0 = x0.contiguous() 2025-05-07T20:32:43.2045082Z x1 = x1.contiguous() 2025-05-07T20:32:43.2045156Z 2025-05-07T20:32:43.2045249Z if scale_ub is not None: 2025-05-07T20:32:43.2045364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2045502Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2045645Z ) 2025-05-07T20:32:43.2045734Z else: 2025-05-07T20:32:43.2045830Z scale_ub_tensor = None 2025-05-07T20:32:43.2045905Z 2025-05-07T20:32:43.2046046Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2046140Z op = silu_mul_quant 2025-05-07T20:32:43.2046240Z if compiled: 2025-05-07T20:32:43.2046343Z op = torch.compile(op) 2025-05-07T20:32:43.2046450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2046535Z 2025-05-07T20:32:43.2046627Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2046631Z 2025-05-07T20:32:43.2046732Z moe/activation_test.py:117: 2025-05-07T20:32:43.2046867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2046974Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2047076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2047583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2047685Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2048049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2048343Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2048725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2048829Z kernel = self.compile( 2025-05-07T20:32:43.2049205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2049386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2049513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2049517Z 2025-05-07T20:32:43.2049731Z self = 2025-05-07T20:32:43.2050536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2051056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb5a78b0>} 2025-05-07T20:32:43.2051812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2052007Z context = 2025-05-07T20:32:43.2052012Z 2025-05-07T20:32:43.2052188Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2052461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2052571Z module_map=module_map) 2025-05-07T20:32:43.2052740Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2052839Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2052928Z E ^ 2025-05-07T20:32:43.2053278Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2053283Z 2025-05-07T20:32:43.2053698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2053702Z 2025-05-07T20:32:43.2053811Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2054034Z self=, 2025-05-07T20:32:43.2054116Z T=128, 2025-05-07T20:32:43.2054193Z D=7168, 2025-05-07T20:32:43.2054276Z scale_ub=None, 2025-05-07T20:32:43.2054415Z contiguous=False, 2025-05-07T20:32:43.2054503Z compiled=True, 2025-05-07T20:32:43.2054580Z ) 2025-05-07T20:32:43.2054801Z self = 2025-05-07T20:32:43.2054971Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2054982Z 2025-05-07T20:32:43.2055059Z @given( 2025-05-07T20:32:43.2055183Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2055282Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2055406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2055523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2055640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2055724Z ) 2025-05-07T20:32:43.2055971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2056067Z def test_silu_mul_quant( 2025-05-07T20:32:43.2056155Z self, 2025-05-07T20:32:43.2056232Z T: int, 2025-05-07T20:32:43.2056308Z D: int, 2025-05-07T20:32:43.2056415Z scale_ub: Optional[float], 2025-05-07T20:32:43.2056504Z contiguous: bool, 2025-05-07T20:32:43.2056591Z compiled: bool, 2025-05-07T20:32:43.2056676Z ) -> None: 2025-05-07T20:32:43.2056851Z torch.manual_seed(2025) 2025-05-07T20:32:43.2056930Z 2025-05-07T20:32:43.2057098Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2057172Z 2025-05-07T20:32:43.2057270Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2057396Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2057487Z x = x_sign * x_clamp 2025-05-07T20:32:43.2057577Z x0 = x[:, :D] 2025-05-07T20:32:43.2057657Z x1 = x[:, D:] 2025-05-07T20:32:43.2057731Z 2025-05-07T20:32:43.2057821Z if contiguous: 2025-05-07T20:32:43.2057912Z x0 = x0.contiguous() 2025-05-07T20:32:43.2058049Z x1 = x1.contiguous() 2025-05-07T20:32:43.2058127Z 2025-05-07T20:32:43.2058219Z if scale_ub is not None: 2025-05-07T20:32:43.2058326Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2058468Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2058549Z ) 2025-05-07T20:32:43.2058634Z else: 2025-05-07T20:32:43.2058728Z scale_ub_tensor = None 2025-05-07T20:32:43.2058800Z 2025-05-07T20:32:43.2058937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2059028Z op = silu_mul_quant 2025-05-07T20:32:43.2059115Z if compiled: 2025-05-07T20:32:43.2059222Z op = torch.compile(op) 2025-05-07T20:32:43.2059329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2059402Z 2025-05-07T20:32:43.2059500Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2059620Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2059700Z 2025-05-07T20:32:43.2059841Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2059945Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2060050Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2060169Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2060315Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2060398Z 2025-05-07T20:32:43.2060497Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2060501Z 2025-05-07T20:32:43.2060623Z moe/activation_test.py:126: 2025-05-07T20:32:43.2060772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2060888Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2061029Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2061685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2061791Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2062155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2062382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2062753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2063015Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2063416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.2063674Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2064046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2064214Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2064557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2064634Z fn() 2025-05-07T20:32:43.2065110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2065194Z self.fn.run( 2025-05-07T20:32:43.2065528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2065628Z kernel = self.compile( 2025-05-07T20:32:43.2066010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2066188Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2066325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2066368Z 2025-05-07T20:32:43.2066576Z self = 2025-05-07T20:32:43.2067352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2067858Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb5a7e50>} 2025-05-07T20:32:43.2068600Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2068795Z context = 2025-05-07T20:32:43.2068799Z 2025-05-07T20:32:43.2068973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2069246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2069355Z module_map=module_map) 2025-05-07T20:32:43.2069521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2069637Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2069715Z E ^ 2025-05-07T20:32:43.2070070Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2070075Z 2025-05-07T20:32:43.2070484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2070489Z 2025-05-07T20:32:43.2070590Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2070817Z self=, 2025-05-07T20:32:43.2070940Z T=128, 2025-05-07T20:32:43.2071025Z D=7168, 2025-05-07T20:32:43.2071108Z scale_ub=None, 2025-05-07T20:32:43.2071195Z contiguous=False, 2025-05-07T20:32:43.2071284Z compiled=False, 2025-05-07T20:32:43.2071357Z ) 2025-05-07T20:32:43.2071573Z self = 2025-05-07T20:32:43.2071754Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2071758Z 2025-05-07T20:32:43.2071835Z @given( 2025-05-07T20:32:43.2071956Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2072060Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2072174Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2072293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2072408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2072484Z ) 2025-05-07T20:32:43.2072737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2072835Z def test_silu_mul_quant( 2025-05-07T20:32:43.2072913Z self, 2025-05-07T20:32:43.2072996Z T: int, 2025-05-07T20:32:43.2073073Z D: int, 2025-05-07T20:32:43.2073171Z scale_ub: Optional[float], 2025-05-07T20:32:43.2073268Z contiguous: bool, 2025-05-07T20:32:43.2073435Z compiled: bool, 2025-05-07T20:32:43.2073516Z ) -> None: 2025-05-07T20:32:43.2073617Z torch.manual_seed(2025) 2025-05-07T20:32:43.2073691Z 2025-05-07T20:32:43.2073863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2073937Z 2025-05-07T20:32:43.2074027Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2074158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2074248Z x = x_sign * x_clamp 2025-05-07T20:32:43.2074329Z x0 = x[:, :D] 2025-05-07T20:32:43.2074418Z x1 = x[:, D:] 2025-05-07T20:32:43.2074494Z 2025-05-07T20:32:43.2074633Z if contiguous: 2025-05-07T20:32:43.2074739Z x0 = x0.contiguous() 2025-05-07T20:32:43.2074831Z x1 = x1.contiguous() 2025-05-07T20:32:43.2074902Z 2025-05-07T20:32:43.2074998Z if scale_ub is not None: 2025-05-07T20:32:43.2075104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2075249Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2075331Z ) 2025-05-07T20:32:43.2075411Z else: 2025-05-07T20:32:43.2075505Z scale_ub_tensor = None 2025-05-07T20:32:43.2075583Z 2025-05-07T20:32:43.2075716Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2075806Z op = silu_mul_quant 2025-05-07T20:32:43.2075897Z if compiled: 2025-05-07T20:32:43.2075998Z op = torch.compile(op) 2025-05-07T20:32:43.2076112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2076185Z 2025-05-07T20:32:43.2076283Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2076288Z 2025-05-07T20:32:43.2076393Z moe/activation_test.py:117: 2025-05-07T20:32:43.2076521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2076623Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2076729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2077238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2077336Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2077694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2077923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2078268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2078362Z kernel = self.compile( 2025-05-07T20:32:43.2078816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2079003Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2079131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2079141Z 2025-05-07T20:32:43.2079353Z self = 2025-05-07T20:32:43.2080122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2080625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb08faf0>} 2025-05-07T20:32:43.2081381Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2081578Z context = 2025-05-07T20:32:43.2081582Z 2025-05-07T20:32:43.2081845Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2082117Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2082223Z module_map=module_map) 2025-05-07T20:32:43.2082388Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2082485Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2082565Z E ^ 2025-05-07T20:32:43.2082919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2082924Z 2025-05-07T20:32:43.2083381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2083386Z 2025-05-07T20:32:43.2083498Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2083718Z self=, 2025-05-07T20:32:43.2083807Z T=4096, 2025-05-07T20:32:43.2083885Z D=5120, 2025-05-07T20:32:43.2083968Z scale_ub=1200.0, 2025-05-07T20:32:43.2084058Z contiguous=True, 2025-05-07T20:32:43.2084144Z compiled=False, 2025-05-07T20:32:43.2084218Z ) 2025-05-07T20:32:43.2084437Z self = 2025-05-07T20:32:43.2084612Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2084616Z 2025-05-07T20:32:43.2084692Z @given( 2025-05-07T20:32:43.2084815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2084914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2085034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2085158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2085273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2085354Z ) 2025-05-07T20:32:43.2085602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2085699Z def test_silu_mul_quant( 2025-05-07T20:32:43.2085780Z self, 2025-05-07T20:32:43.2085856Z T: int, 2025-05-07T20:32:43.2085935Z D: int, 2025-05-07T20:32:43.2086036Z scale_ub: Optional[float], 2025-05-07T20:32:43.2086124Z contiguous: bool, 2025-05-07T20:32:43.2086210Z compiled: bool, 2025-05-07T20:32:43.2086294Z ) -> None: 2025-05-07T20:32:43.2086390Z torch.manual_seed(2025) 2025-05-07T20:32:43.2086465Z 2025-05-07T20:32:43.2086637Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2086711Z 2025-05-07T20:32:43.2086857Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2086983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2087073Z x = x_sign * x_clamp 2025-05-07T20:32:43.2087157Z x0 = x[:, :D] 2025-05-07T20:32:43.2087237Z x1 = x[:, D:] 2025-05-07T20:32:43.2087313Z 2025-05-07T20:32:43.2087402Z if contiguous: 2025-05-07T20:32:43.2087493Z x0 = x0.contiguous() 2025-05-07T20:32:43.2087580Z x1 = x1.contiguous() 2025-05-07T20:32:43.2087654Z 2025-05-07T20:32:43.2087743Z if scale_ub is not None: 2025-05-07T20:32:43.2087848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2087985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2088060Z ) 2025-05-07T20:32:43.2088142Z else: 2025-05-07T20:32:43.2088234Z scale_ub_tensor = None 2025-05-07T20:32:43.2088307Z 2025-05-07T20:32:43.2088446Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2088543Z op = silu_mul_quant 2025-05-07T20:32:43.2088628Z if compiled: 2025-05-07T20:32:43.2088734Z op = torch.compile(op) 2025-05-07T20:32:43.2088840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2088915Z 2025-05-07T20:32:43.2089092Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2089097Z 2025-05-07T20:32:43.2089198Z moe/activation_test.py:117: 2025-05-07T20:32:43.2089330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2089430Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2089528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2090027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2090123Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2090481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2090760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2091141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2091238Z kernel = self.compile( 2025-05-07T20:32:43.2091620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2091798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2091927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2091931Z 2025-05-07T20:32:43.2092138Z self = 2025-05-07T20:32:43.2092908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2093419Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdb165ca0>} 2025-05-07T20:32:43.2094156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2094355Z context = 2025-05-07T20:32:43.2094360Z 2025-05-07T20:32:43.2094527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2094798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2094906Z module_map=module_map) 2025-05-07T20:32:43.2095111Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2095220Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2095298Z E ^ 2025-05-07T20:32:43.2095654Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2095664Z 2025-05-07T20:32:43.2096084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2096089Z 2025-05-07T20:32:43.2096194Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2096421Z self=, 2025-05-07T20:32:43.2096498Z T=1, 2025-05-07T20:32:43.2096574Z D=5120, 2025-05-07T20:32:43.2096662Z scale_ub=None, 2025-05-07T20:32:43.2096747Z contiguous=True, 2025-05-07T20:32:43.2096829Z compiled=True, 2025-05-07T20:32:43.2096906Z ) 2025-05-07T20:32:43.2097125Z self = 2025-05-07T20:32:43.2097296Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2097301Z 2025-05-07T20:32:43.2097378Z @given( 2025-05-07T20:32:43.2097497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2097606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2097800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2097919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2098040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2098114Z ) 2025-05-07T20:32:43.2098362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2098460Z def test_silu_mul_quant( 2025-05-07T20:32:43.2098537Z self, 2025-05-07T20:32:43.2098620Z T: int, 2025-05-07T20:32:43.2098697Z D: int, 2025-05-07T20:32:43.2098796Z scale_ub: Optional[float], 2025-05-07T20:32:43.2098889Z contiguous: bool, 2025-05-07T20:32:43.2099020Z compiled: bool, 2025-05-07T20:32:43.2099100Z ) -> None: 2025-05-07T20:32:43.2099199Z torch.manual_seed(2025) 2025-05-07T20:32:43.2099271Z 2025-05-07T20:32:43.2099439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2099515Z 2025-05-07T20:32:43.2099616Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2099740Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2099836Z x = x_sign * x_clamp 2025-05-07T20:32:43.2099916Z x0 = x[:, :D] 2025-05-07T20:32:43.2100000Z x1 = x[:, D:] 2025-05-07T20:32:43.2100076Z 2025-05-07T20:32:43.2100159Z if contiguous: 2025-05-07T20:32:43.2100255Z x0 = x0.contiguous() 2025-05-07T20:32:43.2100345Z x1 = x1.contiguous() 2025-05-07T20:32:43.2100418Z 2025-05-07T20:32:43.2100515Z if scale_ub is not None: 2025-05-07T20:32:43.2100619Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2100762Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2100845Z ) 2025-05-07T20:32:43.2100921Z else: 2025-05-07T20:32:43.2101014Z scale_ub_tensor = None 2025-05-07T20:32:43.2101147Z 2025-05-07T20:32:43.2101276Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2101372Z op = silu_mul_quant 2025-05-07T20:32:43.2101463Z if compiled: 2025-05-07T20:32:43.2101563Z op = torch.compile(op) 2025-05-07T20:32:43.2101674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2101747Z 2025-05-07T20:32:43.2101838Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2101964Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2102038Z 2025-05-07T20:32:43.2102176Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2102285Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2102427Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2102555Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2102698Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2102773Z 2025-05-07T20:32:43.2102878Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2102885Z 2025-05-07T20:32:43.2102986Z moe/activation_test.py:126: 2025-05-07T20:32:43.2103111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2103218Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2103355Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2103906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2104009Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2104368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2104597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2104960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2105255Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2105695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.2105945Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2106329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2106493Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2106840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2106984Z fn() 2025-05-07T20:32:43.2107387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2107469Z self.fn.run( 2025-05-07T20:32:43.2107809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2107906Z kernel = self.compile( 2025-05-07T20:32:43.2108283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2108457Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2108583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2108588Z 2025-05-07T20:32:43.2108799Z self = 2025-05-07T20:32:43.2109569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2110082Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda9b6550>} 2025-05-07T20:32:43.2110828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2111019Z context = 2025-05-07T20:32:43.2111024Z 2025-05-07T20:32:43.2111193Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2111453Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2111608Z module_map=module_map) 2025-05-07T20:32:43.2111774Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2111877Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2111960Z E ^ 2025-05-07T20:32:43.2112314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2112321Z 2025-05-07T20:32:43.2112734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2112738Z 2025-05-07T20:32:43.2112841Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2113063Z self=, 2025-05-07T20:32:43.2113144Z T=2048, 2025-05-07T20:32:43.2113218Z D=5120, 2025-05-07T20:32:43.2113300Z scale_ub=None, 2025-05-07T20:32:43.2113385Z contiguous=True, 2025-05-07T20:32:43.2113468Z compiled=True, 2025-05-07T20:32:43.2113544Z ) 2025-05-07T20:32:43.2113765Z self = 2025-05-07T20:32:43.2113938Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2113942Z 2025-05-07T20:32:43.2114024Z @given( 2025-05-07T20:32:43.2114142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2114314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2114434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2114552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2114664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2114741Z ) 2025-05-07T20:32:43.2114984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2115077Z def test_silu_mul_quant( 2025-05-07T20:32:43.2115157Z self, 2025-05-07T20:32:43.2115233Z T: int, 2025-05-07T20:32:43.2115313Z D: int, 2025-05-07T20:32:43.2115457Z scale_ub: Optional[float], 2025-05-07T20:32:43.2115546Z contiguous: bool, 2025-05-07T20:32:43.2115634Z compiled: bool, 2025-05-07T20:32:43.2115714Z ) -> None: 2025-05-07T20:32:43.2115806Z torch.manual_seed(2025) 2025-05-07T20:32:43.2115884Z 2025-05-07T20:32:43.2116055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2116132Z 2025-05-07T20:32:43.2116227Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2116349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2116437Z x = x_sign * x_clamp 2025-05-07T20:32:43.2116523Z x0 = x[:, :D] 2025-05-07T20:32:43.2116604Z x1 = x[:, D:] 2025-05-07T20:32:43.2116676Z 2025-05-07T20:32:43.2116761Z if contiguous: 2025-05-07T20:32:43.2116851Z x0 = x0.contiguous() 2025-05-07T20:32:43.2116946Z x1 = x1.contiguous() 2025-05-07T20:32:43.2117019Z 2025-05-07T20:32:43.2117109Z if scale_ub is not None: 2025-05-07T20:32:43.2117221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2117354Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2117428Z ) 2025-05-07T20:32:43.2117509Z else: 2025-05-07T20:32:43.2117601Z scale_ub_tensor = None 2025-05-07T20:32:43.2117679Z 2025-05-07T20:32:43.2117814Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2117904Z op = silu_mul_quant 2025-05-07T20:32:43.2117997Z if compiled: 2025-05-07T20:32:43.2118101Z op = torch.compile(op) 2025-05-07T20:32:43.2118204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2118281Z 2025-05-07T20:32:43.2118371Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2118490Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2118565Z 2025-05-07T20:32:43.2118701Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2118849Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2118953Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2119073Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2119214Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2119291Z 2025-05-07T20:32:43.2119396Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2119401Z 2025-05-07T20:32:43.2119503Z moe/activation_test.py:126: 2025-05-07T20:32:43.2119629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2119733Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2119869Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2120419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2120519Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2120886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2121130Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2121580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2121867Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2122259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.2122513Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2122888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2123058Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2123437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2123513Z fn() 2025-05-07T20:32:43.2123909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2123993Z self.fn.run( 2025-05-07T20:32:43.2124333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2124429Z kernel = self.compile( 2025-05-07T20:32:43.2124802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2124983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2125109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2125113Z 2025-05-07T20:32:43.2125323Z self = 2025-05-07T20:32:43.2126101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2126617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda8f0f70>} 2025-05-07T20:32:43.2127358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2127551Z context = 2025-05-07T20:32:43.2127555Z 2025-05-07T20:32:43.2127722Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2128027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2128139Z module_map=module_map) 2025-05-07T20:32:43.2128306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2128408Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2128483Z E ^ 2025-05-07T20:32:43.2128845Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2128850Z 2025-05-07T20:32:43.2129257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2129261Z 2025-05-07T20:32:43.2129368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2129591Z self=, 2025-05-07T20:32:43.2129667Z T=128, 2025-05-07T20:32:43.2129744Z D=5120, 2025-05-07T20:32:43.2129826Z scale_ub=None, 2025-05-07T20:32:43.2129915Z contiguous=True, 2025-05-07T20:32:43.2130003Z compiled=True, 2025-05-07T20:32:43.2130076Z ) 2025-05-07T20:32:43.2130291Z self = 2025-05-07T20:32:43.2130461Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2130506Z 2025-05-07T20:32:43.2130620Z @given( 2025-05-07T20:32:43.2130754Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2130866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2131002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2131123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2131234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2131306Z ) 2025-05-07T20:32:43.2131553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2131645Z def test_silu_mul_quant( 2025-05-07T20:32:43.2131723Z self, 2025-05-07T20:32:43.2131851Z T: int, 2025-05-07T20:32:43.2131928Z D: int, 2025-05-07T20:32:43.2132028Z scale_ub: Optional[float], 2025-05-07T20:32:43.2132117Z contiguous: bool, 2025-05-07T20:32:43.2132201Z compiled: bool, 2025-05-07T20:32:43.2132285Z ) -> None: 2025-05-07T20:32:43.2132383Z torch.manual_seed(2025) 2025-05-07T20:32:43.2132455Z 2025-05-07T20:32:43.2132626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2132700Z 2025-05-07T20:32:43.2132792Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2132919Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2133007Z x = x_sign * x_clamp 2025-05-07T20:32:43.2133089Z x0 = x[:, :D] 2025-05-07T20:32:43.2133171Z x1 = x[:, D:] 2025-05-07T20:32:43.2133243Z 2025-05-07T20:32:43.2133325Z if contiguous: 2025-05-07T20:32:43.2133421Z x0 = x0.contiguous() 2025-05-07T20:32:43.2133511Z x1 = x1.contiguous() 2025-05-07T20:32:43.2133593Z 2025-05-07T20:32:43.2133683Z if scale_ub is not None: 2025-05-07T20:32:43.2133786Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2133925Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2134000Z ) 2025-05-07T20:32:43.2134080Z else: 2025-05-07T20:32:43.2134177Z scale_ub_tensor = None 2025-05-07T20:32:43.2134249Z 2025-05-07T20:32:43.2134378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2134471Z op = silu_mul_quant 2025-05-07T20:32:43.2134555Z if compiled: 2025-05-07T20:32:43.2134653Z op = torch.compile(op) 2025-05-07T20:32:43.2134763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2134839Z 2025-05-07T20:32:43.2134937Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2135059Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2135138Z 2025-05-07T20:32:43.2135326Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2135432Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2135534Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2135665Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2140385Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2140520Z 2025-05-07T20:32:43.2140675Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2140681Z 2025-05-07T20:32:43.2140787Z moe/activation_test.py:126: 2025-05-07T20:32:43.2140918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2141031Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2141235Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2141814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2141923Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2142283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2142515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2143093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2143356Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2143755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.2144008Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2144390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2144624Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2144971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2145052Z fn() 2025-05-07T20:32:43.2145457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2145547Z self.fn.run( 2025-05-07T20:32:43.2145879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2145974Z kernel = self.compile( 2025-05-07T20:32:43.2146352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2146531Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2146657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2146670Z 2025-05-07T20:32:43.2146882Z self = 2025-05-07T20:32:43.2147666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2148181Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdac8fb80>} 2025-05-07T20:32:43.2148930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2149128Z context = 2025-05-07T20:32:43.2149133Z 2025-05-07T20:32:43.2149362Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2149636Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2149749Z module_map=module_map) 2025-05-07T20:32:43.2149910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2150016Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2150100Z E ^ 2025-05-07T20:32:43.2150456Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2150461Z 2025-05-07T20:32:43.2150875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2150879Z 2025-05-07T20:32:43.2150982Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2151204Z self=, 2025-05-07T20:32:43.2151287Z T=4096, 2025-05-07T20:32:43.2151370Z D=5120, 2025-05-07T20:32:43.2151460Z scale_ub=None, 2025-05-07T20:32:43.2151544Z contiguous=True, 2025-05-07T20:32:43.2151625Z compiled=True, 2025-05-07T20:32:43.2151704Z ) 2025-05-07T20:32:43.2151919Z self = 2025-05-07T20:32:43.2152173Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2152179Z 2025-05-07T20:32:43.2152261Z @given( 2025-05-07T20:32:43.2152380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2152479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2152605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2152720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2152839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2152913Z ) 2025-05-07T20:32:43.2153160Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2153301Z def test_silu_mul_quant( 2025-05-07T20:32:43.2153378Z self, 2025-05-07T20:32:43.2153455Z T: int, 2025-05-07T20:32:43.2153539Z D: int, 2025-05-07T20:32:43.2153634Z scale_ub: Optional[float], 2025-05-07T20:32:43.2153724Z contiguous: bool, 2025-05-07T20:32:43.2153821Z compiled: bool, 2025-05-07T20:32:43.2153901Z ) -> None: 2025-05-07T20:32:43.2153994Z torch.manual_seed(2025) 2025-05-07T20:32:43.2154070Z 2025-05-07T20:32:43.2154240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2154315Z 2025-05-07T20:32:43.2154408Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2154533Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2154629Z x = x_sign * x_clamp 2025-05-07T20:32:43.2154710Z x0 = x[:, :D] 2025-05-07T20:32:43.2154788Z x1 = x[:, D:] 2025-05-07T20:32:43.2154861Z 2025-05-07T20:32:43.2154943Z if contiguous: 2025-05-07T20:32:43.2155038Z x0 = x0.contiguous() 2025-05-07T20:32:43.2155136Z x1 = x1.contiguous() 2025-05-07T20:32:43.2155207Z 2025-05-07T20:32:43.2155296Z if scale_ub is not None: 2025-05-07T20:32:43.2155403Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2155542Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2155618Z ) 2025-05-07T20:32:43.2155702Z else: 2025-05-07T20:32:43.2155796Z scale_ub_tensor = None 2025-05-07T20:32:43.2155876Z 2025-05-07T20:32:43.2156005Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2156097Z op = silu_mul_quant 2025-05-07T20:32:43.2156189Z if compiled: 2025-05-07T20:32:43.2156288Z op = torch.compile(op) 2025-05-07T20:32:43.2156394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2156470Z 2025-05-07T20:32:43.2156559Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2156730Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2156808Z 2025-05-07T20:32:43.2156943Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2157045Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2157150Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2157277Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2157419Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2157491Z 2025-05-07T20:32:43.2157591Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2157595Z 2025-05-07T20:32:43.2157698Z moe/activation_test.py:126: 2025-05-07T20:32:43.2157828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2157932Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2158072Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2158632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2158738Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2159101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2159406Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2159781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2160039Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2160437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.2160709Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2161160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2161329Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2161673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2161754Z fn() 2025-05-07T20:32:43.2162156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2162241Z self.fn.run( 2025-05-07T20:32:43.2162576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2162670Z kernel = self.compile( 2025-05-07T20:32:43.2163052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2163234Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2163368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2163373Z 2025-05-07T20:32:43.2163583Z self = 2025-05-07T20:32:43.2164357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2164879Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda81cca0>} 2025-05-07T20:32:43.2165621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2165815Z context = 2025-05-07T20:32:43.2165860Z 2025-05-07T20:32:43.2166035Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2166301Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2166407Z module_map=module_map) 2025-05-07T20:32:43.2166581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2166684Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2166762Z E ^ 2025-05-07T20:32:43.2167129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2167133Z 2025-05-07T20:32:43.2167547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2167551Z 2025-05-07T20:32:43.2167658Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2167883Z self=, 2025-05-07T20:32:43.2167963Z T=16384, 2025-05-07T20:32:43.2168044Z D=5120, 2025-05-07T20:32:43.2168126Z scale_ub=None, 2025-05-07T20:32:43.2168211Z contiguous=True, 2025-05-07T20:32:43.2168302Z compiled=True, 2025-05-07T20:32:43.2168374Z ) 2025-05-07T20:32:43.2168668Z self = 2025-05-07T20:32:43.2168844Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2168848Z 2025-05-07T20:32:43.2168924Z @given( 2025-05-07T20:32:43.2169050Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2169151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2169267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2169390Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2169502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2169583Z ) 2025-05-07T20:32:43.2169874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2169968Z def test_silu_mul_quant( 2025-05-07T20:32:43.2170048Z self, 2025-05-07T20:32:43.2170125Z T: int, 2025-05-07T20:32:43.2170203Z D: int, 2025-05-07T20:32:43.2170310Z scale_ub: Optional[float], 2025-05-07T20:32:43.2170399Z contiguous: bool, 2025-05-07T20:32:43.2170486Z compiled: bool, 2025-05-07T20:32:43.2170572Z ) -> None: 2025-05-07T20:32:43.2170664Z torch.manual_seed(2025) 2025-05-07T20:32:43.2170736Z 2025-05-07T20:32:43.2170909Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2170982Z 2025-05-07T20:32:43.2171073Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2171203Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2171292Z x = x_sign * x_clamp 2025-05-07T20:32:43.2171377Z x0 = x[:, :D] 2025-05-07T20:32:43.2171463Z x1 = x[:, D:] 2025-05-07T20:32:43.2171535Z 2025-05-07T20:32:43.2171625Z if contiguous: 2025-05-07T20:32:43.2171716Z x0 = x0.contiguous() 2025-05-07T20:32:43.2171803Z x1 = x1.contiguous() 2025-05-07T20:32:43.2171878Z 2025-05-07T20:32:43.2171971Z if scale_ub is not None: 2025-05-07T20:32:43.2172081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2172221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2172297Z ) 2025-05-07T20:32:43.2172375Z else: 2025-05-07T20:32:43.2172475Z scale_ub_tensor = None 2025-05-07T20:32:43.2172550Z 2025-05-07T20:32:43.2172685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2172778Z op = silu_mul_quant 2025-05-07T20:32:43.2172865Z if compiled: 2025-05-07T20:32:43.2172969Z op = torch.compile(op) 2025-05-07T20:32:43.2173078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2173197Z 2025-05-07T20:32:43.2173293Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2173415Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2173486Z 2025-05-07T20:32:43.2173625Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2173734Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2173834Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2173959Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2174098Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2174177Z 2025-05-07T20:32:43.2174276Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2174281Z 2025-05-07T20:32:43.2174379Z moe/activation_test.py:126: 2025-05-07T20:32:43.2174516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2174620Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2174760Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2175316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2175417Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2175888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2176117Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2176485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2176742Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2177140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.2177393Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2177809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2177978Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2178320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2178398Z fn() 2025-05-07T20:32:43.2178799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2178881Z self.fn.run( 2025-05-07T20:32:43.2179225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2179319Z kernel = self.compile( 2025-05-07T20:32:43.2179700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2179890Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2180017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2180022Z 2025-05-07T20:32:43.2180235Z self = 2025-05-07T20:32:43.2181024Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2181582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdaddcb80>} 2025-05-07T20:32:43.2182374Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2182570Z context = 2025-05-07T20:32:43.2182575Z 2025-05-07T20:32:43.2182745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2183009Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2183127Z module_map=module_map) 2025-05-07T20:32:43.2183287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2183391Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2183472Z E ^ 2025-05-07T20:32:43.2183830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2183834Z 2025-05-07T20:32:43.2184250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2184254Z 2025-05-07T20:32:43.2184366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2184591Z self=, 2025-05-07T20:32:43.2184673Z T=1, 2025-05-07T20:32:43.2184749Z D=5120, 2025-05-07T20:32:43.2184832Z scale_ub=1200.0, 2025-05-07T20:32:43.2184918Z contiguous=True, 2025-05-07T20:32:43.2185078Z compiled=True, 2025-05-07T20:32:43.2185151Z ) 2025-05-07T20:32:43.2185372Z self = 2025-05-07T20:32:43.2185537Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2185542Z 2025-05-07T20:32:43.2185620Z @given( 2025-05-07T20:32:43.2185741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2185840Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2185960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2186077Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2186238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2186317Z ) 2025-05-07T20:32:43.2186566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2186659Z def test_silu_mul_quant( 2025-05-07T20:32:43.2186739Z self, 2025-05-07T20:32:43.2186818Z T: int, 2025-05-07T20:32:43.2186899Z D: int, 2025-05-07T20:32:43.2187002Z scale_ub: Optional[float], 2025-05-07T20:32:43.2187090Z contiguous: bool, 2025-05-07T20:32:43.2187174Z compiled: bool, 2025-05-07T20:32:43.2187259Z ) -> None: 2025-05-07T20:32:43.2187351Z torch.manual_seed(2025) 2025-05-07T20:32:43.2187428Z 2025-05-07T20:32:43.2187593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2187665Z 2025-05-07T20:32:43.2187760Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2187882Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2187977Z x = x_sign * x_clamp 2025-05-07T20:32:43.2188062Z x0 = x[:, :D] 2025-05-07T20:32:43.2188143Z x1 = x[:, D:] 2025-05-07T20:32:43.2188215Z 2025-05-07T20:32:43.2188302Z if contiguous: 2025-05-07T20:32:43.2188393Z x0 = x0.contiguous() 2025-05-07T20:32:43.2188482Z x1 = x1.contiguous() 2025-05-07T20:32:43.2188566Z 2025-05-07T20:32:43.2188657Z if scale_ub is not None: 2025-05-07T20:32:43.2188763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2188901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2188976Z ) 2025-05-07T20:32:43.2189056Z else: 2025-05-07T20:32:43.2189148Z scale_ub_tensor = None 2025-05-07T20:32:43.2189220Z 2025-05-07T20:32:43.2189351Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2189440Z op = silu_mul_quant 2025-05-07T20:32:43.2189526Z if compiled: 2025-05-07T20:32:43.2189674Z op = torch.compile(op) 2025-05-07T20:32:43.2189783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2189858Z 2025-05-07T20:32:43.2189954Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2189958Z 2025-05-07T20:32:43.2190055Z moe/activation_test.py:117: 2025-05-07T20:32:43.2190186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2190291Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2190390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2190756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2190847Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2191337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2191436Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2191801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2192034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2192370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2192541Z kernel = self.compile( 2025-05-07T20:32:43.2192929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2193105Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2193232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2193240Z 2025-05-07T20:32:43.2193446Z self = 2025-05-07T20:32:43.2194216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2194770Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda7eda60>} 2025-05-07T20:32:43.2195526Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2195724Z context = 2025-05-07T20:32:43.2195729Z 2025-05-07T20:32:43.2195894Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2196159Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2196270Z module_map=module_map) 2025-05-07T20:32:43.2196436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2196542Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2196617Z E ^ 2025-05-07T20:32:43.2196974Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2196981Z 2025-05-07T20:32:43.2197405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2197409Z 2025-05-07T20:32:43.2197509Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2197729Z self=, 2025-05-07T20:32:43.2197812Z T=1, 2025-05-07T20:32:43.2197887Z D=5120, 2025-05-07T20:32:43.2197972Z scale_ub=None, 2025-05-07T20:32:43.2198056Z contiguous=False, 2025-05-07T20:32:43.2198137Z compiled=True, 2025-05-07T20:32:43.2198210Z ) 2025-05-07T20:32:43.2198464Z self = 2025-05-07T20:32:43.2198638Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2198642Z 2025-05-07T20:32:43.2198724Z @given( 2025-05-07T20:32:43.2198841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2198940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2199062Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2199179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2199296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2199368Z ) 2025-05-07T20:32:43.2199616Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2199714Z def test_silu_mul_quant( 2025-05-07T20:32:43.2199790Z self, 2025-05-07T20:32:43.2199866Z T: int, 2025-05-07T20:32:43.2199945Z D: int, 2025-05-07T20:32:43.2200042Z scale_ub: Optional[float], 2025-05-07T20:32:43.2200137Z contiguous: bool, 2025-05-07T20:32:43.2200224Z compiled: bool, 2025-05-07T20:32:43.2200302Z ) -> None: 2025-05-07T20:32:43.2200397Z torch.manual_seed(2025) 2025-05-07T20:32:43.2200474Z 2025-05-07T20:32:43.2200640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2200758Z 2025-05-07T20:32:43.2200885Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2201012Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2201106Z x = x_sign * x_clamp 2025-05-07T20:32:43.2201186Z x0 = x[:, :D] 2025-05-07T20:32:43.2201264Z x1 = x[:, D:] 2025-05-07T20:32:43.2201339Z 2025-05-07T20:32:43.2201420Z if contiguous: 2025-05-07T20:32:43.2201510Z x0 = x0.contiguous() 2025-05-07T20:32:43.2201601Z x1 = x1.contiguous() 2025-05-07T20:32:43.2201673Z 2025-05-07T20:32:43.2201763Z if scale_ub is not None: 2025-05-07T20:32:43.2201870Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2202053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2202134Z ) 2025-05-07T20:32:43.2202211Z else: 2025-05-07T20:32:43.2202307Z scale_ub_tensor = None 2025-05-07T20:32:43.2202379Z 2025-05-07T20:32:43.2202512Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2202604Z op = silu_mul_quant 2025-05-07T20:32:43.2202692Z if compiled: 2025-05-07T20:32:43.2202796Z op = torch.compile(op) 2025-05-07T20:32:43.2202900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2202977Z 2025-05-07T20:32:43.2203068Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2203186Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2203260Z 2025-05-07T20:32:43.2203396Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2203497Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2203609Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2203731Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2203870Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2203947Z 2025-05-07T20:32:43.2204046Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2204056Z 2025-05-07T20:32:43.2204157Z moe/activation_test.py:126: 2025-05-07T20:32:43.2204284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2204388Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2204525Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2205074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2205172Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2205604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2205832Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2206205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2206468Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2206866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.2207120Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2207488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2207659Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2207998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2208076Z fn() 2025-05-07T20:32:43.2208472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2208553Z self.fn.run( 2025-05-07T20:32:43.2208925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2209058Z kernel = self.compile( 2025-05-07T20:32:43.2209430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2209609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2209734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2209739Z 2025-05-07T20:32:43.2209945Z self = 2025-05-07T20:32:43.2210718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2211279Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda1b9430>} 2025-05-07T20:32:43.2212033Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2212224Z context = 2025-05-07T20:32:43.2212229Z 2025-05-07T20:32:43.2212393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2212659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2212772Z module_map=module_map) 2025-05-07T20:32:43.2212937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2213041Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2213118Z E ^ 2025-05-07T20:32:43.2213472Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2213479Z 2025-05-07T20:32:43.2213891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2213895Z 2025-05-07T20:32:43.2214000Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2214224Z self=, 2025-05-07T20:32:43.2214298Z T=1, 2025-05-07T20:32:43.2214376Z D=5120, 2025-05-07T20:32:43.2214456Z scale_ub=None, 2025-05-07T20:32:43.2214541Z contiguous=True, 2025-05-07T20:32:43.2214627Z compiled=False, 2025-05-07T20:32:43.2214743Z ) 2025-05-07T20:32:43.2214960Z self = 2025-05-07T20:32:43.2215130Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2215134Z 2025-05-07T20:32:43.2215210Z @given( 2025-05-07T20:32:43.2215339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2215435Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2215553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2215677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2215792Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2215866Z ) 2025-05-07T20:32:43.2216118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2216212Z def test_silu_mul_quant( 2025-05-07T20:32:43.2216291Z self, 2025-05-07T20:32:43.2216371Z T: int, 2025-05-07T20:32:43.2216448Z D: int, 2025-05-07T20:32:43.2216547Z scale_ub: Optional[float], 2025-05-07T20:32:43.2216640Z contiguous: bool, 2025-05-07T20:32:43.2216726Z compiled: bool, 2025-05-07T20:32:43.2216806Z ) -> None: 2025-05-07T20:32:43.2216898Z torch.manual_seed(2025) 2025-05-07T20:32:43.2216969Z 2025-05-07T20:32:43.2217216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2217292Z 2025-05-07T20:32:43.2217383Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2217510Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2217597Z x = x_sign * x_clamp 2025-05-07T20:32:43.2217676Z x0 = x[:, :D] 2025-05-07T20:32:43.2217758Z x1 = x[:, D:] 2025-05-07T20:32:43.2217831Z 2025-05-07T20:32:43.2217916Z if contiguous: 2025-05-07T20:32:43.2218010Z x0 = x0.contiguous() 2025-05-07T20:32:43.2218098Z x1 = x1.contiguous() 2025-05-07T20:32:43.2218172Z 2025-05-07T20:32:43.2218308Z if scale_ub is not None: 2025-05-07T20:32:43.2218413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2218552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2218628Z ) 2025-05-07T20:32:43.2218705Z else: 2025-05-07T20:32:43.2218800Z scale_ub_tensor = None 2025-05-07T20:32:43.2218880Z 2025-05-07T20:32:43.2219009Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2219102Z op = silu_mul_quant 2025-05-07T20:32:43.2219189Z if compiled: 2025-05-07T20:32:43.2219287Z op = torch.compile(op) 2025-05-07T20:32:43.2219399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2219470Z 2025-05-07T20:32:43.2219567Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2219572Z 2025-05-07T20:32:43.2219669Z moe/activation_test.py:117: 2025-05-07T20:32:43.2219795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2219907Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2220007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2220509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2220615Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2220974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2221250Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2221585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2221678Z kernel = self.compile( 2025-05-07T20:32:43.2222064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2222281Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2222409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2222417Z 2025-05-07T20:32:43.2222626Z self = 2025-05-07T20:32:43.2223395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2223915Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfda1b9e50>} 2025-05-07T20:32:43.2224663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2224862Z context = 2025-05-07T20:32:43.2224867Z 2025-05-07T20:32:43.2225034Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2225300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2225488Z module_map=module_map) 2025-05-07T20:32:43.2225651Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2225749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2225829Z E ^ 2025-05-07T20:32:43.2226189Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2226194Z 2025-05-07T20:32:43.2226606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2226610Z 2025-05-07T20:32:43.2226713Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2226980Z self=, 2025-05-07T20:32:43.2227059Z T=128, 2025-05-07T20:32:43.2227133Z D=5120, 2025-05-07T20:32:43.2227217Z scale_ub=None, 2025-05-07T20:32:43.2227304Z contiguous=False, 2025-05-07T20:32:43.2227386Z compiled=True, 2025-05-07T20:32:43.2227465Z ) 2025-05-07T20:32:43.2227682Z self = 2025-05-07T20:32:43.2227851Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2227855Z 2025-05-07T20:32:43.2227936Z @given( 2025-05-07T20:32:43.2228052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2228148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2228266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2228382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2228498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2228579Z ) 2025-05-07T20:32:43.2228823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2228919Z def test_silu_mul_quant( 2025-05-07T20:32:43.2228994Z self, 2025-05-07T20:32:43.2229072Z T: int, 2025-05-07T20:32:43.2229153Z D: int, 2025-05-07T20:32:43.2229253Z scale_ub: Optional[float], 2025-05-07T20:32:43.2229342Z contiguous: bool, 2025-05-07T20:32:43.2229432Z compiled: bool, 2025-05-07T20:32:43.2229509Z ) -> None: 2025-05-07T20:32:43.2229603Z torch.manual_seed(2025) 2025-05-07T20:32:43.2229678Z 2025-05-07T20:32:43.2229845Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2229918Z 2025-05-07T20:32:43.2230012Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2230136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2230229Z x = x_sign * x_clamp 2025-05-07T20:32:43.2230354Z x0 = x[:, :D] 2025-05-07T20:32:43.2230436Z x1 = x[:, D:] 2025-05-07T20:32:43.2230512Z 2025-05-07T20:32:43.2230595Z if contiguous: 2025-05-07T20:32:43.2230687Z x0 = x0.contiguous() 2025-05-07T20:32:43.2230782Z x1 = x1.contiguous() 2025-05-07T20:32:43.2230853Z 2025-05-07T20:32:43.2230948Z if scale_ub is not None: 2025-05-07T20:32:43.2231058Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2231192Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2231268Z ) 2025-05-07T20:32:43.2231350Z else: 2025-05-07T20:32:43.2231442Z scale_ub_tensor = None 2025-05-07T20:32:43.2231516Z 2025-05-07T20:32:43.2231643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2231733Z op = silu_mul_quant 2025-05-07T20:32:43.2231823Z if compiled: 2025-05-07T20:32:43.2231922Z op = torch.compile(op) 2025-05-07T20:32:43.2232037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2232114Z 2025-05-07T20:32:43.2232205Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2232209Z 2025-05-07T20:32:43.2232305Z moe/activation_test.py:117: 2025-05-07T20:32:43.2232436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2232613Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2232717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2233083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2233174Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2233667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2233763Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2234130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2234422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2234763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2234856Z kernel = self.compile( 2025-05-07T20:32:43.2235237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2235415Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2235544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2235549Z 2025-05-07T20:32:43.2235757Z self = 2025-05-07T20:32:43.2236531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2237045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd98b8040>} 2025-05-07T20:32:43.2237794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2237995Z context = 2025-05-07T20:32:43.2237999Z 2025-05-07T20:32:43.2238162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2238424Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2238531Z module_map=module_map) 2025-05-07T20:32:43.2238690Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2238831Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2238908Z E ^ 2025-05-07T20:32:43.2239256Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2239263Z 2025-05-07T20:32:43.2239680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2239687Z 2025-05-07T20:32:43.2239788Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2240015Z self=, 2025-05-07T20:32:43.2240320Z T=128, 2025-05-07T20:32:43.2240434Z D=7168, 2025-05-07T20:32:43.2240543Z scale_ub=1200.0, 2025-05-07T20:32:43.2240631Z contiguous=False, 2025-05-07T20:32:43.2240714Z compiled=False, 2025-05-07T20:32:43.2240791Z ) 2025-05-07T20:32:43.2241007Z self = 2025-05-07T20:32:43.2241190Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2241195Z 2025-05-07T20:32:43.2241272Z @given( 2025-05-07T20:32:43.2241388Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2241490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2241742Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2241858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2241976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2242055Z ) 2025-05-07T20:32:43.2242307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2242401Z def test_silu_mul_quant( 2025-05-07T20:32:43.2242476Z self, 2025-05-07T20:32:43.2242557Z T: int, 2025-05-07T20:32:43.2242633Z D: int, 2025-05-07T20:32:43.2242730Z scale_ub: Optional[float], 2025-05-07T20:32:43.2242822Z contiguous: bool, 2025-05-07T20:32:43.2242969Z compiled: bool, 2025-05-07T20:32:43.2243047Z ) -> None: 2025-05-07T20:32:43.2243146Z torch.manual_seed(2025) 2025-05-07T20:32:43.2243219Z 2025-05-07T20:32:43.2243387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2243465Z 2025-05-07T20:32:43.2243562Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2243687Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2243781Z x = x_sign * x_clamp 2025-05-07T20:32:43.2243862Z x0 = x[:, :D] 2025-05-07T20:32:43.2243949Z x1 = x[:, D:] 2025-05-07T20:32:43.2244022Z 2025-05-07T20:32:43.2244105Z if contiguous: 2025-05-07T20:32:43.2244199Z x0 = x0.contiguous() 2025-05-07T20:32:43.2244290Z x1 = x1.contiguous() 2025-05-07T20:32:43.2244363Z 2025-05-07T20:32:43.2244457Z if scale_ub is not None: 2025-05-07T20:32:43.2244561Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2244700Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2244783Z ) 2025-05-07T20:32:43.2244859Z else: 2025-05-07T20:32:43.2244952Z scale_ub_tensor = None 2025-05-07T20:32:43.2245031Z 2025-05-07T20:32:43.2245159Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2245257Z op = silu_mul_quant 2025-05-07T20:32:43.2245342Z if compiled: 2025-05-07T20:32:43.2245443Z op = torch.compile(op) 2025-05-07T20:32:43.2245550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2245624Z 2025-05-07T20:32:43.2245714Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2245719Z 2025-05-07T20:32:43.2245820Z moe/activation_test.py:117: 2025-05-07T20:32:43.2245947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2246049Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2246154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2246724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2246826Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2247183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2247415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2247756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2247849Z kernel = self.compile( 2025-05-07T20:32:43.2248228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2248406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2248531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2248544Z 2025-05-07T20:32:43.2248758Z self = 2025-05-07T20:32:43.2249560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2250110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd98b8c10>} 2025-05-07T20:32:43.2250879Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2251093Z context = 2025-05-07T20:32:43.2251098Z 2025-05-07T20:32:43.2251270Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2251577Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2251690Z module_map=module_map) 2025-05-07T20:32:43.2251852Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2251958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2252037Z E ^ 2025-05-07T20:32:43.2252387Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2252391Z 2025-05-07T20:32:43.2252799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2252805Z 2025-05-07T20:32:43.2252910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2253135Z self=, 2025-05-07T20:32:43.2253216Z T=128, 2025-05-07T20:32:43.2253296Z D=5120, 2025-05-07T20:32:43.2253377Z scale_ub=None, 2025-05-07T20:32:43.2253466Z contiguous=False, 2025-05-07T20:32:43.2253551Z compiled=False, 2025-05-07T20:32:43.2253623Z ) 2025-05-07T20:32:43.2253841Z self = 2025-05-07T20:32:43.2254022Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2254027Z 2025-05-07T20:32:43.2254107Z @given( 2025-05-07T20:32:43.2254226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2254325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2254443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2254561Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2254676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2254757Z ) 2025-05-07T20:32:43.2255002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2255144Z def test_silu_mul_quant( 2025-05-07T20:32:43.2255228Z self, 2025-05-07T20:32:43.2255305Z T: int, 2025-05-07T20:32:43.2255382Z D: int, 2025-05-07T20:32:43.2255482Z scale_ub: Optional[float], 2025-05-07T20:32:43.2255571Z contiguous: bool, 2025-05-07T20:32:43.2255667Z compiled: bool, 2025-05-07T20:32:43.2255746Z ) -> None: 2025-05-07T20:32:43.2255842Z torch.manual_seed(2025) 2025-05-07T20:32:43.2255920Z 2025-05-07T20:32:43.2256088Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2256163Z 2025-05-07T20:32:43.2256259Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2256385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2256473Z x = x_sign * x_clamp 2025-05-07T20:32:43.2256558Z x0 = x[:, :D] 2025-05-07T20:32:43.2256637Z x1 = x[:, D:] 2025-05-07T20:32:43.2256709Z 2025-05-07T20:32:43.2256797Z if contiguous: 2025-05-07T20:32:43.2256892Z x0 = x0.contiguous() 2025-05-07T20:32:43.2256983Z x1 = x1.contiguous() 2025-05-07T20:32:43.2257057Z 2025-05-07T20:32:43.2257148Z if scale_ub is not None: 2025-05-07T20:32:43.2257254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2257469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2257545Z ) 2025-05-07T20:32:43.2257624Z else: 2025-05-07T20:32:43.2257718Z scale_ub_tensor = None 2025-05-07T20:32:43.2257790Z 2025-05-07T20:32:43.2257922Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2258012Z op = silu_mul_quant 2025-05-07T20:32:43.2258097Z if compiled: 2025-05-07T20:32:43.2258203Z op = torch.compile(op) 2025-05-07T20:32:43.2258307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2258379Z 2025-05-07T20:32:43.2258474Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2258521Z 2025-05-07T20:32:43.2258619Z moe/activation_test.py:117: 2025-05-07T20:32:43.2258747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2258849Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2258948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2259451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2259548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2259909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2260133Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2260469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2260568Z kernel = self.compile( 2025-05-07T20:32:43.2260958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2261229Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2261358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2261365Z 2025-05-07T20:32:43.2261575Z self = 2025-05-07T20:32:43.2262342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2262857Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9bf0310>} 2025-05-07T20:32:43.2263642Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2263842Z context = 2025-05-07T20:32:43.2263847Z 2025-05-07T20:32:43.2264015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2264289Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2264396Z module_map=module_map) 2025-05-07T20:32:43.2264558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2264666Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2264743Z E ^ 2025-05-07T20:32:43.2265095Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2265099Z 2025-05-07T20:32:43.2265510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2265517Z 2025-05-07T20:32:43.2265620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2265846Z self=, 2025-05-07T20:32:43.2265924Z T=128, 2025-05-07T20:32:43.2266103Z D=5120, 2025-05-07T20:32:43.2270305Z scale_ub=1200.0, 2025-05-07T20:32:43.2270408Z contiguous=True, 2025-05-07T20:32:43.2270495Z compiled=False, 2025-05-07T20:32:43.2270573Z ) 2025-05-07T20:32:43.2270797Z self = 2025-05-07T20:32:43.2270973Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2270978Z 2025-05-07T20:32:43.2271058Z @given( 2025-05-07T20:32:43.2271176Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2271280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2271471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2271590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2271707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2271782Z ) 2025-05-07T20:32:43.2272027Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2272133Z def test_silu_mul_quant( 2025-05-07T20:32:43.2272210Z self, 2025-05-07T20:32:43.2272287Z T: int, 2025-05-07T20:32:43.2272373Z D: int, 2025-05-07T20:32:43.2272472Z scale_ub: Optional[float], 2025-05-07T20:32:43.2272563Z contiguous: bool, 2025-05-07T20:32:43.2272655Z compiled: bool, 2025-05-07T20:32:43.2272733Z ) -> None: 2025-05-07T20:32:43.2272829Z torch.manual_seed(2025) 2025-05-07T20:32:43.2272905Z 2025-05-07T20:32:43.2273074Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2273150Z 2025-05-07T20:32:43.2273248Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2273376Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2273468Z x = x_sign * x_clamp 2025-05-07T20:32:43.2273549Z x0 = x[:, :D] 2025-05-07T20:32:43.2273627Z x1 = x[:, D:] 2025-05-07T20:32:43.2273702Z 2025-05-07T20:32:43.2273788Z if contiguous: 2025-05-07T20:32:43.2273883Z x0 = x0.contiguous() 2025-05-07T20:32:43.2273980Z x1 = x1.contiguous() 2025-05-07T20:32:43.2274051Z 2025-05-07T20:32:43.2274141Z if scale_ub is not None: 2025-05-07T20:32:43.2274253Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2274387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2274472Z ) 2025-05-07T20:32:43.2274548Z else: 2025-05-07T20:32:43.2274643Z scale_ub_tensor = None 2025-05-07T20:32:43.2274718Z 2025-05-07T20:32:43.2274849Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2274992Z op = silu_mul_quant 2025-05-07T20:32:43.2275083Z if compiled: 2025-05-07T20:32:43.2275183Z op = torch.compile(op) 2025-05-07T20:32:43.2275288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2275364Z 2025-05-07T20:32:43.2275456Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2275465Z 2025-05-07T20:32:43.2275564Z moe/activation_test.py:117: 2025-05-07T20:32:43.2275693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2275794Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2275900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2276403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2276500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2276868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2277102Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2277444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2277542Z kernel = self.compile( 2025-05-07T20:32:43.2278003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2278186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2278313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2278318Z 2025-05-07T20:32:43.2278526Z self = 2025-05-07T20:32:43.2279307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2279860Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9bf0ee0>} 2025-05-07T20:32:43.2280619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2280839Z context = 2025-05-07T20:32:43.2280843Z 2025-05-07T20:32:43.2281038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2281305Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2281412Z module_map=module_map) 2025-05-07T20:32:43.2281579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2281682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2281759Z E ^ 2025-05-07T20:32:43.2282120Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2282125Z 2025-05-07T20:32:43.2282542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2282549Z 2025-05-07T20:32:43.2282653Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2282881Z self=, 2025-05-07T20:32:43.2282959Z T=1, 2025-05-07T20:32:43.2283037Z D=7168, 2025-05-07T20:32:43.2283119Z scale_ub=1200.0, 2025-05-07T20:32:43.2283202Z contiguous=True, 2025-05-07T20:32:43.2283290Z compiled=True, 2025-05-07T20:32:43.2283364Z ) 2025-05-07T20:32:43.2283581Z self = 2025-05-07T20:32:43.2283791Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2283795Z 2025-05-07T20:32:43.2283874Z @given( 2025-05-07T20:32:43.2283997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2284098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2284220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2284343Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2284455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2284529Z ) 2025-05-07T20:32:43.2284782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2284877Z def test_silu_mul_quant( 2025-05-07T20:32:43.2284957Z self, 2025-05-07T20:32:43.2285034Z T: int, 2025-05-07T20:32:43.2285111Z D: int, 2025-05-07T20:32:43.2285213Z scale_ub: Optional[float], 2025-05-07T20:32:43.2285302Z contiguous: bool, 2025-05-07T20:32:43.2285392Z compiled: bool, 2025-05-07T20:32:43.2285473Z ) -> None: 2025-05-07T20:32:43.2285571Z torch.manual_seed(2025) 2025-05-07T20:32:43.2285645Z 2025-05-07T20:32:43.2285821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2285895Z 2025-05-07T20:32:43.2286029Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2286194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2286285Z x = x_sign * x_clamp 2025-05-07T20:32:43.2286370Z x0 = x[:, :D] 2025-05-07T20:32:43.2286458Z x1 = x[:, D:] 2025-05-07T20:32:43.2286529Z 2025-05-07T20:32:43.2286618Z if contiguous: 2025-05-07T20:32:43.2286712Z x0 = x0.contiguous() 2025-05-07T20:32:43.2286801Z x1 = x1.contiguous() 2025-05-07T20:32:43.2286875Z 2025-05-07T20:32:43.2286965Z if scale_ub is not None: 2025-05-07T20:32:43.2287074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2287215Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2287332Z ) 2025-05-07T20:32:43.2287409Z else: 2025-05-07T20:32:43.2287506Z scale_ub_tensor = None 2025-05-07T20:32:43.2287579Z 2025-05-07T20:32:43.2287709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2287809Z op = silu_mul_quant 2025-05-07T20:32:43.2287895Z if compiled: 2025-05-07T20:32:43.2287998Z op = torch.compile(op) 2025-05-07T20:32:43.2288104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2288178Z 2025-05-07T20:32:43.2288274Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2288278Z 2025-05-07T20:32:43.2288377Z moe/activation_test.py:117: 2025-05-07T20:32:43.2288504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2288607Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2288706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2289082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2289178Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2289667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2289772Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2290135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2290364Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2290706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2290799Z kernel = self.compile( 2025-05-07T20:32:43.2291189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2291415Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2291544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2291548Z 2025-05-07T20:32:43.2291758Z self = 2025-05-07T20:32:43.2292540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2293056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfdae2c940>} 2025-05-07T20:32:43.2293793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2293991Z context = 2025-05-07T20:32:43.2293996Z 2025-05-07T20:32:43.2294168Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2294473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2294620Z module_map=module_map) 2025-05-07T20:32:43.2294781Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2294879Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2294960Z E ^ 2025-05-07T20:32:43.2295313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2295317Z 2025-05-07T20:32:43.2295732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2295741Z 2025-05-07T20:32:43.2295886Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2296108Z self=, 2025-05-07T20:32:43.2296189Z T=1, 2025-05-07T20:32:43.2296267Z D=7168, 2025-05-07T20:32:43.2296349Z scale_ub=1200.0, 2025-05-07T20:32:43.2296443Z contiguous=False, 2025-05-07T20:32:43.2296533Z compiled=True, 2025-05-07T20:32:43.2296606Z ) 2025-05-07T20:32:43.2296826Z self = 2025-05-07T20:32:43.2296993Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2296998Z 2025-05-07T20:32:43.2297076Z @given( 2025-05-07T20:32:43.2297194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2297291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2297408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2297524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2297642Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2297720Z ) 2025-05-07T20:32:43.2297966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2298061Z def test_silu_mul_quant( 2025-05-07T20:32:43.2298139Z self, 2025-05-07T20:32:43.2298218Z T: int, 2025-05-07T20:32:43.2298296Z D: int, 2025-05-07T20:32:43.2298401Z scale_ub: Optional[float], 2025-05-07T20:32:43.2298489Z contiguous: bool, 2025-05-07T20:32:43.2298580Z compiled: bool, 2025-05-07T20:32:43.2298657Z ) -> None: 2025-05-07T20:32:43.2298753Z torch.manual_seed(2025) 2025-05-07T20:32:43.2298829Z 2025-05-07T20:32:43.2298999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2299077Z 2025-05-07T20:32:43.2299173Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2299297Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2299388Z x = x_sign * x_clamp 2025-05-07T20:32:43.2299518Z x0 = x[:, :D] 2025-05-07T20:32:43.2299601Z x1 = x[:, D:] 2025-05-07T20:32:43.2299675Z 2025-05-07T20:32:43.2299760Z if contiguous: 2025-05-07T20:32:43.2299851Z x0 = x0.contiguous() 2025-05-07T20:32:43.2299945Z x1 = x1.contiguous() 2025-05-07T20:32:43.2300021Z 2025-05-07T20:32:43.2300113Z if scale_ub is not None: 2025-05-07T20:32:43.2300221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2300358Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2300434Z ) 2025-05-07T20:32:43.2300518Z else: 2025-05-07T20:32:43.2300613Z scale_ub_tensor = None 2025-05-07T20:32:43.2300684Z 2025-05-07T20:32:43.2300817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2300906Z op = silu_mul_quant 2025-05-07T20:32:43.2300991Z if compiled: 2025-05-07T20:32:43.2301168Z op = torch.compile(op) 2025-05-07T20:32:43.2301280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2301356Z 2025-05-07T20:32:43.2301447Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2301452Z 2025-05-07T20:32:43.2301547Z moe/activation_test.py:117: 2025-05-07T20:32:43.2301744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2301882Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2301985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2302350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2302445Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2302934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2303034Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2303396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2303662Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2304000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2304100Z kernel = self.compile( 2025-05-07T20:32:43.2304479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2304660Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2304786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2304791Z 2025-05-07T20:32:43.2304996Z self = 2025-05-07T20:32:43.2305767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2306282Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9ba55e0>} 2025-05-07T20:32:43.2307038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2307230Z context = 2025-05-07T20:32:43.2307235Z 2025-05-07T20:32:43.2307403Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2307665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2307776Z module_map=module_map) 2025-05-07T20:32:43.2307980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2308085Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2308165Z E ^ 2025-05-07T20:32:43.2308521Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2308525Z 2025-05-07T20:32:43.2308941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2308945Z 2025-05-07T20:32:43.2309051Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2309274Z self=, 2025-05-07T20:32:43.2309350Z T=1, 2025-05-07T20:32:43.2309428Z D=7168, 2025-05-07T20:32:43.2309510Z scale_ub=None, 2025-05-07T20:32:43.2309596Z contiguous=False, 2025-05-07T20:32:43.2309682Z compiled=True, 2025-05-07T20:32:43.2309756Z ) 2025-05-07T20:32:43.2309980Z self = 2025-05-07T20:32:43.2310147Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2310152Z 2025-05-07T20:32:43.2310229Z @given( 2025-05-07T20:32:43.2310356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2310455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2310651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2310772Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2310886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2310960Z ) 2025-05-07T20:32:43.2311211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2311304Z def test_silu_mul_quant( 2025-05-07T20:32:43.2311383Z self, 2025-05-07T20:32:43.2311461Z T: int, 2025-05-07T20:32:43.2311537Z D: int, 2025-05-07T20:32:43.2311637Z scale_ub: Optional[float], 2025-05-07T20:32:43.2311727Z contiguous: bool, 2025-05-07T20:32:43.2311856Z compiled: bool, 2025-05-07T20:32:43.2311939Z ) -> None: 2025-05-07T20:32:43.2312033Z torch.manual_seed(2025) 2025-05-07T20:32:43.2312105Z 2025-05-07T20:32:43.2312278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2312355Z 2025-05-07T20:32:43.2312450Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2312579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2312669Z x = x_sign * x_clamp 2025-05-07T20:32:43.2312754Z x0 = x[:, :D] 2025-05-07T20:32:43.2312833Z x1 = x[:, D:] 2025-05-07T20:32:43.2312905Z 2025-05-07T20:32:43.2312991Z if contiguous: 2025-05-07T20:32:43.2313081Z x0 = x0.contiguous() 2025-05-07T20:32:43.2313171Z x1 = x1.contiguous() 2025-05-07T20:32:43.2313244Z 2025-05-07T20:32:43.2313336Z if scale_ub is not None: 2025-05-07T20:32:43.2313440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2313585Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2313660Z ) 2025-05-07T20:32:43.2313738Z else: 2025-05-07T20:32:43.2313835Z scale_ub_tensor = None 2025-05-07T20:32:43.2313907Z 2025-05-07T20:32:43.2314040Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2314137Z op = silu_mul_quant 2025-05-07T20:32:43.2314222Z if compiled: 2025-05-07T20:32:43.2314324Z op = torch.compile(op) 2025-05-07T20:32:43.2314431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2314505Z 2025-05-07T20:32:43.2314599Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.2314720Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.2314791Z 2025-05-07T20:32:43.2314930Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2315032Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.2315178Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.2315306Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.2315451Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2315529Z 2025-05-07T20:32:43.2315629Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.2315635Z 2025-05-07T20:32:43.2315737Z moe/activation_test.py:126: 2025-05-07T20:32:43.2315867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2315971Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.2316106Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.2316662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.2316762Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.2317129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2317358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2317722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.2318023Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2318456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:43.2318711Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.2319081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.2319248Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.2319593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.2319709Z fn() 2025-05-07T20:32:43.2320111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.2320197Z self.fn.run( 2025-05-07T20:32:43.2320540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2320641Z kernel = self.compile( 2025-05-07T20:32:43.2321014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2321192Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2321320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2321324Z 2025-05-07T20:32:43.2321532Z self = 2025-05-07T20:32:43.2322303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2322822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd998c160>} 2025-05-07T20:32:43.2323559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2323755Z context = 2025-05-07T20:32:43.2323759Z 2025-05-07T20:32:43.2323924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2324194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2324345Z module_map=module_map) 2025-05-07T20:32:43.2324510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2324615Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.2324691Z E ^ 2025-05-07T20:32:43.2325046Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2325057Z 2025-05-07T20:32:43.2325467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2325471Z 2025-05-07T20:32:43.2325573Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2325799Z self=, 2025-05-07T20:32:43.2325875Z T=1, 2025-05-07T20:32:43.2325951Z D=5120, 2025-05-07T20:32:43.2326037Z scale_ub=1200.0, 2025-05-07T20:32:43.2326124Z contiguous=False, 2025-05-07T20:32:43.2326207Z compiled=True, 2025-05-07T20:32:43.2326291Z ) 2025-05-07T20:32:43.2326510Z self = 2025-05-07T20:32:43.2326679Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2326683Z 2025-05-07T20:32:43.2326760Z @given( 2025-05-07T20:32:43.2326919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2327058Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2327175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2327296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2327413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2327487Z ) 2025-05-07T20:32:43.2327737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2327834Z def test_silu_mul_quant( 2025-05-07T20:32:43.2327910Z self, 2025-05-07T20:32:43.2327991Z T: int, 2025-05-07T20:32:43.2328068Z D: int, 2025-05-07T20:32:43.2328210Z scale_ub: Optional[float], 2025-05-07T20:32:43.2328304Z contiguous: bool, 2025-05-07T20:32:43.2328389Z compiled: bool, 2025-05-07T20:32:43.2328467Z ) -> None: 2025-05-07T20:32:43.2328567Z torch.manual_seed(2025) 2025-05-07T20:32:43.2328643Z 2025-05-07T20:32:43.2328816Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2328896Z 2025-05-07T20:32:43.2328989Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2329116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2329208Z x = x_sign * x_clamp 2025-05-07T20:32:43.2329290Z x0 = x[:, :D] 2025-05-07T20:32:43.2329369Z x1 = x[:, D:] 2025-05-07T20:32:43.2329444Z 2025-05-07T20:32:43.2329527Z if contiguous: 2025-05-07T20:32:43.2329618Z x0 = x0.contiguous() 2025-05-07T20:32:43.2329710Z x1 = x1.contiguous() 2025-05-07T20:32:43.2329785Z 2025-05-07T20:32:43.2329880Z if scale_ub is not None: 2025-05-07T20:32:43.2329988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2330125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2330203Z ) 2025-05-07T20:32:43.2330279Z else: 2025-05-07T20:32:43.2330373Z scale_ub_tensor = None 2025-05-07T20:32:43.2330454Z 2025-05-07T20:32:43.2330584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2330674Z op = silu_mul_quant 2025-05-07T20:32:43.2330764Z if compiled: 2025-05-07T20:32:43.2330863Z op = torch.compile(op) 2025-05-07T20:32:43.2330968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2331043Z 2025-05-07T20:32:43.2331134Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2331139Z 2025-05-07T20:32:43.2331235Z moe/activation_test.py:117: 2025-05-07T20:32:43.2331367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2331536Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2331646Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2332008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2332100Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2332605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2332702Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2333063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2333294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2333629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2333724Z kernel = self.compile( 2025-05-07T20:32:43.2334107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2334283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2334412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2334457Z 2025-05-07T20:32:43.2334701Z self = 2025-05-07T20:32:43.2335487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2335988Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd998cb80>} 2025-05-07T20:32:43.2336727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2336963Z context = 2025-05-07T20:32:43.2336967Z 2025-05-07T20:32:43.2337135Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2337408Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2337516Z module_map=module_map) 2025-05-07T20:32:43.2337678Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2337777Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2337853Z E ^ 2025-05-07T20:32:43.2338215Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2338220Z 2025-05-07T20:32:43.2338633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2338639Z 2025-05-07T20:32:43.2338739Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2338964Z self=, 2025-05-07T20:32:43.2339040Z T=1, 2025-05-07T20:32:43.2339121Z D=5120, 2025-05-07T20:32:43.2339206Z scale_ub=1200.0, 2025-05-07T20:32:43.2339291Z contiguous=False, 2025-05-07T20:32:43.2339378Z compiled=False, 2025-05-07T20:32:43.2339451Z ) 2025-05-07T20:32:43.2339665Z self = 2025-05-07T20:32:43.2339838Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2339842Z 2025-05-07T20:32:43.2339918Z @given( 2025-05-07T20:32:43.2340036Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2340369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2340614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2340741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2340858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2340933Z ) 2025-05-07T20:32:43.2341229Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2341329Z def test_silu_mul_quant( 2025-05-07T20:32:43.2341404Z self, 2025-05-07T20:32:43.2341483Z T: int, 2025-05-07T20:32:43.2341558Z D: int, 2025-05-07T20:32:43.2341659Z scale_ub: Optional[float], 2025-05-07T20:32:43.2341749Z contiguous: bool, 2025-05-07T20:32:43.2341833Z compiled: bool, 2025-05-07T20:32:43.2341910Z ) -> None: 2025-05-07T20:32:43.2342006Z torch.manual_seed(2025) 2025-05-07T20:32:43.2342077Z 2025-05-07T20:32:43.2342245Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2342321Z 2025-05-07T20:32:43.2342414Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2342542Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2342630Z x = x_sign * x_clamp 2025-05-07T20:32:43.2342709Z x0 = x[:, :D] 2025-05-07T20:32:43.2342793Z x1 = x[:, D:] 2025-05-07T20:32:43.2342864Z 2025-05-07T20:32:43.2343011Z if contiguous: 2025-05-07T20:32:43.2343153Z x0 = x0.contiguous() 2025-05-07T20:32:43.2343244Z x1 = x1.contiguous() 2025-05-07T20:32:43.2343316Z 2025-05-07T20:32:43.2343410Z if scale_ub is not None: 2025-05-07T20:32:43.2343515Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2343652Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2343730Z ) 2025-05-07T20:32:43.2343806Z else: 2025-05-07T20:32:43.2343902Z scale_ub_tensor = None 2025-05-07T20:32:43.2343974Z 2025-05-07T20:32:43.2344103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2344266Z op = silu_mul_quant 2025-05-07T20:32:43.2344352Z if compiled: 2025-05-07T20:32:43.2344451Z op = torch.compile(op) 2025-05-07T20:32:43.2344559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2344632Z 2025-05-07T20:32:43.2344721Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2344730Z 2025-05-07T20:32:43.2344830Z moe/activation_test.py:117: 2025-05-07T20:32:43.2344957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2345057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2345162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2345656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2345754Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2346109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2346340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2346682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2346775Z kernel = self.compile( 2025-05-07T20:32:43.2347167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2347344Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2347469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2347473Z 2025-05-07T20:32:43.2347684Z self = 2025-05-07T20:32:43.2348509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2349028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9a30550>} 2025-05-07T20:32:43.2349787Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2349984Z context = 2025-05-07T20:32:43.2349989Z 2025-05-07T20:32:43.2350158Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2350420Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2350532Z module_map=module_map) 2025-05-07T20:32:43.2350693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2350796Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2350879Z E ^ 2025-05-07T20:32:43.2351238Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2351243Z 2025-05-07T20:32:43.2351703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2351742Z 2025-05-07T20:32:43.2351845Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2352066Z self=, 2025-05-07T20:32:43.2352145Z T=16384, 2025-05-07T20:32:43.2352222Z D=5120, 2025-05-07T20:32:43.2352305Z scale_ub=1200.0, 2025-05-07T20:32:43.2352396Z contiguous=False, 2025-05-07T20:32:43.2352482Z compiled=True, 2025-05-07T20:32:43.2352554Z ) 2025-05-07T20:32:43.2352775Z self = 2025-05-07T20:32:43.2352999Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2353004Z 2025-05-07T20:32:43.2353083Z @given( 2025-05-07T20:32:43.2353200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2353298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2353423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2353538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2353652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2353730Z ) 2025-05-07T20:32:43.2353979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2354072Z def test_silu_mul_quant( 2025-05-07T20:32:43.2354150Z self, 2025-05-07T20:32:43.2354226Z T: int, 2025-05-07T20:32:43.2354305Z D: int, 2025-05-07T20:32:43.2354403Z scale_ub: Optional[float], 2025-05-07T20:32:43.2354491Z contiguous: bool, 2025-05-07T20:32:43.2354585Z compiled: bool, 2025-05-07T20:32:43.2354664Z ) -> None: 2025-05-07T20:32:43.2354759Z torch.manual_seed(2025) 2025-05-07T20:32:43.2354836Z 2025-05-07T20:32:43.2355007Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2355080Z 2025-05-07T20:32:43.2355182Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2355306Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2355396Z x = x_sign * x_clamp 2025-05-07T20:32:43.2355480Z x0 = x[:, :D] 2025-05-07T20:32:43.2355560Z x1 = x[:, D:] 2025-05-07T20:32:43.2355633Z 2025-05-07T20:32:43.2355718Z if contiguous: 2025-05-07T20:32:43.2355808Z x0 = x0.contiguous() 2025-05-07T20:32:43.2355901Z x1 = x1.contiguous() 2025-05-07T20:32:43.2355973Z 2025-05-07T20:32:43.2356063Z if scale_ub is not None: 2025-05-07T20:32:43.2356172Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2356357Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2356435Z ) 2025-05-07T20:32:43.2356514Z else: 2025-05-07T20:32:43.2356607Z scale_ub_tensor = None 2025-05-07T20:32:43.2356680Z 2025-05-07T20:32:43.2356815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2356911Z op = silu_mul_quant 2025-05-07T20:32:43.2356998Z if compiled: 2025-05-07T20:32:43.2357103Z op = torch.compile(op) 2025-05-07T20:32:43.2357208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2357282Z 2025-05-07T20:32:43.2357373Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2357377Z 2025-05-07T20:32:43.2357473Z moe/activation_test.py:117: 2025-05-07T20:32:43.2357601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2357701Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2357800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2358173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2358266Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2358759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2358939Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2359302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2359527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2359864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2359958Z kernel = self.compile( 2025-05-07T20:32:43.2360343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2360583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2360713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2360718Z 2025-05-07T20:32:43.2360923Z self = 2025-05-07T20:32:43.2361710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2362230Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9ae41f0>} 2025-05-07T20:32:43.2362968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2363164Z context = 2025-05-07T20:32:43.2363169Z 2025-05-07T20:32:43.2363334Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2363601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2363711Z module_map=module_map) 2025-05-07T20:32:43.2363873Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2363975Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2364052Z E ^ 2025-05-07T20:32:43.2364407Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2364412Z 2025-05-07T20:32:43.2364824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2364828Z 2025-05-07T20:32:43.2364976Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2365202Z self=, 2025-05-07T20:32:43.2365279Z T=2048, 2025-05-07T20:32:43.2365354Z D=7168, 2025-05-07T20:32:43.2365441Z scale_ub=1200.0, 2025-05-07T20:32:43.2365526Z contiguous=False, 2025-05-07T20:32:43.2365613Z compiled=True, 2025-05-07T20:32:43.2365689Z ) 2025-05-07T20:32:43.2365909Z self = 2025-05-07T20:32:43.2366081Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2366085Z 2025-05-07T20:32:43.2366164Z @given( 2025-05-07T20:32:43.2366281Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2366382Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2366496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2366613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2366734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2366808Z ) 2025-05-07T20:32:43.2367057Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2367155Z def test_silu_mul_quant( 2025-05-07T20:32:43.2367232Z self, 2025-05-07T20:32:43.2367350Z T: int, 2025-05-07T20:32:43.2367465Z D: int, 2025-05-07T20:32:43.2367564Z scale_ub: Optional[float], 2025-05-07T20:32:43.2367653Z contiguous: bool, 2025-05-07T20:32:43.2367743Z compiled: bool, 2025-05-07T20:32:43.2367820Z ) -> None: 2025-05-07T20:32:43.2367919Z torch.manual_seed(2025) 2025-05-07T20:32:43.2367991Z 2025-05-07T20:32:43.2368160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2368238Z 2025-05-07T20:32:43.2368329Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2368455Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2368592Z x = x_sign * x_clamp 2025-05-07T20:32:43.2368672Z x0 = x[:, :D] 2025-05-07T20:32:43.2368752Z x1 = x[:, D:] 2025-05-07T20:32:43.2368826Z 2025-05-07T20:32:43.2368910Z if contiguous: 2025-05-07T20:32:43.2369001Z x0 = x0.contiguous() 2025-05-07T20:32:43.2369092Z x1 = x1.contiguous() 2025-05-07T20:32:43.2369169Z 2025-05-07T20:32:43.2369266Z if scale_ub is not None: 2025-05-07T20:32:43.2369373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2369509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2369590Z ) 2025-05-07T20:32:43.2369667Z else: 2025-05-07T20:32:43.2369760Z scale_ub_tensor = None 2025-05-07T20:32:43.2369834Z 2025-05-07T20:32:43.2369964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2370056Z op = silu_mul_quant 2025-05-07T20:32:43.2370144Z if compiled: 2025-05-07T20:32:43.2370248Z op = torch.compile(op) 2025-05-07T20:32:43.2370356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2370432Z 2025-05-07T20:32:43.2370523Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2370527Z 2025-05-07T20:32:43.2370627Z moe/activation_test.py:117: 2025-05-07T20:32:43.2370756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2370862Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2370965Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2371335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2371426Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2371920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2372017Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2372417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2372649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2372989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2373092Z kernel = self.compile( 2025-05-07T20:32:43.2373474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2373651Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2373778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2373783Z 2025-05-07T20:32:43.2373991Z self = 2025-05-07T20:32:43.2374764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2375280Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9ae4ee0>} 2025-05-07T20:32:43.2376110Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2376306Z context = 2025-05-07T20:32:43.2376310Z 2025-05-07T20:32:43.2376477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2376741Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2376847Z module_map=module_map) 2025-05-07T20:32:43.2377053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2377154Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2377231Z E ^ 2025-05-07T20:32:43.2377588Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2377596Z 2025-05-07T20:32:43.2378013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2378017Z 2025-05-07T20:32:43.2378119Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2378347Z self=, 2025-05-07T20:32:43.2378422Z T=1, 2025-05-07T20:32:43.2378500Z D=5120, 2025-05-07T20:32:43.2378581Z scale_ub=None, 2025-05-07T20:32:43.2378668Z contiguous=False, 2025-05-07T20:32:43.2378754Z compiled=False, 2025-05-07T20:32:43.2378828Z ) 2025-05-07T20:32:43.2379045Z self = 2025-05-07T20:32:43.2379219Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2379224Z 2025-05-07T20:32:43.2379301Z @given( 2025-05-07T20:32:43.2379420Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2379523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2379642Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2379761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2379875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2379951Z ) 2025-05-07T20:32:43.2380198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2380293Z def test_silu_mul_quant( 2025-05-07T20:32:43.2380368Z self, 2025-05-07T20:32:43.2380448Z T: int, 2025-05-07T20:32:43.2380525Z D: int, 2025-05-07T20:32:43.2380625Z scale_ub: Optional[float], 2025-05-07T20:32:43.2380763Z contiguous: bool, 2025-05-07T20:32:43.2380872Z compiled: bool, 2025-05-07T20:32:43.2380954Z ) -> None: 2025-05-07T20:32:43.2381123Z torch.manual_seed(2025) 2025-05-07T20:32:43.2381197Z 2025-05-07T20:32:43.2381370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2381448Z 2025-05-07T20:32:43.2381541Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2381668Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2381757Z x = x_sign * x_clamp 2025-05-07T20:32:43.2381836Z x0 = x[:, :D] 2025-05-07T20:32:43.2381919Z x1 = x[:, D:] 2025-05-07T20:32:43.2381991Z 2025-05-07T20:32:43.2382073Z if contiguous: 2025-05-07T20:32:43.2382167Z x0 = x0.contiguous() 2025-05-07T20:32:43.2382257Z x1 = x1.contiguous() 2025-05-07T20:32:43.2382328Z 2025-05-07T20:32:43.2382421Z if scale_ub is not None: 2025-05-07T20:32:43.2382531Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2382672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2382747Z ) 2025-05-07T20:32:43.2382825Z else: 2025-05-07T20:32:43.2382922Z scale_ub_tensor = None 2025-05-07T20:32:43.2382995Z 2025-05-07T20:32:43.2383168Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2383297Z op = silu_mul_quant 2025-05-07T20:32:43.2383382Z if compiled: 2025-05-07T20:32:43.2383481Z op = torch.compile(op) 2025-05-07T20:32:43.2383587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2383659Z 2025-05-07T20:32:43.2383749Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2383753Z 2025-05-07T20:32:43.2383854Z moe/activation_test.py:117: 2025-05-07T20:32:43.2383979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2384081Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2384223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2384722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2384822Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2385181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2385411Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2385749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2385842Z kernel = self.compile( 2025-05-07T20:32:43.2386222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2386396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2386523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2386530Z 2025-05-07T20:32:43.2386737Z self = 2025-05-07T20:32:43.2387502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2388020Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd98475e0>} 2025-05-07T20:32:43.2388757Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2388952Z context = 2025-05-07T20:32:43.2388960Z 2025-05-07T20:32:43.2389169Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2389439Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2389548Z module_map=module_map) 2025-05-07T20:32:43.2389715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2389813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2389892Z E ^ 2025-05-07T20:32:43.2390242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2390247Z 2025-05-07T20:32:43.2390657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2390661Z 2025-05-07T20:32:43.2394813Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2395055Z self=, 2025-05-07T20:32:43.2395139Z T=4096, 2025-05-07T20:32:43.2395219Z D=7168, 2025-05-07T20:32:43.2395303Z scale_ub=1200.0, 2025-05-07T20:32:43.2395387Z contiguous=False, 2025-05-07T20:32:43.2395473Z compiled=False, 2025-05-07T20:32:43.2395546Z ) 2025-05-07T20:32:43.2395834Z self = 2025-05-07T20:32:43.2396080Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2396085Z 2025-05-07T20:32:43.2396165Z @given( 2025-05-07T20:32:43.2396287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2396386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2396500Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2396617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2396729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2396808Z ) 2025-05-07T20:32:43.2397058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2397193Z def test_silu_mul_quant( 2025-05-07T20:32:43.2397274Z self, 2025-05-07T20:32:43.2397350Z T: int, 2025-05-07T20:32:43.2397426Z D: int, 2025-05-07T20:32:43.2397527Z scale_ub: Optional[float], 2025-05-07T20:32:43.2397622Z contiguous: bool, 2025-05-07T20:32:43.2397707Z compiled: bool, 2025-05-07T20:32:43.2397788Z ) -> None: 2025-05-07T20:32:43.2397882Z torch.manual_seed(2025) 2025-05-07T20:32:43.2397956Z 2025-05-07T20:32:43.2398126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2398200Z 2025-05-07T20:32:43.2398291Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2398419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2398508Z x = x_sign * x_clamp 2025-05-07T20:32:43.2398591Z x0 = x[:, :D] 2025-05-07T20:32:43.2398670Z x1 = x[:, D:] 2025-05-07T20:32:43.2398745Z 2025-05-07T20:32:43.2398834Z if contiguous: 2025-05-07T20:32:43.2398926Z x0 = x0.contiguous() 2025-05-07T20:32:43.2399014Z x1 = x1.contiguous() 2025-05-07T20:32:43.2399090Z 2025-05-07T20:32:43.2399180Z if scale_ub is not None: 2025-05-07T20:32:43.2399285Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2399431Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2399507Z ) 2025-05-07T20:32:43.2399585Z else: 2025-05-07T20:32:43.2399682Z scale_ub_tensor = None 2025-05-07T20:32:43.2399754Z 2025-05-07T20:32:43.2399887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2399977Z op = silu_mul_quant 2025-05-07T20:32:43.2400062Z if compiled: 2025-05-07T20:32:43.2400165Z op = torch.compile(op) 2025-05-07T20:32:43.2400270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2400341Z 2025-05-07T20:32:43.2400484Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2400489Z 2025-05-07T20:32:43.2400588Z moe/activation_test.py:117: 2025-05-07T20:32:43.2400715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2400820Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2400926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2401438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2401535Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2401890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2402121Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2402456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2402555Z kernel = self.compile( 2025-05-07T20:32:43.2402937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2403114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2403291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2403976Z 2025-05-07T20:32:43.2404193Z self = 2025-05-07T20:32:43.2404966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2405486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd97161f0>} 2025-05-07T20:32:43.2406286Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2406487Z context = 2025-05-07T20:32:43.2406494Z 2025-05-07T20:32:43.2406666Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2406935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2407043Z module_map=module_map) 2025-05-07T20:32:43.2407206Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2407307Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2407387Z E ^ 2025-05-07T20:32:43.2407741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2407746Z 2025-05-07T20:32:43.2408165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2408169Z 2025-05-07T20:32:43.2408272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2408500Z self=, 2025-05-07T20:32:43.2408582Z T=16384, 2025-05-07T20:32:43.2408659Z D=7168, 2025-05-07T20:32:43.2408744Z scale_ub=None, 2025-05-07T20:32:43.2408831Z contiguous=True, 2025-05-07T20:32:43.2408915Z compiled=True, 2025-05-07T20:32:43.2408990Z ) 2025-05-07T20:32:43.2409207Z self = 2025-05-07T20:32:43.2409380Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2409389Z 2025-05-07T20:32:43.2409466Z @given( 2025-05-07T20:32:43.2409584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2409685Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2409848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2409969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2410087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2410161Z ) 2025-05-07T20:32:43.2410408Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2410508Z def test_silu_mul_quant( 2025-05-07T20:32:43.2410587Z self, 2025-05-07T20:32:43.2410666Z T: int, 2025-05-07T20:32:43.2410764Z D: int, 2025-05-07T20:32:43.2410872Z scale_ub: Optional[float], 2025-05-07T20:32:43.2410982Z contiguous: bool, 2025-05-07T20:32:43.2411067Z compiled: bool, 2025-05-07T20:32:43.2411146Z ) -> None: 2025-05-07T20:32:43.2411242Z torch.manual_seed(2025) 2025-05-07T20:32:43.2411315Z 2025-05-07T20:32:43.2411482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2411564Z 2025-05-07T20:32:43.2411660Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2411785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2411877Z x = x_sign * x_clamp 2025-05-07T20:32:43.2411961Z x0 = x[:, :D] 2025-05-07T20:32:43.2412042Z x1 = x[:, D:] 2025-05-07T20:32:43.2412161Z 2025-05-07T20:32:43.2412285Z if contiguous: 2025-05-07T20:32:43.2412378Z x0 = x0.contiguous() 2025-05-07T20:32:43.2412469Z x1 = x1.contiguous() 2025-05-07T20:32:43.2412541Z 2025-05-07T20:32:43.2412634Z if scale_ub is not None: 2025-05-07T20:32:43.2412739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2412872Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2412953Z ) 2025-05-07T20:32:43.2413030Z else: 2025-05-07T20:32:43.2413122Z scale_ub_tensor = None 2025-05-07T20:32:43.2413198Z 2025-05-07T20:32:43.2413332Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2413465Z op = silu_mul_quant 2025-05-07T20:32:43.2413555Z if compiled: 2025-05-07T20:32:43.2413656Z op = torch.compile(op) 2025-05-07T20:32:43.2413760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2413834Z 2025-05-07T20:32:43.2413930Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2413935Z 2025-05-07T20:32:43.2414036Z moe/activation_test.py:117: 2025-05-07T20:32:43.2414163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2414265Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2414367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2414729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2414822Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2415316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2415416Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2415776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2416004Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2416342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2416441Z kernel = self.compile( 2025-05-07T20:32:43.2416825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2417001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2417126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2417130Z 2025-05-07T20:32:43.2417378Z self = 2025-05-07T20:32:43.2418167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2418687Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9716ee0>} 2025-05-07T20:32:43.2419438Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2419629Z context = 2025-05-07T20:32:43.2419634Z 2025-05-07T20:32:43.2419797Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2420065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2420173Z module_map=module_map) 2025-05-07T20:32:43.2420340Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2420438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2420555Z E ^ 2025-05-07T20:32:43.2420951Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2420956Z 2025-05-07T20:32:43.2421450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2421455Z 2025-05-07T20:32:43.2421562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2421782Z self=, 2025-05-07T20:32:43.2421861Z T=4096, 2025-05-07T20:32:43.2421940Z D=5120, 2025-05-07T20:32:43.2422022Z scale_ub=None, 2025-05-07T20:32:43.2422154Z contiguous=False, 2025-05-07T20:32:43.2422240Z compiled=True, 2025-05-07T20:32:43.2422312Z ) 2025-05-07T20:32:43.2422526Z self = 2025-05-07T20:32:43.2422704Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2422711Z 2025-05-07T20:32:43.2422792Z @given( 2025-05-07T20:32:43.2422909Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2423011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2423126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2423244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2423358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2423431Z ) 2025-05-07T20:32:43.2423680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2423773Z def test_silu_mul_quant( 2025-05-07T20:32:43.2423854Z self, 2025-05-07T20:32:43.2423936Z T: int, 2025-05-07T20:32:43.2424013Z D: int, 2025-05-07T20:32:43.2424110Z scale_ub: Optional[float], 2025-05-07T20:32:43.2424202Z contiguous: bool, 2025-05-07T20:32:43.2424289Z compiled: bool, 2025-05-07T20:32:43.2424373Z ) -> None: 2025-05-07T20:32:43.2424473Z torch.manual_seed(2025) 2025-05-07T20:32:43.2424545Z 2025-05-07T20:32:43.2424718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2424792Z 2025-05-07T20:32:43.2424882Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2425013Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2425101Z x = x_sign * x_clamp 2025-05-07T20:32:43.2425181Z x0 = x[:, :D] 2025-05-07T20:32:43.2425266Z x1 = x[:, D:] 2025-05-07T20:32:43.2425338Z 2025-05-07T20:32:43.2425420Z if contiguous: 2025-05-07T20:32:43.2425515Z x0 = x0.contiguous() 2025-05-07T20:32:43.2425654Z x1 = x1.contiguous() 2025-05-07T20:32:43.2425727Z 2025-05-07T20:32:43.2425826Z if scale_ub is not None: 2025-05-07T20:32:43.2425932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2426070Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2426147Z ) 2025-05-07T20:32:43.2426227Z else: 2025-05-07T20:32:43.2426322Z scale_ub_tensor = None 2025-05-07T20:32:43.2426394Z 2025-05-07T20:32:43.2426524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2426616Z op = silu_mul_quant 2025-05-07T20:32:43.2426701Z if compiled: 2025-05-07T20:32:43.2426800Z op = torch.compile(op) 2025-05-07T20:32:43.2426908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2426987Z 2025-05-07T20:32:43.2427077Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2427081Z 2025-05-07T20:32:43.2427178Z moe/activation_test.py:117: 2025-05-07T20:32:43.2427314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2427414Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2427513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2427949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2428081Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2428573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2428671Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2429030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2429256Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2429594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2429730Z kernel = self.compile( 2025-05-07T20:32:43.2430113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2430293Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2430428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2430432Z 2025-05-07T20:32:43.2430637Z self = 2025-05-07T20:32:43.2431402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2431912Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9b87940>} 2025-05-07T20:32:43.2432663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2432862Z context = 2025-05-07T20:32:43.2432869Z 2025-05-07T20:32:43.2433038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2433309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2433415Z module_map=module_map) 2025-05-07T20:32:43.2433577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2433678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2433758Z E ^ 2025-05-07T20:32:43.2434108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2434155Z 2025-05-07T20:32:43.2434573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2434578Z 2025-05-07T20:32:43.2434679Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2434911Z self=, 2025-05-07T20:32:43.2434994Z T=4096, 2025-05-07T20:32:43.2435069Z D=5120, 2025-05-07T20:32:43.2435156Z scale_ub=1200.0, 2025-05-07T20:32:43.2435242Z contiguous=False, 2025-05-07T20:32:43.2435329Z compiled=False, 2025-05-07T20:32:43.2435407Z ) 2025-05-07T20:32:43.2435622Z self = 2025-05-07T20:32:43.2435798Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2435807Z 2025-05-07T20:32:43.2435883Z @given( 2025-05-07T20:32:43.2436000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2436110Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2436223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2436338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2436455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2436575Z ) 2025-05-07T20:32:43.2436854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2436950Z def test_silu_mul_quant( 2025-05-07T20:32:43.2437026Z self, 2025-05-07T20:32:43.2437106Z T: int, 2025-05-07T20:32:43.2437181Z D: int, 2025-05-07T20:32:43.2437278Z scale_ub: Optional[float], 2025-05-07T20:32:43.2437370Z contiguous: bool, 2025-05-07T20:32:43.2437455Z compiled: bool, 2025-05-07T20:32:43.2437532Z ) -> None: 2025-05-07T20:32:43.2437629Z torch.manual_seed(2025) 2025-05-07T20:32:43.2437701Z 2025-05-07T20:32:43.2437875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2437993Z 2025-05-07T20:32:43.2438085Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2438211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2438301Z x = x_sign * x_clamp 2025-05-07T20:32:43.2438381Z x0 = x[:, :D] 2025-05-07T20:32:43.2438467Z x1 = x[:, D:] 2025-05-07T20:32:43.2438542Z 2025-05-07T20:32:43.2438624Z if contiguous: 2025-05-07T20:32:43.2438719Z x0 = x0.contiguous() 2025-05-07T20:32:43.2438810Z x1 = x1.contiguous() 2025-05-07T20:32:43.2438882Z 2025-05-07T20:32:43.2438973Z if scale_ub is not None: 2025-05-07T20:32:43.2439079Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2439212Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2439291Z ) 2025-05-07T20:32:43.2439368Z else: 2025-05-07T20:32:43.2439460Z scale_ub_tensor = None 2025-05-07T20:32:43.2439538Z 2025-05-07T20:32:43.2439672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2439763Z op = silu_mul_quant 2025-05-07T20:32:43.2439852Z if compiled: 2025-05-07T20:32:43.2439954Z op = torch.compile(op) 2025-05-07T20:32:43.2440237Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2440353Z 2025-05-07T20:32:43.2440485Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2440492Z 2025-05-07T20:32:43.2440624Z moe/activation_test.py:117: 2025-05-07T20:32:43.2440778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2440893Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2441008Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2441503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2441604Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2442063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2442290Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2442634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2442731Z kernel = self.compile( 2025-05-07T20:32:43.2443108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2443285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2443410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2443414Z 2025-05-07T20:32:43.2443622Z self = 2025-05-07T20:32:43.2444392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2444952Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd95ef3a0>} 2025-05-07T20:32:43.2445757Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2445949Z context = 2025-05-07T20:32:43.2445954Z 2025-05-07T20:32:43.2446125Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2446385Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2446497Z module_map=module_map) 2025-05-07T20:32:43.2446722Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2446819Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2446898Z E ^ 2025-05-07T20:32:43.2447250Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2447260Z 2025-05-07T20:32:43.2447671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2447676Z 2025-05-07T20:32:43.2447780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2448001Z self=, 2025-05-07T20:32:43.2448081Z T=4096, 2025-05-07T20:32:43.2448155Z D=5120, 2025-05-07T20:32:43.2448237Z scale_ub=1200.0, 2025-05-07T20:32:43.2448326Z contiguous=False, 2025-05-07T20:32:43.2448408Z compiled=True, 2025-05-07T20:32:43.2448481Z ) 2025-05-07T20:32:43.2448707Z self = 2025-05-07T20:32:43.2448880Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2448885Z 2025-05-07T20:32:43.2448961Z @given( 2025-05-07T20:32:43.2449082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2449185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2449303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2449419Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2449531Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2449611Z ) 2025-05-07T20:32:43.2449854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2449948Z def test_silu_mul_quant( 2025-05-07T20:32:43.2450026Z self, 2025-05-07T20:32:43.2450102Z T: int, 2025-05-07T20:32:43.2450179Z D: int, 2025-05-07T20:32:43.2450325Z scale_ub: Optional[float], 2025-05-07T20:32:43.2450417Z contiguous: bool, 2025-05-07T20:32:43.2450502Z compiled: bool, 2025-05-07T20:32:43.2450583Z ) -> None: 2025-05-07T20:32:43.2450678Z torch.manual_seed(2025) 2025-05-07T20:32:43.2450754Z 2025-05-07T20:32:43.2450925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2451002Z 2025-05-07T20:32:43.2451097Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2451221Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2451309Z x = x_sign * x_clamp 2025-05-07T20:32:43.2451391Z x0 = x[:, :D] 2025-05-07T20:32:43.2451470Z x1 = x[:, D:] 2025-05-07T20:32:43.2451544Z 2025-05-07T20:32:43.2451632Z if contiguous: 2025-05-07T20:32:43.2451723Z x0 = x0.contiguous() 2025-05-07T20:32:43.2451812Z x1 = x1.contiguous() 2025-05-07T20:32:43.2451888Z 2025-05-07T20:32:43.2451977Z if scale_ub is not None: 2025-05-07T20:32:43.2452093Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2452232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2452307Z ) 2025-05-07T20:32:43.2452384Z else: 2025-05-07T20:32:43.2452482Z scale_ub_tensor = None 2025-05-07T20:32:43.2452600Z 2025-05-07T20:32:43.2452768Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2452865Z op = silu_mul_quant 2025-05-07T20:32:43.2452949Z if compiled: 2025-05-07T20:32:43.2453051Z op = torch.compile(op) 2025-05-07T20:32:43.2453156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2453228Z 2025-05-07T20:32:43.2453321Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2453325Z 2025-05-07T20:32:43.2453424Z moe/activation_test.py:117: 2025-05-07T20:32:43.2453549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2453657Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2453796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2454164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2454259Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2454758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2454861Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2455223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2455448Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2455793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2455885Z kernel = self.compile( 2025-05-07T20:32:43.2456267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2456445Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2456570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2456577Z 2025-05-07T20:32:43.2456789Z self = 2025-05-07T20:32:43.2457552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2458060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd95ef280>} 2025-05-07T20:32:43.2458864Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2459062Z context = 2025-05-07T20:32:43.2459067Z 2025-05-07T20:32:43.2459240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2459511Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2459621Z module_map=module_map) 2025-05-07T20:32:43.2459783Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2459881Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2459959Z E ^ 2025-05-07T20:32:43.2460309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2460314Z 2025-05-07T20:32:43.2460723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2460732Z 2025-05-07T20:32:43.2460832Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2461119Z self=, 2025-05-07T20:32:43.2461202Z T=2048, 2025-05-07T20:32:43.2461322Z D=7168, 2025-05-07T20:32:43.2461440Z scale_ub=1200.0, 2025-05-07T20:32:43.2461531Z contiguous=False, 2025-05-07T20:32:43.2461616Z compiled=False, 2025-05-07T20:32:43.2461688Z ) 2025-05-07T20:32:43.2461912Z self = 2025-05-07T20:32:43.2462091Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2462096Z 2025-05-07T20:32:43.2462176Z @given( 2025-05-07T20:32:43.2462294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2462391Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2462509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2462672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2462788Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2462863Z ) 2025-05-07T20:32:43.2463108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2463208Z def test_silu_mul_quant( 2025-05-07T20:32:43.2463287Z self, 2025-05-07T20:32:43.2463362Z T: int, 2025-05-07T20:32:43.2463441Z D: int, 2025-05-07T20:32:43.2463542Z scale_ub: Optional[float], 2025-05-07T20:32:43.2463629Z contiguous: bool, 2025-05-07T20:32:43.2463719Z compiled: bool, 2025-05-07T20:32:43.2463797Z ) -> None: 2025-05-07T20:32:43.2463892Z torch.manual_seed(2025) 2025-05-07T20:32:43.2463969Z 2025-05-07T20:32:43.2464139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2464211Z 2025-05-07T20:32:43.2464306Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2464441Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2464531Z x = x_sign * x_clamp 2025-05-07T20:32:43.2464618Z x0 = x[:, :D] 2025-05-07T20:32:43.2464697Z x1 = x[:, D:] 2025-05-07T20:32:43.2464769Z 2025-05-07T20:32:43.2464854Z if contiguous: 2025-05-07T20:32:43.2464952Z x0 = x0.contiguous() 2025-05-07T20:32:43.2465043Z x1 = x1.contiguous() 2025-05-07T20:32:43.2465114Z 2025-05-07T20:32:43.2465203Z if scale_ub is not None: 2025-05-07T20:32:43.2465309Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2465443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2465518Z ) 2025-05-07T20:32:43.2465595Z else: 2025-05-07T20:32:43.2465687Z scale_ub_tensor = None 2025-05-07T20:32:43.2465759Z 2025-05-07T20:32:43.2465890Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2465980Z op = silu_mul_quant 2025-05-07T20:32:43.2466113Z if compiled: 2025-05-07T20:32:43.2466216Z op = torch.compile(op) 2025-05-07T20:32:43.2466322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2466397Z 2025-05-07T20:32:43.2466487Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2466494Z 2025-05-07T20:32:43.2466594Z moe/activation_test.py:117: 2025-05-07T20:32:43.2466724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2466826Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2466926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2467430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2467525Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2467884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2468119Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2468461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2468559Z kernel = self.compile( 2025-05-07T20:32:43.2468976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2469189Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2469316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2469320Z 2025-05-07T20:32:43.2469528Z self = 2025-05-07T20:32:43.2470298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2470876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd97ed670>} 2025-05-07T20:32:43.2471641Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2471837Z context = 2025-05-07T20:32:43.2471841Z 2025-05-07T20:32:43.2472005Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2472268Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2472374Z module_map=module_map) 2025-05-07T20:32:43.2472534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2472641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2472717Z E ^ 2025-05-07T20:32:43.2473070Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2473074Z 2025-05-07T20:32:43.2473485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2473491Z 2025-05-07T20:32:43.2473594Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2473820Z self=, 2025-05-07T20:32:43.2473896Z T=1, 2025-05-07T20:32:43.2473974Z D=7168, 2025-05-07T20:32:43.2474054Z scale_ub=None, 2025-05-07T20:32:43.2474138Z contiguous=True, 2025-05-07T20:32:43.2474222Z compiled=False, 2025-05-07T20:32:43.2474294Z ) 2025-05-07T20:32:43.2474510Z self = 2025-05-07T20:32:43.2474717Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2474725Z 2025-05-07T20:32:43.2474801Z @given( 2025-05-07T20:32:43.2474919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2475022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2475137Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2475260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2475374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2475447Z ) 2025-05-07T20:32:43.2475693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2475787Z def test_silu_mul_quant( 2025-05-07T20:32:43.2475863Z self, 2025-05-07T20:32:43.2475943Z T: int, 2025-05-07T20:32:43.2476020Z D: int, 2025-05-07T20:32:43.2476117Z scale_ub: Optional[float], 2025-05-07T20:32:43.2476210Z contiguous: bool, 2025-05-07T20:32:43.2476295Z compiled: bool, 2025-05-07T20:32:43.2476383Z ) -> None: 2025-05-07T20:32:43.2476478Z torch.manual_seed(2025) 2025-05-07T20:32:43.2476550Z 2025-05-07T20:32:43.2476717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2476793Z 2025-05-07T20:32:43.2476884Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2477095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2477185Z x = x_sign * x_clamp 2025-05-07T20:32:43.2477267Z x0 = x[:, :D] 2025-05-07T20:32:43.2477348Z x1 = x[:, D:] 2025-05-07T20:32:43.2477419Z 2025-05-07T20:32:43.2477502Z if contiguous: 2025-05-07T20:32:43.2477596Z x0 = x0.contiguous() 2025-05-07T20:32:43.2477684Z x1 = x1.contiguous() 2025-05-07T20:32:43.2477756Z 2025-05-07T20:32:43.2477850Z if scale_ub is not None: 2025-05-07T20:32:43.2477955Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2478096Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2478215Z ) 2025-05-07T20:32:43.2478291Z else: 2025-05-07T20:32:43.2478385Z scale_ub_tensor = None 2025-05-07T20:32:43.2478457Z 2025-05-07T20:32:43.2478587Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2478678Z op = silu_mul_quant 2025-05-07T20:32:43.2478768Z if compiled: 2025-05-07T20:32:43.2478866Z op = torch.compile(op) 2025-05-07T20:32:43.2478974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2479045Z 2025-05-07T20:32:43.2479135Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2479140Z 2025-05-07T20:32:43.2479240Z moe/activation_test.py:117: 2025-05-07T20:32:43.2479367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2479470Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2479570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2480066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2480166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2480521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2480752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2481092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2481185Z kernel = self.compile( 2025-05-07T20:32:43.2481572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2481744Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2481869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2481873Z 2025-05-07T20:32:43.2482125Z self = 2025-05-07T20:32:43.2482904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2483416Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd96a1280>} 2025-05-07T20:32:43.2484151Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2484343Z context = 2025-05-07T20:32:43.2484349Z 2025-05-07T20:32:43.2484516Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2484783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2484895Z module_map=module_map) 2025-05-07T20:32:43.2485055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2485225Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2485306Z E ^ 2025-05-07T20:32:43.2485658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2485663Z 2025-05-07T20:32:43.2486082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2486087Z 2025-05-07T20:32:43.2486192Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2486413Z self=, 2025-05-07T20:32:43.2486493Z T=16384, 2025-05-07T20:32:43.2486570Z D=7168, 2025-05-07T20:32:43.2486718Z scale_ub=1200.0, 2025-05-07T20:32:43.2486808Z contiguous=False, 2025-05-07T20:32:43.2486895Z compiled=True, 2025-05-07T20:32:43.2486967Z ) 2025-05-07T20:32:43.2487185Z self = 2025-05-07T20:32:43.2487365Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2487371Z 2025-05-07T20:32:43.2487450Z @given( 2025-05-07T20:32:43.2487568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2487667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2487783Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2487898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2488012Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2488088Z ) 2025-05-07T20:32:43.2488331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2488429Z def test_silu_mul_quant( 2025-05-07T20:32:43.2488508Z self, 2025-05-07T20:32:43.2488582Z T: int, 2025-05-07T20:32:43.2488661Z D: int, 2025-05-07T20:32:43.2488757Z scale_ub: Optional[float], 2025-05-07T20:32:43.2488847Z contiguous: bool, 2025-05-07T20:32:43.2488934Z compiled: bool, 2025-05-07T20:32:43.2489017Z ) -> None: 2025-05-07T20:32:43.2489110Z torch.manual_seed(2025) 2025-05-07T20:32:43.2489184Z 2025-05-07T20:32:43.2489350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2489423Z 2025-05-07T20:32:43.2489519Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2489644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2489733Z x = x_sign * x_clamp 2025-05-07T20:32:43.2489815Z x0 = x[:, :D] 2025-05-07T20:32:43.2489894Z x1 = x[:, D:] 2025-05-07T20:32:43.2489970Z 2025-05-07T20:32:43.2490053Z if contiguous: 2025-05-07T20:32:43.2490190Z x0 = x0.contiguous() 2025-05-07T20:32:43.2490284Z x1 = x1.contiguous() 2025-05-07T20:32:43.2490355Z 2025-05-07T20:32:43.2490444Z if scale_ub is not None: 2025-05-07T20:32:43.2490556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2490695Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2490774Z ) 2025-05-07T20:32:43.2490853Z else: 2025-05-07T20:32:43.2490945Z scale_ub_tensor = None 2025-05-07T20:32:43.2491018Z 2025-05-07T20:32:43.2491150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2491240Z op = silu_mul_quant 2025-05-07T20:32:43.2491324Z if compiled: 2025-05-07T20:32:43.2491425Z op = torch.compile(op) 2025-05-07T20:32:43.2491529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2491603Z 2025-05-07T20:32:43.2491694Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2491697Z 2025-05-07T20:32:43.2491801Z moe/activation_test.py:117: 2025-05-07T20:32:43.2491929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2492030Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2492130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2492537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2492665Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2493165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2493261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2493615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2493840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2494176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2494308Z kernel = self.compile( 2025-05-07T20:32:43.2494692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2494873Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2495002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2495006Z 2025-05-07T20:32:43.2495212Z self = 2025-05-07T20:32:43.2495977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2496487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd96a1ee0>} 2025-05-07T20:32:43.2497236Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2497442Z context = 2025-05-07T20:32:43.2497446Z 2025-05-07T20:32:43.2497613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2497877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2497985Z module_map=module_map) 2025-05-07T20:32:43.2498145Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2498245Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2498322Z E ^ 2025-05-07T20:32:43.2498712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2498720Z 2025-05-07T20:32:43.2499137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2499142Z 2025-05-07T20:32:43.2499244Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2499473Z self=, 2025-05-07T20:32:43.2499548Z T=1, 2025-05-07T20:32:43.2499623Z D=7168, 2025-05-07T20:32:43.2499708Z scale_ub=None, 2025-05-07T20:32:43.2499791Z contiguous=False, 2025-05-07T20:32:43.2499874Z compiled=False, 2025-05-07T20:32:43.2499952Z ) 2025-05-07T20:32:43.2500170Z self = 2025-05-07T20:32:43.2500337Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2500345Z 2025-05-07T20:32:43.2500421Z @given( 2025-05-07T20:32:43.2500546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2500646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2500761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2500878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2500994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2501202Z ) 2025-05-07T20:32:43.2501449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2501546Z def test_silu_mul_quant( 2025-05-07T20:32:43.2501621Z self, 2025-05-07T20:32:43.2501697Z T: int, 2025-05-07T20:32:43.2501775Z D: int, 2025-05-07T20:32:43.2501873Z scale_ub: Optional[float], 2025-05-07T20:32:43.2501964Z contiguous: bool, 2025-05-07T20:32:43.2502050Z compiled: bool, 2025-05-07T20:32:43.2502128Z ) -> None: 2025-05-07T20:32:43.2502225Z torch.manual_seed(2025) 2025-05-07T20:32:43.2502297Z 2025-05-07T20:32:43.2502513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2502591Z 2025-05-07T20:32:43.2502683Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2502807Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2502900Z x = x_sign * x_clamp 2025-05-07T20:32:43.2502986Z x0 = x[:, :D] 2025-05-07T20:32:43.2503065Z x1 = x[:, D:] 2025-05-07T20:32:43.2503140Z 2025-05-07T20:32:43.2503222Z if contiguous: 2025-05-07T20:32:43.2503313Z x0 = x0.contiguous() 2025-05-07T20:32:43.2503408Z x1 = x1.contiguous() 2025-05-07T20:32:43.2503479Z 2025-05-07T20:32:43.2503573Z if scale_ub is not None: 2025-05-07T20:32:43.2503678Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2503814Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2503894Z ) 2025-05-07T20:32:43.2503970Z else: 2025-05-07T20:32:43.2504066Z scale_ub_tensor = None 2025-05-07T20:32:43.2504144Z 2025-05-07T20:32:43.2504272Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2504363Z op = silu_mul_quant 2025-05-07T20:32:43.2504452Z if compiled: 2025-05-07T20:32:43.2504550Z op = torch.compile(op) 2025-05-07T20:32:43.2504659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2504735Z 2025-05-07T20:32:43.2504826Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2504830Z 2025-05-07T20:32:43.2504930Z moe/activation_test.py:117: 2025-05-07T20:32:43.2505057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2505158Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2505261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2505753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2505894Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2506253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2506477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2506823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2506919Z kernel = self.compile( 2025-05-07T20:32:43.2507303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2507481Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2507604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2507609Z 2025-05-07T20:32:43.2507817Z self = 2025-05-07T20:32:43.2508584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2509136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd977d670>} 2025-05-07T20:32:43.2509921Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2510113Z context = 2025-05-07T20:32:43.2510118Z 2025-05-07T20:32:43.2510286Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2510545Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2510697Z module_map=module_map) 2025-05-07T20:32:43.2510862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2510962Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2511042Z E ^ 2025-05-07T20:32:43.2511397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2511404Z 2025-05-07T20:32:43.2511818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2511822Z 2025-05-07T20:32:43.2511926Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2512151Z self=, 2025-05-07T20:32:43.2512227Z T=2048, 2025-05-07T20:32:43.2512305Z D=7168, 2025-05-07T20:32:43.2512385Z scale_ub=None, 2025-05-07T20:32:43.2512474Z contiguous=False, 2025-05-07T20:32:43.2512557Z compiled=True, 2025-05-07T20:32:43.2512634Z ) 2025-05-07T20:32:43.2512852Z self = 2025-05-07T20:32:43.2513029Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2513034Z 2025-05-07T20:32:43.2513114Z @given( 2025-05-07T20:32:43.2513238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2513336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2513449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2513568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2513683Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2513760Z ) 2025-05-07T20:32:43.2518034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2518145Z def test_silu_mul_quant( 2025-05-07T20:32:43.2518223Z self, 2025-05-07T20:32:43.2518304Z T: int, 2025-05-07T20:32:43.2518384Z D: int, 2025-05-07T20:32:43.2518550Z scale_ub: Optional[float], 2025-05-07T20:32:43.2518644Z contiguous: bool, 2025-05-07T20:32:43.2518729Z compiled: bool, 2025-05-07T20:32:43.2518810Z ) -> None: 2025-05-07T20:32:43.2518909Z torch.manual_seed(2025) 2025-05-07T20:32:43.2518983Z 2025-05-07T20:32:43.2519162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2519236Z 2025-05-07T20:32:43.2519329Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2519458Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2519546Z x = x_sign * x_clamp 2025-05-07T20:32:43.2519627Z x0 = x[:, :D] 2025-05-07T20:32:43.2519713Z x1 = x[:, D:] 2025-05-07T20:32:43.2519785Z 2025-05-07T20:32:43.2519869Z if contiguous: 2025-05-07T20:32:43.2519965Z x0 = x0.contiguous() 2025-05-07T20:32:43.2520052Z x1 = x1.contiguous() 2025-05-07T20:32:43.2520127Z 2025-05-07T20:32:43.2520226Z if scale_ub is not None: 2025-05-07T20:32:43.2520331Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2520466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2520545Z ) 2025-05-07T20:32:43.2520622Z else: 2025-05-07T20:32:43.2520761Z scale_ub_tensor = None 2025-05-07T20:32:43.2520896Z 2025-05-07T20:32:43.2521030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2521122Z op = silu_mul_quant 2025-05-07T20:32:43.2521207Z if compiled: 2025-05-07T20:32:43.2521307Z op = torch.compile(op) 2025-05-07T20:32:43.2521417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2521489Z 2025-05-07T20:32:43.2521580Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2521586Z 2025-05-07T20:32:43.2521688Z moe/activation_test.py:117: 2025-05-07T20:32:43.2521816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2521966Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2522067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2522435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2522532Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2523034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2523132Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2523490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2523716Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2524054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2524146Z kernel = self.compile( 2025-05-07T20:32:43.2524528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2524708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2524833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2524843Z 2025-05-07T20:32:43.2525055Z self = 2025-05-07T20:32:43.2525823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2526336Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd945b550>} 2025-05-07T20:32:43.2527118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2527316Z context = 2025-05-07T20:32:43.2527321Z 2025-05-07T20:32:43.2527494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2527755Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2527862Z module_map=module_map) 2025-05-07T20:32:43.2528027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2528127Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2528204Z E ^ 2025-05-07T20:32:43.2528562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2528567Z 2025-05-07T20:32:43.2528977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2528985Z 2025-05-07T20:32:43.2529093Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2529319Z self=, 2025-05-07T20:32:43.2529435Z T=4096, 2025-05-07T20:32:43.2529550Z D=7168, 2025-05-07T20:32:43.2529633Z scale_ub=None, 2025-05-07T20:32:43.2529721Z contiguous=False, 2025-05-07T20:32:43.2529803Z compiled=True, 2025-05-07T20:32:43.2529876Z ) 2025-05-07T20:32:43.2530093Z self = 2025-05-07T20:32:43.2530267Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2530272Z 2025-05-07T20:32:43.2530348Z @given( 2025-05-07T20:32:43.2530469Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2530574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2530738Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2530860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2530972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2531048Z ) 2025-05-07T20:32:43.2531293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2531392Z def test_silu_mul_quant( 2025-05-07T20:32:43.2531472Z self, 2025-05-07T20:32:43.2531549Z T: int, 2025-05-07T20:32:43.2531626Z D: int, 2025-05-07T20:32:43.2531725Z scale_ub: Optional[float], 2025-05-07T20:32:43.2531817Z contiguous: bool, 2025-05-07T20:32:43.2531903Z compiled: bool, 2025-05-07T20:32:43.2531985Z ) -> None: 2025-05-07T20:32:43.2532078Z torch.manual_seed(2025) 2025-05-07T20:32:43.2532150Z 2025-05-07T20:32:43.2532321Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2532395Z 2025-05-07T20:32:43.2532495Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2532619Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2532707Z x = x_sign * x_clamp 2025-05-07T20:32:43.2532790Z x0 = x[:, :D] 2025-05-07T20:32:43.2532870Z x1 = x[:, D:] 2025-05-07T20:32:43.2532941Z 2025-05-07T20:32:43.2533032Z if contiguous: 2025-05-07T20:32:43.2533124Z x0 = x0.contiguous() 2025-05-07T20:32:43.2533214Z x1 = x1.contiguous() 2025-05-07T20:32:43.2533289Z 2025-05-07T20:32:43.2533380Z if scale_ub is not None: 2025-05-07T20:32:43.2533484Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2533621Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2533697Z ) 2025-05-07T20:32:43.2533775Z else: 2025-05-07T20:32:43.2533874Z scale_ub_tensor = None 2025-05-07T20:32:43.2533946Z 2025-05-07T20:32:43.2534080Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2534219Z op = silu_mul_quant 2025-05-07T20:32:43.2534305Z if compiled: 2025-05-07T20:32:43.2534407Z op = torch.compile(op) 2025-05-07T20:32:43.2534511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2534583Z 2025-05-07T20:32:43.2534676Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2534685Z 2025-05-07T20:32:43.2534780Z moe/activation_test.py:117: 2025-05-07T20:32:43.2534910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2535012Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2535111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2535478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2535571Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2536064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2536167Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2536520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2536743Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2537159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2537254Z kernel = self.compile( 2025-05-07T20:32:43.2537638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2537812Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2537938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2537942Z 2025-05-07T20:32:43.2538151Z self = 2025-05-07T20:32:43.2538972Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2539490Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd93d0160>} 2025-05-07T20:32:43.2540558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2540770Z context = 2025-05-07T20:32:43.2540775Z 2025-05-07T20:32:43.2540944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2541274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2541394Z module_map=module_map) 2025-05-07T20:32:43.2541561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2541663Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2541750Z E ^ 2025-05-07T20:32:43.2542112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2542117Z 2025-05-07T20:32:43.2542528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2542532Z 2025-05-07T20:32:43.2542633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2542854Z self=, 2025-05-07T20:32:43.2542933Z T=16384, 2025-05-07T20:32:43.2543011Z D=5120, 2025-05-07T20:32:43.2543096Z scale_ub=1200.0, 2025-05-07T20:32:43.2543190Z contiguous=False, 2025-05-07T20:32:43.2543367Z compiled=False, 2025-05-07T20:32:43.2543441Z ) 2025-05-07T20:32:43.2543669Z self = 2025-05-07T20:32:43.2543848Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2543853Z 2025-05-07T20:32:43.2543938Z @given( 2025-05-07T20:32:43.2544056Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2544157Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2544275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2544391Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2544502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2544581Z ) 2025-05-07T20:32:43.2544830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2544928Z def test_silu_mul_quant( 2025-05-07T20:32:43.2545006Z self, 2025-05-07T20:32:43.2545088Z T: int, 2025-05-07T20:32:43.2545166Z D: int, 2025-05-07T20:32:43.2545264Z scale_ub: Optional[float], 2025-05-07T20:32:43.2545352Z contiguous: bool, 2025-05-07T20:32:43.2545440Z compiled: bool, 2025-05-07T20:32:43.2545519Z ) -> None: 2025-05-07T20:32:43.2545613Z torch.manual_seed(2025) 2025-05-07T20:32:43.2545812Z 2025-05-07T20:32:43.2545986Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2546061Z 2025-05-07T20:32:43.2546156Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2546279Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2546372Z x = x_sign * x_clamp 2025-05-07T20:32:43.2546452Z x0 = x[:, :D] 2025-05-07T20:32:43.2546531Z x1 = x[:, D:] 2025-05-07T20:32:43.2546608Z 2025-05-07T20:32:43.2546692Z if contiguous: 2025-05-07T20:32:43.2546783Z x0 = x0.contiguous() 2025-05-07T20:32:43.2546874Z x1 = x1.contiguous() 2025-05-07T20:32:43.2547014Z 2025-05-07T20:32:43.2547107Z if scale_ub is not None: 2025-05-07T20:32:43.2547217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2547352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2547432Z ) 2025-05-07T20:32:43.2547514Z else: 2025-05-07T20:32:43.2547609Z scale_ub_tensor = None 2025-05-07T20:32:43.2547683Z 2025-05-07T20:32:43.2547813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2547903Z op = silu_mul_quant 2025-05-07T20:32:43.2547992Z if compiled: 2025-05-07T20:32:43.2548092Z op = torch.compile(op) 2025-05-07T20:32:43.2548196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2548273Z 2025-05-07T20:32:43.2548365Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2548370Z 2025-05-07T20:32:43.2548466Z moe/activation_test.py:117: 2025-05-07T20:32:43.2548601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2548706Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2548805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2549305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2549402Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2549762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2549985Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2550320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2550416Z kernel = self.compile( 2025-05-07T20:32:43.2550792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2551022Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2551148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2551153Z 2025-05-07T20:32:43.2551365Z self = 2025-05-07T20:32:43.2552135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2552640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd93d0940>} 2025-05-07T20:32:43.2553381Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2553579Z context = 2025-05-07T20:32:43.2553584Z 2025-05-07T20:32:43.2553751Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2554078Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2554224Z module_map=module_map) 2025-05-07T20:32:43.2554385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2554483Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2554562Z E ^ 2025-05-07T20:32:43.2554920Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2554925Z 2025-05-07T20:32:43.2555342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2555346Z 2025-05-07T20:32:43.2555494Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2555715Z self=, 2025-05-07T20:32:43.2555797Z T=16384, 2025-05-07T20:32:43.2555873Z D=5120, 2025-05-07T20:32:43.2555956Z scale_ub=1200.0, 2025-05-07T20:32:43.2556046Z contiguous=True, 2025-05-07T20:32:43.2556132Z compiled=True, 2025-05-07T20:32:43.2556210Z ) 2025-05-07T20:32:43.2556426Z self = 2025-05-07T20:32:43.2556600Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2556604Z 2025-05-07T20:32:43.2556688Z @given( 2025-05-07T20:32:43.2556806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2556909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2557025Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2557145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2557266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2557342Z ) 2025-05-07T20:32:43.2557587Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2557684Z def test_silu_mul_quant( 2025-05-07T20:32:43.2557759Z self, 2025-05-07T20:32:43.2557840Z T: int, 2025-05-07T20:32:43.2557923Z D: int, 2025-05-07T20:32:43.2558021Z scale_ub: Optional[float], 2025-05-07T20:32:43.2558109Z contiguous: bool, 2025-05-07T20:32:43.2558201Z compiled: bool, 2025-05-07T20:32:43.2558280Z ) -> None: 2025-05-07T20:32:43.2558373Z torch.manual_seed(2025) 2025-05-07T20:32:43.2558449Z 2025-05-07T20:32:43.2558617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2558689Z 2025-05-07T20:32:43.2558783Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2558907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2559047Z x = x_sign * x_clamp 2025-05-07T20:32:43.2559131Z x0 = x[:, :D] 2025-05-07T20:32:43.2559210Z x1 = x[:, D:] 2025-05-07T20:32:43.2559283Z 2025-05-07T20:32:43.2559367Z if contiguous: 2025-05-07T20:32:43.2559457Z x0 = x0.contiguous() 2025-05-07T20:32:43.2559550Z x1 = x1.contiguous() 2025-05-07T20:32:43.2559626Z 2025-05-07T20:32:43.2559716Z if scale_ub is not None: 2025-05-07T20:32:43.2559826Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2559958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2560033Z ) 2025-05-07T20:32:43.2560114Z else: 2025-05-07T20:32:43.2560210Z scale_ub_tensor = None 2025-05-07T20:32:43.2560282Z 2025-05-07T20:32:43.2560416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2560505Z op = silu_mul_quant 2025-05-07T20:32:43.2560593Z if compiled: 2025-05-07T20:32:43.2560698Z op = torch.compile(op) 2025-05-07T20:32:43.2560803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2560878Z 2025-05-07T20:32:43.2560968Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2560973Z 2025-05-07T20:32:43.2561068Z moe/activation_test.py:117: 2025-05-07T20:32:43.2561284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2561386Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2561485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2561849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2561942Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2562441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2562538Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2562895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2563174Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2563516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2563622Z kernel = self.compile( 2025-05-07T20:32:43.2563998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2564176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2564304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2564308Z 2025-05-07T20:32:43.2564515Z self = 2025-05-07T20:32:43.2565280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2565795Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd932a550>} 2025-05-07T20:32:43.2566536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2566732Z context = 2025-05-07T20:32:43.2566736Z 2025-05-07T20:32:43.2566900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2567166Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2567273Z module_map=module_map) 2025-05-07T20:32:43.2567478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2567582Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2567658Z E ^ 2025-05-07T20:32:43.2568014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2568025Z 2025-05-07T20:32:43.2568444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2568449Z 2025-05-07T20:32:43.2568550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2568775Z self=, 2025-05-07T20:32:43.2568853Z T=16384, 2025-05-07T20:32:43.2568928Z D=5120, 2025-05-07T20:32:43.2569012Z scale_ub=None, 2025-05-07T20:32:43.2569102Z contiguous=False, 2025-05-07T20:32:43.2569184Z compiled=True, 2025-05-07T20:32:43.2569259Z ) 2025-05-07T20:32:43.2569477Z self = 2025-05-07T20:32:43.2569662Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2569667Z 2025-05-07T20:32:43.2569744Z @given( 2025-05-07T20:32:43.2569860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2570044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2570162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2570277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2570394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2570470Z ) 2025-05-07T20:32:43.2570713Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2570808Z def test_silu_mul_quant( 2025-05-07T20:32:43.2570882Z self, 2025-05-07T20:32:43.2570962Z T: int, 2025-05-07T20:32:43.2571039Z D: int, 2025-05-07T20:32:43.2571137Z scale_ub: Optional[float], 2025-05-07T20:32:43.2571275Z contiguous: bool, 2025-05-07T20:32:43.2571359Z compiled: bool, 2025-05-07T20:32:43.2571438Z ) -> None: 2025-05-07T20:32:43.2571536Z torch.manual_seed(2025) 2025-05-07T20:32:43.2571608Z 2025-05-07T20:32:43.2571775Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2571858Z 2025-05-07T20:32:43.2571948Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2572076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2572168Z x = x_sign * x_clamp 2025-05-07T20:32:43.2572248Z x0 = x[:, :D] 2025-05-07T20:32:43.2572329Z x1 = x[:, D:] 2025-05-07T20:32:43.2572402Z 2025-05-07T20:32:43.2572486Z if contiguous: 2025-05-07T20:32:43.2572583Z x0 = x0.contiguous() 2025-05-07T20:32:43.2572673Z x1 = x1.contiguous() 2025-05-07T20:32:43.2572745Z 2025-05-07T20:32:43.2572838Z if scale_ub is not None: 2025-05-07T20:32:43.2572950Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2573086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2573167Z ) 2025-05-07T20:32:43.2573245Z else: 2025-05-07T20:32:43.2573337Z scale_ub_tensor = None 2025-05-07T20:32:43.2573412Z 2025-05-07T20:32:43.2573545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2573635Z op = silu_mul_quant 2025-05-07T20:32:43.2573723Z if compiled: 2025-05-07T20:32:43.2573824Z op = torch.compile(op) 2025-05-07T20:32:43.2573934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2574005Z 2025-05-07T20:32:43.2574098Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2574102Z 2025-05-07T20:32:43.2574202Z moe/activation_test.py:117: 2025-05-07T20:32:43.2574328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2574433Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2574581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2574949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2575042Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2575540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2575642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2576002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2576236Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2576571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2576669Z kernel = self.compile( 2025-05-07T20:32:43.2577048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2577224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2577351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2577356Z 2025-05-07T20:32:43.2577608Z self = 2025-05-07T20:32:43.2578429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2578932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd92d21f0>} 2025-05-07T20:32:43.2579671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2579906Z context = 2025-05-07T20:32:43.2579910Z 2025-05-07T20:32:43.2580077Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2580354Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2580463Z module_map=module_map) 2025-05-07T20:32:43.2580626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2580726Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2580803Z E ^ 2025-05-07T20:32:43.2581207Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2581212Z 2025-05-07T20:32:43.2581624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2581631Z 2025-05-07T20:32:43.2581732Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2581955Z self=, 2025-05-07T20:32:43.2582033Z T=2048, 2025-05-07T20:32:43.2582109Z D=5120, 2025-05-07T20:32:43.2582200Z scale_ub=None, 2025-05-07T20:32:43.2582291Z contiguous=False, 2025-05-07T20:32:43.2582376Z compiled=True, 2025-05-07T20:32:43.2582448Z ) 2025-05-07T20:32:43.2582664Z self = 2025-05-07T20:32:43.2582842Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2582846Z 2025-05-07T20:32:43.2582922Z @given( 2025-05-07T20:32:43.2583038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2583139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2583252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2583441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2583560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2583633Z ) 2025-05-07T20:32:43.2583884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2583977Z def test_silu_mul_quant( 2025-05-07T20:32:43.2584058Z self, 2025-05-07T20:32:43.2584136Z T: int, 2025-05-07T20:32:43.2584213Z D: int, 2025-05-07T20:32:43.2584310Z scale_ub: Optional[float], 2025-05-07T20:32:43.2584403Z contiguous: bool, 2025-05-07T20:32:43.2584490Z compiled: bool, 2025-05-07T20:32:43.2584568Z ) -> None: 2025-05-07T20:32:43.2584664Z torch.manual_seed(2025) 2025-05-07T20:32:43.2584736Z 2025-05-07T20:32:43.2584902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2584980Z 2025-05-07T20:32:43.2585070Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2585200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2585292Z x = x_sign * x_clamp 2025-05-07T20:32:43.2585373Z x0 = x[:, :D] 2025-05-07T20:32:43.2585455Z x1 = x[:, D:] 2025-05-07T20:32:43.2585527Z 2025-05-07T20:32:43.2585611Z if contiguous: 2025-05-07T20:32:43.2585705Z x0 = x0.contiguous() 2025-05-07T20:32:43.2585879Z x1 = x1.contiguous() 2025-05-07T20:32:43.2585952Z 2025-05-07T20:32:43.2586044Z if scale_ub is not None: 2025-05-07T20:32:43.2586151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2586286Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2586368Z ) 2025-05-07T20:32:43.2586444Z else: 2025-05-07T20:32:43.2586542Z scale_ub_tensor = None 2025-05-07T20:32:43.2586615Z 2025-05-07T20:32:43.2586746Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2586840Z op = silu_mul_quant 2025-05-07T20:32:43.2586971Z if compiled: 2025-05-07T20:32:43.2587071Z op = torch.compile(op) 2025-05-07T20:32:43.2587177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2587250Z 2025-05-07T20:32:43.2587339Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2587344Z 2025-05-07T20:32:43.2587442Z moe/activation_test.py:117: 2025-05-07T20:32:43.2587573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2587673Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2587773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2588141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2588235Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2588721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2588816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2589178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2589404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2589750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2589845Z kernel = self.compile( 2025-05-07T20:32:43.2590227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2590403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2590529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2590534Z 2025-05-07T20:32:43.2590740Z self = 2025-05-07T20:32:43.2591556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2592075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd92d2f70>} 2025-05-07T20:32:43.2592826Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2593018Z context = 2025-05-07T20:32:43.2593022Z 2025-05-07T20:32:43.2593189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2593450Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2593559Z module_map=module_map) 2025-05-07T20:32:43.2593726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2593827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2593903Z E ^ 2025-05-07T20:32:43.2594299Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2594340Z 2025-05-07T20:32:43.2594756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2594760Z 2025-05-07T20:32:43.2594866Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2595089Z self=, 2025-05-07T20:32:43.2595165Z T=2048, 2025-05-07T20:32:43.2595244Z D=5120, 2025-05-07T20:32:43.2595328Z scale_ub=1200.0, 2025-05-07T20:32:43.2595412Z contiguous=False, 2025-05-07T20:32:43.2595499Z compiled=True, 2025-05-07T20:32:43.2595613Z ) 2025-05-07T20:32:43.2595834Z self = 2025-05-07T20:32:43.2596008Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2596013Z 2025-05-07T20:32:43.2596090Z @given( 2025-05-07T20:32:43.2596211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2596311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2596425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2596547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2596658Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2596732Z ) 2025-05-07T20:32:43.2596979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2597071Z def test_silu_mul_quant( 2025-05-07T20:32:43.2597152Z self, 2025-05-07T20:32:43.2597229Z T: int, 2025-05-07T20:32:43.2597303Z D: int, 2025-05-07T20:32:43.2597410Z scale_ub: Optional[float], 2025-05-07T20:32:43.2597498Z contiguous: bool, 2025-05-07T20:32:43.2597584Z compiled: bool, 2025-05-07T20:32:43.2597664Z ) -> None: 2025-05-07T20:32:43.2597759Z torch.manual_seed(2025) 2025-05-07T20:32:43.2597832Z 2025-05-07T20:32:43.2598006Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2598080Z 2025-05-07T20:32:43.2598170Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2598301Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2598390Z x = x_sign * x_clamp 2025-05-07T20:32:43.2598475Z x0 = x[:, :D] 2025-05-07T20:32:43.2598555Z x1 = x[:, D:] 2025-05-07T20:32:43.2598627Z 2025-05-07T20:32:43.2598713Z if contiguous: 2025-05-07T20:32:43.2598803Z x0 = x0.contiguous() 2025-05-07T20:32:43.2598891Z x1 = x1.contiguous() 2025-05-07T20:32:43.2598967Z 2025-05-07T20:32:43.2599102Z if scale_ub is not None: 2025-05-07T20:32:43.2599213Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2599350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2599425Z ) 2025-05-07T20:32:43.2599501Z else: 2025-05-07T20:32:43.2599599Z scale_ub_tensor = None 2025-05-07T20:32:43.2599678Z 2025-05-07T20:32:43.2599806Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2599900Z op = silu_mul_quant 2025-05-07T20:32:43.2599985Z if compiled: 2025-05-07T20:32:43.2600088Z op = torch.compile(op) 2025-05-07T20:32:43.2600192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2600264Z 2025-05-07T20:32:43.2600356Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2600361Z 2025-05-07T20:32:43.2600458Z moe/activation_test.py:117: 2025-05-07T20:32:43.2600590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2600708Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2600808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2601170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2601262Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2601844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2601945Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2602299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2602524Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2602861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2602953Z kernel = self.compile( 2025-05-07T20:32:43.2603384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2603562Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2603686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2603693Z 2025-05-07T20:32:43.2603904Z self = 2025-05-07T20:32:43.2604667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2605170Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd913c940>} 2025-05-07T20:32:43.2605908Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2606102Z context = 2025-05-07T20:32:43.2606110Z 2025-05-07T20:32:43.2606276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2606542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2606654Z module_map=module_map) 2025-05-07T20:32:43.2606816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2606913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2606991Z E ^ 2025-05-07T20:32:43.2607347Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2607352Z 2025-05-07T20:32:43.2607812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2607820Z 2025-05-07T20:32:43.2607923Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2608145Z self=, 2025-05-07T20:32:43.2608224Z T=4096, 2025-05-07T20:32:43.2608304Z D=5120, 2025-05-07T20:32:43.2608390Z scale_ub=1200.0, 2025-05-07T20:32:43.2608481Z contiguous=True, 2025-05-07T20:32:43.2608563Z compiled=True, 2025-05-07T20:32:43.2608635Z ) 2025-05-07T20:32:43.2608854Z self = 2025-05-07T20:32:43.2609025Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2609029Z 2025-05-07T20:32:43.2609110Z @given( 2025-05-07T20:32:43.2609228Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2609325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2609444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2609565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2609679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2609757Z ) 2025-05-07T20:32:43.2610000Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2610178Z def test_silu_mul_quant( 2025-05-07T20:32:43.2610259Z self, 2025-05-07T20:32:43.2610335Z T: int, 2025-05-07T20:32:43.2610413Z D: int, 2025-05-07T20:32:43.2610510Z scale_ub: Optional[float], 2025-05-07T20:32:43.2610597Z contiguous: bool, 2025-05-07T20:32:43.2610684Z compiled: bool, 2025-05-07T20:32:43.2610761Z ) -> None: 2025-05-07T20:32:43.2610853Z torch.manual_seed(2025) 2025-05-07T20:32:43.2610927Z 2025-05-07T20:32:43.2611094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2611167Z 2025-05-07T20:32:43.2611261Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2611456Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2611546Z x = x_sign * x_clamp 2025-05-07T20:32:43.2611629Z x0 = x[:, :D] 2025-05-07T20:32:43.2611709Z x1 = x[:, D:] 2025-05-07T20:32:43.2611785Z 2025-05-07T20:32:43.2611868Z if contiguous: 2025-05-07T20:32:43.2611962Z x0 = x0.contiguous() 2025-05-07T20:32:43.2612054Z x1 = x1.contiguous() 2025-05-07T20:32:43.2612125Z 2025-05-07T20:32:43.2612214Z if scale_ub is not None: 2025-05-07T20:32:43.2612322Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2612458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2612535Z ) 2025-05-07T20:32:43.2612615Z else: 2025-05-07T20:32:43.2612708Z scale_ub_tensor = None 2025-05-07T20:32:43.2612783Z 2025-05-07T20:32:43.2612918Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2613013Z op = silu_mul_quant 2025-05-07T20:32:43.2613097Z if compiled: 2025-05-07T20:32:43.2613200Z op = torch.compile(op) 2025-05-07T20:32:43.2613304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2613381Z 2025-05-07T20:32:43.2613470Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2613476Z 2025-05-07T20:32:43.2613574Z moe/activation_test.py:117: 2025-05-07T20:32:43.2613704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2613805Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2613905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2614275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2614368Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2614859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2615004Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2615365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2615591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2615940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2616037Z kernel = self.compile( 2025-05-07T20:32:43.2616421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2616597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2616725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2616729Z 2025-05-07T20:32:43.2616938Z self = 2025-05-07T20:32:43.2617704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2618264Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9107790>} 2025-05-07T20:32:43.2619075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2619270Z context = 2025-05-07T20:32:43.2619274Z 2025-05-07T20:32:43.2619437Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2619701Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2619850Z module_map=module_map) 2025-05-07T20:32:43.2620009Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2620111Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2620187Z E ^ 2025-05-07T20:32:43.2620545Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2620552Z 2025-05-07T20:32:43.2620972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2620976Z 2025-05-07T20:32:43.2621124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2621351Z self=, 2025-05-07T20:32:43.2621427Z T=128, 2025-05-07T20:32:43.2621502Z D=5120, 2025-05-07T20:32:43.2621586Z scale_ub=1200.0, 2025-05-07T20:32:43.2621674Z contiguous=False, 2025-05-07T20:32:43.2621762Z compiled=True, 2025-05-07T20:32:43.2621837Z ) 2025-05-07T20:32:43.2622053Z self = 2025-05-07T20:32:43.2622222Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2622230Z 2025-05-07T20:32:43.2622306Z @given( 2025-05-07T20:32:43.2622428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2622529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2622641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2622758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2622874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2622948Z ) 2025-05-07T20:32:43.2623191Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2623290Z def test_silu_mul_quant( 2025-05-07T20:32:43.2623369Z self, 2025-05-07T20:32:43.2623446Z T: int, 2025-05-07T20:32:43.2623574Z D: int, 2025-05-07T20:32:43.2623672Z scale_ub: Optional[float], 2025-05-07T20:32:43.2623763Z contiguous: bool, 2025-05-07T20:32:43.2623851Z compiled: bool, 2025-05-07T20:32:43.2623928Z ) -> None: 2025-05-07T20:32:43.2624024Z torch.manual_seed(2025) 2025-05-07T20:32:43.2624101Z 2025-05-07T20:32:43.2624272Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2624348Z 2025-05-07T20:32:43.2624439Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2624563Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2624653Z x = x_sign * x_clamp 2025-05-07T20:32:43.2624734Z x0 = x[:, :D] 2025-05-07T20:32:43.2624813Z x1 = x[:, D:] 2025-05-07T20:32:43.2624891Z 2025-05-07T20:32:43.2624974Z if contiguous: 2025-05-07T20:32:43.2625065Z x0 = x0.contiguous() 2025-05-07T20:32:43.2625157Z x1 = x1.contiguous() 2025-05-07T20:32:43.2625232Z 2025-05-07T20:32:43.2625330Z if scale_ub is not None: 2025-05-07T20:32:43.2625435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2625569Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2625650Z ) 2025-05-07T20:32:43.2625727Z else: 2025-05-07T20:32:43.2625904Z scale_ub_tensor = None 2025-05-07T20:32:43.2625980Z 2025-05-07T20:32:43.2626109Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2626199Z op = silu_mul_quant 2025-05-07T20:32:43.2626291Z if compiled: 2025-05-07T20:32:43.2626390Z op = torch.compile(op) 2025-05-07T20:32:43.2626496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2626570Z 2025-05-07T20:32:43.2626661Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2626665Z 2025-05-07T20:32:43.2626768Z moe/activation_test.py:117: 2025-05-07T20:32:43.2626898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2627039Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2627143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2627506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2627603Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2628094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2628189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2628547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2628770Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2629104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2629204Z kernel = self.compile( 2025-05-07T20:32:43.2629587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2629768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2629896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2629905Z 2025-05-07T20:32:43.2630110Z self = 2025-05-07T20:32:43.2630917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2631430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8f0e0d0>} 2025-05-07T20:32:43.2632216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2632412Z context = 2025-05-07T20:32:43.2632418Z 2025-05-07T20:32:43.2632588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2632852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2632957Z module_map=module_map) 2025-05-07T20:32:43.2633121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2633218Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2633294Z E ^ 2025-05-07T20:32:43.2633648Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2633653Z 2025-05-07T20:32:43.2634077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2634082Z 2025-05-07T20:32:43.2634186Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2634407Z self=, 2025-05-07T20:32:43.2634562Z T=16384, 2025-05-07T20:32:43.2634644Z D=7168, 2025-05-07T20:32:43.2634726Z scale_ub=1200.0, 2025-05-07T20:32:43.2634812Z contiguous=True, 2025-05-07T20:32:43.2634896Z compiled=True, 2025-05-07T20:32:43.2634968Z ) 2025-05-07T20:32:43.2635182Z self = 2025-05-07T20:32:43.2635358Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2635362Z 2025-05-07T20:32:43.2635439Z @given( 2025-05-07T20:32:43.2635558Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2635656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2635817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2635936Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2636048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2636121Z ) 2025-05-07T20:32:43.2636372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2636471Z def test_silu_mul_quant( 2025-05-07T20:32:43.2636546Z self, 2025-05-07T20:32:43.2636624Z T: int, 2025-05-07T20:32:43.2636699Z D: int, 2025-05-07T20:32:43.2636796Z scale_ub: Optional[float], 2025-05-07T20:32:43.2636887Z contiguous: bool, 2025-05-07T20:32:43.2636973Z compiled: bool, 2025-05-07T20:32:43.2637054Z ) -> None: 2025-05-07T20:32:43.2641648Z torch.manual_seed(2025) 2025-05-07T20:32:43.2641736Z 2025-05-07T20:32:43.2641921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2642006Z 2025-05-07T20:32:43.2642103Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2642230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2642323Z x = x_sign * x_clamp 2025-05-07T20:32:43.2642403Z x0 = x[:, :D] 2025-05-07T20:32:43.2642484Z x1 = x[:, D:] 2025-05-07T20:32:43.2642564Z 2025-05-07T20:32:43.2642655Z if contiguous: 2025-05-07T20:32:43.2642746Z x0 = x0.contiguous() 2025-05-07T20:32:43.2642841Z x1 = x1.contiguous() 2025-05-07T20:32:43.2642917Z 2025-05-07T20:32:43.2643012Z if scale_ub is not None: 2025-05-07T20:32:43.2643116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2643253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2643335Z ) 2025-05-07T20:32:43.2643414Z else: 2025-05-07T20:32:43.2643508Z scale_ub_tensor = None 2025-05-07T20:32:43.2643584Z 2025-05-07T20:32:43.2643829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2643928Z op = silu_mul_quant 2025-05-07T20:32:43.2644018Z if compiled: 2025-05-07T20:32:43.2644119Z op = torch.compile(op) 2025-05-07T20:32:43.2644224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2644300Z 2025-05-07T20:32:43.2644397Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2644402Z 2025-05-07T20:32:43.2644504Z moe/activation_test.py:117: 2025-05-07T20:32:43.2644636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2644736Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2644839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2645207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2645300Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2645809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2645908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2646268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2646554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2646974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2647074Z kernel = self.compile( 2025-05-07T20:32:43.2647458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2647638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2647772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2647776Z 2025-05-07T20:32:43.2647987Z self = 2025-05-07T20:32:43.2648831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2649349Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8f0ed30>} 2025-05-07T20:32:43.2650090Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2650284Z context = 2025-05-07T20:32:43.2650289Z 2025-05-07T20:32:43.2650454Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2650734Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2650866Z module_map=module_map) 2025-05-07T20:32:43.2651054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2651154Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2651236Z E ^ 2025-05-07T20:32:43.2651600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2651605Z 2025-05-07T20:32:43.2652021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2652025Z 2025-05-07T20:32:43.2652130Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2652350Z self=, 2025-05-07T20:32:43.2652428Z T=16384, 2025-05-07T20:32:43.2652508Z D=5120, 2025-05-07T20:32:43.2652593Z scale_ub=1200.0, 2025-05-07T20:32:43.2652725Z contiguous=True, 2025-05-07T20:32:43.2652813Z compiled=False, 2025-05-07T20:32:43.2652892Z ) 2025-05-07T20:32:43.2653108Z self = 2025-05-07T20:32:43.2653288Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2653295Z 2025-05-07T20:32:43.2653376Z @given( 2025-05-07T20:32:43.2653494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2653596Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2653710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2653829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2653946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2654021Z ) 2025-05-07T20:32:43.2654270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2654364Z def test_silu_mul_quant( 2025-05-07T20:32:43.2654443Z self, 2025-05-07T20:32:43.2654529Z T: int, 2025-05-07T20:32:43.2654610Z D: int, 2025-05-07T20:32:43.2654713Z scale_ub: Optional[float], 2025-05-07T20:32:43.2654806Z contiguous: bool, 2025-05-07T20:32:43.2654892Z compiled: bool, 2025-05-07T20:32:43.2654970Z ) -> None: 2025-05-07T20:32:43.2655147Z torch.manual_seed(2025) 2025-05-07T20:32:43.2655225Z 2025-05-07T20:32:43.2655400Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2655476Z 2025-05-07T20:32:43.2655570Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2655702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2655792Z x = x_sign * x_clamp 2025-05-07T20:32:43.2655876Z x0 = x[:, :D] 2025-05-07T20:32:43.2655960Z x1 = x[:, D:] 2025-05-07T20:32:43.2656033Z 2025-05-07T20:32:43.2656115Z if contiguous: 2025-05-07T20:32:43.2656212Z x0 = x0.contiguous() 2025-05-07T20:32:43.2656348Z x1 = x1.contiguous() 2025-05-07T20:32:43.2656421Z 2025-05-07T20:32:43.2656517Z if scale_ub is not None: 2025-05-07T20:32:43.2656623Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2656761Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2656841Z ) 2025-05-07T20:32:43.2656921Z else: 2025-05-07T20:32:43.2657017Z scale_ub_tensor = None 2025-05-07T20:32:43.2657090Z 2025-05-07T20:32:43.2657223Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2657319Z op = silu_mul_quant 2025-05-07T20:32:43.2657407Z if compiled: 2025-05-07T20:32:43.2657506Z op = torch.compile(op) 2025-05-07T20:32:43.2657614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2657687Z 2025-05-07T20:32:43.2657779Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2657788Z 2025-05-07T20:32:43.2657888Z moe/activation_test.py:117: 2025-05-07T20:32:43.2658019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2658125Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2658232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2658737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2658839Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2659199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2659423Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2659871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2660003Z kernel = self.compile( 2025-05-07T20:32:43.2660605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2660885Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2661171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2661178Z 2025-05-07T20:32:43.2661477Z self = 2025-05-07T20:32:43.2662317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2662827Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd9052700>} 2025-05-07T20:32:43.2663579Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2663781Z context = 2025-05-07T20:32:43.2663786Z 2025-05-07T20:32:43.2663957Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2664281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2664428Z module_map=module_map) 2025-05-07T20:32:43.2664592Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2664690Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2664772Z E ^ 2025-05-07T20:32:43.2665127Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2665132Z 2025-05-07T20:32:43.2665551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2665601Z 2025-05-07T20:32:43.2665708Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2665929Z self=, 2025-05-07T20:32:43.2666011Z T=1, 2025-05-07T20:32:43.2666087Z D=7168, 2025-05-07T20:32:43.2666172Z scale_ub=1200.0, 2025-05-07T20:32:43.2666271Z contiguous=False, 2025-05-07T20:32:43.2666356Z compiled=False, 2025-05-07T20:32:43.2666436Z ) 2025-05-07T20:32:43.2666652Z self = 2025-05-07T20:32:43.2666818Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2666822Z 2025-05-07T20:32:43.2666903Z @given( 2025-05-07T20:32:43.2667021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2667119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2667238Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2667360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2667479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2667556Z ) 2025-05-07T20:32:43.2667799Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2667899Z def test_silu_mul_quant( 2025-05-07T20:32:43.2667977Z self, 2025-05-07T20:32:43.2668057Z T: int, 2025-05-07T20:32:43.2668137Z D: int, 2025-05-07T20:32:43.2668236Z scale_ub: Optional[float], 2025-05-07T20:32:43.2668327Z contiguous: bool, 2025-05-07T20:32:43.2668418Z compiled: bool, 2025-05-07T20:32:43.2668496Z ) -> None: 2025-05-07T20:32:43.2668591Z torch.manual_seed(2025) 2025-05-07T20:32:43.2668667Z 2025-05-07T20:32:43.2668835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2668908Z 2025-05-07T20:32:43.2669007Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2669134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2669275Z x = x_sign * x_clamp 2025-05-07T20:32:43.2669358Z x0 = x[:, :D] 2025-05-07T20:32:43.2669439Z x1 = x[:, D:] 2025-05-07T20:32:43.2669516Z 2025-05-07T20:32:43.2669599Z if contiguous: 2025-05-07T20:32:43.2669690Z x0 = x0.contiguous() 2025-05-07T20:32:43.2669789Z x1 = x1.contiguous() 2025-05-07T20:32:43.2669864Z 2025-05-07T20:32:43.2669954Z if scale_ub is not None: 2025-05-07T20:32:43.2670063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2670199Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2670276Z ) 2025-05-07T20:32:43.2670356Z else: 2025-05-07T20:32:43.2670451Z scale_ub_tensor = None 2025-05-07T20:32:43.2670524Z 2025-05-07T20:32:43.2670664Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2670754Z op = silu_mul_quant 2025-05-07T20:32:43.2670845Z if compiled: 2025-05-07T20:32:43.2670953Z op = torch.compile(op) 2025-05-07T20:32:43.2671062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2671137Z 2025-05-07T20:32:43.2671230Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2671234Z 2025-05-07T20:32:43.2671332Z moe/activation_test.py:117: 2025-05-07T20:32:43.2671543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2671649Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2671748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2672252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2672350Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2672713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2672942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2673323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2673426Z kernel = self.compile( 2025-05-07T20:32:43.2673806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2673993Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2674121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2674125Z 2025-05-07T20:32:43.2674329Z self = 2025-05-07T20:32:43.2675097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2675610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd900c0d0>} 2025-05-07T20:32:43.2676367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2676563Z context = 2025-05-07T20:32:43.2676568Z 2025-05-07T20:32:43.2676737Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2677006Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2677115Z module_map=module_map) 2025-05-07T20:32:43.2677280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2677376Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2677461Z E ^ 2025-05-07T20:32:43.2677861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2677867Z 2025-05-07T20:32:43.2678286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2678294Z 2025-05-07T20:32:43.2678401Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2678621Z self=, 2025-05-07T20:32:43.2678703Z T=4096, 2025-05-07T20:32:43.2678780Z D=7168, 2025-05-07T20:32:43.2678862Z scale_ub=1200.0, 2025-05-07T20:32:43.2678953Z contiguous=False, 2025-05-07T20:32:43.2679036Z compiled=True, 2025-05-07T20:32:43.2679113Z ) 2025-05-07T20:32:43.2679330Z self = 2025-05-07T20:32:43.2679506Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2679513Z 2025-05-07T20:32:43.2679595Z @given( 2025-05-07T20:32:43.2679715Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2679813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2679932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2680113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2680264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2680342Z ) 2025-05-07T20:32:43.2680588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2680685Z def test_silu_mul_quant( 2025-05-07T20:32:43.2680761Z self, 2025-05-07T20:32:43.2680838Z T: int, 2025-05-07T20:32:43.2680918Z D: int, 2025-05-07T20:32:43.2681017Z scale_ub: Optional[float], 2025-05-07T20:32:43.2681107Z contiguous: bool, 2025-05-07T20:32:43.2681198Z compiled: bool, 2025-05-07T20:32:43.2681275Z ) -> None: 2025-05-07T20:32:43.2681414Z torch.manual_seed(2025) 2025-05-07T20:32:43.2681490Z 2025-05-07T20:32:43.2681656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2681731Z 2025-05-07T20:32:43.2681826Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2681952Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2682053Z x = x_sign * x_clamp 2025-05-07T20:32:43.2682135Z x0 = x[:, :D] 2025-05-07T20:32:43.2682218Z x1 = x[:, D:] 2025-05-07T20:32:43.2682297Z 2025-05-07T20:32:43.2682384Z if contiguous: 2025-05-07T20:32:43.2682479Z x0 = x0.contiguous() 2025-05-07T20:32:43.2682569Z x1 = x1.contiguous() 2025-05-07T20:32:43.2682642Z 2025-05-07T20:32:43.2682734Z if scale_ub is not None: 2025-05-07T20:32:43.2682843Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2682979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2683055Z ) 2025-05-07T20:32:43.2683141Z else: 2025-05-07T20:32:43.2683234Z scale_ub_tensor = None 2025-05-07T20:32:43.2683306Z 2025-05-07T20:32:43.2683441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2683532Z op = silu_mul_quant 2025-05-07T20:32:43.2683620Z if compiled: 2025-05-07T20:32:43.2683725Z op = torch.compile(op) 2025-05-07T20:32:43.2683831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2683905Z 2025-05-07T20:32:43.2683995Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2683999Z 2025-05-07T20:32:43.2684098Z moe/activation_test.py:117: 2025-05-07T20:32:43.2684228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2684329Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2684429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2684841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2684938Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2685429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2685528Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2685893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2686120Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2686462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2686563Z kernel = self.compile( 2025-05-07T20:32:43.2686945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2687121Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2687261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2687265Z 2025-05-07T20:32:43.2687470Z self = 2025-05-07T20:32:43.2688280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2688831Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd900cdc0>} 2025-05-07T20:32:43.2689579Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2689778Z context = 2025-05-07T20:32:43.2689821Z 2025-05-07T20:32:43.2689993Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2690262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2690369Z module_map=module_map) 2025-05-07T20:32:43.2690537Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2690644Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2690741Z E ^ 2025-05-07T20:32:43.2691118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2691126Z 2025-05-07T20:32:43.2691538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2691543Z 2025-05-07T20:32:43.2691645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2691877Z self=, 2025-05-07T20:32:43.2691955Z T=128, 2025-05-07T20:32:43.2692031Z D=7168, 2025-05-07T20:32:43.2692119Z scale_ub=1200.0, 2025-05-07T20:32:43.2692205Z contiguous=False, 2025-05-07T20:32:43.2692287Z compiled=True, 2025-05-07T20:32:43.2692365Z ) 2025-05-07T20:32:43.2692588Z self = 2025-05-07T20:32:43.2692764Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.2692768Z 2025-05-07T20:32:43.2692846Z @given( 2025-05-07T20:32:43.2692969Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2693073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2693186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2693304Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2693423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2693498Z ) 2025-05-07T20:32:43.2693796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2693893Z def test_silu_mul_quant( 2025-05-07T20:32:43.2693971Z self, 2025-05-07T20:32:43.2694055Z T: int, 2025-05-07T20:32:43.2694130Z D: int, 2025-05-07T20:32:43.2694232Z scale_ub: Optional[float], 2025-05-07T20:32:43.2694328Z contiguous: bool, 2025-05-07T20:32:43.2694413Z compiled: bool, 2025-05-07T20:32:43.2694491Z ) -> None: 2025-05-07T20:32:43.2694587Z torch.manual_seed(2025) 2025-05-07T20:32:43.2694660Z 2025-05-07T20:32:43.2694828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2694905Z 2025-05-07T20:32:43.2694996Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2695122Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2695214Z x = x_sign * x_clamp 2025-05-07T20:32:43.2695295Z x0 = x[:, :D] 2025-05-07T20:32:43.2695388Z x1 = x[:, D:] 2025-05-07T20:32:43.2695463Z 2025-05-07T20:32:43.2695546Z if contiguous: 2025-05-07T20:32:43.2695641Z x0 = x0.contiguous() 2025-05-07T20:32:43.2695730Z x1 = x1.contiguous() 2025-05-07T20:32:43.2695803Z 2025-05-07T20:32:43.2695898Z if scale_ub is not None: 2025-05-07T20:32:43.2696088Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2696228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2696307Z ) 2025-05-07T20:32:43.2696385Z else: 2025-05-07T20:32:43.2696478Z scale_ub_tensor = None 2025-05-07T20:32:43.2696558Z 2025-05-07T20:32:43.2696692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2696781Z op = silu_mul_quant 2025-05-07T20:32:43.2696872Z if compiled: 2025-05-07T20:32:43.2696971Z op = torch.compile(op) 2025-05-07T20:32:43.2697080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2697203Z 2025-05-07T20:32:43.2697294Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2697298Z 2025-05-07T20:32:43.2697403Z moe/activation_test.py:117: 2025-05-07T20:32:43.2697531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2697633Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2697739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2698104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2698201Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2698694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2698792Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2699157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2699392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2699728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2699824Z kernel = self.compile( 2025-05-07T20:32:43.2700216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2700395Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2700528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2700532Z 2025-05-07T20:32:43.2700740Z self = 2025-05-07T20:32:43.2701576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2702142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8e9a940>} 2025-05-07T20:32:43.2702888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2703092Z context = 2025-05-07T20:32:43.2703097Z 2025-05-07T20:32:43.2703266Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2703538Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2703644Z module_map=module_map) 2025-05-07T20:32:43.2703805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2703909Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2703991Z E ^ 2025-05-07T20:32:43.2704355Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2704360Z 2025-05-07T20:32:43.2704810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2704849Z 2025-05-07T20:32:43.2704954Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2705178Z self=, 2025-05-07T20:32:43.2705254Z T=2048, 2025-05-07T20:32:43.2705331Z D=7168, 2025-05-07T20:32:43.2705418Z scale_ub=None, 2025-05-07T20:32:43.2705504Z contiguous=True, 2025-05-07T20:32:43.2705588Z compiled=True, 2025-05-07T20:32:43.2705664Z ) 2025-05-07T20:32:43.2705883Z self = 2025-05-07T20:32:43.2706062Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2706106Z 2025-05-07T20:32:43.2706183Z @given( 2025-05-07T20:32:43.2706301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2706406Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2706521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2706644Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2706768Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2706841Z ) 2025-05-07T20:32:43.2707094Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2707187Z def test_silu_mul_quant( 2025-05-07T20:32:43.2707262Z self, 2025-05-07T20:32:43.2707344Z T: int, 2025-05-07T20:32:43.2707421Z D: int, 2025-05-07T20:32:43.2707519Z scale_ub: Optional[float], 2025-05-07T20:32:43.2707610Z contiguous: bool, 2025-05-07T20:32:43.2707695Z compiled: bool, 2025-05-07T20:32:43.2707778Z ) -> None: 2025-05-07T20:32:43.2707876Z torch.manual_seed(2025) 2025-05-07T20:32:43.2707952Z 2025-05-07T20:32:43.2708121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2708200Z 2025-05-07T20:32:43.2708292Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2708430Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2708520Z x = x_sign * x_clamp 2025-05-07T20:32:43.2708601Z x0 = x[:, :D] 2025-05-07T20:32:43.2708685Z x1 = x[:, D:] 2025-05-07T20:32:43.2708761Z 2025-05-07T20:32:43.2708844Z if contiguous: 2025-05-07T20:32:43.2708940Z x0 = x0.contiguous() 2025-05-07T20:32:43.2709028Z x1 = x1.contiguous() 2025-05-07T20:32:43.2709101Z 2025-05-07T20:32:43.2709196Z if scale_ub is not None: 2025-05-07T20:32:43.2709301Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2709435Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2709586Z ) 2025-05-07T20:32:43.2709665Z else: 2025-05-07T20:32:43.2709762Z scale_ub_tensor = None 2025-05-07T20:32:43.2709836Z 2025-05-07T20:32:43.2709965Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2710060Z op = silu_mul_quant 2025-05-07T20:32:43.2710152Z if compiled: 2025-05-07T20:32:43.2710252Z op = torch.compile(op) 2025-05-07T20:32:43.2710363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2710436Z 2025-05-07T20:32:43.2710528Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2710532Z 2025-05-07T20:32:43.2710635Z moe/activation_test.py:117: 2025-05-07T20:32:43.2710764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2710865Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2710971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2711338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2711438Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2711927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2712100Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2712470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2712697Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2713035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2713128Z kernel = self.compile( 2025-05-07T20:32:43.2713504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2713683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2713845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2713849Z 2025-05-07T20:32:43.2714058Z self = 2025-05-07T20:32:43.2714828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2715336Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8c4f550>} 2025-05-07T20:32:43.2716074Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2716270Z context = 2025-05-07T20:32:43.2716275Z 2025-05-07T20:32:43.2716445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2716707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2716821Z module_map=module_map) 2025-05-07T20:32:43.2716990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2717089Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2717167Z E ^ 2025-05-07T20:32:43.2717520Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2717525Z 2025-05-07T20:32:43.2717933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2717937Z 2025-05-07T20:32:43.2718042Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2718304Z self=, 2025-05-07T20:32:43.2718383Z T=16384, 2025-05-07T20:32:43.2718463Z D=5120, 2025-05-07T20:32:43.2718544Z scale_ub=None, 2025-05-07T20:32:43.2718631Z contiguous=False, 2025-05-07T20:32:43.2718721Z compiled=False, 2025-05-07T20:32:43.2718802Z ) 2025-05-07T20:32:43.2719020Z self = 2025-05-07T20:32:43.2719198Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2719202Z 2025-05-07T20:32:43.2719280Z @given( 2025-05-07T20:32:43.2719400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2719502Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2719615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2719737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2719849Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2719929Z ) 2025-05-07T20:32:43.2720176Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2720269Z def test_silu_mul_quant( 2025-05-07T20:32:43.2720349Z self, 2025-05-07T20:32:43.2720425Z T: int, 2025-05-07T20:32:43.2720546Z D: int, 2025-05-07T20:32:43.2720684Z scale_ub: Optional[float], 2025-05-07T20:32:43.2720798Z contiguous: bool, 2025-05-07T20:32:43.2720890Z compiled: bool, 2025-05-07T20:32:43.2720993Z ) -> None: 2025-05-07T20:32:43.2721087Z torch.manual_seed(2025) 2025-05-07T20:32:43.2721159Z 2025-05-07T20:32:43.2721329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2721408Z 2025-05-07T20:32:43.2721499Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2721628Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2723417Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2723467Z 2025-05-07T20:32:43.2723591Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2723596Z 2025-05-07T20:32:43.2723701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2723927Z self=, 2025-05-07T20:32:43.2724005Z T=4096, 2025-05-07T20:32:43.2724081Z D=7168, 2025-05-07T20:32:43.2724166Z scale_ub=1200.0, 2025-05-07T20:32:43.2724250Z contiguous=True, 2025-05-07T20:32:43.2724338Z compiled=True, 2025-05-07T20:32:43.2724416Z ) 2025-05-07T20:32:43.2724632Z self = 2025-05-07T20:32:43.2724804Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2724814Z 2025-05-07T20:32:43.2724894Z @given( 2025-05-07T20:32:43.2725016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2725120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2725233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2725348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2725464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2725537Z ) 2025-05-07T20:32:43.2725787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2725884Z def test_silu_mul_quant( 2025-05-07T20:32:43.2725959Z self, 2025-05-07T20:32:43.2726081Z T: int, 2025-05-07T20:32:43.2726163Z D: int, 2025-05-07T20:32:43.2726264Z scale_ub: Optional[float], 2025-05-07T20:32:43.2726356Z contiguous: bool, 2025-05-07T20:32:43.2726441Z compiled: bool, 2025-05-07T20:32:43.2726518Z ) -> None: 2025-05-07T20:32:43.2726617Z torch.manual_seed(2025) 2025-05-07T20:32:43.2726692Z 2025-05-07T20:32:43.2726861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2726939Z 2025-05-07T20:32:43.2727031Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2727156Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2728915Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2728923Z 2025-05-07T20:32:43.2729041Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2729122Z 2025-05-07T20:32:43.2729229Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2729453Z self=, 2025-05-07T20:32:43.2729535Z T=16384, 2025-05-07T20:32:43.2729611Z D=7168, 2025-05-07T20:32:43.2729691Z scale_ub=None, 2025-05-07T20:32:43.2729780Z contiguous=False, 2025-05-07T20:32:43.2729864Z compiled=False, 2025-05-07T20:32:43.2729936Z ) 2025-05-07T20:32:43.2730155Z self = 2025-05-07T20:32:43.2730330Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2730378Z 2025-05-07T20:32:43.2730456Z @given( 2025-05-07T20:32:43.2730577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2730677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2730793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2730915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2731029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2731108Z ) 2025-05-07T20:32:43.2731359Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2731452Z def test_silu_mul_quant( 2025-05-07T20:32:43.2731536Z self, 2025-05-07T20:32:43.2731613Z T: int, 2025-05-07T20:32:43.2731689Z D: int, 2025-05-07T20:32:43.2731789Z scale_ub: Optional[float], 2025-05-07T20:32:43.2731877Z contiguous: bool, 2025-05-07T20:32:43.2731962Z compiled: bool, 2025-05-07T20:32:43.2732042Z ) -> None: 2025-05-07T20:32:43.2732142Z torch.manual_seed(2025) 2025-05-07T20:32:43.2732219Z 2025-05-07T20:32:43.2732385Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2734149Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2734163Z 2025-05-07T20:32:43.2734281Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2734286Z 2025-05-07T20:32:43.2734387Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2734661Z self=, 2025-05-07T20:32:43.2734738Z T=2048, 2025-05-07T20:32:43.2734814Z D=7168, 2025-05-07T20:32:43.2734904Z scale_ub=1200.0, 2025-05-07T20:32:43.2734988Z contiguous=True, 2025-05-07T20:32:43.2735069Z compiled=True, 2025-05-07T20:32:43.2735147Z ) 2025-05-07T20:32:43.2735367Z self = 2025-05-07T20:32:43.2735544Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2735548Z 2025-05-07T20:32:43.2735625Z @given( 2025-05-07T20:32:43.2735742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2735843Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2735955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2736072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2736186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2736265Z ) 2025-05-07T20:32:43.2736514Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2736610Z def test_silu_mul_quant( 2025-05-07T20:32:43.2736685Z self, 2025-05-07T20:32:43.2736764Z T: int, 2025-05-07T20:32:43.2736839Z D: int, 2025-05-07T20:32:43.2737016Z scale_ub: Optional[float], 2025-05-07T20:32:43.2737108Z contiguous: bool, 2025-05-07T20:32:43.2737193Z compiled: bool, 2025-05-07T20:32:43.2737271Z ) -> None: 2025-05-07T20:32:43.2737368Z torch.manual_seed(2025) 2025-05-07T20:32:43.2737439Z 2025-05-07T20:32:43.2737605Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2737680Z 2025-05-07T20:32:43.2737772Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2737897Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2739647Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2739716Z 2025-05-07T20:32:43.2739837Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2739846Z 2025-05-07T20:32:43.2739947Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2740469Z self=, 2025-05-07T20:32:43.2740590Z T=2048, 2025-05-07T20:32:43.2740679Z D=7168, 2025-05-07T20:32:43.2740760Z scale_ub=None, 2025-05-07T20:32:43.2740848Z contiguous=True, 2025-05-07T20:32:43.2740931Z compiled=False, 2025-05-07T20:32:43.2741009Z ) 2025-05-07T20:32:43.2741269Z self = 2025-05-07T20:32:43.2741441Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2741446Z 2025-05-07T20:32:43.2741526Z @given( 2025-05-07T20:32:43.2741651Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2741750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2741867Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2741981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2742092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2742171Z ) 2025-05-07T20:32:43.2742415Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2742510Z def test_silu_mul_quant( 2025-05-07T20:32:43.2742591Z self, 2025-05-07T20:32:43.2742667Z T: int, 2025-05-07T20:32:43.2742838Z D: int, 2025-05-07T20:32:43.2742942Z scale_ub: Optional[float], 2025-05-07T20:32:43.2743032Z contiguous: bool, 2025-05-07T20:32:43.2743122Z compiled: bool, 2025-05-07T20:32:43.2743199Z ) -> None: 2025-05-07T20:32:43.2743294Z torch.manual_seed(2025) 2025-05-07T20:32:43.2743373Z 2025-05-07T20:32:43.2743541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2743615Z 2025-05-07T20:32:43.2743713Z > x_sign = torch.sign(x) 2025-05-07T20:32:43.2745491Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2745499Z 2025-05-07T20:32:43.2745619Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:43.2745623Z 2025-05-07T20:32:43.2745723Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2746057Z self=, 2025-05-07T20:32:43.2746140Z T=1, 2025-05-07T20:32:43.2746215Z D=7168, 2025-05-07T20:32:43.2746302Z scale_ub=1200.0, 2025-05-07T20:32:43.2746386Z contiguous=True, 2025-05-07T20:32:43.2746471Z compiled=False, 2025-05-07T20:32:43.2746547Z ) 2025-05-07T20:32:43.2746763Z self = 2025-05-07T20:32:43.2746930Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2746935Z 2025-05-07T20:32:43.2747017Z @given( 2025-05-07T20:32:43.2747136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2747295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2747414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2747529Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2747643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2747722Z ) 2025-05-07T20:32:43.2747972Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2748067Z def test_silu_mul_quant( 2025-05-07T20:32:43.2748144Z self, 2025-05-07T20:32:43.2748220Z T: int, 2025-05-07T20:32:43.2748299Z D: int, 2025-05-07T20:32:43.2748395Z scale_ub: Optional[float], 2025-05-07T20:32:43.2748485Z contiguous: bool, 2025-05-07T20:32:43.2748572Z compiled: bool, 2025-05-07T20:32:43.2748649Z ) -> None: 2025-05-07T20:32:43.2748744Z torch.manual_seed(2025) 2025-05-07T20:32:43.2748826Z 2025-05-07T20:32:43.2748996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2749075Z 2025-05-07T20:32:43.2749167Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2749292Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2749388Z x = x_sign * x_clamp 2025-05-07T20:32:43.2749469Z x0 = x[:, :D] 2025-05-07T20:32:43.2749554Z x1 = x[:, D:] 2025-05-07T20:32:43.2749630Z 2025-05-07T20:32:43.2749713Z if contiguous: 2025-05-07T20:32:43.2749806Z x0 = x0.contiguous() 2025-05-07T20:32:43.2749900Z x1 = x1.contiguous() 2025-05-07T20:32:43.2749972Z 2025-05-07T20:32:43.2750062Z if scale_ub is not None: 2025-05-07T20:32:43.2750170Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2750306Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2750381Z ) 2025-05-07T20:32:43.2750463Z else: 2025-05-07T20:32:43.2750560Z scale_ub_tensor = None 2025-05-07T20:32:43.2750687Z 2025-05-07T20:32:43.2750865Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2750990Z op = silu_mul_quant 2025-05-07T20:32:43.2751118Z if compiled: 2025-05-07T20:32:43.2751256Z op = torch.compile(op) 2025-05-07T20:32:43.2751436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2751911Z 2025-05-07T20:32:43.2752170Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2752406Z 2025-05-07T20:32:43.2752557Z moe/activation_test.py:117: 2025-05-07T20:32:43.2752953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2753392Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2753760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2754835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2755931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2756820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2757921Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2759096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2759903Z kernel = self.compile( 2025-05-07T20:32:43.2760522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2761174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2761571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2761799Z 2025-05-07T20:32:43.2762016Z self = 2025-05-07T20:32:43.2763120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2764552Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8d50040>} 2025-05-07T20:32:43.2765913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2766931Z context = 2025-05-07T20:32:43.2772113Z 2025-05-07T20:32:43.2772307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2772841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2773315Z module_map=module_map) 2025-05-07T20:32:43.2773692Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2774049Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2774311Z E ^ 2025-05-07T20:32:43.2774771Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2775229Z 2025-05-07T20:32:43.2775654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2776171Z 2025-05-07T20:32:43.2776272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2776684Z self=, 2025-05-07T20:32:43.2777083Z T=128, 2025-05-07T20:32:43.2777263Z D=5120, 2025-05-07T20:32:43.2777458Z scale_ub=None, 2025-05-07T20:32:43.2777670Z contiguous=True, 2025-05-07T20:32:43.2777887Z compiled=False, 2025-05-07T20:32:43.2778100Z ) 2025-05-07T20:32:43.2778483Z self = 2025-05-07T20:32:43.2778976Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2779251Z 2025-05-07T20:32:43.2779330Z @given( 2025-05-07T20:32:43.2779560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2779884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2780194Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2780531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2780859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2781231Z ) 2025-05-07T20:32:43.2781583Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2782030Z def test_silu_mul_quant( 2025-05-07T20:32:43.2782269Z self, 2025-05-07T20:32:43.2782461Z T: int, 2025-05-07T20:32:43.2782659Z D: int, 2025-05-07T20:32:43.2782884Z scale_ub: Optional[float], 2025-05-07T20:32:43.2783159Z contiguous: bool, 2025-05-07T20:32:43.2783396Z compiled: bool, 2025-05-07T20:32:43.2783613Z ) -> None: 2025-05-07T20:32:43.2783827Z torch.manual_seed(2025) 2025-05-07T20:32:43.2784072Z 2025-05-07T20:32:43.2784387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2784772Z 2025-05-07T20:32:43.2784972Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2785259Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2785568Z x = x_sign * x_clamp 2025-05-07T20:32:43.2785811Z x0 = x[:, :D] 2025-05-07T20:32:43.2786032Z x1 = x[:, D:] 2025-05-07T20:32:43.2786237Z 2025-05-07T20:32:43.2786424Z if contiguous: 2025-05-07T20:32:43.2786654Z x0 = x0.contiguous() 2025-05-07T20:32:43.2786911Z x1 = x1.contiguous() 2025-05-07T20:32:43.2787154Z 2025-05-07T20:32:43.2787346Z if scale_ub is not None: 2025-05-07T20:32:43.2787662Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2788001Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2788310Z ) 2025-05-07T20:32:43.2788500Z else: 2025-05-07T20:32:43.2788708Z scale_ub_tensor = None 2025-05-07T20:32:43.2788967Z 2025-05-07T20:32:43.2789197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2789519Z op = silu_mul_quant 2025-05-07T20:32:43.2789771Z if compiled: 2025-05-07T20:32:43.2790023Z op = torch.compile(op) 2025-05-07T20:32:43.2790321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2790600Z 2025-05-07T20:32:43.2790827Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2791012Z 2025-05-07T20:32:43.2791110Z moe/activation_test.py:117: 2025-05-07T20:32:43.2791410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2791755Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2792032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2792717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2793408Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2793952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2794643Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2795309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2795852Z kernel = self.compile( 2025-05-07T20:32:43.2796390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2797051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2797501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2797736Z 2025-05-07T20:32:43.2797950Z self = 2025-05-07T20:32:43.2799039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2800421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8d50a60>} 2025-05-07T20:32:43.2801762Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2802793Z context = 2025-05-07T20:32:43.2803089Z 2025-05-07T20:32:43.2803265Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2803793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2804297Z module_map=module_map) 2025-05-07T20:32:43.2804724Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2805070Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2805325Z E ^ 2025-05-07T20:32:43.2805786Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2806231Z 2025-05-07T20:32:43.2806656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2807173Z 2025-05-07T20:32:43.2807278Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2807736Z self=, 2025-05-07T20:32:43.2808134Z T=128, 2025-05-07T20:32:43.2808322Z D=7168, 2025-05-07T20:32:43.2808511Z scale_ub=None, 2025-05-07T20:32:43.2808726Z contiguous=True, 2025-05-07T20:32:43.2808946Z compiled=False, 2025-05-07T20:32:43.2809156Z ) 2025-05-07T20:32:43.2809474Z self = 2025-05-07T20:32:43.2809966Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2810235Z 2025-05-07T20:32:43.2810312Z @given( 2025-05-07T20:32:43.2810540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2810859Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2811168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2811498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2811827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2812115Z ) 2025-05-07T20:32:43.2812471Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2812925Z def test_silu_mul_quant( 2025-05-07T20:32:43.2813161Z self, 2025-05-07T20:32:43.2813353Z T: int, 2025-05-07T20:32:43.2813550Z D: int, 2025-05-07T20:32:43.2813770Z scale_ub: Optional[float], 2025-05-07T20:32:43.2814041Z contiguous: bool, 2025-05-07T20:32:43.2814281Z compiled: bool, 2025-05-07T20:32:43.2814499Z ) -> None: 2025-05-07T20:32:43.2814711Z torch.manual_seed(2025) 2025-05-07T20:32:43.2814955Z 2025-05-07T20:32:43.2815221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2815565Z 2025-05-07T20:32:43.2815764Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2816050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2816361Z x = x_sign * x_clamp 2025-05-07T20:32:43.2816603Z x0 = x[:, :D] 2025-05-07T20:32:43.2816863Z x1 = x[:, D:] 2025-05-07T20:32:43.2817074Z 2025-05-07T20:32:43.2817260Z if contiguous: 2025-05-07T20:32:43.2817489Z x0 = x0.contiguous() 2025-05-07T20:32:43.2817746Z x1 = x1.contiguous() 2025-05-07T20:32:43.2817985Z 2025-05-07T20:32:43.2818172Z if scale_ub is not None: 2025-05-07T20:32:43.2818446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2818782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2819086Z ) 2025-05-07T20:32:43.2819277Z else: 2025-05-07T20:32:43.2819488Z scale_ub_tensor = None 2025-05-07T20:32:43.2819742Z 2025-05-07T20:32:43.2819967Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2820282Z op = silu_mul_quant 2025-05-07T20:32:43.2820534Z if compiled: 2025-05-07T20:32:43.2820780Z op = torch.compile(op) 2025-05-07T20:32:43.2821135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2821414Z 2025-05-07T20:32:43.2821603Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2821773Z 2025-05-07T20:32:43.2821872Z moe/activation_test.py:117: 2025-05-07T20:32:43.2822164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2822578Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2822856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2823545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2824231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2824761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2825441Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2826111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2826696Z kernel = self.compile( 2025-05-07T20:32:43.2827231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2827882Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2828284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2828515Z 2025-05-07T20:32:43.2828730Z self = 2025-05-07T20:32:43.2829804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2831176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8cd1790>} 2025-05-07T20:32:43.2832521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2833532Z context = 2025-05-07T20:32:43.2833824Z 2025-05-07T20:32:43.2833994Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2834515Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2834977Z module_map=module_map) 2025-05-07T20:32:43.2835338Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2835685Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2835947Z E ^ 2025-05-07T20:32:43.2836452Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2836913Z 2025-05-07T20:32:43.2837326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2837840Z 2025-05-07T20:32:43.2837944Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2838360Z self=, 2025-05-07T20:32:43.2838761Z T=2048, 2025-05-07T20:32:43.2838943Z D=7168, 2025-05-07T20:32:43.2839133Z scale_ub=1200.0, 2025-05-07T20:32:43.2839359Z contiguous=True, 2025-05-07T20:32:43.2839577Z compiled=False, 2025-05-07T20:32:43.2839777Z ) 2025-05-07T20:32:43.2840464Z self = 2025-05-07T20:32:43.2841116Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2841396Z 2025-05-07T20:32:43.2841473Z @given( 2025-05-07T20:32:43.2841699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2842026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2842332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2842664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2842998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2843386Z ) 2025-05-07T20:32:43.2843793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2844242Z def test_silu_mul_quant( 2025-05-07T20:32:43.2844481Z self, 2025-05-07T20:32:43.2844673Z T: int, 2025-05-07T20:32:43.2844867Z D: int, 2025-05-07T20:32:43.2845088Z scale_ub: Optional[float], 2025-05-07T20:32:43.2845358Z contiguous: bool, 2025-05-07T20:32:43.2845595Z compiled: bool, 2025-05-07T20:32:43.2845816Z ) -> None: 2025-05-07T20:32:43.2846030Z torch.manual_seed(2025) 2025-05-07T20:32:43.2846272Z 2025-05-07T20:32:43.2846542Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2848639Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2850515Z 2025-05-07T20:32:43.2850637Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2850855Z 2025-05-07T20:32:43.2850956Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2851373Z self=, 2025-05-07T20:32:43.2851768Z T=1, 2025-05-07T20:32:43.2851952Z D=5120, 2025-05-07T20:32:43.2852143Z scale_ub=1200.0, 2025-05-07T20:32:43.2852359Z contiguous=True, 2025-05-07T20:32:43.2852582Z compiled=False, 2025-05-07T20:32:43.2852785Z ) 2025-05-07T20:32:43.2853095Z self = 2025-05-07T20:32:43.2853586Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2853852Z 2025-05-07T20:32:43.2853932Z @given( 2025-05-07T20:32:43.2854152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2854467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2854770Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2855099Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2855423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2855710Z ) 2025-05-07T20:32:43.2856058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2856568Z def test_silu_mul_quant( 2025-05-07T20:32:43.2856817Z self, 2025-05-07T20:32:43.2857013Z T: int, 2025-05-07T20:32:43.2857206Z D: int, 2025-05-07T20:32:43.2857421Z scale_ub: Optional[float], 2025-05-07T20:32:43.2857696Z contiguous: bool, 2025-05-07T20:32:43.2857934Z compiled: bool, 2025-05-07T20:32:43.2858155Z ) -> None: 2025-05-07T20:32:43.2858367Z torch.manual_seed(2025) 2025-05-07T20:32:43.2858604Z 2025-05-07T20:32:43.2858870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2859208Z 2025-05-07T20:32:43.2859400Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2859686Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2859995Z x = x_sign * x_clamp 2025-05-07T20:32:43.2860236Z x0 = x[:, :D] 2025-05-07T20:32:43.2860447Z x1 = x[:, D:] 2025-05-07T20:32:43.2860651Z 2025-05-07T20:32:43.2860834Z if contiguous: 2025-05-07T20:32:43.2861135Z x0 = x0.contiguous() 2025-05-07T20:32:43.2861394Z x1 = x1.contiguous() 2025-05-07T20:32:43.2861634Z 2025-05-07T20:32:43.2861820Z if scale_ub is not None: 2025-05-07T20:32:43.2862092Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2862478Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2862822Z ) 2025-05-07T20:32:43.2863015Z else: 2025-05-07T20:32:43.2863221Z scale_ub_tensor = None 2025-05-07T20:32:43.2863468Z 2025-05-07T20:32:43.2863700Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2864014Z op = silu_mul_quant 2025-05-07T20:32:43.2864268Z if compiled: 2025-05-07T20:32:43.2864510Z op = torch.compile(op) 2025-05-07T20:32:43.2864805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2865081Z 2025-05-07T20:32:43.2865267Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2865477Z 2025-05-07T20:32:43.2865579Z moe/activation_test.py:117: 2025-05-07T20:32:43.2865876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2866206Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2866487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2867180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2867868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2868401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2869077Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2869737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2870259Z kernel = self.compile( 2025-05-07T20:32:43.2870802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2871454Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2871851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2872084Z 2025-05-07T20:32:43.2872293Z self = 2025-05-07T20:32:43.2873366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2874727Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8c96040>} 2025-05-07T20:32:43.2876104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2877132Z context = 2025-05-07T20:32:43.2877418Z 2025-05-07T20:32:43.2877588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2878124Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2878589Z module_map=module_map) 2025-05-07T20:32:43.2878948Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2879298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2879557Z E ^ 2025-05-07T20:32:43.2880021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2880474Z 2025-05-07T20:32:43.2880892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2881421Z 2025-05-07T20:32:43.2881525Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2881938Z self=, 2025-05-07T20:32:43.2882331Z T=2048, 2025-05-07T20:32:43.2882594Z D=5120, 2025-05-07T20:32:43.2882826Z scale_ub=None, 2025-05-07T20:32:43.2883035Z contiguous=True, 2025-05-07T20:32:43.2883260Z compiled=False, 2025-05-07T20:32:43.2883461Z ) 2025-05-07T20:32:43.2883777Z self = 2025-05-07T20:32:43.2884269Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2884539Z 2025-05-07T20:32:43.2884616Z @given( 2025-05-07T20:32:43.2884841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2885149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2885458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2885832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2886158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2886443Z ) 2025-05-07T20:32:43.2886794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2887241Z def test_silu_mul_quant( 2025-05-07T20:32:43.2887478Z self, 2025-05-07T20:32:43.2887671Z T: int, 2025-05-07T20:32:43.2887867Z D: int, 2025-05-07T20:32:43.2888083Z scale_ub: Optional[float], 2025-05-07T20:32:43.2888352Z contiguous: bool, 2025-05-07T20:32:43.2888590Z compiled: bool, 2025-05-07T20:32:43.2888806Z ) -> None: 2025-05-07T20:32:43.2889018Z torch.manual_seed(2025) 2025-05-07T20:32:43.2889260Z 2025-05-07T20:32:43.2889531Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2889872Z 2025-05-07T20:32:43.2890063Z > x_sign = torch.sign(x) 2025-05-07T20:32:43.2891999Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2893831Z 2025-05-07T20:32:43.2893956Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:43.2894171Z 2025-05-07T20:32:43.2894272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2894683Z self=, 2025-05-07T20:32:43.2895083Z T=16384, 2025-05-07T20:32:43.2895269Z D=5120, 2025-05-07T20:32:43.2895507Z scale_ub=None, 2025-05-07T20:32:43.2895717Z contiguous=True, 2025-05-07T20:32:43.2895935Z compiled=False, 2025-05-07T20:32:43.2896139Z ) 2025-05-07T20:32:43.2896454Z self = 2025-05-07T20:32:43.2896948Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2897231Z 2025-05-07T20:32:43.2897307Z @given( 2025-05-07T20:32:43.2897534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2897845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2898143Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2898468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2898798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2899078Z ) 2025-05-07T20:32:43.2899428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2899875Z def test_silu_mul_quant( 2025-05-07T20:32:43.2900124Z self, 2025-05-07T20:32:43.2900319Z T: int, 2025-05-07T20:32:43.2900514Z D: int, 2025-05-07T20:32:43.2900727Z scale_ub: Optional[float], 2025-05-07T20:32:43.2900996Z contiguous: bool, 2025-05-07T20:32:43.2901315Z compiled: bool, 2025-05-07T20:32:43.2901579Z ) -> None: 2025-05-07T20:32:43.2901832Z torch.manual_seed(2025) 2025-05-07T20:32:43.2902077Z 2025-05-07T20:32:43.2902348Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2904396Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2906309Z 2025-05-07T20:32:43.2906428Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2906648Z 2025-05-07T20:32:43.2906749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2907161Z self=, 2025-05-07T20:32:43.2907554Z T=4096, 2025-05-07T20:32:43.2907739Z D=5120, 2025-05-07T20:32:43.2907925Z scale_ub=None, 2025-05-07T20:32:43.2908138Z contiguous=True, 2025-05-07T20:32:43.2908356Z compiled=False, 2025-05-07T20:32:43.2908558Z ) 2025-05-07T20:32:43.2908868Z self = 2025-05-07T20:32:43.2909357Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2909631Z 2025-05-07T20:32:43.2909706Z @given( 2025-05-07T20:32:43.2909933Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2910244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2910547Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2910901Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2911248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2911541Z ) 2025-05-07T20:32:43.2911884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2912321Z def test_silu_mul_quant( 2025-05-07T20:32:43.2912557Z self, 2025-05-07T20:32:43.2912746Z T: int, 2025-05-07T20:32:43.2912939Z D: int, 2025-05-07T20:32:43.2913152Z scale_ub: Optional[float], 2025-05-07T20:32:43.2913419Z contiguous: bool, 2025-05-07T20:32:43.2913653Z compiled: bool, 2025-05-07T20:32:43.2913868Z ) -> None: 2025-05-07T20:32:43.2914080Z torch.manual_seed(2025) 2025-05-07T20:32:43.2914318Z 2025-05-07T20:32:43.2914632Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2916682Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2918546Z 2025-05-07T20:32:43.2918662Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2918876Z 2025-05-07T20:32:43.2918976Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2919387Z self=, 2025-05-07T20:32:43.2919789Z T=2048, 2025-05-07T20:32:43.2925234Z D=5120, 2025-05-07T20:32:43.2925454Z scale_ub=None, 2025-05-07T20:32:43.2925676Z contiguous=False, 2025-05-07T20:32:43.2925904Z compiled=False, 2025-05-07T20:32:43.2926108Z ) 2025-05-07T20:32:43.2926422Z self = 2025-05-07T20:32:43.2927029Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2927312Z 2025-05-07T20:32:43.2927390Z @given( 2025-05-07T20:32:43.2927620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2927933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2928233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2928564Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2928889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2929167Z ) 2025-05-07T20:32:43.2929515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2930006Z def test_silu_mul_quant( 2025-05-07T20:32:43.2930246Z self, 2025-05-07T20:32:43.2930432Z T: int, 2025-05-07T20:32:43.2930630Z D: int, 2025-05-07T20:32:43.2930847Z scale_ub: Optional[float], 2025-05-07T20:32:43.2931120Z contiguous: bool, 2025-05-07T20:32:43.2931367Z compiled: bool, 2025-05-07T20:32:43.2931591Z ) -> None: 2025-05-07T20:32:43.2931802Z torch.manual_seed(2025) 2025-05-07T20:32:43.2932041Z 2025-05-07T20:32:43.2932311Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2934366Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2936225Z 2025-05-07T20:32:43.2936347Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2936566Z 2025-05-07T20:32:43.2936671Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2937090Z self=, 2025-05-07T20:32:43.2937488Z T=4096, 2025-05-07T20:32:43.2937669Z D=7168, 2025-05-07T20:32:43.2937854Z scale_ub=None, 2025-05-07T20:32:43.2938063Z contiguous=True, 2025-05-07T20:32:43.2938280Z compiled=True, 2025-05-07T20:32:43.2938481Z ) 2025-05-07T20:32:43.2938798Z self = 2025-05-07T20:32:43.2939287Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.2939560Z 2025-05-07T20:32:43.2939685Z @given( 2025-05-07T20:32:43.2939914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2940500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2940805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2941185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2941529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2941813Z ) 2025-05-07T20:32:43.2942163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2942603Z def test_silu_mul_quant( 2025-05-07T20:32:43.2942839Z self, 2025-05-07T20:32:43.2943028Z T: int, 2025-05-07T20:32:43.2943223Z D: int, 2025-05-07T20:32:43.2943436Z scale_ub: Optional[float], 2025-05-07T20:32:43.2943706Z contiguous: bool, 2025-05-07T20:32:43.2943943Z compiled: bool, 2025-05-07T20:32:43.2944161Z ) -> None: 2025-05-07T20:32:43.2944380Z torch.manual_seed(2025) 2025-05-07T20:32:43.2944626Z 2025-05-07T20:32:43.2944894Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2947017Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2948932Z 2025-05-07T20:32:43.2949055Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2949270Z 2025-05-07T20:32:43.2949372Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2949788Z self=, 2025-05-07T20:32:43.2950239Z T=2048, 2025-05-07T20:32:43.2950425Z D=5120, 2025-05-07T20:32:43.2950615Z scale_ub=1200.0, 2025-05-07T20:32:43.2950836Z contiguous=False, 2025-05-07T20:32:43.2951056Z compiled=False, 2025-05-07T20:32:43.2951258Z ) 2025-05-07T20:32:43.2951576Z self = 2025-05-07T20:32:43.2952068Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2952343Z 2025-05-07T20:32:43.2952418Z @given( 2025-05-07T20:32:43.2952645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2952952Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2953254Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2953584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2953908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2954195Z ) 2025-05-07T20:32:43.2954544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2954988Z def test_silu_mul_quant( 2025-05-07T20:32:43.2955225Z self, 2025-05-07T20:32:43.2955419Z T: int, 2025-05-07T20:32:43.2955614Z D: int, 2025-05-07T20:32:43.2955829Z scale_ub: Optional[float], 2025-05-07T20:32:43.2956108Z contiguous: bool, 2025-05-07T20:32:43.2956345Z compiled: bool, 2025-05-07T20:32:43.2956560Z ) -> None: 2025-05-07T20:32:43.2956773Z torch.manual_seed(2025) 2025-05-07T20:32:43.2957014Z 2025-05-07T20:32:43.2957281Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2959353Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2961210Z 2025-05-07T20:32:43.2961333Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2961554Z 2025-05-07T20:32:43.2961656Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2962068Z self=, 2025-05-07T20:32:43.2962466Z T=4096, 2025-05-07T20:32:43.2962651Z D=7168, 2025-05-07T20:32:43.2962842Z scale_ub=1200.0, 2025-05-07T20:32:43.2963058Z contiguous=True, 2025-05-07T20:32:43.2963276Z compiled=False, 2025-05-07T20:32:43.2963480Z ) 2025-05-07T20:32:43.2963789Z self = 2025-05-07T20:32:43.2964287Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2964569Z 2025-05-07T20:32:43.2964647Z @given( 2025-05-07T20:32:43.2964868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2965180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2965486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2965683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2965805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2965879Z ) 2025-05-07T20:32:43.2966123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2966222Z def test_silu_mul_quant( 2025-05-07T20:32:43.2966300Z self, 2025-05-07T20:32:43.2966380Z T: int, 2025-05-07T20:32:43.2966456Z D: int, 2025-05-07T20:32:43.2966553Z scale_ub: Optional[float], 2025-05-07T20:32:43.2966645Z contiguous: bool, 2025-05-07T20:32:43.2966732Z compiled: bool, 2025-05-07T20:32:43.2966856Z ) -> None: 2025-05-07T20:32:43.2966949Z torch.manual_seed(2025) 2025-05-07T20:32:43.2967022Z 2025-05-07T20:32:43.2967193Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2968941Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2968950Z 2025-05-07T20:32:43.2969070Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2969074Z 2025-05-07T20:32:43.2969180Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2969403Z self=, 2025-05-07T20:32:43.2969487Z T=16384, 2025-05-07T20:32:43.2969567Z D=7168, 2025-05-07T20:32:43.2969648Z scale_ub=None, 2025-05-07T20:32:43.2969739Z contiguous=False, 2025-05-07T20:32:43.2969825Z compiled=True, 2025-05-07T20:32:43.2969904Z ) 2025-05-07T20:32:43.2970120Z self = 2025-05-07T20:32:43.2970295Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.2970299Z 2025-05-07T20:32:43.2970382Z @given( 2025-05-07T20:32:43.2970500Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2970599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2970717Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2970832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2970987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2971069Z ) 2025-05-07T20:32:43.2971312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2971408Z def test_silu_mul_quant( 2025-05-07T20:32:43.2971485Z self, 2025-05-07T20:32:43.2971563Z T: int, 2025-05-07T20:32:43.2971651Z D: int, 2025-05-07T20:32:43.2971749Z scale_ub: Optional[float], 2025-05-07T20:32:43.2971837Z contiguous: bool, 2025-05-07T20:32:43.2971929Z compiled: bool, 2025-05-07T20:32:43.2972009Z ) -> None: 2025-05-07T20:32:43.2972104Z torch.manual_seed(2025) 2025-05-07T20:32:43.2972181Z 2025-05-07T20:32:43.2972347Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2974142Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2974187Z 2025-05-07T20:32:43.2974306Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2974311Z 2025-05-07T20:32:43.2974416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2974636Z self=, 2025-05-07T20:32:43.2974713Z T=4096, 2025-05-07T20:32:43.2974790Z D=7168, 2025-05-07T20:32:43.2974871Z scale_ub=None, 2025-05-07T20:32:43.2974954Z contiguous=True, 2025-05-07T20:32:43.2975040Z compiled=False, 2025-05-07T20:32:43.2975115Z ) 2025-05-07T20:32:43.2975333Z self = 2025-05-07T20:32:43.2975578Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2975582Z 2025-05-07T20:32:43.2975660Z @given( 2025-05-07T20:32:43.2975780Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2975878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2975998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2976117Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2976229Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2976302Z ) 2025-05-07T20:32:43.2976554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2976648Z def test_silu_mul_quant( 2025-05-07T20:32:43.2976724Z self, 2025-05-07T20:32:43.2976803Z T: int, 2025-05-07T20:32:43.2976878Z D: int, 2025-05-07T20:32:43.2976974Z scale_ub: Optional[float], 2025-05-07T20:32:43.2977072Z contiguous: bool, 2025-05-07T20:32:43.2977158Z compiled: bool, 2025-05-07T20:32:43.2977238Z ) -> None: 2025-05-07T20:32:43.2977332Z torch.manual_seed(2025) 2025-05-07T20:32:43.2977404Z 2025-05-07T20:32:43.2977576Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2979358Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2979365Z 2025-05-07T20:32:43.2979485Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2979537Z 2025-05-07T20:32:43.2979642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2979862Z self=, 2025-05-07T20:32:43.2979945Z T=16384, 2025-05-07T20:32:43.2980024Z D=7168, 2025-05-07T20:32:43.2980108Z scale_ub=None, 2025-05-07T20:32:43.2980199Z contiguous=True, 2025-05-07T20:32:43.2980283Z compiled=False, 2025-05-07T20:32:43.2980358Z ) 2025-05-07T20:32:43.2980577Z self = 2025-05-07T20:32:43.2980768Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.2980774Z 2025-05-07T20:32:43.2980862Z @given( 2025-05-07T20:32:43.2981003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2981147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2981264Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2981383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2981505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2981581Z ) 2025-05-07T20:32:43.2981831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2981928Z def test_silu_mul_quant( 2025-05-07T20:32:43.2982081Z self, 2025-05-07T20:32:43.2982160Z T: int, 2025-05-07T20:32:43.2982239Z D: int, 2025-05-07T20:32:43.2982336Z scale_ub: Optional[float], 2025-05-07T20:32:43.2982425Z contiguous: bool, 2025-05-07T20:32:43.2982514Z compiled: bool, 2025-05-07T20:32:43.2982591Z ) -> None: 2025-05-07T20:32:43.2982686Z torch.manual_seed(2025) 2025-05-07T20:32:43.2982762Z 2025-05-07T20:32:43.2982930Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2984685Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2984732Z 2025-05-07T20:32:43.2984849Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2984854Z 2025-05-07T20:32:43.2984963Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2985184Z self=, 2025-05-07T20:32:43.2985261Z T=16384, 2025-05-07T20:32:43.2985343Z D=7168, 2025-05-07T20:32:43.2985426Z scale_ub=1200.0, 2025-05-07T20:32:43.2985509Z contiguous=True, 2025-05-07T20:32:43.2985594Z compiled=False, 2025-05-07T20:32:43.2985668Z ) 2025-05-07T20:32:43.2985884Z self = 2025-05-07T20:32:43.2986067Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.2986072Z 2025-05-07T20:32:43.2986149Z @given( 2025-05-07T20:32:43.2986271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2986374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2986489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2986608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2986723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2986795Z ) 2025-05-07T20:32:43.2987046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2987138Z def test_silu_mul_quant( 2025-05-07T20:32:43.2987214Z self, 2025-05-07T20:32:43.2987294Z T: int, 2025-05-07T20:32:43.2987371Z D: int, 2025-05-07T20:32:43.2987512Z scale_ub: Optional[float], 2025-05-07T20:32:43.2987609Z contiguous: bool, 2025-05-07T20:32:43.2987694Z compiled: bool, 2025-05-07T20:32:43.2987774Z ) -> None: 2025-05-07T20:32:43.2987869Z torch.manual_seed(2025) 2025-05-07T20:32:43.2987941Z 2025-05-07T20:32:43.2988116Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2989897Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2989906Z 2025-05-07T20:32:43.2990029Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2990034Z 2025-05-07T20:32:43.2990137Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2990358Z self=, 2025-05-07T20:32:43.2990441Z T=128, 2025-05-07T20:32:43.2990590Z D=5120, 2025-05-07T20:32:43.2990673Z scale_ub=1200.0, 2025-05-07T20:32:43.2990761Z contiguous=False, 2025-05-07T20:32:43.2990844Z compiled=False, 2025-05-07T20:32:43.2990919Z ) 2025-05-07T20:32:43.2991133Z self = 2025-05-07T20:32:43.2991308Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.2991313Z 2025-05-07T20:32:43.2991395Z @given( 2025-05-07T20:32:43.2991513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2991613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2991773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2991896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2992009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2992085Z ) 2025-05-07T20:32:43.2992333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2992435Z def test_silu_mul_quant( 2025-05-07T20:32:43.2992515Z self, 2025-05-07T20:32:43.2992592Z T: int, 2025-05-07T20:32:43.2992669Z D: int, 2025-05-07T20:32:43.2992769Z scale_ub: Optional[float], 2025-05-07T20:32:43.2992858Z contiguous: bool, 2025-05-07T20:32:43.2992945Z compiled: bool, 2025-05-07T20:32:43.2993024Z ) -> None: 2025-05-07T20:32:43.2993120Z torch.manual_seed(2025) 2025-05-07T20:32:43.2993196Z 2025-05-07T20:32:43.2993364Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2993438Z 2025-05-07T20:32:43.2993540Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2993665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2993754Z x = x_sign * x_clamp 2025-05-07T20:32:43.2993838Z x0 = x[:, :D] 2025-05-07T20:32:43.2993918Z x1 = x[:, D:] 2025-05-07T20:32:43.2993989Z 2025-05-07T20:32:43.2994079Z if contiguous: 2025-05-07T20:32:43.2994173Z x0 = x0.contiguous() 2025-05-07T20:32:43.2994267Z x1 = x1.contiguous() 2025-05-07T20:32:43.2994342Z 2025-05-07T20:32:43.2994431Z if scale_ub is not None: 2025-05-07T20:32:43.2994540Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2994677Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2994754Z ) 2025-05-07T20:32:43.2994835Z else: 2025-05-07T20:32:43.2994929Z scale_ub_tensor = None 2025-05-07T20:32:43.2995003Z 2025-05-07T20:32:43.2995136Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2995272Z op = silu_mul_quant 2025-05-07T20:32:43.2995359Z if compiled: 2025-05-07T20:32:43.2995463Z op = torch.compile(op) 2025-05-07T20:32:43.2995567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2995638Z 2025-05-07T20:32:43.2995731Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2995740Z 2025-05-07T20:32:43.2995837Z moe/activation_test.py:117: 2025-05-07T20:32:43.2995966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2996066Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2996166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2996674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2996771Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2997127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2997361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2997705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2997804Z kernel = self.compile( 2025-05-07T20:32:43.2998261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2998440Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2998571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2998575Z 2025-05-07T20:32:43.2998782Z self = 2025-05-07T20:32:43.2999557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.3000110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd8a46ca0>} 2025-05-07T20:32:43.3000852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.3001049Z context = 2025-05-07T20:32:43.3001053Z 2025-05-07T20:32:43.3001218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.3001484Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.3001591Z module_map=module_map) 2025-05-07T20:32:43.3001752Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.3001858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.3001935Z E ^ 2025-05-07T20:32:43.3002291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.3002296Z 2025-05-07T20:32:43.3002708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.3002715Z 2025-05-07T20:32:43.3002818Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.3003045Z self=, 2025-05-07T20:32:43.3003122Z T=2048, 2025-05-07T20:32:43.3003202Z D=7168, 2025-05-07T20:32:43.3003287Z scale_ub=None, 2025-05-07T20:32:43.3003375Z contiguous=False, 2025-05-07T20:32:43.3003462Z compiled=False, 2025-05-07T20:32:43.3003535Z ) 2025-05-07T20:32:43.3003749Z self = 2025-05-07T20:32:43.3003969Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.3003975Z 2025-05-07T20:32:43.3004053Z @given( 2025-05-07T20:32:43.3004170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.3004271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.3004390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.3004507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.3004622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.3004695Z ) 2025-05-07T20:32:43.3004943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.3005035Z def test_silu_mul_quant( 2025-05-07T20:32:43.3005110Z self, 2025-05-07T20:32:43.3005188Z T: int, 2025-05-07T20:32:43.3005263Z D: int, 2025-05-07T20:32:43.3005359Z scale_ub: Optional[float], 2025-05-07T20:32:43.3005449Z contiguous: bool, 2025-05-07T20:32:43.3005542Z compiled: bool, 2025-05-07T20:32:43.3005619Z ) -> None: 2025-05-07T20:32:43.3005717Z torch.manual_seed(2025) 2025-05-07T20:32:43.3005789Z 2025-05-07T20:32:43.3005956Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.3007755Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.3007824Z 2025-05-07T20:32:43.3007944Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.3007952Z 2025-05-07T20:32:43.3008098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.3008318Z self=, 2025-05-07T20:32:43.3008398Z T=128, 2025-05-07T20:32:43.3008473Z D=7168, 2025-05-07T20:32:43.3008554Z scale_ub=1200.0, 2025-05-07T20:32:43.3008640Z contiguous=True, 2025-05-07T20:32:43.3008729Z compiled=True, 2025-05-07T20:32:43.3008801Z ) 2025-05-07T20:32:43.3009021Z self = 2025-05-07T20:32:43.3009189Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.3009194Z 2025-05-07T20:32:43.3009269Z @given( 2025-05-07T20:32:43.3009391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.3009489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.3009605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.3009720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.3009841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.3009921Z ) 2025-05-07T20:32:43.3010170Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.3010262Z def test_silu_mul_quant( 2025-05-07T20:32:43.3010339Z self, 2025-05-07T20:32:43.3010420Z T: int, 2025-05-07T20:32:43.3010497Z D: int, 2025-05-07T20:32:43.3010601Z scale_ub: Optional[float], 2025-05-07T20:32:43.3010689Z contiguous: bool, 2025-05-07T20:32:43.3010789Z compiled: bool, 2025-05-07T20:32:43.3010882Z ) -> None: 2025-05-07T20:32:43.3010990Z torch.manual_seed(2025) 2025-05-07T20:32:43.3011077Z 2025-05-07T20:32:43.3011242Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.3011316Z 2025-05-07T20:32:43.3011409Z x_sign = torch.sign(x) 2025-05-07T20:32:43.3011532Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.3011686Z x = x_sign * x_clamp 2025-05-07T20:32:43.3011771Z x0 = x[:, :D] 2025-05-07T20:32:43.3011850Z x1 = x[:, D:] 2025-05-07T20:32:43.3011921Z 2025-05-07T20:32:43.3012008Z if contiguous: 2025-05-07T20:32:43.3012100Z x0 = x0.contiguous() 2025-05-07T20:32:43.3012190Z x1 = x1.contiguous() 2025-05-07T20:32:43.3012272Z 2025-05-07T20:32:43.3012362Z if scale_ub is not None: 2025-05-07T20:32:43.3012469Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.3012605Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.3012682Z ) 2025-05-07T20:32:43.3012761Z else: 2025-05-07T20:32:43.3012858Z scale_ub_tensor = None 2025-05-07T20:32:43.3012930Z 2025-05-07T20:32:43.3013065Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.3013154Z op = silu_mul_quant 2025-05-07T20:32:43.3013239Z if compiled: 2025-05-07T20:32:43.3013349Z op = torch.compile(op) 2025-05-07T20:32:43.3013454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.3013525Z 2025-05-07T20:32:43.3013617Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.3013622Z 2025-05-07T20:32:43.3013719Z moe/activation_test.py:117: 2025-05-07T20:32:43.3013891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.3014029Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.3014129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.3014501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.3014593Z return fn(*args, **kwargs) 2025-05-07T20:32:43.3015083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.3015185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.3015543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.3015812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.3016146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.3016246Z kernel = self.compile( 2025-05-07T20:32:43.3016633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.3016809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.3016938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.3016943Z 2025-05-07T20:32:43.3017151Z self = 2025-05-07T20:32:43.3017921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.3018437Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fbfd89350d0>} 2025-05-07T20:32:43.3019191Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.3019387Z context = 2025-05-07T20:32:43.3019392Z 2025-05-07T20:32:43.3019559Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.3019821Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.3019930Z module_map=module_map) 2025-05-07T20:32:43.3020136Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.3020242Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.3020318Z E ^ 2025-05-07T20:32:43.3020669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.3020677Z 2025-05-07T20:32:43.3021152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.3021157Z 2025-05-07T20:32:43.3021258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.3021482Z self=, 2025-05-07T20:32:43.3021558Z T=128, 2025-05-07T20:32:43.3021633Z D=7168, 2025-05-07T20:32:43.3021723Z scale_ub=1200.0, 2025-05-07T20:32:43.3021807Z contiguous=True, 2025-05-07T20:32:43.3021892Z compiled=False, 2025-05-07T20:32:43.3021967Z ) 2025-05-07T20:32:43.3022186Z self = 2025-05-07T20:32:43.3022361Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.3022365Z 2025-05-07T20:32:43.3022446Z @given( 2025-05-07T20:32:43.3022562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.3022737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.3022856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.3022971Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.3023090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.3023164Z ) 2025-05-07T20:32:43.3023410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.3023505Z def test_silu_mul_quant( 2025-05-07T20:32:43.3023581Z self, 2025-05-07T20:32:43.3023656Z T: int, 2025-05-07T20:32:43.3023735Z D: int, 2025-05-07T20:32:43.3023833Z scale_ub: Optional[float], 2025-05-07T20:32:43.3023966Z contiguous: bool, 2025-05-07T20:32:43.3024053Z compiled: bool, 2025-05-07T20:32:43.3024130Z ) -> None: 2025-05-07T20:32:43.3024230Z torch.manual_seed(2025) 2025-05-07T20:32:43.3024302Z 2025-05-07T20:32:43.3024469Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.3024551Z 2025-05-07T20:32:43.3024642Z x_sign = torch.sign(x) 2025-05-07T20:32:43.3024769Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.3026520Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.3026528Z 2025-05-07T20:32:43.3026645Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.3026650Z 2025-05-07T20:32:43.3026756Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.3026983Z self=, 2025-05-07T20:32:43.3027061Z T=128, 2025-05-07T20:32:43.3027140Z D=5120, 2025-05-07T20:32:43.3027221Z scale_ub=1200.0, 2025-05-07T20:32:43.3027304Z contiguous=True, 2025-05-07T20:32:43.3027390Z compiled=True, 2025-05-07T20:32:43.3027461Z ) 2025-05-07T20:32:43.3027677Z self = 2025-05-07T20:32:43.3027846Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.3027851Z 2025-05-07T20:32:43.3027928Z @given( 2025-05-07T20:32:43.3028047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.3028189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.3028306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.3028423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.3028536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.3028616Z ) 2025-05-07T20:32:43.3028862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.3028956Z def test_silu_mul_quant( 2025-05-07T20:32:43.3029034Z self, 2025-05-07T20:32:43.3029111Z T: int, 2025-05-07T20:32:43.3029186Z D: int, 2025-05-07T20:32:43.3029286Z scale_ub: Optional[float], 2025-05-07T20:32:43.3029375Z contiguous: bool, 2025-05-07T20:32:43.3029459Z compiled: bool, 2025-05-07T20:32:43.3029539Z ) -> None: 2025-05-07T20:32:43.3029632Z torch.manual_seed(2025) 2025-05-07T20:32:43.3029703Z 2025-05-07T20:32:43.3029875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.3029953Z 2025-05-07T20:32:43.3030044Z x_sign = torch.sign(x) 2025-05-07T20:32:43.3030171Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.3031952Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.3031996Z 2025-05-07T20:32:43.3032117Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.3032121Z 2025-05-07T20:32:43.3032224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.3032493Z self=, 2025-05-07T20:32:43.3032571Z T=128, 2025-05-07T20:32:43.3032646Z D=7168, 2025-05-07T20:32:43.3032729Z scale_ub=None, 2025-05-07T20:32:43.3032813Z contiguous=True, 2025-05-07T20:32:43.3032894Z compiled=True, 2025-05-07T20:32:43.3032972Z ) 2025-05-07T20:32:43.3033191Z self = 2025-05-07T20:32:43.3033360Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.3033365Z 2025-05-07T20:32:43.3033440Z @given( 2025-05-07T20:32:43.3033555Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.3033658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.3033770Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.3033885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.3034001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.3034079Z ) 2025-05-07T20:32:43.3034327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.3034421Z def test_silu_mul_quant( 2025-05-07T20:32:43.3034498Z self, 2025-05-07T20:32:43.3034576Z T: int, 2025-05-07T20:32:43.3034657Z D: int, 2025-05-07T20:32:43.3034757Z scale_ub: Optional[float], 2025-05-07T20:32:43.3034850Z contiguous: bool, 2025-05-07T20:32:43.3034934Z compiled: bool, 2025-05-07T20:32:43.3035012Z ) -> None: 2025-05-07T20:32:43.3035111Z torch.manual_seed(2025) 2025-05-07T20:32:43.3035185Z 2025-05-07T20:32:43.3035353Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.3037137Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.3037151Z 2025-05-07T20:32:43.3037270Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.3037409Z =============================== warnings summary =============================== 2025-05-07T20:32:43.3037712Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:43.3038017Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:43.3038311Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:43.3039177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:43.3039412Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:43.3039513Z 2025-05-07T20:32:43.3039727Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:43.3039896Z ================= 1 failed, 1 deselected, 3 warnings in 19.45s ================= 2025-05-07T20:32:44.8487177Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:44.9116742Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:44.9117078Z 2025-05-07T20:32:46.9135965Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:49.0826173Z ============================= test session starts ============================== 2025-05-07T20:32:49.0826895Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:49.0827441Z cachedir: .pytest_cache 2025-05-07T20:32:49.0828044Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:49.0828788Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:49.0829224Z plugins: hypothesis-6.131.14 2025-05-07T20:32:50.6956144Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:50.9071847Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:50.9072943Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:50.9073267Z 2025-05-07T20:32:53.6352094Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6353055Z self=, 2025-05-07T20:32:53.6353613Z T=1, 2025-05-07T20:32:53.6353883Z D=5120, 2025-05-07T20:32:53.6354145Z scale_ub=None, 2025-05-07T20:32:53.6354384Z contiguous=True, 2025-05-07T20:32:53.6361994Z compiled=True, 2025-05-07T20:32:53.6362263Z ) 2025-05-07T20:32:53.6362641Z self = 2025-05-07T20:32:53.6363207Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.6363473Z 2025-05-07T20:32:53.6363558Z @given( 2025-05-07T20:32:53.6363807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6364139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6364753Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6365115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6365463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6365768Z ) 2025-05-07T20:32:53.6366130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6366607Z def test_silu_mul_quant( 2025-05-07T20:32:53.6366869Z self, 2025-05-07T20:32:53.6367074Z T: int, 2025-05-07T20:32:53.6367291Z D: int, 2025-05-07T20:32:53.6367526Z scale_ub: Optional[float], 2025-05-07T20:32:53.6367806Z contiguous: bool, 2025-05-07T20:32:53.6368057Z compiled: bool, 2025-05-07T20:32:53.6368307Z ) -> None: 2025-05-07T20:32:53.6368532Z torch.manual_seed(2025) 2025-05-07T20:32:53.6368790Z 2025-05-07T20:32:53.6369079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6369430Z 2025-05-07T20:32:53.6369640Z x_sign = torch.sign(x) 2025-05-07T20:32:53.6369957Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.6370276Z x = x_sign * x_clamp 2025-05-07T20:32:53.6370532Z x0 = x[:, :D] 2025-05-07T20:32:53.6370764Z x1 = x[:, D:] 2025-05-07T20:32:53.6370980Z 2025-05-07T20:32:53.6371177Z if contiguous: 2025-05-07T20:32:53.6371610Z x0 = x0.contiguous() 2025-05-07T20:32:53.6371886Z x1 = x1.contiguous() 2025-05-07T20:32:53.6372136Z 2025-05-07T20:32:53.6372342Z if scale_ub is not None: 2025-05-07T20:32:53.6372629Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.6372975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.6373296Z ) 2025-05-07T20:32:53.6373505Z else: 2025-05-07T20:32:53.6373724Z scale_ub_tensor = None 2025-05-07T20:32:53.6373991Z 2025-05-07T20:32:53.6374240Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.6374565Z op = silu_mul_quant 2025-05-07T20:32:53.6374918Z if compiled: 2025-05-07T20:32:53.6375180Z op = torch.compile(op) 2025-05-07T20:32:53.6375481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.6375771Z 2025-05-07T20:32:53.6375978Z y_fp8, y_scale = fn() 2025-05-07T20:32:53.6376277Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:53.6376583Z 2025-05-07T20:32:53.6376835Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.6377192Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:53.6377498Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:53.6377822Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:53.6378199Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.6378515Z 2025-05-07T20:32:53.6378733Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:53.6378932Z 2025-05-07T20:32:53.6379046Z moe/activation_test.py:126: 2025-05-07T20:32:53.6379354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.6379708Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:53.6380052Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.6380862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:53.6381717Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:53.6382279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.6382974Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.6383672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:53.6384457Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.6385224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:53.6385983Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.6386730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:53.6387387Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:53.6388002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:53.6388542Z fn() 2025-05-07T20:32:53.6389059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:53.6389653Z self.fn.run( 2025-05-07T20:32:53.6390126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.6390667Z kernel = self.compile( 2025-05-07T20:32:53.6391227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.6391887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.6392381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.6392616Z 2025-05-07T20:32:53.6392825Z self = 2025-05-07T20:32:53.6393908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.6395294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15925d89d0>} 2025-05-07T20:32:53.6396692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.6397711Z context = 2025-05-07T20:32:53.6398008Z 2025-05-07T20:32:53.6398179Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.6398722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.6399200Z module_map=module_map) 2025-05-07T20:32:53.6399570Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.6399944Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:53.6400229Z E ^ 2025-05-07T20:32:53.6400702Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.6401163Z 2025-05-07T20:32:53.6401581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.6402102Z 2025-05-07T20:32:53.6402214Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6402641Z self=, 2025-05-07T20:32:53.6403059Z T=2048, 2025-05-07T20:32:53.6403255Z D=5120, 2025-05-07T20:32:53.6403462Z scale_ub=1200.0, 2025-05-07T20:32:53.6403700Z contiguous=True, 2025-05-07T20:32:53.6403930Z compiled=False, 2025-05-07T20:32:53.6404156Z ) 2025-05-07T20:32:55.1138082Z self = 2025-05-07T20:32:55.1138913Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.1139321Z 2025-05-07T20:32:55.1139443Z @given( 2025-05-07T20:32:55.1139794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.1140880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.1141296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.1141646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.1141986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.1142290Z ) 2025-05-07T20:32:55.1142665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.1143123Z def test_silu_mul_quant( 2025-05-07T20:32:55.1143381Z self, 2025-05-07T20:32:55.1143592Z T: int, 2025-05-07T20:32:55.1143798Z D: int, 2025-05-07T20:32:55.1144035Z scale_ub: Optional[float], 2025-05-07T20:32:55.1144321Z contiguous: bool, 2025-05-07T20:32:55.1144575Z compiled: bool, 2025-05-07T20:32:55.1144814Z ) -> None: 2025-05-07T20:32:55.1145048Z torch.manual_seed(2025) 2025-05-07T20:32:55.1145303Z 2025-05-07T20:32:55.1145588Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.1145954Z 2025-05-07T20:32:55.1146164Z x_sign = torch.sign(x) 2025-05-07T20:32:55.1146464Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.1146794Z x = x_sign * x_clamp 2025-05-07T20:32:55.1147054Z x0 = x[:, :D] 2025-05-07T20:32:55.1147374Z x1 = x[:, D:] 2025-05-07T20:32:55.1147705Z 2025-05-07T20:32:55.1147909Z if contiguous: 2025-05-07T20:32:55.1148148Z x0 = x0.contiguous() 2025-05-07T20:32:55.1148425Z x1 = x1.contiguous() 2025-05-07T20:32:55.1148681Z 2025-05-07T20:32:55.1148878Z if scale_ub is not None: 2025-05-07T20:32:55.1149167Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.1149517Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.1149839Z ) 2025-05-07T20:32:55.1150047Z else: 2025-05-07T20:32:55.1150272Z scale_ub_tensor = None 2025-05-07T20:32:55.1150539Z 2025-05-07T20:32:55.1150875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.1151207Z op = silu_mul_quant 2025-05-07T20:32:55.1151470Z if compiled: 2025-05-07T20:32:55.1151726Z op = torch.compile(op) 2025-05-07T20:32:55.1152035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1152330Z 2025-05-07T20:32:55.1152531Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.1152709Z 2025-05-07T20:32:55.1152816Z moe/activation_test.py:117: 2025-05-07T20:32:55.1153129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1153469Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.1153772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1154514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.1155241Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.1155792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.1156486Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.1157164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.1157706Z kernel = self.compile( 2025-05-07T20:32:55.1158268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.1158937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.1159351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1159588Z 2025-05-07T20:32:55.1159801Z self = 2025-05-07T20:32:55.1160958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.1162352Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f156fced5e0>} 2025-05-07T20:32:55.1163704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.1164779Z context = 2025-05-07T20:32:55.1165073Z 2025-05-07T20:32:55.1165245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.1165788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.1166266Z module_map=module_map) 2025-05-07T20:32:55.1166648Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.1167024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.1167298Z E ^ 2025-05-07T20:32:55.1167773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.1168324Z 2025-05-07T20:32:55.1168745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.1169265Z 2025-05-07T20:32:55.1169373Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.1169800Z self=, 2025-05-07T20:32:55.1170212Z T=2048, 2025-05-07T20:32:55.1170408Z D=5120, 2025-05-07T20:32:55.1170613Z scale_ub=1200.0, 2025-05-07T20:32:55.1170846Z contiguous=True, 2025-05-07T20:32:55.1171077Z compiled=True, 2025-05-07T20:32:55.1171298Z ) 2025-05-07T20:32:55.1171629Z self = 2025-05-07T20:32:55.1172183Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.1172486Z 2025-05-07T20:32:55.1172570Z @given( 2025-05-07T20:32:55.1172812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.1173141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.1173462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.1173807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.1174142Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.1174441Z ) 2025-05-07T20:32:55.1174804Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.1175254Z def test_silu_mul_quant( 2025-05-07T20:32:55.1175509Z self, 2025-05-07T20:32:55.1175716Z T: int, 2025-05-07T20:32:55.1175920Z D: int, 2025-05-07T20:32:55.1176153Z scale_ub: Optional[float], 2025-05-07T20:32:55.1176446Z contiguous: bool, 2025-05-07T20:32:55.1176692Z compiled: bool, 2025-05-07T20:32:55.1176930Z ) -> None: 2025-05-07T20:32:55.1177164Z torch.manual_seed(2025) 2025-05-07T20:32:55.1177420Z 2025-05-07T20:32:55.1177693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.1178052Z 2025-05-07T20:32:55.1178255Z x_sign = torch.sign(x) 2025-05-07T20:32:55.1178550Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.1178874Z x = x_sign * x_clamp 2025-05-07T20:32:55.1179129Z x0 = x[:, :D] 2025-05-07T20:32:55.1179350Z x1 = x[:, D:] 2025-05-07T20:32:55.1179567Z 2025-05-07T20:32:55.1179763Z if contiguous: 2025-05-07T20:32:55.1179997Z x0 = x0.contiguous() 2025-05-07T20:32:55.1180266Z x1 = x1.contiguous() 2025-05-07T20:32:55.1180520Z 2025-05-07T20:32:55.1180723Z if scale_ub is not None: 2025-05-07T20:32:55.1181124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.1181473Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.1181785Z ) 2025-05-07T20:32:55.1181993Z else: 2025-05-07T20:32:55.1182217Z scale_ub_tensor = None 2025-05-07T20:32:55.1182479Z 2025-05-07T20:32:55.1182721Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.1183050Z op = silu_mul_quant 2025-05-07T20:32:55.1183314Z if compiled: 2025-05-07T20:32:55.1183568Z op = torch.compile(op) 2025-05-07T20:32:55.1183875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.1184170Z 2025-05-07T20:32:55.1184372Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.1184709Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.1185031Z 2025-05-07T20:32:55.1185271Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.1185626Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.1185936Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.1186257Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.1186632Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.1186960Z 2025-05-07T20:32:55.1187269Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.1187475Z 2025-05-07T20:32:55.1187580Z moe/activation_test.py:126: 2025-05-07T20:32:55.1187889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1188240Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.1188571Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.1189370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.1190132Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.1190698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.1191438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.1192134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.1192872Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.1193623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.1194390Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.1195178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.1195821Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.1196427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.1196957Z fn() 2025-05-07T20:32:55.1197475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.1198058Z self.fn.run( 2025-05-07T20:32:55.1198532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.1199079Z kernel = self.compile( 2025-05-07T20:32:55.1199630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.1200285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.1200691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.1200933Z 2025-05-07T20:32:55.1201143Z self = 2025-05-07T20:32:55.1202295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.1203680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1591056430>} 2025-05-07T20:32:55.1205091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.1206116Z context = 2025-05-07T20:32:55.1206409Z 2025-05-07T20:32:55.1206587Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.1207129Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.1207606Z module_map=module_map) 2025-05-07T20:32:55.1207983Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.1208351Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.1208624Z E ^ 2025-05-07T20:32:55.1209140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.1209667Z 2025-05-07T20:32:55.1210098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.1210612Z 2025-05-07T20:32:55.1210726Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.1211143Z self=, 2025-05-07T20:32:55.1211558Z T=16384, 2025-05-07T20:32:55.1211764Z D=7168, 2025-05-07T20:32:55.1211962Z scale_ub=1200.0, 2025-05-07T20:32:55.1212199Z contiguous=False, 2025-05-07T20:32:55.1212488Z compiled=False, 2025-05-07T20:32:55.1212702Z ) 2025-05-07T20:32:56.4793592Z self = 2025-05-07T20:32:56.4794440Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:56.4794848Z 2025-05-07T20:32:56.4794986Z @given( 2025-05-07T20:32:56.4795304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.4795635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.4795958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.4796302Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.4796644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.4796942Z ) 2025-05-07T20:32:56.4797304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.4797755Z def test_silu_mul_quant( 2025-05-07T20:32:56.4798007Z self, 2025-05-07T20:32:56.4798257Z T: int, 2025-05-07T20:32:56.4798466Z D: int, 2025-05-07T20:32:56.4798694Z scale_ub: Optional[float], 2025-05-07T20:32:56.4798979Z contiguous: bool, 2025-05-07T20:32:56.4799234Z compiled: bool, 2025-05-07T20:32:56.4799474Z ) -> None: 2025-05-07T20:32:56.4799695Z torch.manual_seed(2025) 2025-05-07T20:32:56.4799955Z 2025-05-07T20:32:56.4800239Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.4800587Z 2025-05-07T20:32:56.4800793Z x_sign = torch.sign(x) 2025-05-07T20:32:56.4801099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.4801416Z x = x_sign * x_clamp 2025-05-07T20:32:56.4801671Z x0 = x[:, :D] 2025-05-07T20:32:56.4801899Z x1 = x[:, D:] 2025-05-07T20:32:56.4802112Z 2025-05-07T20:32:56.4802311Z if contiguous: 2025-05-07T20:32:56.4802556Z x0 = x0.contiguous() 2025-05-07T20:32:56.4802823Z x1 = x1.contiguous() 2025-05-07T20:32:56.4803375Z 2025-05-07T20:32:56.4803582Z if scale_ub is not None: 2025-05-07T20:32:56.4803864Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.4804218Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.4804539Z ) 2025-05-07T20:32:56.4804744Z else: 2025-05-07T20:32:56.4804969Z scale_ub_tensor = None 2025-05-07T20:32:56.4805233Z 2025-05-07T20:32:56.4805479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.4805800Z op = silu_mul_quant 2025-05-07T20:32:56.4806067Z if compiled: 2025-05-07T20:32:56.4806331Z op = torch.compile(op) 2025-05-07T20:32:56.4806632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.4806924Z 2025-05-07T20:32:56.4807128Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.4807298Z 2025-05-07T20:32:56.4807403Z moe/activation_test.py:117: 2025-05-07T20:32:56.4807718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.4808065Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.4808356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.4809057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.4809933Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.4810491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.4811178Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.4811850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.4812393Z kernel = self.compile( 2025-05-07T20:32:56.4812947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.4813694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.4814100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.4814335Z 2025-05-07T20:32:56.4814551Z self = 2025-05-07T20:32:56.4815636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.4817020Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590ffe9d0>} 2025-05-07T20:32:56.4818358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.4819382Z context = 2025-05-07T20:32:56.4819671Z 2025-05-07T20:32:56.4819852Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.4820381Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.4820857Z module_map=module_map) 2025-05-07T20:32:56.4821351Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.4821708Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.4821985Z E ^ 2025-05-07T20:32:56.4822457Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.4822911Z 2025-05-07T20:32:56.4823334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.4823856Z 2025-05-07T20:32:56.4824017Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.4824441Z self=, 2025-05-07T20:32:56.4824856Z T=1, 2025-05-07T20:32:56.4825053Z D=7168, 2025-05-07T20:32:56.4825253Z scale_ub=None, 2025-05-07T20:32:56.4825482Z contiguous=True, 2025-05-07T20:32:56.4825721Z compiled=True, 2025-05-07T20:32:56.4825933Z ) 2025-05-07T20:32:56.4826266Z self = 2025-05-07T20:32:56.4826759Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:56.4827020Z 2025-05-07T20:32:56.4827104Z @given( 2025-05-07T20:32:56.4827346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.4827670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.4828001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.4835491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.4835864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.4836164Z ) 2025-05-07T20:32:56.4836540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.4837000Z def test_silu_mul_quant( 2025-05-07T20:32:56.4837248Z self, 2025-05-07T20:32:56.4837537Z T: int, 2025-05-07T20:32:56.4837793Z D: int, 2025-05-07T20:32:56.4838031Z scale_ub: Optional[float], 2025-05-07T20:32:56.4838308Z contiguous: bool, 2025-05-07T20:32:56.4838562Z compiled: bool, 2025-05-07T20:32:56.4838801Z ) -> None: 2025-05-07T20:32:56.4839023Z torch.manual_seed(2025) 2025-05-07T20:32:56.4839282Z 2025-05-07T20:32:56.4839567Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.4839913Z 2025-05-07T20:32:56.4840429Z x_sign = torch.sign(x) 2025-05-07T20:32:56.4840745Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.4841072Z x = x_sign * x_clamp 2025-05-07T20:32:56.4841417Z x0 = x[:, :D] 2025-05-07T20:32:56.4841648Z x1 = x[:, D:] 2025-05-07T20:32:56.4841858Z 2025-05-07T20:32:56.4842058Z if contiguous: 2025-05-07T20:32:56.4842301Z x0 = x0.contiguous() 2025-05-07T20:32:56.4842563Z x1 = x1.contiguous() 2025-05-07T20:32:56.4842821Z 2025-05-07T20:32:56.4843027Z if scale_ub is not None: 2025-05-07T20:32:56.4843308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.4843659Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.4843986Z ) 2025-05-07T20:32:56.4844194Z else: 2025-05-07T20:32:56.4844415Z scale_ub_tensor = None 2025-05-07T20:32:56.4844682Z 2025-05-07T20:32:56.4844928Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.4845246Z op = silu_mul_quant 2025-05-07T20:32:56.4845508Z if compiled: 2025-05-07T20:32:56.4845771Z op = torch.compile(op) 2025-05-07T20:32:56.4846080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.4846369Z 2025-05-07T20:32:56.4846574Z y_fp8, y_scale = fn() 2025-05-07T20:32:56.4846862Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:56.4847166Z 2025-05-07T20:32:56.4847419Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.4847761Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:56.4848067Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:56.4848395Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:56.4848765Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.4849083Z 2025-05-07T20:32:56.4849300Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:56.4849502Z 2025-05-07T20:32:56.4849615Z moe/activation_test.py:126: 2025-05-07T20:32:56.4849925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.4850354Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:56.4850703Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.4851507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:56.4852276Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:56.4852839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.4853537Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.4854227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:56.4854964Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:56.4855730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:56.4856491Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:56.4857217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:56.4857993Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:56.4858612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:56.4859138Z fn() 2025-05-07T20:32:56.4859651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:56.4860237Z self.fn.run( 2025-05-07T20:32:56.4860717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.4861317Z kernel = self.compile( 2025-05-07T20:32:56.4861875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.4862589Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.4863003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.4863236Z 2025-05-07T20:32:56.4863455Z self = 2025-05-07T20:32:56.4864538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.4865936Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590ffef70>} 2025-05-07T20:32:56.4867280Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.4868308Z context = 2025-05-07T20:32:56.4868606Z 2025-05-07T20:32:56.4868777Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.4869323Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.4869796Z module_map=module_map) 2025-05-07T20:32:56.4870168Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.4870537Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:56.4870815Z E ^ 2025-05-07T20:32:56.4871275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.4871730Z 2025-05-07T20:32:56.4872197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.4872732Z 2025-05-07T20:32:56.4872837Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.4873259Z self=, 2025-05-07T20:32:56.4873662Z T=4096, 2025-05-07T20:32:56.4873857Z D=5120, 2025-05-07T20:32:56.4874064Z scale_ub=None, 2025-05-07T20:32:56.4874282Z contiguous=False, 2025-05-07T20:32:56.4874516Z compiled=False, 2025-05-07T20:32:56.4874730Z ) 2025-05-07T20:32:58.2388059Z self = 2025-05-07T20:32:58.2388877Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.2389280Z 2025-05-07T20:32:58.2389395Z @given( 2025-05-07T20:32:58.2389722Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.2390048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.2390375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.2390752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.2391094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.2391396Z ) 2025-05-07T20:32:58.2391765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.2392654Z def test_silu_mul_quant( 2025-05-07T20:32:58.2392919Z self, 2025-05-07T20:32:58.2393133Z T: int, 2025-05-07T20:32:58.2393338Z D: int, 2025-05-07T20:32:58.2393574Z scale_ub: Optional[float], 2025-05-07T20:32:58.2393862Z contiguous: bool, 2025-05-07T20:32:58.2394112Z compiled: bool, 2025-05-07T20:32:58.2394355Z ) -> None: 2025-05-07T20:32:58.2394586Z torch.manual_seed(2025) 2025-05-07T20:32:58.2394836Z 2025-05-07T20:32:58.2395122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.2395503Z 2025-05-07T20:32:58.2395727Z x_sign = torch.sign(x) 2025-05-07T20:32:58.2396134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.2396463Z x = x_sign * x_clamp 2025-05-07T20:32:58.2396719Z x0 = x[:, :D] 2025-05-07T20:32:58.2396941Z x1 = x[:, D:] 2025-05-07T20:32:58.2397159Z 2025-05-07T20:32:58.2397360Z if contiguous: 2025-05-07T20:32:58.2397604Z x0 = x0.contiguous() 2025-05-07T20:32:58.2397877Z x1 = x1.contiguous() 2025-05-07T20:32:58.2398131Z 2025-05-07T20:32:58.2398328Z if scale_ub is not None: 2025-05-07T20:32:58.2398612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.2398961Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.2399277Z ) 2025-05-07T20:32:58.2399482Z else: 2025-05-07T20:32:58.2399705Z scale_ub_tensor = None 2025-05-07T20:32:58.2399963Z 2025-05-07T20:32:58.2400209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.2400539Z op = silu_mul_quant 2025-05-07T20:32:58.2400804Z if compiled: 2025-05-07T20:32:58.2401064Z op = torch.compile(op) 2025-05-07T20:32:58.2401376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2401667Z 2025-05-07T20:32:58.2401864Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.2402040Z 2025-05-07T20:32:58.2402151Z moe/activation_test.py:117: 2025-05-07T20:32:58.2402459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2402798Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.2403091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2403792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.2404484Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.2405034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.2405882Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.2406565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.2407102Z kernel = self.compile( 2025-05-07T20:32:58.2407656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.2408322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.2408731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2408969Z 2025-05-07T20:32:58.2409183Z self = 2025-05-07T20:32:58.2410269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.2411685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590c1fca0>} 2025-05-07T20:32:58.2413077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.2414133Z context = 2025-05-07T20:32:58.2414429Z 2025-05-07T20:32:58.2414604Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.2415138Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.2415613Z module_map=module_map) 2025-05-07T20:32:58.2415986Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.2416354Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.2416673Z E ^ 2025-05-07T20:32:58.2417136Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.2417600Z 2025-05-07T20:32:58.2418028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.2418550Z 2025-05-07T20:32:58.2418659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.2419085Z self=, 2025-05-07T20:32:58.2419489Z T=4096, 2025-05-07T20:32:58.2419687Z D=7168, 2025-05-07T20:32:58.2419891Z scale_ub=None, 2025-05-07T20:32:58.2420111Z contiguous=False, 2025-05-07T20:32:58.2420345Z compiled=False, 2025-05-07T20:32:58.2420565Z ) 2025-05-07T20:32:58.2420884Z self = 2025-05-07T20:32:58.2421516Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.2421798Z 2025-05-07T20:32:58.2421879Z @given( 2025-05-07T20:32:58.2422114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.2422431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.2422745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.2423090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.2423441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.2423741Z ) 2025-05-07T20:32:58.2424103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.2424549Z def test_silu_mul_quant( 2025-05-07T20:32:58.2424800Z self, 2025-05-07T20:32:58.2425005Z T: int, 2025-05-07T20:32:58.2425205Z D: int, 2025-05-07T20:32:58.2425457Z scale_ub: Optional[float], 2025-05-07T20:32:58.2425762Z contiguous: bool, 2025-05-07T20:32:58.2426003Z compiled: bool, 2025-05-07T20:32:58.2426295Z ) -> None: 2025-05-07T20:32:58.2426523Z torch.manual_seed(2025) 2025-05-07T20:32:58.2426768Z 2025-05-07T20:32:58.2427047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.2427400Z 2025-05-07T20:32:58.2427601Z x_sign = torch.sign(x) 2025-05-07T20:32:58.2427909Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.2428234Z x = x_sign * x_clamp 2025-05-07T20:32:58.2428478Z x0 = x[:, :D] 2025-05-07T20:32:58.2428703Z x1 = x[:, D:] 2025-05-07T20:32:58.2428920Z 2025-05-07T20:32:58.2429118Z if contiguous: 2025-05-07T20:32:58.2429353Z x0 = x0.contiguous() 2025-05-07T20:32:58.2429620Z x1 = x1.contiguous() 2025-05-07T20:32:58.2429872Z 2025-05-07T20:32:58.2430067Z if scale_ub is not None: 2025-05-07T20:32:58.2430352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.2430700Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.2431024Z ) 2025-05-07T20:32:58.2431229Z else: 2025-05-07T20:32:58.2431450Z scale_ub_tensor = None 2025-05-07T20:32:58.2431704Z 2025-05-07T20:32:58.2431944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.2432265Z op = silu_mul_quant 2025-05-07T20:32:58.2432616Z if compiled: 2025-05-07T20:32:58.2432878Z op = torch.compile(op) 2025-05-07T20:32:58.2433183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2433462Z 2025-05-07T20:32:58.2433666Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.2433842Z 2025-05-07T20:32:58.2433945Z moe/activation_test.py:117: 2025-05-07T20:32:58.2434252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2434591Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.2434884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2435588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.2436334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.2436880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.2437574Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.2438250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.2438785Z kernel = self.compile( 2025-05-07T20:32:58.2439333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.2439998Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.2440728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2440975Z 2025-05-07T20:32:58.2441193Z self = 2025-05-07T20:32:58.2442278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.2443650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590c7a700>} 2025-05-07T20:32:58.2444994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.2446008Z context = 2025-05-07T20:32:58.2446306Z 2025-05-07T20:32:58.2446477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.2447089Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.2447574Z module_map=module_map) 2025-05-07T20:32:58.2447948Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.2448313Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.2448604Z E ^ 2025-05-07T20:32:58.2449066Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.2449520Z 2025-05-07T20:32:58.2449947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.2450467Z 2025-05-07T20:32:58.2450574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.2450997Z self=, 2025-05-07T20:32:58.2451399Z T=128, 2025-05-07T20:32:58.2451601Z D=7168, 2025-05-07T20:32:58.2451809Z scale_ub=None, 2025-05-07T20:32:58.2452027Z contiguous=False, 2025-05-07T20:32:58.2452266Z compiled=True, 2025-05-07T20:32:58.2452478Z ) 2025-05-07T20:32:58.3209831Z self = 2025-05-07T20:32:58.3210893Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.3211343Z 2025-05-07T20:32:58.3211431Z @given( 2025-05-07T20:32:58.3211679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.3212006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.3212331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.3212671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.3213020Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.3213318Z ) 2025-05-07T20:32:58.3213673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.3214132Z def test_silu_mul_quant( 2025-05-07T20:32:58.3214523Z self, 2025-05-07T20:32:58.3214725Z T: int, 2025-05-07T20:32:58.3214938Z D: int, 2025-05-07T20:32:58.3215172Z scale_ub: Optional[float], 2025-05-07T20:32:58.3215452Z contiguous: bool, 2025-05-07T20:32:58.3215706Z compiled: bool, 2025-05-07T20:32:58.3215952Z ) -> None: 2025-05-07T20:32:58.3216174Z torch.manual_seed(2025) 2025-05-07T20:32:58.3216434Z 2025-05-07T20:32:58.3216720Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.3217073Z 2025-05-07T20:32:58.3217272Z x_sign = torch.sign(x) 2025-05-07T20:32:58.3217577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.3217900Z x = x_sign * x_clamp 2025-05-07T20:32:58.3218147Z x0 = x[:, :D] 2025-05-07T20:32:58.3218378Z x1 = x[:, D:] 2025-05-07T20:32:58.3218602Z 2025-05-07T20:32:58.3218794Z if contiguous: 2025-05-07T20:32:58.3219049Z x0 = x0.contiguous() 2025-05-07T20:32:58.3219323Z x1 = x1.contiguous() 2025-05-07T20:32:58.3219571Z 2025-05-07T20:32:58.3219778Z if scale_ub is not None: 2025-05-07T20:32:58.3220066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.3220414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.3220744Z ) 2025-05-07T20:32:58.3220952Z else: 2025-05-07T20:32:58.3221277Z scale_ub_tensor = None 2025-05-07T20:32:58.3221546Z 2025-05-07T20:32:58.3221794Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.3222125Z op = silu_mul_quant 2025-05-07T20:32:58.3222385Z if compiled: 2025-05-07T20:32:58.3222645Z op = torch.compile(op) 2025-05-07T20:32:58.3222961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.3223243Z 2025-05-07T20:32:58.3223451Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.3223837Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.3224142Z 2025-05-07T20:32:58.3224394Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.3224746Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.3225048Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.3225379Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.3225750Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.3226072Z 2025-05-07T20:32:58.3226278Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.3226485Z 2025-05-07T20:32:58.3226590Z moe/activation_test.py:126: 2025-05-07T20:32:58.3226902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.3227242Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.3227586Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.3228395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.3229154Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.3229720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.3230517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.3231228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.3231953Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.3232716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.3233470Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.3234205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.3234897Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.3235517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.3236056Z fn() 2025-05-07T20:32:58.3236569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.3237157Z self.fn.run( 2025-05-07T20:32:58.3237634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.3238172Z kernel = self.compile( 2025-05-07T20:32:58.3238716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.3239377Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.3239789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.3240030Z 2025-05-07T20:32:58.3240532Z self = 2025-05-07T20:32:58.3241621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.3243011Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590a2f5e0>} 2025-05-07T20:32:58.3244355Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.3245390Z context = 2025-05-07T20:32:58.3245688Z 2025-05-07T20:32:58.3245936Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.3246483Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.3246964Z module_map=module_map) 2025-05-07T20:32:58.3247350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.3247716Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.3247999Z E ^ 2025-05-07T20:32:58.3248478Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.3248932Z 2025-05-07T20:32:58.3249358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.3249881Z 2025-05-07T20:32:58.3249990Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3250421Z self=, 2025-05-07T20:32:58.3250841Z T=128, 2025-05-07T20:32:58.3251038Z D=7168, 2025-05-07T20:32:58.3251244Z scale_ub=None, 2025-05-07T20:32:58.3251474Z contiguous=False, 2025-05-07T20:32:58.3251709Z compiled=False, 2025-05-07T20:32:58.3251927Z ) 2025-05-07T20:32:58.7325873Z self = 2025-05-07T20:32:58.7326764Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.7327044Z 2025-05-07T20:32:58.7327137Z @given( 2025-05-07T20:32:58.7327383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.7327709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.7328029Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.7328375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.7328710Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.7329011Z ) 2025-05-07T20:32:58.7329376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.7337576Z def test_silu_mul_quant( 2025-05-07T20:32:58.7337860Z self, 2025-05-07T20:32:58.7338076Z T: int, 2025-05-07T20:32:58.7338284Z D: int, 2025-05-07T20:32:58.7338520Z scale_ub: Optional[float], 2025-05-07T20:32:58.7338826Z contiguous: bool, 2025-05-07T20:32:58.7339075Z compiled: bool, 2025-05-07T20:32:58.7339329Z ) -> None: 2025-05-07T20:32:58.7339564Z torch.manual_seed(2025) 2025-05-07T20:32:58.7339814Z 2025-05-07T20:32:58.7340377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.7340744Z 2025-05-07T20:32:58.7340945Z x_sign = torch.sign(x) 2025-05-07T20:32:58.7341354Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.7341679Z x = x_sign * x_clamp 2025-05-07T20:32:58.7341927Z x0 = x[:, :D] 2025-05-07T20:32:58.7342160Z x1 = x[:, D:] 2025-05-07T20:32:58.7342392Z 2025-05-07T20:32:58.7342585Z if contiguous: 2025-05-07T20:32:58.7342836Z x0 = x0.contiguous() 2025-05-07T20:32:58.7343110Z x1 = x1.contiguous() 2025-05-07T20:32:58.7343365Z 2025-05-07T20:32:58.7343569Z if scale_ub is not None: 2025-05-07T20:32:58.7343859Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.7344214Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.7344530Z ) 2025-05-07T20:32:58.7344738Z else: 2025-05-07T20:32:58.7344962Z scale_ub_tensor = None 2025-05-07T20:32:58.7345221Z 2025-05-07T20:32:58.7345475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.7345855Z op = silu_mul_quant 2025-05-07T20:32:58.7346116Z if compiled: 2025-05-07T20:32:58.7346383Z op = torch.compile(op) 2025-05-07T20:32:58.7346696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7346980Z 2025-05-07T20:32:58.7347328Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.7347502Z 2025-05-07T20:32:58.7347619Z moe/activation_test.py:117: 2025-05-07T20:32:58.7347931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7348269Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.7348573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7349289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.7349984Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.7350537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.7351227Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.7351900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.7352442Z kernel = self.compile( 2025-05-07T20:32:58.7353005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.7353671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.7354143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7354444Z 2025-05-07T20:32:58.7354657Z self = 2025-05-07T20:32:58.7355750Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.7357141Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15906d2ee0>} 2025-05-07T20:32:58.7358567Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.7359578Z context = 2025-05-07T20:32:58.7359878Z 2025-05-07T20:32:58.7360053Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.7360595Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.7361076Z module_map=module_map) 2025-05-07T20:32:58.7361451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.7361817Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.7362090Z E ^ 2025-05-07T20:32:58.7362554Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.7363024Z 2025-05-07T20:32:58.7363445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.7363965Z 2025-05-07T20:32:58.7364073Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.7364496Z self=, 2025-05-07T20:32:58.7364905Z T=4096, 2025-05-07T20:32:58.7365109Z D=5120, 2025-05-07T20:32:58.7365314Z scale_ub=1200.0, 2025-05-07T20:32:58.7365546Z contiguous=True, 2025-05-07T20:32:58.7365809Z compiled=False, 2025-05-07T20:32:58.7366048Z ) 2025-05-07T20:32:58.7366369Z self = 2025-05-07T20:32:58.7366875Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.7367151Z 2025-05-07T20:32:58.7367243Z @given( 2025-05-07T20:32:58.7367488Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.7367857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.7368186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.7368536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.7368876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.7369181Z ) 2025-05-07T20:32:58.7369559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.7370006Z def test_silu_mul_quant( 2025-05-07T20:32:58.7370261Z self, 2025-05-07T20:32:58.7370470Z T: int, 2025-05-07T20:32:58.7370678Z D: int, 2025-05-07T20:32:58.7370909Z scale_ub: Optional[float], 2025-05-07T20:32:58.7371194Z contiguous: bool, 2025-05-07T20:32:58.7371436Z compiled: bool, 2025-05-07T20:32:58.7371676Z ) -> None: 2025-05-07T20:32:58.7371906Z torch.manual_seed(2025) 2025-05-07T20:32:58.7372152Z 2025-05-07T20:32:58.7372437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.7372804Z 2025-05-07T20:32:58.7373007Z x_sign = torch.sign(x) 2025-05-07T20:32:58.7373302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.7373628Z x = x_sign * x_clamp 2025-05-07T20:32:58.7373888Z x0 = x[:, :D] 2025-05-07T20:32:58.7374111Z x1 = x[:, D:] 2025-05-07T20:32:58.7374425Z 2025-05-07T20:32:58.7374627Z if contiguous: 2025-05-07T20:32:58.7374866Z x0 = x0.contiguous() 2025-05-07T20:32:58.7375133Z x1 = x1.contiguous() 2025-05-07T20:32:58.7375379Z 2025-05-07T20:32:58.7375597Z if scale_ub is not None: 2025-05-07T20:32:58.7375911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.7376257Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.7376575Z ) 2025-05-07T20:32:58.7376780Z else: 2025-05-07T20:32:58.7377005Z scale_ub_tensor = None 2025-05-07T20:32:58.7377263Z 2025-05-07T20:32:58.7377510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.7377889Z op = silu_mul_quant 2025-05-07T20:32:58.7378151Z if compiled: 2025-05-07T20:32:58.7378405Z op = torch.compile(op) 2025-05-07T20:32:58.7378714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7378995Z 2025-05-07T20:32:58.7379200Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.7379368Z 2025-05-07T20:32:58.7379476Z moe/activation_test.py:117: 2025-05-07T20:32:58.7379773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7380114Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.7380405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7381193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.7381881Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.7382427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.7383129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.7383788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.7384343Z kernel = self.compile( 2025-05-07T20:32:58.7384892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.7385559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.7385966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7386206Z 2025-05-07T20:32:58.7386420Z self = 2025-05-07T20:32:58.7387549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.7388927Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15907a9670>} 2025-05-07T20:32:58.7390259Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.7391289Z context = 2025-05-07T20:32:58.7391585Z 2025-05-07T20:32:58.7391758Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.7392288Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.7392753Z module_map=module_map) 2025-05-07T20:32:58.7393136Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.7393500Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.7393770Z E ^ 2025-05-07T20:32:58.7394229Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.7394742Z 2025-05-07T20:32:58.7395204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.7395724Z 2025-05-07T20:32:58.7395842Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.7396256Z self=, 2025-05-07T20:32:58.7396665Z T=1, 2025-05-07T20:32:58.7396861Z D=5120, 2025-05-07T20:32:58.7397062Z scale_ub=None, 2025-05-07T20:32:58.7397278Z contiguous=True, 2025-05-07T20:32:58.7397509Z compiled=True, 2025-05-07T20:32:58.7397720Z ) 2025-05-07T20:32:59.3873700Z self = 2025-05-07T20:32:59.3874752Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:59.3875105Z 2025-05-07T20:32:59.3875217Z @given( 2025-05-07T20:32:59.3875522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.3875956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.3876292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.3876638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.3876987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.3877286Z ) 2025-05-07T20:32:59.3877643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.3878099Z def test_silu_mul_quant( 2025-05-07T20:32:59.3878355Z self, 2025-05-07T20:32:59.3878558Z T: int, 2025-05-07T20:32:59.3878769Z D: int, 2025-05-07T20:32:59.3878999Z scale_ub: Optional[float], 2025-05-07T20:32:59.3879291Z contiguous: bool, 2025-05-07T20:32:59.3879537Z compiled: bool, 2025-05-07T20:32:59.3879783Z ) -> None: 2025-05-07T20:32:59.3880010Z torch.manual_seed(2025) 2025-05-07T20:32:59.3880257Z 2025-05-07T20:32:59.3880538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.3880898Z 2025-05-07T20:32:59.3881098Z x_sign = torch.sign(x) 2025-05-07T20:32:59.3881399Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.3881720Z x = x_sign * x_clamp 2025-05-07T20:32:59.3881968Z x0 = x[:, :D] 2025-05-07T20:32:59.3882195Z x1 = x[:, D:] 2025-05-07T20:32:59.3882411Z 2025-05-07T20:32:59.3882603Z if contiguous: 2025-05-07T20:32:59.3882849Z x0 = x0.contiguous() 2025-05-07T20:32:59.3883121Z x1 = x1.contiguous() 2025-05-07T20:32:59.3883392Z 2025-05-07T20:32:59.3883596Z if scale_ub is not None: 2025-05-07T20:32:59.3883881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.3884364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.3884689Z ) 2025-05-07T20:32:59.3884894Z else: 2025-05-07T20:32:59.3885110Z scale_ub_tensor = None 2025-05-07T20:32:59.3885373Z 2025-05-07T20:32:59.3885618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3885942Z op = silu_mul_quant 2025-05-07T20:32:59.3886206Z if compiled: 2025-05-07T20:32:59.3886465Z op = torch.compile(op) 2025-05-07T20:32:59.3886777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3887056Z 2025-05-07T20:32:59.3887262Z y_fp8, y_scale = fn() 2025-05-07T20:32:59.3887561Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:59.3887859Z 2025-05-07T20:32:59.3888108Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3888453Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:59.3888763Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:59.3889090Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:59.3889465Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.3889778Z 2025-05-07T20:32:59.3889994Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:59.3890423Z 2025-05-07T20:32:59.3890533Z moe/activation_test.py:126: 2025-05-07T20:32:59.3890843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3891183Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:59.3891526Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.3892325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:59.3893082Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:59.3893642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.3894389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.3895088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:59.3895849Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:59.3896625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:59.3897382Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:59.3898112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:59.3898751Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:59.3899368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:59.3899898Z fn() 2025-05-07T20:32:59.3900407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:59.3901134Z self.fn.run( 2025-05-07T20:32:59.3901612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.3902153Z kernel = self.compile( 2025-05-07T20:32:59.3902697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.3903367Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.3903778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3904012Z 2025-05-07T20:32:59.3904224Z self = 2025-05-07T20:32:59.3905367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.3906820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15903f6550>} 2025-05-07T20:32:59.3908178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.3909202Z context = 2025-05-07T20:32:59.3909497Z 2025-05-07T20:32:59.3909670Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.3910207Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.3910696Z module_map=module_map) 2025-05-07T20:32:59.3911074Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.3911439Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:59.3911723Z E ^ 2025-05-07T20:32:59.3912244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.3912741Z 2025-05-07T20:32:59.3913160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.3913688Z 2025-05-07T20:32:59.3913797Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.3914219Z self=, 2025-05-07T20:32:59.3914631Z T=2048, 2025-05-07T20:32:59.3914826Z D=5120, 2025-05-07T20:32:59.3915031Z scale_ub=None, 2025-05-07T20:32:59.3915257Z contiguous=True, 2025-05-07T20:32:59.3915484Z compiled=True, 2025-05-07T20:32:59.3915754Z ) 2025-05-07T20:33:00.0038868Z self = 2025-05-07T20:33:00.0040752Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.0041315Z 2025-05-07T20:33:00.0041494Z @given( 2025-05-07T20:33:00.0042015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0042653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0043291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0043981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0044648Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0045233Z ) 2025-05-07T20:33:00.0045919Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0046427Z def test_silu_mul_quant( 2025-05-07T20:33:00.0046682Z self, 2025-05-07T20:33:00.0046888Z T: int, 2025-05-07T20:33:00.0047099Z D: int, 2025-05-07T20:33:00.0047339Z scale_ub: Optional[float], 2025-05-07T20:33:00.0047628Z contiguous: bool, 2025-05-07T20:33:00.0047875Z compiled: bool, 2025-05-07T20:33:00.0048117Z ) -> None: 2025-05-07T20:33:00.0048348Z torch.manual_seed(2025) 2025-05-07T20:33:00.0048604Z 2025-05-07T20:33:00.0048893Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0049252Z 2025-05-07T20:33:00.0049456Z x_sign = torch.sign(x) 2025-05-07T20:33:00.0049752Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.0050078Z x = x_sign * x_clamp 2025-05-07T20:33:00.0050333Z x0 = x[:, :D] 2025-05-07T20:33:00.0050560Z x1 = x[:, D:] 2025-05-07T20:33:00.0050778Z 2025-05-07T20:33:00.0050974Z if contiguous: 2025-05-07T20:33:00.0051213Z x0 = x0.contiguous() 2025-05-07T20:33:00.0051490Z x1 = x1.contiguous() 2025-05-07T20:33:00.0051745Z 2025-05-07T20:33:00.0052260Z if scale_ub is not None: 2025-05-07T20:33:00.0052554Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.0052911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.0053229Z ) 2025-05-07T20:33:00.0053436Z else: 2025-05-07T20:33:00.0053665Z scale_ub_tensor = None 2025-05-07T20:33:00.0053926Z 2025-05-07T20:33:00.0054178Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.0054511Z op = silu_mul_quant 2025-05-07T20:33:00.0054779Z if compiled: 2025-05-07T20:33:00.0055038Z op = torch.compile(op) 2025-05-07T20:33:00.0055350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.0055643Z 2025-05-07T20:33:00.0055839Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.0056146Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.0056452Z 2025-05-07T20:33:00.0056694Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.0057048Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.0057355Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.0057675Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.0058051Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.0058546Z 2025-05-07T20:33:00.0058764Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.0058965Z 2025-05-07T20:33:00.0059072Z moe/activation_test.py:126: 2025-05-07T20:33:00.0059380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0059731Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.0060070Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.0060873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.0061714Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.0062356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.0063046Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.0063746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.0064482Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.0065236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.0066003Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.0066741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.0067390Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.0068000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.0068529Z fn() 2025-05-07T20:33:00.0069056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.0069652Z self.fn.run( 2025-05-07T20:33:00.0070128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.0070669Z kernel = self.compile( 2025-05-07T20:33:00.0071229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.0071887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.0072298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0072541Z 2025-05-07T20:33:00.0072802Z self = 2025-05-07T20:33:00.0073894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.0075310Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590015f70>} 2025-05-07T20:33:00.0076676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.0077701Z context = 2025-05-07T20:33:00.0077994Z 2025-05-07T20:33:00.0078179Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.0078720Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.0079198Z module_map=module_map) 2025-05-07T20:33:00.0079580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.0079950Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.0080315Z E ^ 2025-05-07T20:33:00.0080789Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.0081243Z 2025-05-07T20:33:00.0081671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.0082187Z 2025-05-07T20:33:00.0082303Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0082722Z self=, 2025-05-07T20:33:00.0083138Z T=128, 2025-05-07T20:33:00.0083342Z D=5120, 2025-05-07T20:33:00.0083541Z scale_ub=None, 2025-05-07T20:33:00.0083839Z contiguous=True, 2025-05-07T20:33:00.0084077Z compiled=True, 2025-05-07T20:33:00.0084286Z ) 2025-05-07T20:33:00.9990724Z self = 2025-05-07T20:33:00.9991509Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.9991930Z 2025-05-07T20:33:00.9992041Z @given( 2025-05-07T20:33:00.9992374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.9992789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.9993197Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.9993544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.9993876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.9994174Z ) 2025-05-07T20:33:00.9994531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.9994985Z def test_silu_mul_quant( 2025-05-07T20:33:00.9995253Z self, 2025-05-07T20:33:00.9995454Z T: int, 2025-05-07T20:33:00.9995657Z D: int, 2025-05-07T20:33:01.0002779Z scale_ub: Optional[float], 2025-05-07T20:33:01.0003118Z contiguous: bool, 2025-05-07T20:33:01.0003392Z compiled: bool, 2025-05-07T20:33:01.0003650Z ) -> None: 2025-05-07T20:33:01.0003883Z torch.manual_seed(2025) 2025-05-07T20:33:01.0004141Z 2025-05-07T20:33:01.0004436Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0004788Z 2025-05-07T20:33:01.0004994Z x_sign = torch.sign(x) 2025-05-07T20:33:01.0005300Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.0005619Z x = x_sign * x_clamp 2025-05-07T20:33:01.0005873Z x0 = x[:, :D] 2025-05-07T20:33:01.0006106Z x1 = x[:, D:] 2025-05-07T20:33:01.0006319Z 2025-05-07T20:33:01.0006516Z if contiguous: 2025-05-07T20:33:01.0006762Z x0 = x0.contiguous() 2025-05-07T20:33:01.0007175Z x1 = x1.contiguous() 2025-05-07T20:33:01.0007436Z 2025-05-07T20:33:01.0007641Z if scale_ub is not None: 2025-05-07T20:33:01.0007929Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.0008271Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.0008596Z ) 2025-05-07T20:33:01.0008806Z else: 2025-05-07T20:33:01.0009021Z scale_ub_tensor = None 2025-05-07T20:33:01.0009284Z 2025-05-07T20:33:01.0009533Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.0009856Z op = silu_mul_quant 2025-05-07T20:33:01.0010120Z if compiled: 2025-05-07T20:33:01.0010380Z op = torch.compile(op) 2025-05-07T20:33:01.0010682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.0010971Z 2025-05-07T20:33:01.0011176Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.0011467Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.0011776Z 2025-05-07T20:33:01.0012026Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.0012373Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.0012674Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.0012998Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.0013511Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.0013832Z 2025-05-07T20:33:01.0014051Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.0014253Z 2025-05-07T20:33:01.0014366Z moe/activation_test.py:126: 2025-05-07T20:33:01.0014668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.0015018Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.0015358Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.0016158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.0017056Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.0017622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.0018320Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.0019019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.0019749Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.0020511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:01.0021397Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.0022130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.0022789Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.0023400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.0023937Z fn() 2025-05-07T20:33:01.0024456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.0025055Z self.fn.run( 2025-05-07T20:33:01.0025535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.0026068Z kernel = self.compile( 2025-05-07T20:33:01.0026626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.0027341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.0027752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.0028042Z 2025-05-07T20:33:01.0028259Z self = 2025-05-07T20:33:01.0029372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.0030773Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590293b80>} 2025-05-07T20:33:01.0032110Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.0033147Z context = 2025-05-07T20:33:01.0033440Z 2025-05-07T20:33:01.0033621Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.0034156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.0034632Z module_map=module_map) 2025-05-07T20:33:01.0035004Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.0035452Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.0035736Z E ^ 2025-05-07T20:33:01.0036202Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.0036660Z 2025-05-07T20:33:01.0037078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.0037600Z 2025-05-07T20:33:01.0037707Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.0038133Z self=, 2025-05-07T20:33:01.0038582Z T=4096, 2025-05-07T20:33:01.0038785Z D=5120, 2025-05-07T20:33:01.0038988Z scale_ub=None, 2025-05-07T20:33:01.0039206Z contiguous=True, 2025-05-07T20:33:01.0039446Z compiled=True, 2025-05-07T20:33:01.0039663Z ) 2025-05-07T20:33:01.8501079Z self = 2025-05-07T20:33:01.8501896Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.8502285Z 2025-05-07T20:33:01.8502427Z @given( 2025-05-07T20:33:01.8502699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.8503034Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.8503365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.8503705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.8504054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.8504356Z ) 2025-05-07T20:33:01.8504728Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.8505191Z def test_silu_mul_quant( 2025-05-07T20:33:01.8505448Z self, 2025-05-07T20:33:01.8505660Z T: int, 2025-05-07T20:33:01.8505862Z D: int, 2025-05-07T20:33:01.8506099Z scale_ub: Optional[float], 2025-05-07T20:33:01.8506383Z contiguous: bool, 2025-05-07T20:33:01.8506632Z compiled: bool, 2025-05-07T20:33:01.8506875Z ) -> None: 2025-05-07T20:33:01.8507108Z torch.manual_seed(2025) 2025-05-07T20:33:01.8507356Z 2025-05-07T20:33:01.8507640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.8507997Z 2025-05-07T20:33:01.8508194Z x_sign = torch.sign(x) 2025-05-07T20:33:01.8508499Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.8508812Z x = x_sign * x_clamp 2025-05-07T20:33:01.8509065Z x0 = x[:, :D] 2025-05-07T20:33:01.8509294Z x1 = x[:, D:] 2025-05-07T20:33:01.8509512Z 2025-05-07T20:33:01.8509836Z if contiguous: 2025-05-07T20:33:01.8510084Z x0 = x0.contiguous() 2025-05-07T20:33:01.8510353Z x1 = x1.contiguous() 2025-05-07T20:33:01.8510597Z 2025-05-07T20:33:01.8510798Z if scale_ub is not None: 2025-05-07T20:33:01.8511082Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.8511427Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.8511749Z ) 2025-05-07T20:33:01.8511952Z else: 2025-05-07T20:33:01.8512166Z scale_ub_tensor = None 2025-05-07T20:33:01.8512426Z 2025-05-07T20:33:01.8512674Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.8512996Z op = silu_mul_quant 2025-05-07T20:33:01.8513259Z if compiled: 2025-05-07T20:33:01.8513517Z op = torch.compile(op) 2025-05-07T20:33:01.8513816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8514101Z 2025-05-07T20:33:01.8514302Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.8514605Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.8514897Z 2025-05-07T20:33:01.8515139Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.8515481Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.8515845Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.8516260Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.8516626Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.8516937Z 2025-05-07T20:33:01.8517147Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.8517346Z 2025-05-07T20:33:01.8517458Z moe/activation_test.py:126: 2025-05-07T20:33:01.8517754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8518109Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.8518450Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.8519329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.8520080Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.8520637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.8521327Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.8522021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.8522741Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.8523493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:01.8524247Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.8524984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.8525629Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.8526244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.8526803Z fn() 2025-05-07T20:33:01.8527332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.8527921Z self.fn.run( 2025-05-07T20:33:01.8528397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.8528936Z kernel = self.compile( 2025-05-07T20:33:01.8529481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.8530144Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.8530604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8530840Z 2025-05-07T20:33:01.8531050Z self = 2025-05-07T20:33:01.8532139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.8533592Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158fddeca0>} 2025-05-07T20:33:01.8534929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.8535958Z context = 2025-05-07T20:33:01.8536251Z 2025-05-07T20:33:01.8536433Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.8536960Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.8537565Z module_map=module_map) 2025-05-07T20:33:01.8537945Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.8538305Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.8538582Z E ^ 2025-05-07T20:33:01.8539056Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.8539506Z 2025-05-07T20:33:01.8539936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.8540716Z 2025-05-07T20:33:01.8540823Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.8541378Z self=, 2025-05-07T20:33:01.8541785Z T=16384, 2025-05-07T20:33:01.8541983Z D=5120, 2025-05-07T20:33:01.8542182Z scale_ub=None, 2025-05-07T20:33:01.8542403Z contiguous=True, 2025-05-07T20:33:01.8542629Z compiled=True, 2025-05-07T20:33:01.8542844Z ) 2025-05-07T20:33:01.8975028Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:01.8977203Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:01.8978530Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:01.8979526Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:01.8980642Z W0507 20:33:01.895821 88433 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:02.0179959Z self = 2025-05-07T20:33:02.0180782Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:02.0181294Z 2025-05-07T20:33:02.0181410Z @given( 2025-05-07T20:33:02.0181768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.0182249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.0182629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.0182986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.0183335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.0183780Z ) 2025-05-07T20:33:02.0184149Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.0184613Z def test_silu_mul_quant( 2025-05-07T20:33:02.0184878Z self, 2025-05-07T20:33:02.0185083Z T: int, 2025-05-07T20:33:02.0185300Z D: int, 2025-05-07T20:33:02.0185555Z scale_ub: Optional[float], 2025-05-07T20:33:02.0185842Z contiguous: bool, 2025-05-07T20:33:02.0186104Z compiled: bool, 2025-05-07T20:33:02.0186347Z ) -> None: 2025-05-07T20:33:02.0186576Z torch.manual_seed(2025) 2025-05-07T20:33:02.0186881Z 2025-05-07T20:33:02.0187181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.0187537Z 2025-05-07T20:33:02.0187748Z x_sign = torch.sign(x) 2025-05-07T20:33:02.0188059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.0188395Z x = x_sign * x_clamp 2025-05-07T20:33:02.0188644Z x0 = x[:, :D] 2025-05-07T20:33:02.0188884Z x1 = x[:, D:] 2025-05-07T20:33:02.0189111Z 2025-05-07T20:33:02.0189304Z if contiguous: 2025-05-07T20:33:02.0189551Z x0 = x0.contiguous() 2025-05-07T20:33:02.0189823Z x1 = x1.contiguous() 2025-05-07T20:33:02.0190069Z 2025-05-07T20:33:02.0190342Z if scale_ub is not None: 2025-05-07T20:33:02.0190686Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.0191028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.0191353Z ) 2025-05-07T20:33:02.0191563Z else: 2025-05-07T20:33:02.0191780Z scale_ub_tensor = None 2025-05-07T20:33:02.0192044Z 2025-05-07T20:33:02.0192290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.0192608Z op = silu_mul_quant 2025-05-07T20:33:02.0192873Z if compiled: 2025-05-07T20:33:02.0193135Z op = torch.compile(op) 2025-05-07T20:33:02.0193448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.0193797Z 2025-05-07T20:33:02.0194003Z y_fp8, y_scale = fn() 2025-05-07T20:33:02.0194303Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:02.0194598Z 2025-05-07T20:33:02.0194850Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.0195202Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:02.0195503Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:02.0195833Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:02.0196209Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.0196530Z 2025-05-07T20:33:02.0196745Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:02.0196949Z 2025-05-07T20:33:02.0197064Z moe/activation_test.py:126: 2025-05-07T20:33:02.0197384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.0197729Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:02.0198083Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.0198886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:02.0199638Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:02.0200208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.0200895Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.0201589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:02.0202309Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.0203066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:02.0203881Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.0204622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:02.0205260Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:02.0205874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:02.0206402Z fn() 2025-05-07T20:33:02.0206906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:02.0207500Z self.fn.run( 2025-05-07T20:33:02.0207976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.0208709Z kernel = self.compile( 2025-05-07T20:33:02.0209256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.0209923Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.0210336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.0210570Z 2025-05-07T20:33:02.0210781Z self = 2025-05-07T20:33:02.0211950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.0213361Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590293ca0>} 2025-05-07T20:33:02.0214706Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.0215766Z context = 2025-05-07T20:33:02.0216059Z 2025-05-07T20:33:02.0216231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.0216764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.0217243Z module_map=module_map) 2025-05-07T20:33:02.0217619Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.0217978Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:02.0218261Z E ^ 2025-05-07T20:33:02.0218734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.0219184Z 2025-05-07T20:33:02.0219611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.0220127Z 2025-05-07T20:33:02.0220236Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.0220662Z self=, 2025-05-07T20:33:02.0221132Z T=1, 2025-05-07T20:33:02.0221324Z D=5120, 2025-05-07T20:33:02.0221533Z scale_ub=1200.0, 2025-05-07T20:33:02.0221777Z contiguous=True, 2025-05-07T20:33:02.0222004Z compiled=True, 2025-05-07T20:33:02.0222221Z ) 2025-05-07T20:33:02.1922969Z self = 2025-05-07T20:33:02.1924420Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:02.1925158Z 2025-05-07T20:33:02.1925397Z @given( 2025-05-07T20:33:02.1926028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.1926716Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.1927070Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.1927411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.1927886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.1928202Z ) 2025-05-07T20:33:02.1928559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.1929025Z def test_silu_mul_quant( 2025-05-07T20:33:02.1929284Z self, 2025-05-07T20:33:02.1929497Z T: int, 2025-05-07T20:33:02.1929719Z D: int, 2025-05-07T20:33:02.1929961Z scale_ub: Optional[float], 2025-05-07T20:33:02.1930250Z contiguous: bool, 2025-05-07T20:33:02.1930501Z compiled: bool, 2025-05-07T20:33:02.1930744Z ) -> None: 2025-05-07T20:33:02.1930972Z torch.manual_seed(2025) 2025-05-07T20:33:02.1931219Z 2025-05-07T20:33:02.1931503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.1931854Z 2025-05-07T20:33:02.1932047Z x_sign = torch.sign(x) 2025-05-07T20:33:02.1932349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.1932676Z x = x_sign * x_clamp 2025-05-07T20:33:02.1932921Z x0 = x[:, :D] 2025-05-07T20:33:02.1933147Z x1 = x[:, D:] 2025-05-07T20:33:02.1933364Z 2025-05-07T20:33:02.1933563Z if contiguous: 2025-05-07T20:33:02.1933807Z x0 = x0.contiguous() 2025-05-07T20:33:02.1934263Z x1 = x1.contiguous() 2025-05-07T20:33:02.1934514Z 2025-05-07T20:33:02.1934719Z if scale_ub is not None: 2025-05-07T20:33:02.1935003Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.1935346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.1935669Z ) 2025-05-07T20:33:02.1935872Z else: 2025-05-07T20:33:02.1936095Z scale_ub_tensor = None 2025-05-07T20:33:02.1936352Z 2025-05-07T20:33:02.1936593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.1936917Z op = silu_mul_quant 2025-05-07T20:33:02.1937173Z if compiled: 2025-05-07T20:33:02.1937505Z op = torch.compile(op) 2025-05-07T20:33:02.1937817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.1938096Z 2025-05-07T20:33:02.1938303Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.1938476Z 2025-05-07T20:33:02.1938590Z moe/activation_test.py:117: 2025-05-07T20:33:02.1938898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.1939245Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.1939540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.1940371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.1940946Z return fn(*args, **kwargs) 2025-05-07T20:33:02.1941671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.1942365Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.1942906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.1943606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.1944282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.1944823Z kernel = self.compile( 2025-05-07T20:33:02.1945361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.1946022Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.1946427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.1946663Z 2025-05-07T20:33:02.1946880Z self = 2025-05-07T20:33:02.1948025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.1949428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f6568b0>} 2025-05-07T20:33:02.1950775Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.1951794Z context = 2025-05-07T20:33:02.1952083Z 2025-05-07T20:33:02.1952256Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.1952790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.1953264Z module_map=module_map) 2025-05-07T20:33:02.1953647Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.1954002Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.1954271Z E ^ 2025-05-07T20:33:02.1954737Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.1955306Z 2025-05-07T20:33:02.1955729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.1956249Z 2025-05-07T20:33:02.1956356Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.1956775Z self=, 2025-05-07T20:33:02.1957200Z T=1, 2025-05-07T20:33:02.1957421Z D=5120, 2025-05-07T20:33:02.1957627Z scale_ub=None, 2025-05-07T20:33:02.1957854Z contiguous=False, 2025-05-07T20:33:02.1958083Z compiled=True, 2025-05-07T20:33:02.1958291Z ) 2025-05-07T20:33:02.2761041Z self = 2025-05-07T20:33:02.2761907Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:02.2762286Z 2025-05-07T20:33:02.2762401Z @given( 2025-05-07T20:33:02.2762744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.2763224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.2763589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.2763943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.2764292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.2764584Z ) 2025-05-07T20:33:02.2764951Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.2765411Z def test_silu_mul_quant( 2025-05-07T20:33:02.2765669Z self, 2025-05-07T20:33:02.2765872Z T: int, 2025-05-07T20:33:02.2766081Z D: int, 2025-05-07T20:33:02.2766318Z scale_ub: Optional[float], 2025-05-07T20:33:02.2766604Z contiguous: bool, 2025-05-07T20:33:02.2766859Z compiled: bool, 2025-05-07T20:33:02.2767125Z ) -> None: 2025-05-07T20:33:02.2767358Z torch.manual_seed(2025) 2025-05-07T20:33:02.2767606Z 2025-05-07T20:33:02.2767894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.2768262Z 2025-05-07T20:33:02.2768459Z x_sign = torch.sign(x) 2025-05-07T20:33:02.2768766Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.2769093Z x = x_sign * x_clamp 2025-05-07T20:33:02.2769341Z x0 = x[:, :D] 2025-05-07T20:33:02.2769568Z x1 = x[:, D:] 2025-05-07T20:33:02.2769788Z 2025-05-07T20:33:02.2769980Z if contiguous: 2025-05-07T20:33:02.2770225Z x0 = x0.contiguous() 2025-05-07T20:33:02.2770495Z x1 = x1.contiguous() 2025-05-07T20:33:02.2770739Z 2025-05-07T20:33:02.2770943Z if scale_ub is not None: 2025-05-07T20:33:02.2771314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.2771670Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.2771987Z ) 2025-05-07T20:33:02.2772195Z else: 2025-05-07T20:33:02.2772417Z scale_ub_tensor = None 2025-05-07T20:33:02.2772673Z 2025-05-07T20:33:02.2772926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.2773251Z op = silu_mul_quant 2025-05-07T20:33:02.2773510Z if compiled: 2025-05-07T20:33:02.2773769Z op = torch.compile(op) 2025-05-07T20:33:02.2774078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2774359Z 2025-05-07T20:33:02.2774563Z y_fp8, y_scale = fn() 2025-05-07T20:33:02.2774860Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:02.2775156Z 2025-05-07T20:33:02.2775406Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.2775753Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:02.2776062Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:02.2776391Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:02.2776766Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.2777093Z 2025-05-07T20:33:02.2777367Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:02.2777631Z 2025-05-07T20:33:02.2777739Z moe/activation_test.py:126: 2025-05-07T20:33:02.2778053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2778398Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:02.2778739Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.2779544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:02.2780310Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:02.2780871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.2781701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.2782398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:02.2783132Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.2783898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:02.2784653Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.2785391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:02.2786033Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:02.2786654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:02.2787190Z fn() 2025-05-07T20:33:02.2787706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:02.2788289Z self.fn.run( 2025-05-07T20:33:02.2788770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.2789309Z kernel = self.compile( 2025-05-07T20:33:02.2789861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.2790528Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.2790938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2791175Z 2025-05-07T20:33:02.2791393Z self = 2025-05-07T20:33:02.2792516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.2793907Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f6ace50>} 2025-05-07T20:33:02.2795268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.2796295Z context = 2025-05-07T20:33:02.2796588Z 2025-05-07T20:33:02.2796768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.2797301Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.2797781Z module_map=module_map) 2025-05-07T20:33:02.2798159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.2798525Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:02.2798801Z E ^ 2025-05-07T20:33:02.2799313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.2799799Z 2025-05-07T20:33:02.2800229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.2800741Z 2025-05-07T20:33:02.2800848Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.2801268Z self=, 2025-05-07T20:33:02.2801678Z T=1, 2025-05-07T20:33:02.2801867Z D=5120, 2025-05-07T20:33:02.2802069Z scale_ub=None, 2025-05-07T20:33:02.2802292Z contiguous=True, 2025-05-07T20:33:02.2802567Z compiled=False, 2025-05-07T20:33:02.2802784Z ) 2025-05-07T20:33:02.6395601Z self = 2025-05-07T20:33:02.6396407Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.6396792Z 2025-05-07T20:33:02.6396906Z @given( 2025-05-07T20:33:02.6397252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6397614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6397940Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6398282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6398632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6398939Z ) 2025-05-07T20:33:02.6399307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6399764Z def test_silu_mul_quant( 2025-05-07T20:33:02.6400026Z self, 2025-05-07T20:33:02.6400229Z T: int, 2025-05-07T20:33:02.6400451Z D: int, 2025-05-07T20:33:02.6400691Z scale_ub: Optional[float], 2025-05-07T20:33:02.6400976Z contiguous: bool, 2025-05-07T20:33:02.6401234Z compiled: bool, 2025-05-07T20:33:02.6401479Z ) -> None: 2025-05-07T20:33:02.6401715Z torch.manual_seed(2025) 2025-05-07T20:33:02.6401969Z 2025-05-07T20:33:02.6402252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6402608Z 2025-05-07T20:33:02.6402803Z x_sign = torch.sign(x) 2025-05-07T20:33:02.6403105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.6403425Z x = x_sign * x_clamp 2025-05-07T20:33:02.6403670Z x0 = x[:, :D] 2025-05-07T20:33:02.6403895Z x1 = x[:, D:] 2025-05-07T20:33:02.6404113Z 2025-05-07T20:33:02.6404301Z if contiguous: 2025-05-07T20:33:02.6404547Z x0 = x0.contiguous() 2025-05-07T20:33:02.6404820Z x1 = x1.contiguous() 2025-05-07T20:33:02.6405076Z 2025-05-07T20:33:02.6406031Z if scale_ub is not None: 2025-05-07T20:33:02.6406327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.6406671Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.6406993Z ) 2025-05-07T20:33:02.6407196Z else: 2025-05-07T20:33:02.6407424Z scale_ub_tensor = None 2025-05-07T20:33:02.6407679Z 2025-05-07T20:33:02.6407923Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6408251Z op = silu_mul_quant 2025-05-07T20:33:02.6408506Z if compiled: 2025-05-07T20:33:02.6408766Z op = torch.compile(op) 2025-05-07T20:33:02.6409073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6409354Z 2025-05-07T20:33:02.6409560Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.6409731Z 2025-05-07T20:33:02.6409844Z moe/activation_test.py:117: 2025-05-07T20:33:02.6410152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6410500Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.6410797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6411496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.6412318Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.6412879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.6413570Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.6414233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.6414784Z kernel = self.compile( 2025-05-07T20:33:02.6415336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.6416002Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.6416470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6416713Z 2025-05-07T20:33:02.6416927Z self = 2025-05-07T20:33:02.6418013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.6419385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158fdb0790>} 2025-05-07T20:33:02.6420719Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.6421843Z context = 2025-05-07T20:33:02.6422141Z 2025-05-07T20:33:02.6422314Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.6422851Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.6423324Z module_map=module_map) 2025-05-07T20:33:02.6423709Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.6424080Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.6424355Z E ^ 2025-05-07T20:33:02.6424818Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.6425273Z 2025-05-07T20:33:02.6425690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.6426211Z 2025-05-07T20:33:02.6426376Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6426795Z self=, 2025-05-07T20:33:02.6427211Z T=128, 2025-05-07T20:33:02.6427411Z D=5120, 2025-05-07T20:33:02.6427614Z scale_ub=None, 2025-05-07T20:33:02.6427834Z contiguous=False, 2025-05-07T20:33:02.6428073Z compiled=True, 2025-05-07T20:33:02.6428291Z ) 2025-05-07T20:33:02.6428623Z self = 2025-05-07T20:33:02.6429125Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:02.6429397Z 2025-05-07T20:33:02.6429484Z @given( 2025-05-07T20:33:02.6429721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6430045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6430362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6430699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6431053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6431350Z ) 2025-05-07T20:33:02.6431702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6432154Z def test_silu_mul_quant( 2025-05-07T20:33:02.6432405Z self, 2025-05-07T20:33:02.6432603Z T: int, 2025-05-07T20:33:02.6432927Z D: int, 2025-05-07T20:33:02.6433156Z scale_ub: Optional[float], 2025-05-07T20:33:02.6433432Z contiguous: bool, 2025-05-07T20:33:02.6433682Z compiled: bool, 2025-05-07T20:33:02.6433916Z ) -> None: 2025-05-07T20:33:02.6434145Z torch.manual_seed(2025) 2025-05-07T20:33:02.6434395Z 2025-05-07T20:33:02.6434671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6435021Z 2025-05-07T20:33:02.6435215Z x_sign = torch.sign(x) 2025-05-07T20:33:02.6435514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.6435837Z x = x_sign * x_clamp 2025-05-07T20:33:02.6436129Z x0 = x[:, :D] 2025-05-07T20:33:02.6436355Z x1 = x[:, D:] 2025-05-07T20:33:02.6436568Z 2025-05-07T20:33:02.6436753Z if contiguous: 2025-05-07T20:33:02.6436992Z x0 = x0.contiguous() 2025-05-07T20:33:02.6437263Z x1 = x1.contiguous() 2025-05-07T20:33:02.6437508Z 2025-05-07T20:33:02.6437713Z if scale_ub is not None: 2025-05-07T20:33:02.6437994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.6438338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.6438659Z ) 2025-05-07T20:33:02.6438861Z else: 2025-05-07T20:33:02.6439073Z scale_ub_tensor = None 2025-05-07T20:33:02.6439334Z 2025-05-07T20:33:02.6439576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6439902Z op = silu_mul_quant 2025-05-07T20:33:02.6440411Z if compiled: 2025-05-07T20:33:02.6440669Z op = torch.compile(op) 2025-05-07T20:33:02.6440979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6441257Z 2025-05-07T20:33:02.6441460Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.6441629Z 2025-05-07T20:33:02.6441740Z moe/activation_test.py:117: 2025-05-07T20:33:02.6442039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6442384Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.6442676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6443244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.6443818Z return fn(*args, **kwargs) 2025-05-07T20:33:02.6444486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.6445189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.6445808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.6446503Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.6447171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.6447715Z kernel = self.compile( 2025-05-07T20:33:02.6448268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.6448937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.6449347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6449581Z 2025-05-07T20:33:02.6449794Z self = 2025-05-07T20:33:02.6450887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.6452264Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ee51040>} 2025-05-07T20:33:02.6453668Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.6454758Z context = 2025-05-07T20:33:02.6455051Z 2025-05-07T20:33:02.6455222Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.6455756Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.6456230Z module_map=module_map) 2025-05-07T20:33:02.6456615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.6457038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.6457307Z E ^ 2025-05-07T20:33:02.6457777Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.6458224Z 2025-05-07T20:33:02.6458648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.6459173Z 2025-05-07T20:33:02.6459281Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6459708Z self=, 2025-05-07T20:33:02.6460117Z T=128, 2025-05-07T20:33:02.6460307Z D=7168, 2025-05-07T20:33:02.6460507Z scale_ub=1200.0, 2025-05-07T20:33:02.6460741Z contiguous=False, 2025-05-07T20:33:02.6460973Z compiled=False, 2025-05-07T20:33:02.6461254Z ) 2025-05-07T20:33:02.7992891Z self = 2025-05-07T20:33:02.7993681Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.7994082Z 2025-05-07T20:33:02.7994225Z @given( 2025-05-07T20:33:02.7994566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.7995027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.7995475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.7995847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.7996197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.7996500Z ) 2025-05-07T20:33:02.7996857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.7997317Z def test_silu_mul_quant( 2025-05-07T20:33:02.7997570Z self, 2025-05-07T20:33:02.7997775Z T: int, 2025-05-07T20:33:02.7997989Z D: int, 2025-05-07T20:33:02.7998226Z scale_ub: Optional[float], 2025-05-07T20:33:02.7998506Z contiguous: bool, 2025-05-07T20:33:02.7998878Z compiled: bool, 2025-05-07T20:33:02.7999119Z ) -> None: 2025-05-07T20:33:02.7999341Z torch.manual_seed(2025) 2025-05-07T20:33:02.7999594Z 2025-05-07T20:33:02.7999879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.8000237Z 2025-05-07T20:33:02.8000447Z x_sign = torch.sign(x) 2025-05-07T20:33:02.8000757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.8001073Z x = x_sign * x_clamp 2025-05-07T20:33:02.8001326Z x0 = x[:, :D] 2025-05-07T20:33:02.8001556Z x1 = x[:, D:] 2025-05-07T20:33:02.8001770Z 2025-05-07T20:33:02.8002016Z if contiguous: 2025-05-07T20:33:02.8002260Z x0 = x0.contiguous() 2025-05-07T20:33:02.8002534Z x1 = x1.contiguous() 2025-05-07T20:33:02.8002785Z 2025-05-07T20:33:02.8002978Z if scale_ub is not None: 2025-05-07T20:33:02.8003269Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.8003636Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.8003959Z ) 2025-05-07T20:33:02.8004158Z else: 2025-05-07T20:33:02.8004387Z scale_ub_tensor = None 2025-05-07T20:33:02.8004646Z 2025-05-07T20:33:02.8004881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.8005334Z op = silu_mul_quant 2025-05-07T20:33:02.8005602Z if compiled: 2025-05-07T20:33:02.8005853Z op = torch.compile(op) 2025-05-07T20:33:02.8006158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8006438Z 2025-05-07T20:33:02.8006633Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.8006805Z 2025-05-07T20:33:02.8006909Z moe/activation_test.py:117: 2025-05-07T20:33:02.8007212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8007550Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.8007840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8008608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.8009305Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.8009846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.8010538Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.8011205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.8011744Z kernel = self.compile( 2025-05-07T20:33:02.8012287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.8012950Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.8013363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8013602Z 2025-05-07T20:33:02.8013812Z self = 2025-05-07T20:33:02.8014910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.8016282Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ee51ca0>} 2025-05-07T20:33:02.8017623Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.8018644Z context = 2025-05-07T20:33:02.8018934Z 2025-05-07T20:33:02.8019157Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.8019704Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.8020175Z module_map=module_map) 2025-05-07T20:33:02.8020549Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.8020920Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.8021270Z E ^ 2025-05-07T20:33:02.8021751Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.8022200Z 2025-05-07T20:33:02.8022623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.8023142Z 2025-05-07T20:33:02.8023252Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.8023675Z self=, 2025-05-07T20:33:02.8024088Z T=128, 2025-05-07T20:33:02.8024281Z D=5120, 2025-05-07T20:33:02.8024480Z scale_ub=None, 2025-05-07T20:33:02.8024703Z contiguous=False, 2025-05-07T20:33:02.8024936Z compiled=False, 2025-05-07T20:33:02.8025151Z ) 2025-05-07T20:33:02.8025476Z self = 2025-05-07T20:33:02.8026056Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:02.8026341Z 2025-05-07T20:33:02.8026425Z @given( 2025-05-07T20:33:02.8026666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.8026982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.8027297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.8027633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.8027972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.8028261Z ) 2025-05-07T20:33:02.8028620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.8029115Z def test_silu_mul_quant( 2025-05-07T20:33:02.8029362Z self, 2025-05-07T20:33:02.8029572Z T: int, 2025-05-07T20:33:02.8029783Z D: int, 2025-05-07T20:33:02.8030004Z scale_ub: Optional[float], 2025-05-07T20:33:02.8030282Z contiguous: bool, 2025-05-07T20:33:02.8030535Z compiled: bool, 2025-05-07T20:33:02.8030761Z ) -> None: 2025-05-07T20:33:02.8030981Z torch.manual_seed(2025) 2025-05-07T20:33:02.8037790Z 2025-05-07T20:33:02.8038101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.8038457Z 2025-05-07T20:33:02.8038661Z x_sign = torch.sign(x) 2025-05-07T20:33:02.8038961Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.8039285Z x = x_sign * x_clamp 2025-05-07T20:33:02.8039539Z x0 = x[:, :D] 2025-05-07T20:33:02.8039763Z x1 = x[:, D:] 2025-05-07T20:33:02.8039970Z 2025-05-07T20:33:02.8040424Z if contiguous: 2025-05-07T20:33:02.8040673Z x0 = x0.contiguous() 2025-05-07T20:33:02.8040935Z x1 = x1.contiguous() 2025-05-07T20:33:02.8041186Z 2025-05-07T20:33:02.8041389Z if scale_ub is not None: 2025-05-07T20:33:02.8041667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.8042020Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.8042338Z ) 2025-05-07T20:33:02.8042532Z else: 2025-05-07T20:33:02.8042752Z scale_ub_tensor = None 2025-05-07T20:33:02.8043019Z 2025-05-07T20:33:02.8043256Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.8043585Z op = silu_mul_quant 2025-05-07T20:33:02.8043846Z if compiled: 2025-05-07T20:33:02.8044103Z op = torch.compile(op) 2025-05-07T20:33:02.8044407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8044696Z 2025-05-07T20:33:02.8045011Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.8045186Z 2025-05-07T20:33:02.8045290Z moe/activation_test.py:117: 2025-05-07T20:33:02.8045596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8045940Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.8046224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8046928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.8047622Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.8048167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.8048852Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.8049529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.8050067Z kernel = self.compile( 2025-05-07T20:33:02.8050621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.8051288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.8051688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8052041Z 2025-05-07T20:33:02.8052260Z self = 2025-05-07T20:33:02.8053344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.8054714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f3fd310>} 2025-05-07T20:33:02.8056058Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.8057149Z context = 2025-05-07T20:33:02.8057440Z 2025-05-07T20:33:02.8057624Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.8058152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.8058623Z module_map=module_map) 2025-05-07T20:33:02.8059002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.8059359Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.8059630Z E ^ 2025-05-07T20:33:02.8060098Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.8060552Z 2025-05-07T20:33:02.8060979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.8061570Z 2025-05-07T20:33:02.8061679Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.8062096Z self=, 2025-05-07T20:33:02.8062508Z T=128, 2025-05-07T20:33:02.8062702Z D=5120, 2025-05-07T20:33:02.8062908Z scale_ub=1200.0, 2025-05-07T20:33:02.8063139Z contiguous=True, 2025-05-07T20:33:02.8063365Z compiled=False, 2025-05-07T20:33:02.8063585Z ) 2025-05-07T20:33:03.0361973Z self = 2025-05-07T20:33:03.0363493Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.0364273Z 2025-05-07T20:33:03.0364497Z @given( 2025-05-07T20:33:03.0364983Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.0365620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.0366587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.0367059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.0367397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.0367692Z ) 2025-05-07T20:33:03.0368069Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.0368524Z def test_silu_mul_quant( 2025-05-07T20:33:03.0368778Z self, 2025-05-07T20:33:03.0368990Z T: int, 2025-05-07T20:33:03.0369198Z D: int, 2025-05-07T20:33:03.0369435Z scale_ub: Optional[float], 2025-05-07T20:33:03.0369726Z contiguous: bool, 2025-05-07T20:33:03.0369982Z compiled: bool, 2025-05-07T20:33:03.0370213Z ) -> None: 2025-05-07T20:33:03.0370448Z torch.manual_seed(2025) 2025-05-07T20:33:03.0370710Z 2025-05-07T20:33:03.0370993Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.0371346Z 2025-05-07T20:33:03.0371567Z x_sign = torch.sign(x) 2025-05-07T20:33:03.0371870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.0372198Z x = x_sign * x_clamp 2025-05-07T20:33:03.0372454Z x0 = x[:, :D] 2025-05-07T20:33:03.0372678Z x1 = x[:, D:] 2025-05-07T20:33:03.0372962Z 2025-05-07T20:33:03.0373211Z if contiguous: 2025-05-07T20:33:03.0373454Z x0 = x0.contiguous() 2025-05-07T20:33:03.0373729Z x1 = x1.contiguous() 2025-05-07T20:33:03.0373980Z 2025-05-07T20:33:03.0374180Z if scale_ub is not None: 2025-05-07T20:33:03.0374471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.0374832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.0375161Z ) 2025-05-07T20:33:03.0375360Z else: 2025-05-07T20:33:03.0375587Z scale_ub_tensor = None 2025-05-07T20:33:03.0375851Z 2025-05-07T20:33:03.0376091Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.0376495Z op = silu_mul_quant 2025-05-07T20:33:03.0376768Z if compiled: 2025-05-07T20:33:03.0377024Z op = torch.compile(op) 2025-05-07T20:33:03.0377339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0377633Z 2025-05-07T20:33:03.0377833Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.0378006Z 2025-05-07T20:33:03.0378112Z moe/activation_test.py:117: 2025-05-07T20:33:03.0378417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0378757Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.0379055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0379757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.0380447Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.0381059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.0381751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.0382418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.0382961Z kernel = self.compile( 2025-05-07T20:33:03.0383519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.0384184Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.0384585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0384827Z 2025-05-07T20:33:03.0385038Z self = 2025-05-07T20:33:03.0386176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.0387546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f3fdee0>} 2025-05-07T20:33:03.0388884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.0389916Z context = 2025-05-07T20:33:03.0390210Z 2025-05-07T20:33:03.0390379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.0390914Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.0391377Z module_map=module_map) 2025-05-07T20:33:03.0391762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.0392127Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.0392399Z E ^ 2025-05-07T20:33:03.0392858Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.0393308Z 2025-05-07T20:33:03.0393804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.0394318Z 2025-05-07T20:33:03.0394431Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.0394852Z self=, 2025-05-07T20:33:03.0395253Z T=1, 2025-05-07T20:33:03.0395443Z D=7168, 2025-05-07T20:33:03.0395644Z scale_ub=1200.0, 2025-05-07T20:33:03.0395870Z contiguous=True, 2025-05-07T20:33:03.0396101Z compiled=True, 2025-05-07T20:33:03.0396312Z ) 2025-05-07T20:33:03.0396631Z self = 2025-05-07T20:33:03.0397172Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.0397438Z 2025-05-07T20:33:03.0397524Z @given( 2025-05-07T20:33:03.0397756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.0398076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.0398394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.0398728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.0399058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.0399355Z ) 2025-05-07T20:33:03.0399714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.0400156Z def test_silu_mul_quant( 2025-05-07T20:33:03.0400404Z self, 2025-05-07T20:33:03.0400604Z T: int, 2025-05-07T20:33:03.0400804Z D: int, 2025-05-07T20:33:03.0401031Z scale_ub: Optional[float], 2025-05-07T20:33:03.0401309Z contiguous: bool, 2025-05-07T20:33:03.0401559Z compiled: bool, 2025-05-07T20:33:03.0401792Z ) -> None: 2025-05-07T20:33:03.0402018Z torch.manual_seed(2025) 2025-05-07T20:33:03.0402260Z 2025-05-07T20:33:03.0402538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.0402888Z 2025-05-07T20:33:03.0403087Z x_sign = torch.sign(x) 2025-05-07T20:33:03.0403388Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.0403708Z x = x_sign * x_clamp 2025-05-07T20:33:03.0403959Z x0 = x[:, :D] 2025-05-07T20:33:03.0404177Z x1 = x[:, D:] 2025-05-07T20:33:03.0404393Z 2025-05-07T20:33:03.0404585Z if contiguous: 2025-05-07T20:33:03.0404820Z x0 = x0.contiguous() 2025-05-07T20:33:03.0405087Z x1 = x1.contiguous() 2025-05-07T20:33:03.0405330Z 2025-05-07T20:33:03.0405524Z if scale_ub is not None: 2025-05-07T20:33:03.0405806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.0406204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.0406517Z ) 2025-05-07T20:33:03.0406719Z else: 2025-05-07T20:33:03.0406937Z scale_ub_tensor = None 2025-05-07T20:33:03.0407192Z 2025-05-07T20:33:03.0407455Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.0407809Z op = silu_mul_quant 2025-05-07T20:33:03.0408066Z if compiled: 2025-05-07T20:33:03.0408322Z op = torch.compile(op) 2025-05-07T20:33:03.0408628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0408915Z 2025-05-07T20:33:03.0409109Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.0409279Z 2025-05-07T20:33:03.0409394Z moe/activation_test.py:117: 2025-05-07T20:33:03.0409689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0410036Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.0410326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0410904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.0411461Z return fn(*args, **kwargs) 2025-05-07T20:33:03.0412170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.0412923Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.0413462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.0414153Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.0414827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.0415362Z kernel = self.compile( 2025-05-07T20:33:03.0415902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.0416604Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.0417031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0417290Z 2025-05-07T20:33:03.0417508Z self = 2025-05-07T20:33:03.0418592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.0419952Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158edd6940>} 2025-05-07T20:33:03.0421360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.0422380Z context = 2025-05-07T20:33:03.0422669Z 2025-05-07T20:33:03.0422840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.0423370Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.0423852Z module_map=module_map) 2025-05-07T20:33:03.0424229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.0424583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.0424851Z E ^ 2025-05-07T20:33:03.0425313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.0425760Z 2025-05-07T20:33:03.0426175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.0426704Z 2025-05-07T20:33:03.0426861Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.0427285Z self=, 2025-05-07T20:33:03.0427692Z T=1, 2025-05-07T20:33:03.0427878Z D=7168, 2025-05-07T20:33:03.0428079Z scale_ub=1200.0, 2025-05-07T20:33:03.0428315Z contiguous=False, 2025-05-07T20:33:03.0428548Z compiled=True, 2025-05-07T20:33:03.0428762Z ) 2025-05-07T20:33:03.3854468Z self = 2025-05-07T20:33:03.3855196Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.3855615Z 2025-05-07T20:33:03.3855728Z @given( 2025-05-07T20:33:03.3856049Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.3856494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.3856811Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.3857154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.3857504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.3857799Z ) 2025-05-07T20:33:03.3858156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.3858614Z def test_silu_mul_quant( 2025-05-07T20:33:03.3858857Z self, 2025-05-07T20:33:03.3859241Z T: int, 2025-05-07T20:33:03.3859445Z D: int, 2025-05-07T20:33:03.3859668Z scale_ub: Optional[float], 2025-05-07T20:33:03.3859953Z contiguous: bool, 2025-05-07T20:33:03.3860199Z compiled: bool, 2025-05-07T20:33:03.3860431Z ) -> None: 2025-05-07T20:33:03.3860649Z torch.manual_seed(2025) 2025-05-07T20:33:03.3860894Z 2025-05-07T20:33:03.3861291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.3861636Z 2025-05-07T20:33:03.3861833Z x_sign = torch.sign(x) 2025-05-07T20:33:03.3862136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.3862543Z x = x_sign * x_clamp 2025-05-07T20:33:03.3862789Z x0 = x[:, :D] 2025-05-07T20:33:03.3863012Z x1 = x[:, D:] 2025-05-07T20:33:03.3863219Z 2025-05-07T20:33:03.3863414Z if contiguous: 2025-05-07T20:33:03.3863651Z x0 = x0.contiguous() 2025-05-07T20:33:03.3863913Z x1 = x1.contiguous() 2025-05-07T20:33:03.3864172Z 2025-05-07T20:33:03.3864369Z if scale_ub is not None: 2025-05-07T20:33:03.3864645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.3864991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.3865308Z ) 2025-05-07T20:33:03.3865504Z else: 2025-05-07T20:33:03.3865721Z scale_ub_tensor = None 2025-05-07T20:33:03.3865980Z 2025-05-07T20:33:03.3866222Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.3866541Z op = silu_mul_quant 2025-05-07T20:33:03.3866803Z if compiled: 2025-05-07T20:33:03.3867083Z op = torch.compile(op) 2025-05-07T20:33:03.3867382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.3867672Z 2025-05-07T20:33:03.3867874Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.3868044Z 2025-05-07T20:33:03.3868158Z moe/activation_test.py:117: 2025-05-07T20:33:03.3868465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.3868807Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.3869097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.3869654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.3870218Z return fn(*args, **kwargs) 2025-05-07T20:33:03.3870880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.3871589Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.3872203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.3872905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.3873572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.3874111Z kernel = self.compile( 2025-05-07T20:33:03.3874659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.3875321Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.3875729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.3875961Z 2025-05-07T20:33:03.3876174Z self = 2025-05-07T20:33:03.3877258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.3878630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f1f05e0>} 2025-05-07T20:33:03.3880066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.3881080Z context = 2025-05-07T20:33:03.3881374Z 2025-05-07T20:33:03.3881545Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.3882075Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.3882551Z module_map=module_map) 2025-05-07T20:33:03.3882963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.3883322Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.3883590Z E ^ 2025-05-07T20:33:03.3884053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.3884513Z 2025-05-07T20:33:03.3884940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.3885457Z 2025-05-07T20:33:03.3885564Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.3885983Z self=, 2025-05-07T20:33:03.3886383Z T=1, 2025-05-07T20:33:03.3886573Z D=7168, 2025-05-07T20:33:03.3886779Z scale_ub=None, 2025-05-07T20:33:03.3886999Z contiguous=False, 2025-05-07T20:33:03.3887231Z compiled=True, 2025-05-07T20:33:03.3887442Z ) 2025-05-07T20:33:03.5020354Z self = 2025-05-07T20:33:03.5021251Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5021630Z 2025-05-07T20:33:03.5021743Z @given( 2025-05-07T20:33:03.5022020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5022354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5022667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5023007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5023345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5023637Z ) 2025-05-07T20:33:03.5023994Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5024445Z def test_silu_mul_quant( 2025-05-07T20:33:03.5024694Z self, 2025-05-07T20:33:03.5024892Z T: int, 2025-05-07T20:33:03.5025098Z D: int, 2025-05-07T20:33:03.5025323Z scale_ub: Optional[float], 2025-05-07T20:33:03.5025710Z contiguous: bool, 2025-05-07T20:33:03.5025960Z compiled: bool, 2025-05-07T20:33:03.5026192Z ) -> None: 2025-05-07T20:33:03.5026412Z torch.manual_seed(2025) 2025-05-07T20:33:03.5026662Z 2025-05-07T20:33:03.5026946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5027298Z 2025-05-07T20:33:03.5027496Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5027798Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5028117Z x = x_sign * x_clamp 2025-05-07T20:33:03.5028367Z x0 = x[:, :D] 2025-05-07T20:33:03.5028588Z x1 = x[:, D:] 2025-05-07T20:33:03.5028796Z 2025-05-07T20:33:03.5028988Z if contiguous: 2025-05-07T20:33:03.5029235Z x0 = x0.contiguous() 2025-05-07T20:33:03.5029499Z x1 = x1.contiguous() 2025-05-07T20:33:03.5029755Z 2025-05-07T20:33:03.5029955Z if scale_ub is not None: 2025-05-07T20:33:03.5030242Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5030585Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5030909Z ) 2025-05-07T20:33:03.5031111Z else: 2025-05-07T20:33:03.5031326Z scale_ub_tensor = None 2025-05-07T20:33:03.5031586Z 2025-05-07T20:33:03.5031952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5032277Z op = silu_mul_quant 2025-05-07T20:33:03.5032536Z if compiled: 2025-05-07T20:33:03.5032791Z op = torch.compile(op) 2025-05-07T20:33:03.5033089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5033374Z 2025-05-07T20:33:03.5033575Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.5033867Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.5034168Z 2025-05-07T20:33:03.5034411Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5034750Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.5035124Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.5035448Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.5035816Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5036129Z 2025-05-07T20:33:03.5036337Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.5036545Z 2025-05-07T20:33:03.5036655Z moe/activation_test.py:126: 2025-05-07T20:33:03.5036954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5037298Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.5037635Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5038425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.5039176Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.5046187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5046939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5047640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.5048369Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.5049125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:03.5049872Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.5050603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.5051240Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.5051958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.5052499Z fn() 2025-05-07T20:33:03.5053020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.5053603Z self.fn.run( 2025-05-07T20:33:03.5054081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5054608Z kernel = self.compile( 2025-05-07T20:33:03.5055157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5055810Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5056206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5056444Z 2025-05-07T20:33:03.5056656Z self = 2025-05-07T20:33:03.5057793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5059238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f042160>} 2025-05-07T20:33:03.5060644Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5061754Z context = 2025-05-07T20:33:03.5062047Z 2025-05-07T20:33:03.5062216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5062751Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5063286Z module_map=module_map) 2025-05-07T20:33:03.5063662Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5064029Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.5064302Z E ^ 2025-05-07T20:33:03.5064770Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5065235Z 2025-05-07T20:33:03.5065656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5066168Z 2025-05-07T20:33:03.5066276Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5066693Z self=, 2025-05-07T20:33:03.5067097Z T=1, 2025-05-07T20:33:03.5067288Z D=5120, 2025-05-07T20:33:03.5067484Z scale_ub=1200.0, 2025-05-07T20:33:03.5067710Z contiguous=False, 2025-05-07T20:33:03.5067946Z compiled=True, 2025-05-07T20:33:03.5068153Z ) 2025-05-07T20:33:03.7054222Z self = 2025-05-07T20:33:03.7055011Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.7055391Z 2025-05-07T20:33:03.7055505Z @given( 2025-05-07T20:33:03.7055844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.7056291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.7056607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.7056956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.7057303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.7057598Z ) 2025-05-07T20:33:03.7057962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.7058426Z def test_silu_mul_quant( 2025-05-07T20:33:03.7058680Z self, 2025-05-07T20:33:03.7058883Z T: int, 2025-05-07T20:33:03.7059217Z D: int, 2025-05-07T20:33:03.7059449Z scale_ub: Optional[float], 2025-05-07T20:33:03.7059730Z contiguous: bool, 2025-05-07T20:33:03.7059984Z compiled: bool, 2025-05-07T20:33:03.7060225Z ) -> None: 2025-05-07T20:33:03.7060447Z torch.manual_seed(2025) 2025-05-07T20:33:03.7060704Z 2025-05-07T20:33:03.7061072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.7061425Z 2025-05-07T20:33:03.7061633Z x_sign = torch.sign(x) 2025-05-07T20:33:03.7061947Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.7062266Z x = x_sign * x_clamp 2025-05-07T20:33:03.7062520Z x0 = x[:, :D] 2025-05-07T20:33:03.7062750Z x1 = x[:, D:] 2025-05-07T20:33:03.7062962Z 2025-05-07T20:33:03.7063159Z if contiguous: 2025-05-07T20:33:03.7063403Z x0 = x0.contiguous() 2025-05-07T20:33:03.7063669Z x1 = x1.contiguous() 2025-05-07T20:33:03.7063918Z 2025-05-07T20:33:03.7064123Z if scale_ub is not None: 2025-05-07T20:33:03.7064407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.7064752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.7065072Z ) 2025-05-07T20:33:03.7065279Z else: 2025-05-07T20:33:03.7065629Z scale_ub_tensor = None 2025-05-07T20:33:03.7065900Z 2025-05-07T20:33:03.7066144Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.7066461Z op = silu_mul_quant 2025-05-07T20:33:03.7066727Z if compiled: 2025-05-07T20:33:03.7066985Z op = torch.compile(op) 2025-05-07T20:33:03.7067288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7067574Z 2025-05-07T20:33:03.7067777Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.7067947Z 2025-05-07T20:33:03.7068052Z moe/activation_test.py:117: 2025-05-07T20:33:03.7068361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7068786Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.7069082Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7069647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.7070208Z return fn(*args, **kwargs) 2025-05-07T20:33:03.7070889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.7071576Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.7072122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.7072810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.7073478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.7074015Z kernel = self.compile( 2025-05-07T20:33:03.7074572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.7075246Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.7075650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7075901Z 2025-05-07T20:33:03.7076116Z self = 2025-05-07T20:33:03.7077201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.7078582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f042b80>} 2025-05-07T20:33:03.7079985Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.7081019Z context = 2025-05-07T20:33:03.7081322Z 2025-05-07T20:33:03.7081504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.7082050Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.7082522Z module_map=module_map) 2025-05-07T20:33:03.7082897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.7083261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.7083538Z E ^ 2025-05-07T20:33:03.7084006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.7084456Z 2025-05-07T20:33:03.7084891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.7085406Z 2025-05-07T20:33:03.7085518Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.7085937Z self=, 2025-05-07T20:33:03.7086396Z T=1, 2025-05-07T20:33:03.7086631Z D=5120, 2025-05-07T20:33:03.7086832Z scale_ub=1200.0, 2025-05-07T20:33:03.7087071Z contiguous=False, 2025-05-07T20:33:03.7087311Z compiled=False, 2025-05-07T20:33:03.7087521Z ) 2025-05-07T20:33:03.7087847Z self = 2025-05-07T20:33:03.7088380Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.7088657Z 2025-05-07T20:33:03.7088738Z @given( 2025-05-07T20:33:03.7088982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.7089306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.7089677Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.7090022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.7090355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.7090651Z ) 2025-05-07T20:33:03.7091024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.7091474Z def test_silu_mul_quant( 2025-05-07T20:33:03.7091725Z self, 2025-05-07T20:33:03.7091931Z T: int, 2025-05-07T20:33:03.7092132Z D: int, 2025-05-07T20:33:03.7092360Z scale_ub: Optional[float], 2025-05-07T20:33:03.7092638Z contiguous: bool, 2025-05-07T20:33:03.7092884Z compiled: bool, 2025-05-07T20:33:03.7093114Z ) -> None: 2025-05-07T20:33:03.7093343Z torch.manual_seed(2025) 2025-05-07T20:33:03.7093595Z 2025-05-07T20:33:03.7093872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.7094226Z 2025-05-07T20:33:03.7094434Z x_sign = torch.sign(x) 2025-05-07T20:33:03.7094731Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.7095054Z x = x_sign * x_clamp 2025-05-07T20:33:03.7095305Z x0 = x[:, :D] 2025-05-07T20:33:03.7095526Z x1 = x[:, D:] 2025-05-07T20:33:03.7095745Z 2025-05-07T20:33:03.7095946Z if contiguous: 2025-05-07T20:33:03.7096182Z x0 = x0.contiguous() 2025-05-07T20:33:03.7096458Z x1 = x1.contiguous() 2025-05-07T20:33:03.7096708Z 2025-05-07T20:33:03.7096911Z if scale_ub is not None: 2025-05-07T20:33:03.7097194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.7097546Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.7097859Z ) 2025-05-07T20:33:03.7098064Z else: 2025-05-07T20:33:03.7098283Z scale_ub_tensor = None 2025-05-07T20:33:03.7098539Z 2025-05-07T20:33:03.7098784Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.7099164Z op = silu_mul_quant 2025-05-07T20:33:03.7099432Z if compiled: 2025-05-07T20:33:03.7099686Z op = torch.compile(op) 2025-05-07T20:33:03.7099993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7100279Z 2025-05-07T20:33:03.7100479Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.7100659Z 2025-05-07T20:33:03.7100766Z moe/activation_test.py:117: 2025-05-07T20:33:03.7101143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7101480Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.7101775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7102473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.7103166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.7103709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.7104398Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.7105068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.7105602Z kernel = self.compile( 2025-05-07T20:33:03.7106269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.7106940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.7107399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7107635Z 2025-05-07T20:33:03.7107844Z self = 2025-05-07T20:33:03.7108931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.7110335Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f44c550>} 2025-05-07T20:33:03.7111677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.7112709Z context = 2025-05-07T20:33:03.7113003Z 2025-05-07T20:33:03.7113175Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.7113709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.7114182Z module_map=module_map) 2025-05-07T20:33:03.7114567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.7114939Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.7115210Z E ^ 2025-05-07T20:33:03.7115678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.7116137Z 2025-05-07T20:33:03.7116565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.7117081Z 2025-05-07T20:33:03.7117191Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.7117617Z self=, 2025-05-07T20:33:03.7118026Z T=16384, 2025-05-07T20:33:03.7118227Z D=5120, 2025-05-07T20:33:03.7118430Z scale_ub=1200.0, 2025-05-07T20:33:03.7118659Z contiguous=False, 2025-05-07T20:33:03.7118893Z compiled=True, 2025-05-07T20:33:03.7119106Z ) 2025-05-07T20:33:03.8295310Z self = 2025-05-07T20:33:03.8296216Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.8296616Z 2025-05-07T20:33:03.8296728Z @given( 2025-05-07T20:33:03.8297043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.8297370Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.8297693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.8298041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.8298389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.8298685Z ) 2025-05-07T20:33:03.8299042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.8299490Z def test_silu_mul_quant( 2025-05-07T20:33:03.8299743Z self, 2025-05-07T20:33:03.8299943Z T: int, 2025-05-07T20:33:03.8300153Z D: int, 2025-05-07T20:33:03.8300381Z scale_ub: Optional[float], 2025-05-07T20:33:03.8300659Z contiguous: bool, 2025-05-07T20:33:03.8300916Z compiled: bool, 2025-05-07T20:33:03.8301217Z ) -> None: 2025-05-07T20:33:03.8301434Z torch.manual_seed(2025) 2025-05-07T20:33:03.8301690Z 2025-05-07T20:33:03.8301972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.8302409Z 2025-05-07T20:33:03.8302677Z x_sign = torch.sign(x) 2025-05-07T20:33:03.8302988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.8303309Z x = x_sign * x_clamp 2025-05-07T20:33:03.8303563Z x0 = x[:, :D] 2025-05-07T20:33:03.8303789Z x1 = x[:, D:] 2025-05-07T20:33:03.8304000Z 2025-05-07T20:33:03.8304193Z if contiguous: 2025-05-07T20:33:03.8304433Z x0 = x0.contiguous() 2025-05-07T20:33:03.8304700Z x1 = x1.contiguous() 2025-05-07T20:33:03.8304952Z 2025-05-07T20:33:03.8305155Z if scale_ub is not None: 2025-05-07T20:33:03.8305435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.8305855Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.8306176Z ) 2025-05-07T20:33:03.8306381Z else: 2025-05-07T20:33:03.8306598Z scale_ub_tensor = None 2025-05-07T20:33:03.8306857Z 2025-05-07T20:33:03.8307103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.8307479Z op = silu_mul_quant 2025-05-07T20:33:03.8307744Z if compiled: 2025-05-07T20:33:03.8308000Z op = torch.compile(op) 2025-05-07T20:33:03.8308301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8308587Z 2025-05-07T20:33:03.8308789Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.8308959Z 2025-05-07T20:33:03.8309065Z moe/activation_test.py:117: 2025-05-07T20:33:03.8309372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8309714Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.8310007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8310575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.8311138Z return fn(*args, **kwargs) 2025-05-07T20:33:03.8311807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.8312498Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.8313040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.8313728Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.8314403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.8314937Z kernel = self.compile( 2025-05-07T20:33:03.8315489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.8316201Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.8316613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8316852Z 2025-05-07T20:33:03.8317066Z self = 2025-05-07T20:33:03.8318154Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.8319524Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ea901f0>} 2025-05-07T20:33:03.8320875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.8321892Z context = 2025-05-07T20:33:03.8322189Z 2025-05-07T20:33:03.8322360Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.8322934Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.8323453Z module_map=module_map) 2025-05-07T20:33:03.8323824Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.8324189Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.8324463Z E ^ 2025-05-07T20:33:03.8324925Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.8325384Z 2025-05-07T20:33:03.8325802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.8326374Z 2025-05-07T20:33:03.8326485Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.8326906Z self=, 2025-05-07T20:33:03.8327313Z T=2048, 2025-05-07T20:33:03.8327545Z D=7168, 2025-05-07T20:33:03.8327762Z scale_ub=1200.0, 2025-05-07T20:33:03.8327995Z contiguous=False, 2025-05-07T20:33:03.8328238Z compiled=True, 2025-05-07T20:33:03.8328452Z ) 2025-05-07T20:33:03.8328780Z self = 2025-05-07T20:33:03.8329285Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.8329567Z 2025-05-07T20:33:03.8329648Z @given( 2025-05-07T20:33:03.8329883Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.8330199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.8330514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.8330852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.8331190Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.8331485Z ) 2025-05-07T20:33:03.8331844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.8332288Z def test_silu_mul_quant( 2025-05-07T20:33:03.8332539Z self, 2025-05-07T20:33:03.8332746Z T: int, 2025-05-07T20:33:03.8332950Z D: int, 2025-05-07T20:33:03.8333177Z scale_ub: Optional[float], 2025-05-07T20:33:03.8333460Z contiguous: bool, 2025-05-07T20:33:03.8333701Z compiled: bool, 2025-05-07T20:33:03.8333931Z ) -> None: 2025-05-07T20:33:03.8334157Z torch.manual_seed(2025) 2025-05-07T20:33:03.8334411Z 2025-05-07T20:33:03.8334688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.8335037Z 2025-05-07T20:33:03.8335239Z x_sign = torch.sign(x) 2025-05-07T20:33:03.8335535Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.8335909Z x = x_sign * x_clamp 2025-05-07T20:33:03.8336160Z x0 = x[:, :D] 2025-05-07T20:33:03.8336381Z x1 = x[:, D:] 2025-05-07T20:33:03.8336599Z 2025-05-07T20:33:03.8336793Z if contiguous: 2025-05-07T20:33:03.8337027Z x0 = x0.contiguous() 2025-05-07T20:33:03.8337314Z x1 = x1.contiguous() 2025-05-07T20:33:03.8337608Z 2025-05-07T20:33:03.8337807Z if scale_ub is not None: 2025-05-07T20:33:03.8338091Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.8338459Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.8338784Z ) 2025-05-07T20:33:03.8338984Z else: 2025-05-07T20:33:03.8339199Z scale_ub_tensor = None 2025-05-07T20:33:03.8339460Z 2025-05-07T20:33:03.8339696Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.8340019Z op = silu_mul_quant 2025-05-07T20:33:03.8340481Z if compiled: 2025-05-07T20:33:03.8340742Z op = torch.compile(op) 2025-05-07T20:33:03.8341113Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8341403Z 2025-05-07T20:33:03.8341599Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.8341774Z 2025-05-07T20:33:03.8341876Z moe/activation_test.py:117: 2025-05-07T20:33:03.8342312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8342659Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.8342944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8343512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.8344081Z return fn(*args, **kwargs) 2025-05-07T20:33:03.8344747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.8345444Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.8345999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.8346755Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.8347419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.8347972Z kernel = self.compile( 2025-05-07T20:33:03.8348520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.8349186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.8349589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8349831Z 2025-05-07T20:33:03.8350042Z self = 2025-05-07T20:33:03.8351128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.8352505Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ea90ee0>} 2025-05-07T20:33:03.8353846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.8354881Z context = 2025-05-07T20:33:03.8355174Z 2025-05-07T20:33:03.8355344Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.8355875Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.8356346Z module_map=module_map) 2025-05-07T20:33:03.8356790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.8357152Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.8357442Z E ^ 2025-05-07T20:33:03.8357933Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.8358396Z 2025-05-07T20:33:03.8358821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.8359334Z 2025-05-07T20:33:04.1016938Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.1017785Z self=, 2025-05-07T20:33:04.1018348Z T=1, 2025-05-07T20:33:04.1018614Z D=5120, 2025-05-07T20:33:04.1018864Z scale_ub=None, 2025-05-07T20:33:04.1019086Z contiguous=False, 2025-05-07T20:33:04.1019320Z compiled=False, 2025-05-07T20:33:04.1019531Z ) 2025-05-07T20:33:04.1019870Z self = 2025-05-07T20:33:04.1027182Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:04.1027459Z 2025-05-07T20:33:04.1027546Z @given( 2025-05-07T20:33:04.1027791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.1028291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.1028620Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.1028968Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.1029296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.1029598Z ) 2025-05-07T20:33:04.1029960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.1030408Z def test_silu_mul_quant( 2025-05-07T20:33:04.1030660Z self, 2025-05-07T20:33:04.1030866Z T: int, 2025-05-07T20:33:04.1031068Z D: int, 2025-05-07T20:33:04.1031294Z scale_ub: Optional[float], 2025-05-07T20:33:04.1031647Z contiguous: bool, 2025-05-07T20:33:04.1031889Z compiled: bool, 2025-05-07T20:33:04.1032130Z ) -> None: 2025-05-07T20:33:04.1032356Z torch.manual_seed(2025) 2025-05-07T20:33:04.1032603Z 2025-05-07T20:33:04.1032885Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.1033245Z 2025-05-07T20:33:04.1033447Z x_sign = torch.sign(x) 2025-05-07T20:33:04.1033739Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.1034062Z x = x_sign * x_clamp 2025-05-07T20:33:04.1034322Z x0 = x[:, :D] 2025-05-07T20:33:04.1034542Z x1 = x[:, D:] 2025-05-07T20:33:04.1034759Z 2025-05-07T20:33:04.1034955Z if contiguous: 2025-05-07T20:33:04.1035192Z x0 = x0.contiguous() 2025-05-07T20:33:04.1035458Z x1 = x1.contiguous() 2025-05-07T20:33:04.1035710Z 2025-05-07T20:33:04.1035907Z if scale_ub is not None: 2025-05-07T20:33:04.1036200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.1036552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.1036868Z ) 2025-05-07T20:33:04.1037071Z else: 2025-05-07T20:33:04.1037296Z scale_ub_tensor = None 2025-05-07T20:33:04.1037551Z 2025-05-07T20:33:04.1037798Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.1038124Z op = silu_mul_quant 2025-05-07T20:33:04.1038385Z if compiled: 2025-05-07T20:33:04.1038641Z op = torch.compile(op) 2025-05-07T20:33:04.1038948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.1039242Z 2025-05-07T20:33:04.1039434Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.1039602Z 2025-05-07T20:33:04.1039706Z moe/activation_test.py:117: 2025-05-07T20:33:04.1040017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.1040560Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.1040930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.1041642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.1042336Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.1042882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.1043587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.1044259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.1044794Z kernel = self.compile( 2025-05-07T20:33:04.1045343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.1046008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.1046425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.1046662Z 2025-05-07T20:33:04.1046874Z self = 2025-05-07T20:33:04.1048029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.1049475Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f1585e0>} 2025-05-07T20:33:04.1050820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.1051838Z context = 2025-05-07T20:33:04.1052188Z 2025-05-07T20:33:04.1052363Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.1052905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.1053381Z module_map=module_map) 2025-05-07T20:33:04.1053765Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.1054139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.1054406Z E ^ 2025-05-07T20:33:04.1054875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.1055331Z 2025-05-07T20:33:04.1055753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.1056273Z 2025-05-07T20:33:04.1056380Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.1056831Z self=, 2025-05-07T20:33:04.1057243Z T=4096, 2025-05-07T20:33:04.1057437Z D=7168, 2025-05-07T20:33:04.1057639Z scale_ub=1200.0, 2025-05-07T20:33:04.1057877Z contiguous=False, 2025-05-07T20:33:04.1058106Z compiled=False, 2025-05-07T20:33:04.1058321Z ) 2025-05-07T20:33:04.1058651Z self = 2025-05-07T20:33:04.1059167Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.1059445Z 2025-05-07T20:33:04.1059524Z @given( 2025-05-07T20:33:04.1059764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.1060092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.1060411Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.1060751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.1061186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.1061482Z ) 2025-05-07T20:33:04.1061893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.1062343Z def test_silu_mul_quant( 2025-05-07T20:33:04.1062591Z self, 2025-05-07T20:33:04.1062789Z T: int, 2025-05-07T20:33:04.1062995Z D: int, 2025-05-07T20:33:04.1063220Z scale_ub: Optional[float], 2025-05-07T20:33:04.1063508Z contiguous: bool, 2025-05-07T20:33:04.1063754Z compiled: bool, 2025-05-07T20:33:04.1063988Z ) -> None: 2025-05-07T20:33:04.1064208Z torch.manual_seed(2025) 2025-05-07T20:33:04.1064463Z 2025-05-07T20:33:04.1064746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.1065096Z 2025-05-07T20:33:04.1065301Z x_sign = torch.sign(x) 2025-05-07T20:33:04.1065602Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.1065921Z x = x_sign * x_clamp 2025-05-07T20:33:04.1066174Z x0 = x[:, :D] 2025-05-07T20:33:04.1066402Z x1 = x[:, D:] 2025-05-07T20:33:04.1066623Z 2025-05-07T20:33:04.1066821Z if contiguous: 2025-05-07T20:33:04.1067063Z x0 = x0.contiguous() 2025-05-07T20:33:04.1067326Z x1 = x1.contiguous() 2025-05-07T20:33:04.1067578Z 2025-05-07T20:33:04.1067784Z if scale_ub is not None: 2025-05-07T20:33:04.1068153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.1068503Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.1068817Z ) 2025-05-07T20:33:04.1069017Z else: 2025-05-07T20:33:04.1069238Z scale_ub_tensor = None 2025-05-07T20:33:04.1069505Z 2025-05-07T20:33:04.1069739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.1070062Z op = silu_mul_quant 2025-05-07T20:33:04.1070321Z if compiled: 2025-05-07T20:33:04.1070574Z op = torch.compile(op) 2025-05-07T20:33:04.1070884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.1071275Z 2025-05-07T20:33:04.1071470Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.1071644Z 2025-05-07T20:33:04.1071745Z moe/activation_test.py:117: 2025-05-07T20:33:04.1072050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.1072392Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.1072688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.1073386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.1074080Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.1074629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.1075321Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.1075986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.1076530Z kernel = self.compile( 2025-05-07T20:33:04.1077078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.1077734Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.1078151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.1078386Z 2025-05-07T20:33:04.1078602Z self = 2025-05-07T20:33:04.1079686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.1081052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158eb0d1f0>} 2025-05-07T20:33:04.1082435Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.1083463Z context = 2025-05-07T20:33:04.1083757Z 2025-05-07T20:33:04.1083933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.1084461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.1084931Z module_map=module_map) 2025-05-07T20:33:04.1085305Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.1085662Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.1085929Z E ^ 2025-05-07T20:33:04.1086401Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.1086859Z 2025-05-07T20:33:04.1087275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.1087806Z 2025-05-07T20:33:04.1087913Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.1088378Z self=, 2025-05-07T20:33:04.1088825Z T=16384, 2025-05-07T20:33:04.1089021Z D=7168, 2025-05-07T20:33:04.1089219Z scale_ub=None, 2025-05-07T20:33:04.1089438Z contiguous=True, 2025-05-07T20:33:04.1089661Z compiled=True, 2025-05-07T20:33:04.1089868Z ) 2025-05-07T20:33:04.3950676Z self = 2025-05-07T20:33:04.3951429Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:04.3951833Z 2025-05-07T20:33:04.3951957Z @given( 2025-05-07T20:33:04.3952286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.3952884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.3953204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.3953550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.3953885Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.3954183Z ) 2025-05-07T20:33:04.3954549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.3955001Z def test_silu_mul_quant( 2025-05-07T20:33:04.3955252Z self, 2025-05-07T20:33:04.3955460Z T: int, 2025-05-07T20:33:04.3955662Z D: int, 2025-05-07T20:33:04.3955895Z scale_ub: Optional[float], 2025-05-07T20:33:04.3956178Z contiguous: bool, 2025-05-07T20:33:04.3956423Z compiled: bool, 2025-05-07T20:33:04.3956660Z ) -> None: 2025-05-07T20:33:04.3956884Z torch.manual_seed(2025) 2025-05-07T20:33:04.3957133Z 2025-05-07T20:33:04.3957427Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.3957784Z 2025-05-07T20:33:04.3957989Z x_sign = torch.sign(x) 2025-05-07T20:33:04.3958290Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.3958612Z x = x_sign * x_clamp 2025-05-07T20:33:04.3958861Z x0 = x[:, :D] 2025-05-07T20:33:04.3959090Z x1 = x[:, D:] 2025-05-07T20:33:04.3959305Z 2025-05-07T20:33:04.3959500Z if contiguous: 2025-05-07T20:33:04.3959744Z x0 = x0.contiguous() 2025-05-07T20:33:04.3960012Z x1 = x1.contiguous() 2025-05-07T20:33:04.3960261Z 2025-05-07T20:33:04.3960464Z if scale_ub is not None: 2025-05-07T20:33:04.3960740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.3961086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.3961413Z ) 2025-05-07T20:33:04.3961611Z else: 2025-05-07T20:33:04.3961831Z scale_ub_tensor = None 2025-05-07T20:33:04.3962093Z 2025-05-07T20:33:04.3962412Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.3962740Z op = silu_mul_quant 2025-05-07T20:33:04.3963006Z if compiled: 2025-05-07T20:33:04.3963270Z op = torch.compile(op) 2025-05-07T20:33:04.3963578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3963878Z 2025-05-07T20:33:04.3964085Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.3964255Z 2025-05-07T20:33:04.3964360Z moe/activation_test.py:117: 2025-05-07T20:33:04.3964660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.3965004Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.3965291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3965861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.3966424Z return fn(*args, **kwargs) 2025-05-07T20:33:04.3967098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.3967802Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.3968348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.3969105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.3969836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.3970381Z kernel = self.compile( 2025-05-07T20:33:04.3970930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.3971592Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.3971992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.3972228Z 2025-05-07T20:33:04.3972438Z self = 2025-05-07T20:33:04.3973564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.3974933Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158eb0dee0>} 2025-05-07T20:33:04.3976272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.3977291Z context = 2025-05-07T20:33:04.3977608Z 2025-05-07T20:33:04.3977808Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.3978349Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.3978848Z module_map=module_map) 2025-05-07T20:33:04.3979220Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.3979581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.3979852Z E ^ 2025-05-07T20:33:04.3980323Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.3980772Z 2025-05-07T20:33:04.3981263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.3981783Z 2025-05-07T20:33:04.3981889Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.3982309Z self=, 2025-05-07T20:33:04.3982718Z T=4096, 2025-05-07T20:33:04.3982909Z D=5120, 2025-05-07T20:33:04.3983162Z scale_ub=None, 2025-05-07T20:33:04.3983388Z contiguous=False, 2025-05-07T20:33:04.3983616Z compiled=True, 2025-05-07T20:33:04.3983826Z ) 2025-05-07T20:33:04.3984153Z self = 2025-05-07T20:33:04.3984659Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:04.3984944Z 2025-05-07T20:33:04.3985025Z @given( 2025-05-07T20:33:04.3985264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.3985580Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.3985895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.3986234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.3986567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.3986855Z ) 2025-05-07T20:33:04.3987214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.3987712Z def test_silu_mul_quant( 2025-05-07T20:33:04.3987963Z self, 2025-05-07T20:33:04.3988165Z T: int, 2025-05-07T20:33:04.3988367Z D: int, 2025-05-07T20:33:04.3988588Z scale_ub: Optional[float], 2025-05-07T20:33:04.3988865Z contiguous: bool, 2025-05-07T20:33:04.3989111Z compiled: bool, 2025-05-07T20:33:04.3989383Z ) -> None: 2025-05-07T20:33:04.3989644Z torch.manual_seed(2025) 2025-05-07T20:33:04.3989896Z 2025-05-07T20:33:04.3990167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.3990518Z 2025-05-07T20:33:04.3990716Z x_sign = torch.sign(x) 2025-05-07T20:33:04.3991010Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.3991333Z x = x_sign * x_clamp 2025-05-07T20:33:04.3991579Z x0 = x[:, :D] 2025-05-07T20:33:04.3991801Z x1 = x[:, D:] 2025-05-07T20:33:04.3992008Z 2025-05-07T20:33:04.3992199Z if contiguous: 2025-05-07T20:33:04.3992436Z x0 = x0.contiguous() 2025-05-07T20:33:04.3992747Z x1 = x1.contiguous() 2025-05-07T20:33:04.3992999Z 2025-05-07T20:33:04.3993202Z if scale_ub is not None: 2025-05-07T20:33:04.3993481Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.3993829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.3994147Z ) 2025-05-07T20:33:04.3994343Z else: 2025-05-07T20:33:04.3994561Z scale_ub_tensor = None 2025-05-07T20:33:04.3994820Z 2025-05-07T20:33:04.3995059Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.3995382Z op = silu_mul_quant 2025-05-07T20:33:04.3995642Z if compiled: 2025-05-07T20:33:04.3995893Z op = torch.compile(op) 2025-05-07T20:33:04.3996202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3996484Z 2025-05-07T20:33:04.3996682Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.3996851Z 2025-05-07T20:33:04.3996954Z moe/activation_test.py:117: 2025-05-07T20:33:04.3997258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.3997595Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.3997882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3998447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.3999013Z return fn(*args, **kwargs) 2025-05-07T20:33:04.3999669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.4000357Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.4000897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.4001580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.4002292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.4002844Z kernel = self.compile( 2025-05-07T20:33:04.4003389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.4004056Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.4004458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.4004697Z 2025-05-07T20:33:04.4004910Z self = 2025-05-07T20:33:04.4005989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.4007361Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f2b9940>} 2025-05-07T20:33:04.4008723Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.4009799Z context = 2025-05-07T20:33:04.4010133Z 2025-05-07T20:33:04.4010305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.4010833Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.4011303Z module_map=module_map) 2025-05-07T20:33:04.4011676Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.4012039Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.4012298Z E ^ 2025-05-07T20:33:04.4012770Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.4013271Z 2025-05-07T20:33:04.4013689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.4014203Z 2025-05-07T20:33:04.5958879Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.5959319Z self=, 2025-05-07T20:33:04.5959903Z T=4096, 2025-05-07T20:33:04.5960107Z D=5120, 2025-05-07T20:33:04.5961106Z scale_ub=1200.0, 2025-05-07T20:33:04.5961584Z contiguous=False, 2025-05-07T20:33:04.5961958Z compiled=False, 2025-05-07T20:33:04.5962296Z ) 2025-05-07T20:33:04.5962812Z self = 2025-05-07T20:33:04.5963569Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.5964053Z 2025-05-07T20:33:04.5964180Z @given( 2025-05-07T20:33:04.5964589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.5965139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.5965654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.5966225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.5966782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.5967279Z ) 2025-05-07T20:33:04.5967886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.5968663Z def test_silu_mul_quant( 2025-05-07T20:33:04.5969060Z self, 2025-05-07T20:33:04.5969379Z T: int, 2025-05-07T20:33:04.5969705Z D: int, 2025-05-07T20:33:04.5970063Z scale_ub: Optional[float], 2025-05-07T20:33:04.5970520Z contiguous: bool, 2025-05-07T20:33:04.5970923Z compiled: bool, 2025-05-07T20:33:04.5971295Z ) -> None: 2025-05-07T20:33:04.5971656Z torch.manual_seed(2025) 2025-05-07T20:33:04.5972063Z 2025-05-07T20:33:04.5972898Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.5973482Z 2025-05-07T20:33:04.5973802Z x_sign = torch.sign(x) 2025-05-07T20:33:04.5974287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.5974806Z x = x_sign * x_clamp 2025-05-07T20:33:04.5975222Z x0 = x[:, :D] 2025-05-07T20:33:04.5975592Z x1 = x[:, D:] 2025-05-07T20:33:04.5975930Z 2025-05-07T20:33:04.5976235Z if contiguous: 2025-05-07T20:33:04.5976626Z x0 = x0.contiguous() 2025-05-07T20:33:04.5977052Z x1 = x1.contiguous() 2025-05-07T20:33:04.5977463Z 2025-05-07T20:33:04.5977779Z if scale_ub is not None: 2025-05-07T20:33:04.5978231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.5978820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.5979363Z ) 2025-05-07T20:33:04.5979672Z else: 2025-05-07T20:33:04.5980022Z scale_ub_tensor = None 2025-05-07T20:33:04.5980460Z 2025-05-07T20:33:04.5980832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.5981506Z op = silu_mul_quant 2025-05-07T20:33:04.5981928Z if compiled: 2025-05-07T20:33:04.5982339Z op = torch.compile(op) 2025-05-07T20:33:04.5994338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5994853Z 2025-05-07T20:33:04.5995183Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.5995469Z 2025-05-07T20:33:04.5995648Z moe/activation_test.py:117: 2025-05-07T20:33:04.5996155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5996771Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.5997244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5998465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.5999707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.6000778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.6001993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.6003169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.6004117Z kernel = self.compile( 2025-05-07T20:33:04.6005059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.6006205Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.6006899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.6007305Z 2025-05-07T20:33:04.6007656Z self = 2025-05-07T20:33:04.6009614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.6012219Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ec343a0>} 2025-05-07T20:33:04.6014654Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.6016493Z context = 2025-05-07T20:33:04.6017000Z 2025-05-07T20:33:04.6017292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.6018201Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.6019098Z module_map=module_map) 2025-05-07T20:33:04.6019728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.6020325Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.6020773Z E ^ 2025-05-07T20:33:04.6021685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.6022502Z 2025-05-07T20:33:04.6023251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.6024163Z 2025-05-07T20:33:04.6024334Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.6025049Z self=, 2025-05-07T20:33:04.6025748Z T=4096, 2025-05-07T20:33:04.6026053Z D=5120, 2025-05-07T20:33:04.6026383Z scale_ub=1200.0, 2025-05-07T20:33:04.6026761Z contiguous=False, 2025-05-07T20:33:04.6027129Z compiled=True, 2025-05-07T20:33:04.6027480Z ) 2025-05-07T20:33:04.6028008Z self = 2025-05-07T20:33:04.6028759Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:04.6029125Z 2025-05-07T20:33:04.6029229Z @given( 2025-05-07T20:33:04.6029626Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.6030151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.6030577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.6031042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.6031494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.6031895Z ) 2025-05-07T20:33:04.6032396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.6033026Z def test_silu_mul_quant( 2025-05-07T20:33:04.6033382Z self, 2025-05-07T20:33:04.6033667Z T: int, 2025-05-07T20:33:04.6033979Z D: int, 2025-05-07T20:33:04.6034424Z scale_ub: Optional[float], 2025-05-07T20:33:04.6034825Z contiguous: bool, 2025-05-07T20:33:04.6035184Z compiled: bool, 2025-05-07T20:33:04.6035524Z ) -> None: 2025-05-07T20:33:04.6035843Z torch.manual_seed(2025) 2025-05-07T20:33:04.6036198Z 2025-05-07T20:33:04.6036649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.6037135Z 2025-05-07T20:33:04.6037476Z x_sign = torch.sign(x) 2025-05-07T20:33:04.6037971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.6038460Z x = x_sign * x_clamp 2025-05-07T20:33:04.6038849Z x0 = x[:, :D] 2025-05-07T20:33:04.6039194Z x1 = x[:, D:] 2025-05-07T20:33:04.6039517Z 2025-05-07T20:33:04.6039819Z if contiguous: 2025-05-07T20:33:04.6040855Z x0 = x0.contiguous() 2025-05-07T20:33:04.6041274Z x1 = x1.contiguous() 2025-05-07T20:33:04.6041673Z 2025-05-07T20:33:04.6041996Z if scale_ub is not None: 2025-05-07T20:33:04.6042403Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.6042922Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.6043425Z ) 2025-05-07T20:33:04.6043746Z else: 2025-05-07T20:33:04.6044083Z scale_ub_tensor = None 2025-05-07T20:33:04.6044479Z 2025-05-07T20:33:04.6044850Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.6045366Z op = silu_mul_quant 2025-05-07T20:33:04.6045763Z if compiled: 2025-05-07T20:33:04.6046170Z op = torch.compile(op) 2025-05-07T20:33:04.6046651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.6047116Z 2025-05-07T20:33:04.6047455Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.6047768Z 2025-05-07T20:33:04.6047927Z moe/activation_test.py:117: 2025-05-07T20:33:04.6048404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.6049132Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.6049625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.6050578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.6051502Z return fn(*args, **kwargs) 2025-05-07T20:33:04.6052641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.6053828Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.6054753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.6055950Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.6057107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.6058076Z kernel = self.compile( 2025-05-07T20:33:04.6059041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.6060190Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.6060857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.6061480Z 2025-05-07T20:33:04.6061911Z self = 2025-05-07T20:33:04.6063788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.6066257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ec34280>} 2025-05-07T20:33:04.6068574Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.6070457Z context = 2025-05-07T20:33:04.6070974Z 2025-05-07T20:33:04.6071258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.6072175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.6072994Z module_map=module_map) 2025-05-07T20:33:04.6073606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.6074172Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.6074609Z E ^ 2025-05-07T20:33:04.6075397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.6076204Z 2025-05-07T20:33:04.6076854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.6077718Z 2025-05-07T20:33:04.8806604Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.8807285Z self=, 2025-05-07T20:33:04.8807790Z T=2048, 2025-05-07T20:33:04.8808034Z D=7168, 2025-05-07T20:33:04.8808245Z scale_ub=1200.0, 2025-05-07T20:33:04.8808483Z contiguous=False, 2025-05-07T20:33:04.8808732Z compiled=False, 2025-05-07T20:33:04.8808962Z ) 2025-05-07T20:33:04.8809292Z self = 2025-05-07T20:33:04.8809845Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.8810139Z 2025-05-07T20:33:04.8810224Z @given( 2025-05-07T20:33:04.8810476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.8810812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.8811416Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.8811777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.8812125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.8812423Z ) 2025-05-07T20:33:04.8812791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.8813266Z def test_silu_mul_quant( 2025-05-07T20:33:04.8813523Z self, 2025-05-07T20:33:04.8813739Z T: int, 2025-05-07T20:33:04.8813959Z D: int, 2025-05-07T20:33:04.8814190Z scale_ub: Optional[float], 2025-05-07T20:33:04.8814482Z contiguous: bool, 2025-05-07T20:33:04.8814739Z compiled: bool, 2025-05-07T20:33:04.8814988Z ) -> None: 2025-05-07T20:33:04.8815216Z torch.manual_seed(2025) 2025-05-07T20:33:04.8815479Z 2025-05-07T20:33:04.8815770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.8816124Z 2025-05-07T20:33:04.8816347Z x_sign = torch.sign(x) 2025-05-07T20:33:04.8816656Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.8816984Z x = x_sign * x_clamp 2025-05-07T20:33:04.8817245Z x0 = x[:, :D] 2025-05-07T20:33:04.8817500Z x1 = x[:, D:] 2025-05-07T20:33:04.8817748Z 2025-05-07T20:33:04.8818057Z if contiguous: 2025-05-07T20:33:04.8818388Z x0 = x0.contiguous() 2025-05-07T20:33:04.8818661Z x1 = x1.contiguous() 2025-05-07T20:33:04.8818923Z 2025-05-07T20:33:04.8819135Z if scale_ub is not None: 2025-05-07T20:33:04.8819423Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.8819777Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.8820106Z ) 2025-05-07T20:33:04.8820319Z else: 2025-05-07T20:33:04.8820540Z scale_ub_tensor = None 2025-05-07T20:33:04.8820814Z 2025-05-07T20:33:04.8821174Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.8821603Z op = silu_mul_quant 2025-05-07T20:33:04.8821879Z if compiled: 2025-05-07T20:33:04.8822142Z op = torch.compile(op) 2025-05-07T20:33:04.8822448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8822739Z 2025-05-07T20:33:04.8822948Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.8823124Z 2025-05-07T20:33:04.8823232Z moe/activation_test.py:117: 2025-05-07T20:33:04.8823542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8823893Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.8824183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8824897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.8825606Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.8826161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.8826855Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.8827540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.8828091Z kernel = self.compile( 2025-05-07T20:33:04.8828654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.8829317Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.8829734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8829974Z 2025-05-07T20:33:04.8830213Z self = 2025-05-07T20:33:04.8831355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.8832841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ed85670>} 2025-05-07T20:33:04.8834194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.8835239Z context = 2025-05-07T20:33:04.8835534Z 2025-05-07T20:33:04.8835708Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.8836246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.8836728Z module_map=module_map) 2025-05-07T20:33:04.8837101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.8837475Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.8837790Z E ^ 2025-05-07T20:33:04.8838274Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.8838726Z 2025-05-07T20:33:04.8839201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.8839765Z 2025-05-07T20:33:04.8839874Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.8840578Z self=, 2025-05-07T20:33:04.8840996Z T=1, 2025-05-07T20:33:04.8841195Z D=7168, 2025-05-07T20:33:04.8841402Z scale_ub=None, 2025-05-07T20:33:04.8841633Z contiguous=True, 2025-05-07T20:33:04.8841863Z compiled=False, 2025-05-07T20:33:04.8842081Z ) 2025-05-07T20:33:04.8842406Z self = 2025-05-07T20:33:04.8842992Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.8843262Z 2025-05-07T20:33:04.8843344Z @given( 2025-05-07T20:33:04.8843586Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.8843905Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.8844228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.8844574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.8844907Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.8845206Z ) 2025-05-07T20:33:04.8845570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.8846027Z def test_silu_mul_quant( 2025-05-07T20:33:04.8846275Z self, 2025-05-07T20:33:04.8846482Z T: int, 2025-05-07T20:33:04.8846695Z D: int, 2025-05-07T20:33:04.8846921Z scale_ub: Optional[float], 2025-05-07T20:33:04.8847204Z contiguous: bool, 2025-05-07T20:33:04.8847463Z compiled: bool, 2025-05-07T20:33:04.8847694Z ) -> None: 2025-05-07T20:33:04.8847928Z torch.manual_seed(2025) 2025-05-07T20:33:04.8848188Z 2025-05-07T20:33:04.8848463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.8848821Z 2025-05-07T20:33:04.8849034Z x_sign = torch.sign(x) 2025-05-07T20:33:04.8849338Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.8849661Z x = x_sign * x_clamp 2025-05-07T20:33:04.8849918Z x0 = x[:, :D] 2025-05-07T20:33:04.8850142Z x1 = x[:, D:] 2025-05-07T20:33:04.8850365Z 2025-05-07T20:33:04.8850566Z if contiguous: 2025-05-07T20:33:04.8850812Z x0 = x0.contiguous() 2025-05-07T20:33:04.8851080Z x1 = x1.contiguous() 2025-05-07T20:33:04.8851336Z 2025-05-07T20:33:04.8851543Z if scale_ub is not None: 2025-05-07T20:33:04.8851825Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.8852251Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.8852582Z ) 2025-05-07T20:33:04.8852780Z else: 2025-05-07T20:33:04.8853006Z scale_ub_tensor = None 2025-05-07T20:33:04.8853276Z 2025-05-07T20:33:04.8853513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.8853850Z op = silu_mul_quant 2025-05-07T20:33:04.8854120Z if compiled: 2025-05-07T20:33:04.8854376Z op = torch.compile(op) 2025-05-07T20:33:04.8854686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8854977Z 2025-05-07T20:33:04.8855176Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.8855356Z 2025-05-07T20:33:04.8855460Z moe/activation_test.py:117: 2025-05-07T20:33:04.8855768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8856114Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.8856402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8857107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.8857853Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.8858395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.8859248Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.8859933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.8860477Z kernel = self.compile( 2025-05-07T20:33:04.8861097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.8861764Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.8862175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8862464Z 2025-05-07T20:33:04.8862693Z self = 2025-05-07T20:33:04.8863778Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.8865153Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ec5f280>} 2025-05-07T20:33:04.8866501Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.8867550Z context = 2025-05-07T20:33:04.8867868Z 2025-05-07T20:33:04.8868042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.8868584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.8869063Z module_map=module_map) 2025-05-07T20:33:04.8869444Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.8869813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.8870089Z E ^ 2025-05-07T20:33:04.8870570Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.8871027Z 2025-05-07T20:33:04.8871444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.8871979Z 2025-05-07T20:33:04.8872087Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.8872518Z self=, 2025-05-07T20:33:04.8872936Z T=16384, 2025-05-07T20:33:04.8873187Z D=7168, 2025-05-07T20:33:04.8873399Z scale_ub=1200.0, 2025-05-07T20:33:04.8873636Z contiguous=False, 2025-05-07T20:33:04.8873869Z compiled=True, 2025-05-07T20:33:04.8874089Z ) 2025-05-07T20:33:05.0789293Z self = 2025-05-07T20:33:05.0790080Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.0790371Z 2025-05-07T20:33:05.0790466Z @given( 2025-05-07T20:33:05.0790708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.0791042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.0791369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.0791722Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.0792066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.0792370Z ) 2025-05-07T20:33:05.0792734Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.0793198Z def test_silu_mul_quant( 2025-05-07T20:33:05.0793456Z self, 2025-05-07T20:33:05.0793666Z T: int, 2025-05-07T20:33:05.0793870Z D: int, 2025-05-07T20:33:05.0794106Z scale_ub: Optional[float], 2025-05-07T20:33:05.0794395Z contiguous: bool, 2025-05-07T20:33:05.0794976Z compiled: bool, 2025-05-07T20:33:05.0795221Z ) -> None: 2025-05-07T20:33:05.0795454Z torch.manual_seed(2025) 2025-05-07T20:33:05.0795702Z 2025-05-07T20:33:05.0795989Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.0796347Z 2025-05-07T20:33:05.0796550Z x_sign = torch.sign(x) 2025-05-07T20:33:05.0796862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.0797192Z x = x_sign * x_clamp 2025-05-07T20:33:05.0797447Z x0 = x[:, :D] 2025-05-07T20:33:05.0797705Z x1 = x[:, D:] 2025-05-07T20:33:05.0797946Z 2025-05-07T20:33:05.0798149Z if contiguous: 2025-05-07T20:33:05.0798480Z x0 = x0.contiguous() 2025-05-07T20:33:05.0798762Z x1 = x1.contiguous() 2025-05-07T20:33:05.0799021Z 2025-05-07T20:33:05.0799223Z if scale_ub is not None: 2025-05-07T20:33:05.0799521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.0799882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.0800203Z ) 2025-05-07T20:33:05.0800418Z else: 2025-05-07T20:33:05.0800649Z scale_ub_tensor = None 2025-05-07T20:33:05.0800912Z 2025-05-07T20:33:05.0801162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.0801490Z op = silu_mul_quant 2025-05-07T20:33:05.0801752Z if compiled: 2025-05-07T20:33:05.0802014Z op = torch.compile(op) 2025-05-07T20:33:05.0802325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.0802608Z 2025-05-07T20:33:05.0802806Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.0802993Z 2025-05-07T20:33:05.0803099Z moe/activation_test.py:117: 2025-05-07T20:33:05.0803406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.0803745Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.0804043Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.0804615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.0805176Z return fn(*args, **kwargs) 2025-05-07T20:33:05.0805852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.0806543Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.0807092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.0807772Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.0808529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.0809079Z kernel = self.compile( 2025-05-07T20:33:05.0809627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.0810296Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.0810700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.0810935Z 2025-05-07T20:33:05.0811154Z self = 2025-05-07T20:33:05.0812227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.0813609Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ec5fee0>} 2025-05-07T20:33:05.0814962Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.0816066Z context = 2025-05-07T20:33:05.0816359Z 2025-05-07T20:33:05.0816537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.0817062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.0817532Z module_map=module_map) 2025-05-07T20:33:05.0817956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.0818313Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.0818581Z E ^ 2025-05-07T20:33:05.0819052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.0819550Z 2025-05-07T20:33:05.0819982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.0820495Z 2025-05-07T20:33:05.0820608Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.0821112Z self=, 2025-05-07T20:33:05.0821523Z T=1, 2025-05-07T20:33:05.0821714Z D=7168, 2025-05-07T20:33:05.0821909Z scale_ub=None, 2025-05-07T20:33:05.0822139Z contiguous=False, 2025-05-07T20:33:05.0822378Z compiled=False, 2025-05-07T20:33:05.0822588Z ) 2025-05-07T20:33:05.0822913Z self = 2025-05-07T20:33:05.0823407Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.0823670Z 2025-05-07T20:33:05.0823750Z @given( 2025-05-07T20:33:05.0823993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.0832625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.0833004Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.0833363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.0833715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.0834019Z ) 2025-05-07T20:33:05.0834382Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.0834838Z def test_silu_mul_quant( 2025-05-07T20:33:05.0835098Z self, 2025-05-07T20:33:05.0835309Z T: int, 2025-05-07T20:33:05.0835514Z D: int, 2025-05-07T20:33:05.0835747Z scale_ub: Optional[float], 2025-05-07T20:33:05.0836034Z contiguous: bool, 2025-05-07T20:33:05.0836284Z compiled: bool, 2025-05-07T20:33:05.0836527Z ) -> None: 2025-05-07T20:33:05.0836755Z torch.manual_seed(2025) 2025-05-07T20:33:05.0837007Z 2025-05-07T20:33:05.0837379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.0837790Z 2025-05-07T20:33:05.0837995Z x_sign = torch.sign(x) 2025-05-07T20:33:05.0838291Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.0838618Z x = x_sign * x_clamp 2025-05-07T20:33:05.0838888Z x0 = x[:, :D] 2025-05-07T20:33:05.0839109Z x1 = x[:, D:] 2025-05-07T20:33:05.0839327Z 2025-05-07T20:33:05.0839526Z if contiguous: 2025-05-07T20:33:05.0839765Z x0 = x0.contiguous() 2025-05-07T20:33:05.0840038Z x1 = x1.contiguous() 2025-05-07T20:33:05.0840581Z 2025-05-07T20:33:05.0840780Z if scale_ub is not None: 2025-05-07T20:33:05.0841072Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.0841424Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.0841743Z ) 2025-05-07T20:33:05.0841958Z else: 2025-05-07T20:33:05.0842189Z scale_ub_tensor = None 2025-05-07T20:33:05.0842447Z 2025-05-07T20:33:05.0842694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.0843026Z op = silu_mul_quant 2025-05-07T20:33:05.0843296Z if compiled: 2025-05-07T20:33:05.0843550Z op = torch.compile(op) 2025-05-07T20:33:05.0844039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.0844334Z 2025-05-07T20:33:05.0844534Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.0844714Z 2025-05-07T20:33:05.0844820Z moe/activation_test.py:117: 2025-05-07T20:33:05.0845128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.0845468Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.0845764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.0846478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.0847248Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.0847796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.0848493Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.0849173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.0849716Z kernel = self.compile( 2025-05-07T20:33:05.0850270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.0850937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.0851344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.0851578Z 2025-05-07T20:33:05.0851788Z self = 2025-05-07T20:33:05.0852876Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.0854271Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ecb6670>} 2025-05-07T20:33:05.0855629Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.0856664Z context = 2025-05-07T20:33:05.0856955Z 2025-05-07T20:33:05.0857127Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.0857668Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.0858209Z module_map=module_map) 2025-05-07T20:33:05.0858582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.0858946Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.0859215Z E ^ 2025-05-07T20:33:05.0859689Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.0860148Z 2025-05-07T20:33:05.0860574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.0861182Z 2025-05-07T20:33:05.0861290Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.0861715Z self=, 2025-05-07T20:33:05.0862119Z T=2048, 2025-05-07T20:33:05.0862315Z D=7168, 2025-05-07T20:33:05.0862516Z scale_ub=None, 2025-05-07T20:33:05.0862733Z contiguous=False, 2025-05-07T20:33:05.0862980Z compiled=True, 2025-05-07T20:33:05.0863194Z ) 2025-05-07T20:33:05.3808694Z self = 2025-05-07T20:33:05.3809230Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.3809552Z 2025-05-07T20:33:05.3809642Z @given( 2025-05-07T20:33:05.3810172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3810653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3811099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3811577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3812038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3812431Z ) 2025-05-07T20:33:05.3812905Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3813377Z def test_silu_mul_quant( 2025-05-07T20:33:05.3813624Z self, 2025-05-07T20:33:05.3813830Z T: int, 2025-05-07T20:33:05.3814174Z D: int, 2025-05-07T20:33:05.3814398Z scale_ub: Optional[float], 2025-05-07T20:33:05.3814682Z contiguous: bool, 2025-05-07T20:33:05.3814939Z compiled: bool, 2025-05-07T20:33:05.3815172Z ) -> None: 2025-05-07T20:33:05.3815400Z torch.manual_seed(2025) 2025-05-07T20:33:05.3815661Z 2025-05-07T20:33:05.3815942Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3816295Z 2025-05-07T20:33:05.3816504Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3816807Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3817121Z x = x_sign * x_clamp 2025-05-07T20:33:05.3817373Z x0 = x[:, :D] 2025-05-07T20:33:05.3817604Z x1 = x[:, D:] 2025-05-07T20:33:05.3817855Z 2025-05-07T20:33:05.3818056Z if contiguous: 2025-05-07T20:33:05.3818299Z x0 = x0.contiguous() 2025-05-07T20:33:05.3818569Z x1 = x1.contiguous() 2025-05-07T20:33:05.3818817Z 2025-05-07T20:33:05.3819019Z if scale_ub is not None: 2025-05-07T20:33:05.3819307Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3819651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3819969Z ) 2025-05-07T20:33:05.3820171Z else: 2025-05-07T20:33:05.3820389Z scale_ub_tensor = None 2025-05-07T20:33:05.3820654Z 2025-05-07T20:33:05.3820896Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3821302Z op = silu_mul_quant 2025-05-07T20:33:05.3821567Z if compiled: 2025-05-07T20:33:05.3821827Z op = torch.compile(op) 2025-05-07T20:33:05.3822125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3822408Z 2025-05-07T20:33:05.3822607Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3822777Z 2025-05-07T20:33:05.3822888Z moe/activation_test.py:117: 2025-05-07T20:33:05.3823264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3823614Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3823936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3824500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3825065Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3825731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3826420Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3826957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3827639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3828305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3828849Z kernel = self.compile( 2025-05-07T20:33:05.3829395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3830053Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3830505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3830775Z 2025-05-07T20:33:05.3830993Z self = 2025-05-07T20:33:05.3832081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3833455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e8c1550>} 2025-05-07T20:33:05.3834800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3835874Z context = 2025-05-07T20:33:05.3836170Z 2025-05-07T20:33:05.3836354Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3836884Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3837358Z module_map=module_map) 2025-05-07T20:33:05.3837735Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3838092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3838362Z E ^ 2025-05-07T20:33:05.3838836Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3839284Z 2025-05-07T20:33:05.3839717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3840418Z 2025-05-07T20:33:05.3840529Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3840958Z self=, 2025-05-07T20:33:05.3841376Z T=4096, 2025-05-07T20:33:05.3841567Z D=7168, 2025-05-07T20:33:05.3841769Z scale_ub=None, 2025-05-07T20:33:05.3841994Z contiguous=False, 2025-05-07T20:33:05.3842226Z compiled=True, 2025-05-07T20:33:05.3842440Z ) 2025-05-07T20:33:05.3842767Z self = 2025-05-07T20:33:05.3843267Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.3843550Z 2025-05-07T20:33:05.3843632Z @given( 2025-05-07T20:33:05.3843876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3844202Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3844592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3844933Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3845277Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3845566Z ) 2025-05-07T20:33:05.3845924Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3846381Z def test_silu_mul_quant( 2025-05-07T20:33:05.3846625Z self, 2025-05-07T20:33:05.3846826Z T: int, 2025-05-07T20:33:05.3847031Z D: int, 2025-05-07T20:33:05.3847250Z scale_ub: Optional[float], 2025-05-07T20:33:05.3847533Z contiguous: bool, 2025-05-07T20:33:05.3847783Z compiled: bool, 2025-05-07T20:33:05.3848007Z ) -> None: 2025-05-07T20:33:05.3848238Z torch.manual_seed(2025) 2025-05-07T20:33:05.3848491Z 2025-05-07T20:33:05.3848765Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3849121Z 2025-05-07T20:33:05.3849330Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3849622Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3849943Z x = x_sign * x_clamp 2025-05-07T20:33:05.3850194Z x0 = x[:, :D] 2025-05-07T20:33:05.3850418Z x1 = x[:, D:] 2025-05-07T20:33:05.3850703Z 2025-05-07T20:33:05.3850960Z if contiguous: 2025-05-07T20:33:05.3851208Z x0 = x0.contiguous() 2025-05-07T20:33:05.3851470Z x1 = x1.contiguous() 2025-05-07T20:33:05.3851724Z 2025-05-07T20:33:05.3851928Z if scale_ub is not None: 2025-05-07T20:33:05.3852206Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3852553Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3852872Z ) 2025-05-07T20:33:05.3853071Z else: 2025-05-07T20:33:05.3853291Z scale_ub_tensor = None 2025-05-07T20:33:05.3853554Z 2025-05-07T20:33:05.3853792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3854180Z op = silu_mul_quant 2025-05-07T20:33:05.3854442Z if compiled: 2025-05-07T20:33:05.3854695Z op = torch.compile(op) 2025-05-07T20:33:05.3854999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3855288Z 2025-05-07T20:33:05.3855489Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3855667Z 2025-05-07T20:33:05.3855770Z moe/activation_test.py:117: 2025-05-07T20:33:05.3856077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3856419Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3856703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3857269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3857838Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3858501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3859199Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3859746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3860431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3861149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3861691Z kernel = self.compile( 2025-05-07T20:33:05.3862238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3862907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3863307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3863548Z 2025-05-07T20:33:05.3863759Z self = 2025-05-07T20:33:05.3864891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3866278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e92b160>} 2025-05-07T20:33:05.3867624Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3868657Z context = 2025-05-07T20:33:05.3868956Z 2025-05-07T20:33:05.3869125Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3869658Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3870128Z module_map=module_map) 2025-05-07T20:33:05.3870504Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3870867Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3871129Z E ^ 2025-05-07T20:33:05.3871678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3872133Z 2025-05-07T20:33:05.3872556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3873067Z 2025-05-07T20:33:05.5934751Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.5936164Z self=, 2025-05-07T20:33:05.5936970Z T=16384, 2025-05-07T20:33:05.5937296Z D=5120, 2025-05-07T20:33:05.5937598Z scale_ub=1200.0, 2025-05-07T20:33:05.5938329Z contiguous=False, 2025-05-07T20:33:05.5938663Z compiled=False, 2025-05-07T20:33:05.5938978Z ) 2025-05-07T20:33:05.5939510Z self = 2025-05-07T20:33:05.5941328Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.5941832Z 2025-05-07T20:33:05.5941968Z @given( 2025-05-07T20:33:05.5942347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.5942873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.5943382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.5943945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.5944497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.5944979Z ) 2025-05-07T20:33:05.5945570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.5946329Z def test_silu_mul_quant( 2025-05-07T20:33:05.5946738Z self, 2025-05-07T20:33:05.5947050Z T: int, 2025-05-07T20:33:05.5947375Z D: int, 2025-05-07T20:33:05.5947736Z scale_ub: Optional[float], 2025-05-07T20:33:05.5948188Z contiguous: bool, 2025-05-07T20:33:05.5948590Z compiled: bool, 2025-05-07T20:33:05.5948964Z ) -> None: 2025-05-07T20:33:05.5949316Z torch.manual_seed(2025) 2025-05-07T20:33:05.5949718Z 2025-05-07T20:33:05.5950165Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.5950732Z 2025-05-07T20:33:05.5951045Z x_sign = torch.sign(x) 2025-05-07T20:33:05.5951529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.5952048Z x = x_sign * x_clamp 2025-05-07T20:33:05.5952439Z x0 = x[:, :D] 2025-05-07T20:33:05.5952789Z x1 = x[:, D:] 2025-05-07T20:33:05.5953129Z 2025-05-07T20:33:05.5953426Z if contiguous: 2025-05-07T20:33:05.5953806Z x0 = x0.contiguous() 2025-05-07T20:33:05.5954398Z x1 = x1.contiguous() 2025-05-07T20:33:05.5954804Z 2025-05-07T20:33:05.5955118Z if scale_ub is not None: 2025-05-07T20:33:05.5955573Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.5956119Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.5956656Z ) 2025-05-07T20:33:05.5956996Z else: 2025-05-07T20:33:05.5957336Z scale_ub_tensor = None 2025-05-07T20:33:05.5957757Z 2025-05-07T20:33:05.5958136Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.5958661Z op = silu_mul_quant 2025-05-07T20:33:05.5959075Z if compiled: 2025-05-07T20:33:05.5959481Z op = torch.compile(op) 2025-05-07T20:33:05.5959970Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.5960434Z 2025-05-07T20:33:05.5960748Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.5961023Z 2025-05-07T20:33:05.5961195Z moe/activation_test.py:117: 2025-05-07T20:33:05.5961691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5962222Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.5962681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.5963982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.5965269Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.5966182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.5967342Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.5968516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.5969452Z kernel = self.compile( 2025-05-07T20:33:05.5970399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.5971682Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.5972362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5972766Z 2025-05-07T20:33:05.5973106Z self = 2025-05-07T20:33:05.5975038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.5977540Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e92b940>} 2025-05-07T20:33:05.5979929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.5981822Z context = 2025-05-07T20:33:05.5982314Z 2025-05-07T20:33:05.5982596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.5983507Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.5984315Z module_map=module_map) 2025-05-07T20:33:05.5984929Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.5985522Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.5985949Z E ^ 2025-05-07T20:33:05.5986751Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.5987559Z 2025-05-07T20:33:05.5988302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.5989294Z 2025-05-07T20:33:05.5989477Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.5990173Z self=, 2025-05-07T20:33:05.5990866Z T=16384, 2025-05-07T20:33:05.5991185Z D=5120, 2025-05-07T20:33:05.5991493Z scale_ub=1200.0, 2025-05-07T20:33:05.5991870Z contiguous=True, 2025-05-07T20:33:05.5992234Z compiled=True, 2025-05-07T20:33:05.5992564Z ) 2025-05-07T20:33:05.5993104Z self = 2025-05-07T20:33:05.5993954Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.5994426Z 2025-05-07T20:33:05.5994560Z @given( 2025-05-07T20:33:05.5994925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.5995455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.5995970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.5996526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.5997106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.5997593Z ) 2025-05-07T20:33:05.5998180Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.5998947Z def test_silu_mul_quant( 2025-05-07T20:33:05.5999427Z self, 2025-05-07T20:33:05.5999792Z T: int, 2025-05-07T20:33:05.6000117Z D: int, 2025-05-07T20:33:05.6000477Z scale_ub: Optional[float], 2025-05-07T20:33:05.6000931Z contiguous: bool, 2025-05-07T20:33:05.6001326Z compiled: bool, 2025-05-07T20:33:05.6001696Z ) -> None: 2025-05-07T20:33:05.6002047Z torch.manual_seed(2025) 2025-05-07T20:33:05.6002444Z 2025-05-07T20:33:05.6002889Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.6003473Z 2025-05-07T20:33:05.6003781Z x_sign = torch.sign(x) 2025-05-07T20:33:05.6004264Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.6004839Z x = x_sign * x_clamp 2025-05-07T20:33:05.6005203Z x0 = x[:, :D] 2025-05-07T20:33:05.6005514Z x1 = x[:, D:] 2025-05-07T20:33:05.6005803Z 2025-05-07T20:33:05.6006065Z if contiguous: 2025-05-07T20:33:05.6006373Z x0 = x0.contiguous() 2025-05-07T20:33:05.6006748Z x1 = x1.contiguous() 2025-05-07T20:33:05.6007095Z 2025-05-07T20:33:05.6007358Z if scale_ub is not None: 2025-05-07T20:33:05.6007742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.6008203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.6008615Z ) 2025-05-07T20:33:05.6008879Z else: 2025-05-07T20:33:05.6009180Z scale_ub_tensor = None 2025-05-07T20:33:05.6009556Z 2025-05-07T20:33:05.6009903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.6010366Z op = silu_mul_quant 2025-05-07T20:33:05.6010716Z if compiled: 2025-05-07T20:33:05.6011078Z op = torch.compile(op) 2025-05-07T20:33:05.6011553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.6011970Z 2025-05-07T20:33:05.6012253Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.6012506Z 2025-05-07T20:33:05.6012652Z moe/activation_test.py:117: 2025-05-07T20:33:05.6024964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.6025550Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.6025996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.6026904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.6027887Z return fn(*args, **kwargs) 2025-05-07T20:33:05.6029051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.6030285Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.6031334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.6032539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.6033708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.6034651Z kernel = self.compile( 2025-05-07T20:33:05.6035598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.6036748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.6037423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.6037838Z 2025-05-07T20:33:05.6038185Z self = 2025-05-07T20:33:05.6040434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.6042947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ea18550>} 2025-05-07T20:33:05.6045592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.6047428Z context = 2025-05-07T20:33:05.6047995Z 2025-05-07T20:33:05.6048272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.6049189Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.6049999Z module_map=module_map) 2025-05-07T20:33:05.6050728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.6051330Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.6051764Z E ^ 2025-05-07T20:33:05.6052577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.6053411Z 2025-05-07T20:33:05.6054147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.6055064Z 2025-05-07T20:33:05.8277241Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.8277999Z self=, 2025-05-07T20:33:05.8278676Z T=16384, 2025-05-07T20:33:05.8279002Z D=5120, 2025-05-07T20:33:05.8279303Z scale_ub=None, 2025-05-07T20:33:05.8279660Z contiguous=False, 2025-05-07T20:33:05.8280025Z compiled=True, 2025-05-07T20:33:05.8280362Z ) 2025-05-07T20:33:05.8280910Z self = 2025-05-07T20:33:05.8281800Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.8282236Z 2025-05-07T20:33:05.8282356Z @given( 2025-05-07T20:33:05.8282672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.8283138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.8283589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.8284073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.8284570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.8285013Z ) 2025-05-07T20:33:05.8285567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.8286283Z def test_silu_mul_quant( 2025-05-07T20:33:05.8286655Z self, 2025-05-07T20:33:05.8286953Z T: int, 2025-05-07T20:33:05.8287248Z D: int, 2025-05-07T20:33:05.8287911Z scale_ub: Optional[float], 2025-05-07T20:33:05.8288389Z contiguous: bool, 2025-05-07T20:33:05.8288773Z compiled: bool, 2025-05-07T20:33:05.8289143Z ) -> None: 2025-05-07T20:33:05.8289491Z torch.manual_seed(2025) 2025-05-07T20:33:05.8289863Z 2025-05-07T20:33:05.8290324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.8290913Z 2025-05-07T20:33:05.8291211Z x_sign = torch.sign(x) 2025-05-07T20:33:05.8291694Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.8292217Z x = x_sign * x_clamp 2025-05-07T20:33:05.8292616Z x0 = x[:, :D] 2025-05-07T20:33:05.8292972Z x1 = x[:, D:] 2025-05-07T20:33:05.8293321Z 2025-05-07T20:33:05.8293619Z if contiguous: 2025-05-07T20:33:05.8294001Z x0 = x0.contiguous() 2025-05-07T20:33:05.8294441Z x1 = x1.contiguous() 2025-05-07T20:33:05.8294870Z 2025-05-07T20:33:05.8295191Z if scale_ub is not None: 2025-05-07T20:33:05.8295671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.8296258Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.8296773Z ) 2025-05-07T20:33:05.8297105Z else: 2025-05-07T20:33:05.8297460Z scale_ub_tensor = None 2025-05-07T20:33:05.8298056Z 2025-05-07T20:33:05.8298568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.8299136Z op = silu_mul_quant 2025-05-07T20:33:05.8299555Z if compiled: 2025-05-07T20:33:05.8299988Z op = torch.compile(op) 2025-05-07T20:33:05.8300504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.8301073Z 2025-05-07T20:33:05.8301403Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.8301683Z 2025-05-07T20:33:05.8301859Z moe/activation_test.py:117: 2025-05-07T20:33:05.8302363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.8302927Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.8303554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.8304524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.8305500Z return fn(*args, **kwargs) 2025-05-07T20:33:05.8306655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.8307931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.8308887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.8310091Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.8311272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.8312224Z kernel = self.compile( 2025-05-07T20:33:05.8313166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.8314339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.8315003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.8315399Z 2025-05-07T20:33:05.8315752Z self = 2025-05-07T20:33:05.8317543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.8319951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e8fb1f0>} 2025-05-07T20:33:05.8322373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.8324195Z context = 2025-05-07T20:33:05.8324693Z 2025-05-07T20:33:05.8324985Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.8325896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.8326717Z module_map=module_map) 2025-05-07T20:33:05.8327309Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.8327886Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.8328326Z E ^ 2025-05-07T20:33:05.8329138Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.8329939Z 2025-05-07T20:33:05.8330685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.8331589Z 2025-05-07T20:33:05.8331759Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.8332464Z self=, 2025-05-07T20:33:05.8333162Z T=2048, 2025-05-07T20:33:05.8333455Z D=5120, 2025-05-07T20:33:05.8333913Z scale_ub=None, 2025-05-07T20:33:05.8334273Z contiguous=False, 2025-05-07T20:33:05.8334645Z compiled=True, 2025-05-07T20:33:05.8334970Z ) 2025-05-07T20:33:05.9545671Z self = 2025-05-07T20:33:05.9546602Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.9547079Z 2025-05-07T20:33:05.9547209Z @given( 2025-05-07T20:33:05.9547603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.9548139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.9548627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.9550170Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.9550726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.9551218Z ) 2025-05-07T20:33:05.9551819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.9552589Z def test_silu_mul_quant( 2025-05-07T20:33:05.9552997Z self, 2025-05-07T20:33:05.9553325Z T: int, 2025-05-07T20:33:05.9553655Z D: int, 2025-05-07T20:33:05.9554008Z scale_ub: Optional[float], 2025-05-07T20:33:05.9554463Z contiguous: bool, 2025-05-07T20:33:05.9554864Z compiled: bool, 2025-05-07T20:33:05.9555230Z ) -> None: 2025-05-07T20:33:05.9555587Z torch.manual_seed(2025) 2025-05-07T20:33:05.9555997Z 2025-05-07T20:33:05.9556437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.9557020Z 2025-05-07T20:33:05.9557340Z x_sign = torch.sign(x) 2025-05-07T20:33:05.9557823Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.9558354Z x = x_sign * x_clamp 2025-05-07T20:33:05.9558761Z x0 = x[:, :D] 2025-05-07T20:33:05.9559113Z x1 = x[:, D:] 2025-05-07T20:33:05.9559460Z 2025-05-07T20:33:05.9559769Z if contiguous: 2025-05-07T20:33:05.9560152Z x0 = x0.contiguous() 2025-05-07T20:33:05.9560595Z x1 = x1.contiguous() 2025-05-07T20:33:05.9561002Z 2025-05-07T20:33:05.9561311Z if scale_ub is not None: 2025-05-07T20:33:05.9561770Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.9562330Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.9562850Z ) 2025-05-07T20:33:05.9563161Z else: 2025-05-07T20:33:05.9563509Z scale_ub_tensor = None 2025-05-07T20:33:05.9563931Z 2025-05-07T20:33:05.9564305Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.9564841Z op = silu_mul_quant 2025-05-07T20:33:05.9565408Z if compiled: 2025-05-07T20:33:05.9565819Z op = torch.compile(op) 2025-05-07T20:33:05.9566317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9566784Z 2025-05-07T20:33:05.9567091Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.9567377Z 2025-05-07T20:33:05.9567550Z moe/activation_test.py:117: 2025-05-07T20:33:05.9568087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9568668Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.9569133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9570105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.9571083Z return fn(*args, **kwargs) 2025-05-07T20:33:05.9572226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.9573394Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.9574306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.9575480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.9576740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.9577774Z kernel = self.compile( 2025-05-07T20:33:05.9578673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.9579822Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.9580505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9580907Z 2025-05-07T20:33:05.9581391Z self = 2025-05-07T20:33:05.9583320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.9585919Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e8fbf70>} 2025-05-07T20:33:05.9588338Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.9590147Z context = 2025-05-07T20:33:05.9590647Z 2025-05-07T20:33:05.9590935Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.9591835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.9592662Z module_map=module_map) 2025-05-07T20:33:05.9593268Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.9593855Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.9594297Z E ^ 2025-05-07T20:33:05.9595111Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.9595921Z 2025-05-07T20:33:05.9596670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.9597583Z 2025-05-07T20:33:05.9597757Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.9598516Z self=, 2025-05-07T20:33:05.9599222Z T=2048, 2025-05-07T20:33:05.9599525Z D=5120, 2025-05-07T20:33:05.9599846Z scale_ub=1200.0, 2025-05-07T20:33:05.9600221Z contiguous=False, 2025-05-07T20:33:05.9600598Z compiled=True, 2025-05-07T20:33:05.9601011Z ) 2025-05-07T20:33:05.9601560Z self = 2025-05-07T20:33:05.9602418Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.9602894Z 2025-05-07T20:33:05.9603021Z @given( 2025-05-07T20:33:05.9603411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.9603945Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.9604458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.9605026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.9605594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.9606088Z ) 2025-05-07T20:33:05.9606679Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.9607446Z def test_silu_mul_quant( 2025-05-07T20:33:05.9607851Z self, 2025-05-07T20:33:05.9608213Z T: int, 2025-05-07T20:33:05.9608547Z D: int, 2025-05-07T20:33:05.9608914Z scale_ub: Optional[float], 2025-05-07T20:33:05.9609363Z contiguous: bool, 2025-05-07T20:33:05.9609769Z compiled: bool, 2025-05-07T20:33:05.9610149Z ) -> None: 2025-05-07T20:33:05.9610502Z torch.manual_seed(2025) 2025-05-07T20:33:05.9610913Z 2025-05-07T20:33:05.9611494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.9612071Z 2025-05-07T20:33:05.9612392Z x_sign = torch.sign(x) 2025-05-07T20:33:05.9612880Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.9613401Z x = x_sign * x_clamp 2025-05-07T20:33:05.9613806Z x0 = x[:, :D] 2025-05-07T20:33:05.9614164Z x1 = x[:, D:] 2025-05-07T20:33:05.9614502Z 2025-05-07T20:33:05.9614809Z if contiguous: 2025-05-07T20:33:05.9615196Z x0 = x0.contiguous() 2025-05-07T20:33:05.9615609Z x1 = x1.contiguous() 2025-05-07T20:33:05.9616003Z 2025-05-07T20:33:05.9616360Z if scale_ub is not None: 2025-05-07T20:33:05.9616743Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.9617201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.9617624Z ) 2025-05-07T20:33:05.9617897Z else: 2025-05-07T20:33:05.9618192Z scale_ub_tensor = None 2025-05-07T20:33:05.9618548Z 2025-05-07T20:33:05.9618875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.9619305Z op = silu_mul_quant 2025-05-07T20:33:05.9619641Z if compiled: 2025-05-07T20:33:05.9619984Z op = torch.compile(op) 2025-05-07T20:33:05.9620414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9620832Z 2025-05-07T20:33:05.9621226Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.9621471Z 2025-05-07T20:33:05.9621621Z moe/activation_test.py:117: 2025-05-07T20:33:05.9622036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9622550Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.9622969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9623770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.9624598Z return fn(*args, **kwargs) 2025-05-07T20:33:05.9625587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.9626587Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.9627405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.9628510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.9629591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.9630452Z kernel = self.compile( 2025-05-07T20:33:05.9631428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.9632495Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.9633128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9633509Z 2025-05-07T20:33:05.9633834Z self = 2025-05-07T20:33:05.9635613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.9637908Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e737940>} 2025-05-07T20:33:05.9640693Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.9642334Z context = 2025-05-07T20:33:05.9642808Z 2025-05-07T20:33:05.9643305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.9644154Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.9644942Z module_map=module_map) 2025-05-07T20:33:05.9645529Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.9646124Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.9646560Z E ^ 2025-05-07T20:33:05.9647354Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.9648142Z 2025-05-07T20:33:05.9648856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.9649879Z 2025-05-07T20:33:06.3831136Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.3832401Z self=, 2025-05-07T20:33:06.3833163Z T=4096, 2025-05-07T20:33:06.3833481Z D=5120, 2025-05-07T20:33:06.3833795Z scale_ub=1200.0, 2025-05-07T20:33:06.3834157Z contiguous=True, 2025-05-07T20:33:06.3834530Z compiled=True, 2025-05-07T20:33:06.3834865Z ) 2025-05-07T20:33:06.3835402Z self = 2025-05-07T20:33:06.3836225Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:06.3836615Z 2025-05-07T20:33:06.3836734Z @given( 2025-05-07T20:33:06.3837053Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3837505Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3837967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3838456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3838976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3839435Z ) 2025-05-07T20:33:06.3839996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3841009Z def test_silu_mul_quant( 2025-05-07T20:33:06.3841383Z self, 2025-05-07T20:33:06.3841679Z T: int, 2025-05-07T20:33:06.3841966Z D: int, 2025-05-07T20:33:06.3842307Z scale_ub: Optional[float], 2025-05-07T20:33:06.3842742Z contiguous: bool, 2025-05-07T20:33:06.3843120Z compiled: bool, 2025-05-07T20:33:06.3843476Z ) -> None: 2025-05-07T20:33:06.3843821Z torch.manual_seed(2025) 2025-05-07T20:33:06.3844227Z 2025-05-07T20:33:06.3844687Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3845243Z 2025-05-07T20:33:06.3845885Z x_sign = torch.sign(x) 2025-05-07T20:33:06.3846387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.3846908Z x = x_sign * x_clamp 2025-05-07T20:33:06.3847294Z x0 = x[:, :D] 2025-05-07T20:33:06.3847642Z x1 = x[:, D:] 2025-05-07T20:33:06.3847982Z 2025-05-07T20:33:06.3848291Z if contiguous: 2025-05-07T20:33:06.3848680Z x0 = x0.contiguous() 2025-05-07T20:33:06.3849124Z x1 = x1.contiguous() 2025-05-07T20:33:06.3849522Z 2025-05-07T20:33:06.3849846Z if scale_ub is not None: 2025-05-07T20:33:06.3850310Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.3850871Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.3851376Z ) 2025-05-07T20:33:06.3851697Z else: 2025-05-07T20:33:06.3852044Z scale_ub_tensor = None 2025-05-07T20:33:06.3852455Z 2025-05-07T20:33:06.3852837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.3853392Z op = silu_mul_quant 2025-05-07T20:33:06.3853808Z if compiled: 2025-05-07T20:33:06.3854221Z op = torch.compile(op) 2025-05-07T20:33:06.3854727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3855192Z 2025-05-07T20:33:06.3855511Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.3856054Z 2025-05-07T20:33:06.3856236Z moe/activation_test.py:117: 2025-05-07T20:33:06.3856728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3857293Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.3857765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3858738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.3859692Z return fn(*args, **kwargs) 2025-05-07T20:33:06.3860850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.3862379Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.3863307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.3864506Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.3865690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.3866619Z kernel = self.compile( 2025-05-07T20:33:06.3867541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.3868640Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.3869289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3869658Z 2025-05-07T20:33:06.3869997Z self = 2025-05-07T20:33:06.3871782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.3874178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e5c1790>} 2025-05-07T20:33:06.3876543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.3878336Z context = 2025-05-07T20:33:06.3878840Z 2025-05-07T20:33:06.3879117Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.3880109Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.3880913Z module_map=module_map) 2025-05-07T20:33:06.3881524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.3882106Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.3882549Z E ^ 2025-05-07T20:33:06.3883387Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.3884189Z 2025-05-07T20:33:06.3884918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.3885838Z 2025-05-07T20:33:06.3886012Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.3886709Z self=, 2025-05-07T20:33:06.3887388Z T=128, 2025-05-07T20:33:06.3887692Z D=5120, 2025-05-07T20:33:06.3898511Z scale_ub=1200.0, 2025-05-07T20:33:06.3898942Z contiguous=False, 2025-05-07T20:33:06.3899319Z compiled=True, 2025-05-07T20:33:06.3899664Z ) 2025-05-07T20:33:06.5215204Z self = 2025-05-07T20:33:06.5216142Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.5216626Z 2025-05-07T20:33:06.5217258Z @given( 2025-05-07T20:33:06.5217658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.5218173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.5218677Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.5219165Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.5219694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.5220177Z ) 2025-05-07T20:33:06.5220781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.5221697Z def test_silu_mul_quant( 2025-05-07T20:33:06.5222116Z self, 2025-05-07T20:33:06.5222603Z T: int, 2025-05-07T20:33:06.5222921Z D: int, 2025-05-07T20:33:06.5223288Z scale_ub: Optional[float], 2025-05-07T20:33:06.5223748Z contiguous: bool, 2025-05-07T20:33:06.5224150Z compiled: bool, 2025-05-07T20:33:06.5224519Z ) -> None: 2025-05-07T20:33:06.5224882Z torch.manual_seed(2025) 2025-05-07T20:33:06.5225302Z 2025-05-07T20:33:06.5225745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.5226342Z 2025-05-07T20:33:06.5226662Z x_sign = torch.sign(x) 2025-05-07T20:33:06.5227141Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.5227674Z x = x_sign * x_clamp 2025-05-07T20:33:06.5228074Z x0 = x[:, :D] 2025-05-07T20:33:06.5228422Z x1 = x[:, D:] 2025-05-07T20:33:06.5228771Z 2025-05-07T20:33:06.5229087Z if contiguous: 2025-05-07T20:33:06.5229470Z x0 = x0.contiguous() 2025-05-07T20:33:06.5229913Z x1 = x1.contiguous() 2025-05-07T20:33:06.5230336Z 2025-05-07T20:33:06.5230648Z if scale_ub is not None: 2025-05-07T20:33:06.5231128Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.5231711Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.5232243Z ) 2025-05-07T20:33:06.5232554Z else: 2025-05-07T20:33:06.5232908Z scale_ub_tensor = None 2025-05-07T20:33:06.5233335Z 2025-05-07T20:33:06.5233704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.5234239Z op = silu_mul_quant 2025-05-07T20:33:06.5234659Z if compiled: 2025-05-07T20:33:06.5235064Z op = torch.compile(op) 2025-05-07T20:33:06.5235564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5236037Z 2025-05-07T20:33:06.5236348Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.5236641Z 2025-05-07T20:33:06.5236804Z moe/activation_test.py:117: 2025-05-07T20:33:06.5237443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5238018Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.5238502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5239475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.5240977Z return fn(*args, **kwargs) 2025-05-07T20:33:06.5242137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.5243344Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.5244251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.5245408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.5246549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.5247472Z kernel = self.compile( 2025-05-07T20:33:06.5248443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.5249542Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.5250352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5250845Z 2025-05-07T20:33:06.5251195Z self = 2025-05-07T20:33:06.5253142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.5255639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e61e0d0>} 2025-05-07T20:33:06.5258165Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.5259996Z context = 2025-05-07T20:33:06.5260498Z 2025-05-07T20:33:06.5260795Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.5261836Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.5262648Z module_map=module_map) 2025-05-07T20:33:06.5263268Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.5263858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.5264286Z E ^ 2025-05-07T20:33:06.5265096Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.5265907Z 2025-05-07T20:33:06.5266662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.5267576Z 2025-05-07T20:33:06.5267758Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.5268518Z self=, 2025-05-07T20:33:06.5269222Z T=16384, 2025-05-07T20:33:06.5269547Z D=7168, 2025-05-07T20:33:06.5269861Z scale_ub=1200.0, 2025-05-07T20:33:06.5270238Z contiguous=True, 2025-05-07T20:33:06.5270613Z compiled=True, 2025-05-07T20:33:06.5270952Z ) 2025-05-07T20:33:06.5271494Z self = 2025-05-07T20:33:06.5272348Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:06.5272824Z 2025-05-07T20:33:06.5272953Z @given( 2025-05-07T20:33:06.5273340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.5274006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.5274534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.5275087Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.5275656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.5276146Z ) 2025-05-07T20:33:06.5276750Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.5277522Z def test_silu_mul_quant( 2025-05-07T20:33:06.5277938Z self, 2025-05-07T20:33:06.5278251Z T: int, 2025-05-07T20:33:06.5278581Z D: int, 2025-05-07T20:33:06.5278955Z scale_ub: Optional[float], 2025-05-07T20:33:06.5279409Z contiguous: bool, 2025-05-07T20:33:06.5279818Z compiled: bool, 2025-05-07T20:33:06.5280198Z ) -> None: 2025-05-07T20:33:06.5280550Z torch.manual_seed(2025) 2025-05-07T20:33:06.5280965Z 2025-05-07T20:33:06.5281421Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.5282019Z 2025-05-07T20:33:06.5282339Z x_sign = torch.sign(x) 2025-05-07T20:33:06.5282834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.5283347Z x = x_sign * x_clamp 2025-05-07T20:33:06.5283755Z x0 = x[:, :D] 2025-05-07T20:33:06.5284119Z x1 = x[:, D:] 2025-05-07T20:33:06.5284580Z 2025-05-07T20:33:06.5284886Z if contiguous: 2025-05-07T20:33:06.5285268Z x0 = x0.contiguous() 2025-05-07T20:33:06.5285693Z x1 = x1.contiguous() 2025-05-07T20:33:06.5286101Z 2025-05-07T20:33:06.5286398Z if scale_ub is not None: 2025-05-07T20:33:06.5286827Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.5287280Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.5287714Z ) 2025-05-07T20:33:06.5287979Z else: 2025-05-07T20:33:06.5288268Z scale_ub_tensor = None 2025-05-07T20:33:06.5288620Z 2025-05-07T20:33:06.5289026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.5289467Z op = silu_mul_quant 2025-05-07T20:33:06.5289835Z if compiled: 2025-05-07T20:33:06.5290203Z op = torch.compile(op) 2025-05-07T20:33:06.5290625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5291010Z 2025-05-07T20:33:06.5291286Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.5291519Z 2025-05-07T20:33:06.5291660Z moe/activation_test.py:117: 2025-05-07T20:33:06.5292115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5292618Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.5293025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5293841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.5294670Z return fn(*args, **kwargs) 2025-05-07T20:33:06.5295624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.5296664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.5297449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.5298456Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.5299432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.5300258Z kernel = self.compile( 2025-05-07T20:33:06.5301265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.5302336Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.5302968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5303347Z 2025-05-07T20:33:06.5303773Z self = 2025-05-07T20:33:06.5305564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.5307867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e61ed30>} 2025-05-07T20:33:06.5310132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.5311812Z context = 2025-05-07T20:33:06.5312287Z 2025-05-07T20:33:06.5312549Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.5313404Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.5314157Z module_map=module_map) 2025-05-07T20:33:06.5314736Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.5315294Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.5315821Z E ^ 2025-05-07T20:33:06.5316572Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.5317328Z 2025-05-07T20:33:06.5318002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.5318845Z 2025-05-07T20:33:06.8057876Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.8058611Z self=, 2025-05-07T20:33:06.8059305Z T=16384, 2025-05-07T20:33:06.8059627Z D=5120, 2025-05-07T20:33:06.8060246Z scale_ub=1200.0, 2025-05-07T20:33:06.8060619Z contiguous=True, 2025-05-07T20:33:06.8061088Z compiled=False, 2025-05-07T20:33:06.8061435Z ) 2025-05-07T20:33:06.8061954Z self = 2025-05-07T20:33:06.8062800Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:06.8063259Z 2025-05-07T20:33:06.8063374Z @given( 2025-05-07T20:33:06.8063698Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.8064142Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.8064591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.8065083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.8065584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.8066040Z ) 2025-05-07T20:33:06.8066596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.8067295Z def test_silu_mul_quant( 2025-05-07T20:33:06.8067665Z self, 2025-05-07T20:33:06.8067958Z T: int, 2025-05-07T20:33:06.8068249Z D: int, 2025-05-07T20:33:06.8068579Z scale_ub: Optional[float], 2025-05-07T20:33:06.8068999Z contiguous: bool, 2025-05-07T20:33:06.8069368Z compiled: bool, 2025-05-07T20:33:06.8069730Z ) -> None: 2025-05-07T20:33:06.8070091Z torch.manual_seed(2025) 2025-05-07T20:33:06.8070487Z 2025-05-07T20:33:06.8070900Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.8071449Z 2025-05-07T20:33:06.8071776Z x_sign = torch.sign(x) 2025-05-07T20:33:06.8072261Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.8072766Z x = x_sign * x_clamp 2025-05-07T20:33:06.8073161Z x0 = x[:, :D] 2025-05-07T20:33:06.8073502Z x1 = x[:, D:] 2025-05-07T20:33:06.8073852Z 2025-05-07T20:33:06.8074153Z if contiguous: 2025-05-07T20:33:06.8074520Z x0 = x0.contiguous() 2025-05-07T20:33:06.8075131Z x1 = x1.contiguous() 2025-05-07T20:33:06.8075532Z 2025-05-07T20:33:06.8075846Z if scale_ub is not None: 2025-05-07T20:33:06.8076313Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.8076875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.8077404Z ) 2025-05-07T20:33:06.8077726Z else: 2025-05-07T20:33:06.8078095Z scale_ub_tensor = None 2025-05-07T20:33:06.8078536Z 2025-05-07T20:33:06.8078921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.8079452Z op = silu_mul_quant 2025-05-07T20:33:06.8079877Z if compiled: 2025-05-07T20:33:06.8080278Z op = torch.compile(op) 2025-05-07T20:33:06.8080780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.8081248Z 2025-05-07T20:33:06.8081557Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.8081846Z 2025-05-07T20:33:06.8082013Z moe/activation_test.py:117: 2025-05-07T20:33:06.8082531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.8083094Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.8083569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.8084917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.8086245Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.8087172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.8088405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.8089571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.8090512Z kernel = self.compile( 2025-05-07T20:33:06.8091473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.8092699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.8093394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.8093793Z 2025-05-07T20:33:06.8094134Z self = 2025-05-07T20:33:06.8096063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.8098439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e452700>} 2025-05-07T20:33:06.8100692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.8102552Z context = 2025-05-07T20:33:06.8103041Z 2025-05-07T20:33:06.8103313Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.8104216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.8105025Z module_map=module_map) 2025-05-07T20:33:06.8105625Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.8106217Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.8106647Z E ^ 2025-05-07T20:33:06.8107460Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.8108292Z 2025-05-07T20:33:06.8109100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.8110017Z 2025-05-07T20:33:06.8110192Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.8110905Z self=, 2025-05-07T20:33:06.8111588Z T=1, 2025-05-07T20:33:06.8111891Z D=7168, 2025-05-07T20:33:06.8112214Z scale_ub=1200.0, 2025-05-07T20:33:06.8112592Z contiguous=False, 2025-05-07T20:33:06.8112960Z compiled=False, 2025-05-07T20:33:06.8113299Z ) 2025-05-07T20:33:06.8113836Z self = 2025-05-07T20:33:06.8114663Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:06.8115138Z 2025-05-07T20:33:06.8115265Z @given( 2025-05-07T20:33:06.8115638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.8116156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.8116675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.8117239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.8117791Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.8118279Z ) 2025-05-07T20:33:06.8118874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.8119649Z def test_silu_mul_quant( 2025-05-07T20:33:06.8120181Z self, 2025-05-07T20:33:06.8120508Z T: int, 2025-05-07T20:33:06.8120830Z D: int, 2025-05-07T20:33:06.8121180Z scale_ub: Optional[float], 2025-05-07T20:33:06.8121639Z contiguous: bool, 2025-05-07T20:33:06.8122035Z compiled: bool, 2025-05-07T20:33:06.8122387Z ) -> None: 2025-05-07T20:33:06.8122742Z torch.manual_seed(2025) 2025-05-07T20:33:06.8123145Z 2025-05-07T20:33:06.8123583Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.8124163Z 2025-05-07T20:33:06.8124479Z x_sign = torch.sign(x) 2025-05-07T20:33:06.8124965Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.8125589Z x = x_sign * x_clamp 2025-05-07T20:33:06.8125996Z x0 = x[:, :D] 2025-05-07T20:33:06.8126345Z x1 = x[:, D:] 2025-05-07T20:33:06.8126688Z 2025-05-07T20:33:06.8126993Z if contiguous: 2025-05-07T20:33:06.8127368Z x0 = x0.contiguous() 2025-05-07T20:33:06.8127816Z x1 = x1.contiguous() 2025-05-07T20:33:06.8128257Z 2025-05-07T20:33:06.8128601Z if scale_ub is not None: 2025-05-07T20:33:06.8129057Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.8129622Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.8130155Z ) 2025-05-07T20:33:06.8130465Z else: 2025-05-07T20:33:06.8130818Z scale_ub_tensor = None 2025-05-07T20:33:06.8131243Z 2025-05-07T20:33:06.8131616Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.8132153Z op = silu_mul_quant 2025-05-07T20:33:06.8132583Z if compiled: 2025-05-07T20:33:06.8132988Z op = torch.compile(op) 2025-05-07T20:33:06.8133478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.8133937Z 2025-05-07T20:33:06.8134241Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.8134530Z 2025-05-07T20:33:06.8134692Z moe/activation_test.py:117: 2025-05-07T20:33:06.8135189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.8135755Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.8136217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.8137382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.8138642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.8139567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.8141226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.8142354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.8143248Z kernel = self.compile( 2025-05-07T20:33:06.8144190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.8145342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.8146028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.8146431Z 2025-05-07T20:33:06.8146783Z self = 2025-05-07T20:33:06.8148747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.8151246Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e6900d0>} 2025-05-07T20:33:06.8153707Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.8155549Z context = 2025-05-07T20:33:06.8156037Z 2025-05-07T20:33:06.8156311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.8157238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.8158007Z module_map=module_map) 2025-05-07T20:33:06.8158576Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.8159070Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.8159615Z E ^ 2025-05-07T20:33:06.8160412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.8161214Z 2025-05-07T20:33:06.8161950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.8162878Z 2025-05-07T20:33:06.8163050Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.8163758Z self=, 2025-05-07T20:33:06.8164455Z T=4096, 2025-05-07T20:33:06.8164754Z D=7168, 2025-05-07T20:33:06.8165068Z scale_ub=1200.0, 2025-05-07T20:33:06.8165439Z contiguous=False, 2025-05-07T20:33:06.8165803Z compiled=True, 2025-05-07T20:33:06.8166141Z ) 2025-05-07T20:33:06.9338095Z self = 2025-05-07T20:33:06.9339066Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.9339586Z 2025-05-07T20:33:06.9339723Z @given( 2025-05-07T20:33:06.9340442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9340963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9341562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9342066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9342591Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9343073Z ) 2025-05-07T20:33:06.9343670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9344428Z def test_silu_mul_quant( 2025-05-07T20:33:06.9344831Z self, 2025-05-07T20:33:06.9345153Z T: int, 2025-05-07T20:33:06.9345465Z D: int, 2025-05-07T20:33:06.9345825Z scale_ub: Optional[float], 2025-05-07T20:33:06.9346278Z contiguous: bool, 2025-05-07T20:33:06.9346670Z compiled: bool, 2025-05-07T20:33:06.9347043Z ) -> None: 2025-05-07T20:33:06.9347743Z torch.manual_seed(2025) 2025-05-07T20:33:06.9348179Z 2025-05-07T20:33:06.9348659Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9349240Z 2025-05-07T20:33:06.9349546Z x_sign = torch.sign(x) 2025-05-07T20:33:06.9350034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.9350561Z x = x_sign * x_clamp 2025-05-07T20:33:06.9350958Z x0 = x[:, :D] 2025-05-07T20:33:06.9351312Z x1 = x[:, D:] 2025-05-07T20:33:06.9351652Z 2025-05-07T20:33:06.9351957Z if contiguous: 2025-05-07T20:33:06.9352335Z x0 = x0.contiguous() 2025-05-07T20:33:06.9352766Z x1 = x1.contiguous() 2025-05-07T20:33:06.9353164Z 2025-05-07T20:33:06.9353478Z if scale_ub is not None: 2025-05-07T20:33:06.9353935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.9354494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.9355016Z ) 2025-05-07T20:33:06.9355356Z else: 2025-05-07T20:33:06.9355703Z scale_ub_tensor = None 2025-05-07T20:33:06.9356120Z 2025-05-07T20:33:06.9367959Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.9368597Z op = silu_mul_quant 2025-05-07T20:33:06.9369300Z if compiled: 2025-05-07T20:33:06.9369718Z op = torch.compile(op) 2025-05-07T20:33:06.9370190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9370658Z 2025-05-07T20:33:06.9370982Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.9371256Z 2025-05-07T20:33:06.9371421Z moe/activation_test.py:117: 2025-05-07T20:33:06.9371927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9372500Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.9372987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9373967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.9375092Z return fn(*args, **kwargs) 2025-05-07T20:33:06.9376261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.9377477Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.9378425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.9379627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.9380798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.9381794Z kernel = self.compile( 2025-05-07T20:33:06.9382720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.9383874Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.9384564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9384961Z 2025-05-07T20:33:06.9385304Z self = 2025-05-07T20:33:06.9387231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.9389732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e690dc0>} 2025-05-07T20:33:06.9392151Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.9394042Z context = 2025-05-07T20:33:06.9394560Z 2025-05-07T20:33:06.9394839Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.9395749Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.9396577Z module_map=module_map) 2025-05-07T20:33:06.9397179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.9397770Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.9398243Z E ^ 2025-05-07T20:33:06.9399058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.9399873Z 2025-05-07T20:33:06.9400608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.9401528Z 2025-05-07T20:33:06.9401698Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9402418Z self=, 2025-05-07T20:33:06.9403105Z T=128, 2025-05-07T20:33:06.9403415Z D=7168, 2025-05-07T20:33:06.9403738Z scale_ub=1200.0, 2025-05-07T20:33:06.9404105Z contiguous=False, 2025-05-07T20:33:06.9404480Z compiled=True, 2025-05-07T20:33:06.9404905Z ) 2025-05-07T20:33:06.9405488Z self = 2025-05-07T20:33:06.9406339Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.9406820Z 2025-05-07T20:33:06.9406948Z @given( 2025-05-07T20:33:06.9407330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9407850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9408338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9408813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9409255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9409735Z ) 2025-05-07T20:33:06.9410232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9410849Z def test_silu_mul_quant( 2025-05-07T20:33:06.9411182Z self, 2025-05-07T20:33:06.9411456Z T: int, 2025-05-07T20:33:06.9411725Z D: int, 2025-05-07T20:33:06.9412049Z scale_ub: Optional[float], 2025-05-07T20:33:06.9412438Z contiguous: bool, 2025-05-07T20:33:06.9412785Z compiled: bool, 2025-05-07T20:33:06.9413094Z ) -> None: 2025-05-07T20:33:06.9413400Z torch.manual_seed(2025) 2025-05-07T20:33:06.9413745Z 2025-05-07T20:33:06.9414142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9414656Z 2025-05-07T20:33:06.9414948Z x_sign = torch.sign(x) 2025-05-07T20:33:06.9415371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.9415823Z x = x_sign * x_clamp 2025-05-07T20:33:06.9416182Z x0 = x[:, :D] 2025-05-07T20:33:06.9416522Z x1 = x[:, D:] 2025-05-07T20:33:06.9416842Z 2025-05-07T20:33:06.9417119Z if contiguous: 2025-05-07T20:33:06.9417453Z x0 = x0.contiguous() 2025-05-07T20:33:06.9417833Z x1 = x1.contiguous() 2025-05-07T20:33:06.9418205Z 2025-05-07T20:33:06.9418497Z if scale_ub is not None: 2025-05-07T20:33:06.9418872Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.9419381Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.9419895Z ) 2025-05-07T20:33:06.9420190Z else: 2025-05-07T20:33:06.9420527Z scale_ub_tensor = None 2025-05-07T20:33:06.9420933Z 2025-05-07T20:33:06.9421382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.9421895Z op = silu_mul_quant 2025-05-07T20:33:06.9422289Z if compiled: 2025-05-07T20:33:06.9422684Z op = torch.compile(op) 2025-05-07T20:33:06.9423268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9423743Z 2025-05-07T20:33:06.9424049Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.9424344Z 2025-05-07T20:33:06.9424509Z moe/activation_test.py:117: 2025-05-07T20:33:06.9425005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9425565Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.9426051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9426968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.9427910Z return fn(*args, **kwargs) 2025-05-07T20:33:06.9429021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.9430151Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.9431082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.9432296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.9433470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.9434410Z kernel = self.compile( 2025-05-07T20:33:06.9435506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.9436656Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.9437346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9437755Z 2025-05-07T20:33:06.9438126Z self = 2025-05-07T20:33:06.9440429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.9443047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e3bf940>} 2025-05-07T20:33:06.9445464Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.9447295Z context = 2025-05-07T20:33:06.9447804Z 2025-05-07T20:33:06.9448083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.9449041Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.9449845Z module_map=module_map) 2025-05-07T20:33:06.9450458Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.9451064Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.9451502Z E ^ 2025-05-07T20:33:06.9452297Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.9453109Z 2025-05-07T20:33:06.9453848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.9454763Z 2025-05-07T20:33:07.1190712Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1191495Z self=, 2025-05-07T20:33:07.1192162Z T=2048, 2025-05-07T20:33:07.1192463Z D=7168, 2025-05-07T20:33:07.1192783Z scale_ub=None, 2025-05-07T20:33:07.1193138Z contiguous=True, 2025-05-07T20:33:07.1193501Z compiled=True, 2025-05-07T20:33:07.1193845Z ) 2025-05-07T20:33:07.1194367Z self = 2025-05-07T20:33:07.1195552Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.1196048Z 2025-05-07T20:33:07.1196178Z @given( 2025-05-07T20:33:07.1196554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1197083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1197595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1198187Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1198774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1199252Z ) 2025-05-07T20:33:07.1199851Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1200624Z def test_silu_mul_quant( 2025-05-07T20:33:07.1201025Z self, 2025-05-07T20:33:07.1201339Z T: int, 2025-05-07T20:33:07.1201663Z D: int, 2025-05-07T20:33:07.1202025Z scale_ub: Optional[float], 2025-05-07T20:33:07.1202478Z contiguous: bool, 2025-05-07T20:33:07.1202876Z compiled: bool, 2025-05-07T20:33:07.1203265Z ) -> None: 2025-05-07T20:33:07.1203616Z torch.manual_seed(2025) 2025-05-07T20:33:07.1204022Z 2025-05-07T20:33:07.1204478Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1205059Z 2025-05-07T20:33:07.1205375Z x_sign = torch.sign(x) 2025-05-07T20:33:07.1206116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.1206646Z x = x_sign * x_clamp 2025-05-07T20:33:07.1207056Z x0 = x[:, :D] 2025-05-07T20:33:07.1207410Z x1 = x[:, D:] 2025-05-07T20:33:07.1207748Z 2025-05-07T20:33:07.1208054Z if contiguous: 2025-05-07T20:33:07.1208435Z x0 = x0.contiguous() 2025-05-07T20:33:07.1208860Z x1 = x1.contiguous() 2025-05-07T20:33:07.1209267Z 2025-05-07T20:33:07.1209585Z if scale_ub is not None: 2025-05-07T20:33:07.1210041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.1210601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.1211293Z ) 2025-05-07T20:33:07.1211613Z else: 2025-05-07T20:33:07.1211949Z scale_ub_tensor = None 2025-05-07T20:33:07.1212379Z 2025-05-07T20:33:07.1212774Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.1213328Z op = silu_mul_quant 2025-05-07T20:33:07.1213763Z if compiled: 2025-05-07T20:33:07.1214180Z op = torch.compile(op) 2025-05-07T20:33:07.1214669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1215141Z 2025-05-07T20:33:07.1215453Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.1215737Z 2025-05-07T20:33:07.1215900Z moe/activation_test.py:117: 2025-05-07T20:33:07.1216392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1216960Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.1217435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1218393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.1219377Z return fn(*args, **kwargs) 2025-05-07T20:33:07.1220491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.1221843Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.1222761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.1223934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.1225084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.1226035Z kernel = self.compile( 2025-05-07T20:33:07.1226979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.1228218Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.1228948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1229349Z 2025-05-07T20:33:07.1229694Z self = 2025-05-07T20:33:07.1231578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.1234063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e383550>} 2025-05-07T20:33:07.1236448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.1238238Z context = 2025-05-07T20:33:07.1238750Z 2025-05-07T20:33:07.1239029Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.1239932Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.1241264Z module_map=module_map) 2025-05-07T20:33:07.1241888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.1242478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.1242920Z E ^ 2025-05-07T20:33:07.1243690Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.1244490Z 2025-05-07T20:33:07.1245227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.1246141Z 2025-05-07T20:33:07.1246323Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1247136Z self=, 2025-05-07T20:33:07.1247828Z T=16384, 2025-05-07T20:33:07.1248140Z D=5120, 2025-05-07T20:33:07.1248453Z scale_ub=None, 2025-05-07T20:33:07.1248793Z contiguous=False, 2025-05-07T20:33:07.1249175Z compiled=False, 2025-05-07T20:33:07.1249511Z ) 2025-05-07T20:33:07.1250033Z self = 2025-05-07T20:33:07.1250886Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.1251362Z 2025-05-07T20:33:07.1251499Z @given( 2025-05-07T20:33:07.1251861Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1252385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1252902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1253447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1254016Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1254502Z ) 2025-05-07T20:33:07.1255096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1255854Z def test_silu_mul_quant( 2025-05-07T20:33:07.1256255Z self, 2025-05-07T20:33:07.1256570Z T: int, 2025-05-07T20:33:07.1256889Z D: int, 2025-05-07T20:33:07.1257248Z scale_ub: Optional[float], 2025-05-07T20:33:07.1257704Z contiguous: bool, 2025-05-07T20:33:07.1258114Z compiled: bool, 2025-05-07T20:33:07.1258499Z ) -> None: 2025-05-07T20:33:07.1258851Z torch.manual_seed(2025) 2025-05-07T20:33:07.1259241Z 2025-05-07T20:33:07.1259686Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1260265Z 2025-05-07T20:33:07.1260570Z x_sign = torch.sign(x) 2025-05-07T20:33:07.1261139Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.1264583Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.1267968Z 2025-05-07T20:33:07.1268174Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.1268550Z 2025-05-07T20:33:07.1268719Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1269430Z self=, 2025-05-07T20:33:07.1270118Z T=4096, 2025-05-07T20:33:07.1270430Z D=7168, 2025-05-07T20:33:07.1270746Z scale_ub=1200.0, 2025-05-07T20:33:07.1271122Z contiguous=True, 2025-05-07T20:33:07.1271490Z compiled=True, 2025-05-07T20:33:07.1271823Z ) 2025-05-07T20:33:07.1272353Z self = 2025-05-07T20:33:07.1273206Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.1273828Z 2025-05-07T20:33:07.1274013Z @given( 2025-05-07T20:33:07.1274392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1274914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1275434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1275988Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1276528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1277013Z ) 2025-05-07T20:33:07.1277590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1278339Z def test_silu_mul_quant( 2025-05-07T20:33:07.1278745Z self, 2025-05-07T20:33:07.1279154Z T: int, 2025-05-07T20:33:07.1279471Z D: int, 2025-05-07T20:33:07.1279824Z scale_ub: Optional[float], 2025-05-07T20:33:07.1280286Z contiguous: bool, 2025-05-07T20:33:07.1280690Z compiled: bool, 2025-05-07T20:33:07.1281048Z ) -> None: 2025-05-07T20:33:07.1281402Z torch.manual_seed(2025) 2025-05-07T20:33:07.1281787Z 2025-05-07T20:33:07.1282203Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1282772Z 2025-05-07T20:33:07.1283085Z x_sign = torch.sign(x) 2025-05-07T20:33:07.1283560Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.1287193Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.1290579Z 2025-05-07T20:33:07.1290780Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.1291145Z 2025-05-07T20:33:07.1291325Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1292027Z self=, 2025-05-07T20:33:07.1292711Z T=16384, 2025-05-07T20:33:07.1293031Z D=7168, 2025-05-07T20:33:07.1293339Z scale_ub=None, 2025-05-07T20:33:07.1293686Z contiguous=False, 2025-05-07T20:33:07.1294057Z compiled=False, 2025-05-07T20:33:07.1294393Z ) 2025-05-07T20:33:07.2333867Z self = 2025-05-07T20:33:07.2335117Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.2335618Z 2025-05-07T20:33:07.2335744Z @given( 2025-05-07T20:33:07.2336125Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2336654Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2337163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2337665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2338166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2338613Z ) 2025-05-07T20:33:07.2339158Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2339883Z def test_silu_mul_quant( 2025-05-07T20:33:07.2340631Z self, 2025-05-07T20:33:07.2340963Z T: int, 2025-05-07T20:33:07.2341357Z D: int, 2025-05-07T20:33:07.2341710Z scale_ub: Optional[float], 2025-05-07T20:33:07.2342165Z contiguous: bool, 2025-05-07T20:33:07.2342548Z compiled: bool, 2025-05-07T20:33:07.2342913Z ) -> None: 2025-05-07T20:33:07.2343283Z torch.manual_seed(2025) 2025-05-07T20:33:07.2343696Z 2025-05-07T20:33:07.2344136Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2347941Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2351450Z 2025-05-07T20:33:07.2351656Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.2352035Z 2025-05-07T20:33:07.2352205Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2353053Z self=, 2025-05-07T20:33:07.2353746Z T=2048, 2025-05-07T20:33:07.2354057Z D=7168, 2025-05-07T20:33:07.2354374Z scale_ub=1200.0, 2025-05-07T20:33:07.2354737Z contiguous=True, 2025-05-07T20:33:07.2355106Z compiled=True, 2025-05-07T20:33:07.2355463Z ) 2025-05-07T20:33:07.2356000Z self = 2025-05-07T20:33:07.2356837Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.2357311Z 2025-05-07T20:33:07.2357436Z @given( 2025-05-07T20:33:07.2357815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2358328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2358846Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2359406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2359953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2360455Z ) 2025-05-07T20:33:07.2361059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2361837Z def test_silu_mul_quant( 2025-05-07T20:33:07.2362237Z self, 2025-05-07T20:33:07.2362562Z T: int, 2025-05-07T20:33:07.2362894Z D: int, 2025-05-07T20:33:07.2363253Z scale_ub: Optional[float], 2025-05-07T20:33:07.2363712Z contiguous: bool, 2025-05-07T20:33:07.2364119Z compiled: bool, 2025-05-07T20:33:07.2364484Z ) -> None: 2025-05-07T20:33:07.2364832Z torch.manual_seed(2025) 2025-05-07T20:33:07.2365245Z 2025-05-07T20:33:07.2365687Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2366270Z 2025-05-07T20:33:07.2366585Z x_sign = torch.sign(x) 2025-05-07T20:33:07.2367051Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.2370577Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2373838Z 2025-05-07T20:33:07.2374040Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.2374409Z 2025-05-07T20:33:07.2374577Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2375274Z self=, 2025-05-07T20:33:07.2375946Z T=2048, 2025-05-07T20:33:07.2376259Z D=7168, 2025-05-07T20:33:07.2376574Z scale_ub=None, 2025-05-07T20:33:07.2376918Z contiguous=True, 2025-05-07T20:33:07.2377302Z compiled=False, 2025-05-07T20:33:07.2377651Z ) 2025-05-07T20:33:07.2378168Z self = 2025-05-07T20:33:07.2378991Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.2379467Z 2025-05-07T20:33:07.2379681Z @given( 2025-05-07T20:33:07.2380146Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2380666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2381281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2381837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2382382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2382870Z ) 2025-05-07T20:33:07.2383462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2384214Z def test_silu_mul_quant( 2025-05-07T20:33:07.2384618Z self, 2025-05-07T20:33:07.2385022Z T: int, 2025-05-07T20:33:07.2385340Z D: int, 2025-05-07T20:33:07.2385692Z scale_ub: Optional[float], 2025-05-07T20:33:07.2386147Z contiguous: bool, 2025-05-07T20:33:07.2386542Z compiled: bool, 2025-05-07T20:33:07.2386912Z ) -> None: 2025-05-07T20:33:07.2387261Z torch.manual_seed(2025) 2025-05-07T20:33:07.2387672Z 2025-05-07T20:33:07.2388113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2388700Z 2025-05-07T20:33:07.2389017Z > x_sign = torch.sign(x) 2025-05-07T20:33:07.2392471Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2395848Z 2025-05-07T20:33:07.2396049Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:07.2396428Z 2025-05-07T20:33:07.2396598Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2397309Z self=, 2025-05-07T20:33:07.2398002Z T=1, 2025-05-07T20:33:07.2398301Z D=7168, 2025-05-07T20:33:07.2398678Z scale_ub=1200.0, 2025-05-07T20:33:07.2399047Z contiguous=True, 2025-05-07T20:33:07.2399405Z compiled=False, 2025-05-07T20:33:07.2399747Z ) 2025-05-07T20:33:07.5747114Z self = 2025-05-07T20:33:07.5748031Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.5748485Z 2025-05-07T20:33:07.5748617Z @given( 2025-05-07T20:33:07.5749379Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5749903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5750407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5750898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5751405Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5751873Z ) 2025-05-07T20:33:07.5752433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5753138Z def test_silu_mul_quant( 2025-05-07T20:33:07.5753504Z self, 2025-05-07T20:33:07.5753799Z T: int, 2025-05-07T20:33:07.5754099Z D: int, 2025-05-07T20:33:07.5754419Z scale_ub: Optional[float], 2025-05-07T20:33:07.5754853Z contiguous: bool, 2025-05-07T20:33:07.5755248Z compiled: bool, 2025-05-07T20:33:07.5755605Z ) -> None: 2025-05-07T20:33:07.5755965Z torch.manual_seed(2025) 2025-05-07T20:33:07.5756355Z 2025-05-07T20:33:07.5756789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5757377Z 2025-05-07T20:33:07.5757702Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5758173Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5758684Z x = x_sign * x_clamp 2025-05-07T20:33:07.5759358Z x0 = x[:, :D] 2025-05-07T20:33:07.5759717Z x1 = x[:, D:] 2025-05-07T20:33:07.5760059Z 2025-05-07T20:33:07.5760361Z if contiguous: 2025-05-07T20:33:07.5760729Z x0 = x0.contiguous() 2025-05-07T20:33:07.5761166Z x1 = x1.contiguous() 2025-05-07T20:33:07.5761573Z 2025-05-07T20:33:07.5761882Z if scale_ub is not None: 2025-05-07T20:33:07.5762349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5762916Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5763442Z ) 2025-05-07T20:33:07.5763749Z else: 2025-05-07T20:33:07.5764106Z scale_ub_tensor = None 2025-05-07T20:33:07.5764672Z 2025-05-07T20:33:07.5765052Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5765592Z op = silu_mul_quant 2025-05-07T20:33:07.5766024Z if compiled: 2025-05-07T20:33:07.5766435Z op = torch.compile(op) 2025-05-07T20:33:07.5766950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5767431Z 2025-05-07T20:33:07.5767744Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5768032Z 2025-05-07T20:33:07.5768200Z moe/activation_test.py:117: 2025-05-07T20:33:07.5768709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5769278Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5769751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5770963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5772204Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5773115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5774313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5775497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5776443Z kernel = self.compile( 2025-05-07T20:33:07.5777380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5778544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5779220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5779618Z 2025-05-07T20:33:07.5779963Z self = 2025-05-07T20:33:07.5782033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5784396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e1a3040>} 2025-05-07T20:33:07.5786715Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5788510Z context = 2025-05-07T20:33:07.5789007Z 2025-05-07T20:33:07.5789284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5790189Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5791008Z module_map=module_map) 2025-05-07T20:33:07.5791617Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5792208Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5792639Z E ^ 2025-05-07T20:33:07.5793521Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5794362Z 2025-05-07T20:33:07.5795099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5796019Z 2025-05-07T20:33:07.5796193Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5796895Z self=, 2025-05-07T20:33:07.5797588Z T=128, 2025-05-07T20:33:07.5797889Z D=5120, 2025-05-07T20:33:07.5798213Z scale_ub=None, 2025-05-07T20:33:07.5798622Z contiguous=True, 2025-05-07T20:33:07.5798982Z compiled=False, 2025-05-07T20:33:07.5799412Z ) 2025-05-07T20:33:07.5799952Z self = 2025-05-07T20:33:07.5800761Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.5801230Z 2025-05-07T20:33:07.5801358Z @given( 2025-05-07T20:33:07.5801743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5802268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5802791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5803357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5803919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5804398Z ) 2025-05-07T20:33:07.5804994Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5805764Z def test_silu_mul_quant( 2025-05-07T20:33:07.5806164Z self, 2025-05-07T20:33:07.5806485Z T: int, 2025-05-07T20:33:07.5806816Z D: int, 2025-05-07T20:33:07.5807167Z scale_ub: Optional[float], 2025-05-07T20:33:07.5807621Z contiguous: bool, 2025-05-07T20:33:07.5808025Z compiled: bool, 2025-05-07T20:33:07.5808403Z ) -> None: 2025-05-07T20:33:07.5808790Z torch.manual_seed(2025) 2025-05-07T20:33:07.5809197Z 2025-05-07T20:33:07.5809650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5810228Z 2025-05-07T20:33:07.5810546Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5811018Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5811542Z x = x_sign * x_clamp 2025-05-07T20:33:07.5811941Z x0 = x[:, :D] 2025-05-07T20:33:07.5812285Z x1 = x[:, D:] 2025-05-07T20:33:07.5812625Z 2025-05-07T20:33:07.5812922Z if contiguous: 2025-05-07T20:33:07.5813282Z x0 = x0.contiguous() 2025-05-07T20:33:07.5813712Z x1 = x1.contiguous() 2025-05-07T20:33:07.5814107Z 2025-05-07T20:33:07.5814499Z if scale_ub is not None: 2025-05-07T20:33:07.5814962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5815526Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5816040Z ) 2025-05-07T20:33:07.5816355Z else: 2025-05-07T20:33:07.5816693Z scale_ub_tensor = None 2025-05-07T20:33:07.5817124Z 2025-05-07T20:33:07.5817492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5818021Z op = silu_mul_quant 2025-05-07T20:33:07.5818472Z if compiled: 2025-05-07T20:33:07.5818883Z op = torch.compile(op) 2025-05-07T20:33:07.5819382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5819844Z 2025-05-07T20:33:07.5820155Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5820438Z 2025-05-07T20:33:07.5820597Z moe/activation_test.py:117: 2025-05-07T20:33:07.5821196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5821763Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5822231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5823430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5824803Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5825677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5826852Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5828015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5828939Z kernel = self.compile( 2025-05-07T20:33:07.5829889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5831051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5831819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5832208Z 2025-05-07T20:33:07.5832550Z self = 2025-05-07T20:33:07.5834477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5836947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e1a3a60>} 2025-05-07T20:33:07.5839328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5841956Z context = 2025-05-07T20:33:07.5842411Z 2025-05-07T20:33:07.5842668Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5843472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5844281Z module_map=module_map) 2025-05-07T20:33:07.5844883Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5845475Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5845912Z E ^ 2025-05-07T20:33:07.5846711Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5847512Z 2025-05-07T20:33:07.5848246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5849217Z 2025-05-07T20:33:07.5849387Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5850234Z self=, 2025-05-07T20:33:07.5850930Z T=128, 2025-05-07T20:33:07.5851237Z D=7168, 2025-05-07T20:33:07.5851552Z scale_ub=None, 2025-05-07T20:33:07.5851902Z contiguous=True, 2025-05-07T20:33:07.5852257Z compiled=False, 2025-05-07T20:33:07.5852602Z ) 2025-05-07T20:33:07.6743490Z self = 2025-05-07T20:33:07.6748198Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6748587Z 2025-05-07T20:33:07.6748702Z @given( 2025-05-07T20:33:07.6749048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6749520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6749997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6750525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6751033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6751489Z ) 2025-05-07T20:33:07.6752032Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6752745Z def test_silu_mul_quant( 2025-05-07T20:33:07.6753125Z self, 2025-05-07T20:33:07.6753439Z T: int, 2025-05-07T20:33:07.6753763Z D: int, 2025-05-07T20:33:07.6754315Z scale_ub: Optional[float], 2025-05-07T20:33:07.6754788Z contiguous: bool, 2025-05-07T20:33:07.6755197Z compiled: bool, 2025-05-07T20:33:07.6755577Z ) -> None: 2025-05-07T20:33:07.6755940Z torch.manual_seed(2025) 2025-05-07T20:33:07.6756338Z 2025-05-07T20:33:07.6756783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6757329Z 2025-05-07T20:33:07.6757636Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6758122Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6758708Z x = x_sign * x_clamp 2025-05-07T20:33:07.6759270Z x0 = x[:, :D] 2025-05-07T20:33:07.6759605Z x1 = x[:, D:] 2025-05-07T20:33:07.6759943Z 2025-05-07T20:33:07.6760248Z if contiguous: 2025-05-07T20:33:07.6760614Z x0 = x0.contiguous() 2025-05-07T20:33:07.6761041Z x1 = x1.contiguous() 2025-05-07T20:33:07.6761438Z 2025-05-07T20:33:07.6761748Z if scale_ub is not None: 2025-05-07T20:33:07.6762212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6762774Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6763301Z ) 2025-05-07T20:33:07.6763609Z else: 2025-05-07T20:33:07.6763954Z scale_ub_tensor = None 2025-05-07T20:33:07.6764372Z 2025-05-07T20:33:07.6764743Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6765278Z op = silu_mul_quant 2025-05-07T20:33:07.6765699Z if compiled: 2025-05-07T20:33:07.6766095Z op = torch.compile(op) 2025-05-07T20:33:07.6766609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6767077Z 2025-05-07T20:33:07.6767387Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6767677Z 2025-05-07T20:33:07.6767841Z moe/activation_test.py:117: 2025-05-07T20:33:07.6768342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6768912Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6769376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6770597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6771819Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6772735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6773911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6775212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6776166Z kernel = self.compile( 2025-05-07T20:33:07.6777103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6778258Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6778944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6779463Z 2025-05-07T20:33:07.6779810Z self = 2025-05-07T20:33:07.6781788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6784141Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e29f790>} 2025-05-07T20:33:07.6786467Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6788297Z context = 2025-05-07T20:33:07.6788795Z 2025-05-07T20:33:07.6789068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6789970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6790779Z module_map=module_map) 2025-05-07T20:33:07.6791387Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6791968Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6792399Z E ^ 2025-05-07T20:33:07.6793208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6794044Z 2025-05-07T20:33:07.6794754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6795671Z 2025-05-07T20:33:07.6795845Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6796552Z self=, 2025-05-07T20:33:07.6797237Z T=2048, 2025-05-07T20:33:07.6797542Z D=7168, 2025-05-07T20:33:07.6797858Z scale_ub=1200.0, 2025-05-07T20:33:07.6798222Z contiguous=True, 2025-05-07T20:33:07.6798619Z compiled=False, 2025-05-07T20:33:07.6798973Z ) 2025-05-07T20:33:07.6799509Z self = 2025-05-07T20:33:07.6800338Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.6800811Z 2025-05-07T20:33:07.6800937Z @given( 2025-05-07T20:33:07.6801309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6801831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6802351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6802909Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6803467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6803951Z ) 2025-05-07T20:33:07.6804544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6805308Z def test_silu_mul_quant( 2025-05-07T20:33:07.6805700Z self, 2025-05-07T20:33:07.6806021Z T: int, 2025-05-07T20:33:07.6806345Z D: int, 2025-05-07T20:33:07.6806685Z scale_ub: Optional[float], 2025-05-07T20:33:07.6807137Z contiguous: bool, 2025-05-07T20:33:07.6807529Z compiled: bool, 2025-05-07T20:33:07.6807891Z ) -> None: 2025-05-07T20:33:07.6808251Z torch.manual_seed(2025) 2025-05-07T20:33:07.6808695Z 2025-05-07T20:33:07.6809219Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6812907Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6816241Z 2025-05-07T20:33:07.6816444Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6816820Z 2025-05-07T20:33:07.6816988Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6817692Z self=, 2025-05-07T20:33:07.6818375Z T=1, 2025-05-07T20:33:07.6818686Z D=5120, 2025-05-07T20:33:07.6818997Z scale_ub=1200.0, 2025-05-07T20:33:07.6819353Z contiguous=True, 2025-05-07T20:33:07.6819718Z compiled=False, 2025-05-07T20:33:07.6820058Z ) 2025-05-07T20:33:07.7297110Z self = 2025-05-07T20:33:07.7298220Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.7298660Z 2025-05-07T20:33:07.7298796Z @given( 2025-05-07T20:33:07.7299159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.7299676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.7300194Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.7300753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.7301434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.7301928Z ) 2025-05-07T20:33:07.7302529Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.7303474Z def test_silu_mul_quant( 2025-05-07T20:33:07.7303895Z self, 2025-05-07T20:33:07.7304211Z T: int, 2025-05-07T20:33:07.7304545Z D: int, 2025-05-07T20:33:07.7304911Z scale_ub: Optional[float], 2025-05-07T20:33:07.7305366Z contiguous: bool, 2025-05-07T20:33:07.7305782Z compiled: bool, 2025-05-07T20:33:07.7306167Z ) -> None: 2025-05-07T20:33:07.7306532Z torch.manual_seed(2025) 2025-05-07T20:33:07.7306942Z 2025-05-07T20:33:07.7307402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7307991Z 2025-05-07T20:33:07.7308310Z x_sign = torch.sign(x) 2025-05-07T20:33:07.7308818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.7309362Z x = x_sign * x_clamp 2025-05-07T20:33:07.7309763Z x0 = x[:, :D] 2025-05-07T20:33:07.7310128Z x1 = x[:, D:] 2025-05-07T20:33:07.7310478Z 2025-05-07T20:33:07.7310798Z if contiguous: 2025-05-07T20:33:07.7311199Z x0 = x0.contiguous() 2025-05-07T20:33:07.7311643Z x1 = x1.contiguous() 2025-05-07T20:33:07.7312049Z 2025-05-07T20:33:07.7312379Z if scale_ub is not None: 2025-05-07T20:33:07.7312852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.7313420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.7313953Z ) 2025-05-07T20:33:07.7314284Z else: 2025-05-07T20:33:07.7314645Z scale_ub_tensor = None 2025-05-07T20:33:07.7315068Z 2025-05-07T20:33:07.7315462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.7316011Z op = silu_mul_quant 2025-05-07T20:33:07.7316439Z if compiled: 2025-05-07T20:33:07.7316865Z op = torch.compile(op) 2025-05-07T20:33:07.7317379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.7317848Z 2025-05-07T20:33:07.7318183Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.7318621Z 2025-05-07T20:33:07.7318814Z moe/activation_test.py:117: 2025-05-07T20:33:07.7319314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7319897Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.7320393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.7321596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.7322933Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.7323877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.7325066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.7326223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.7327159Z kernel = self.compile( 2025-05-07T20:33:07.7328119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.7329264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.7329940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7330448Z 2025-05-07T20:33:07.7330803Z self = 2025-05-07T20:33:07.7332696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.7335153Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e23f040>} 2025-05-07T20:33:07.7337524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.7339411Z context = 2025-05-07T20:33:07.7339920Z 2025-05-07T20:33:07.7340665Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.7341635Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.7342435Z module_map=module_map) 2025-05-07T20:33:07.7343050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.7343646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.7344087Z E ^ 2025-05-07T20:33:07.7344866Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.7345667Z 2025-05-07T20:33:07.7346398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.7347303Z 2025-05-07T20:33:07.7347488Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.7348188Z self=, 2025-05-07T20:33:07.7348931Z T=2048, 2025-05-07T20:33:07.7349247Z D=5120, 2025-05-07T20:33:07.7349568Z scale_ub=None, 2025-05-07T20:33:07.7349917Z contiguous=True, 2025-05-07T20:33:07.7350294Z compiled=False, 2025-05-07T20:33:07.7350642Z ) 2025-05-07T20:33:07.7351165Z self = 2025-05-07T20:33:07.7351999Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.7352464Z 2025-05-07T20:33:07.7352601Z @given( 2025-05-07T20:33:07.7352974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.7353504Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.7354169Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.7354740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.7355308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.7355798Z ) 2025-05-07T20:33:07.7356393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.7357160Z def test_silu_mul_quant( 2025-05-07T20:33:07.7369534Z self, 2025-05-07T20:33:07.7369886Z T: int, 2025-05-07T20:33:07.7370390Z D: int, 2025-05-07T20:33:07.7370752Z scale_ub: Optional[float], 2025-05-07T20:33:07.7371229Z contiguous: bool, 2025-05-07T20:33:07.7371638Z compiled: bool, 2025-05-07T20:33:07.7372012Z ) -> None: 2025-05-07T20:33:07.7372388Z torch.manual_seed(2025) 2025-05-07T20:33:07.7372806Z 2025-05-07T20:33:07.7373274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7373864Z 2025-05-07T20:33:07.7374211Z > x_sign = torch.sign(x) 2025-05-07T20:33:07.7377786Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.7381182Z 2025-05-07T20:33:07.7381403Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:07.7381770Z 2025-05-07T20:33:07.7381939Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.7382653Z self=, 2025-05-07T20:33:07.7383353Z T=16384, 2025-05-07T20:33:07.7383685Z D=5120, 2025-05-07T20:33:07.7384150Z scale_ub=None, 2025-05-07T20:33:07.7384510Z contiguous=True, 2025-05-07T20:33:07.7384900Z compiled=False, 2025-05-07T20:33:07.7385240Z ) 2025-05-07T20:33:07.7385782Z self = 2025-05-07T20:33:07.7386652Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.7387132Z 2025-05-07T20:33:07.7387262Z @given( 2025-05-07T20:33:07.7387654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.7388193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.7388711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.7389284Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.7389855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.7390353Z ) 2025-05-07T20:33:07.7390950Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.7391733Z def test_silu_mul_quant( 2025-05-07T20:33:07.7392151Z self, 2025-05-07T20:33:07.7392470Z T: int, 2025-05-07T20:33:07.7392803Z D: int, 2025-05-07T20:33:07.7393174Z scale_ub: Optional[float], 2025-05-07T20:33:07.7393633Z contiguous: bool, 2025-05-07T20:33:07.7394043Z compiled: bool, 2025-05-07T20:33:07.7394429Z ) -> None: 2025-05-07T20:33:07.7394785Z torch.manual_seed(2025) 2025-05-07T20:33:07.7395196Z 2025-05-07T20:33:07.7395664Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7399311Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.7402605Z 2025-05-07T20:33:07.7402836Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.7403205Z 2025-05-07T20:33:07.7403389Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.7404106Z self=, 2025-05-07T20:33:07.7404907Z T=4096, 2025-05-07T20:33:07.7405221Z D=5120, 2025-05-07T20:33:07.7405548Z scale_ub=None, 2025-05-07T20:33:07.7405913Z contiguous=True, 2025-05-07T20:33:07.7406284Z compiled=False, 2025-05-07T20:33:07.7406638Z ) 2025-05-07T20:33:07.8423927Z self = 2025-05-07T20:33:07.8424819Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.8425287Z 2025-05-07T20:33:07.8425414Z @given( 2025-05-07T20:33:07.8425816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8426317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8426828Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8427388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8428167Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8428665Z ) 2025-05-07T20:33:07.8429280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8430064Z def test_silu_mul_quant( 2025-05-07T20:33:07.8430472Z self, 2025-05-07T20:33:07.8430807Z T: int, 2025-05-07T20:33:07.8431146Z D: int, 2025-05-07T20:33:07.8431508Z scale_ub: Optional[float], 2025-05-07T20:33:07.8431975Z contiguous: bool, 2025-05-07T20:33:07.8432379Z compiled: bool, 2025-05-07T20:33:07.8432752Z ) -> None: 2025-05-07T20:33:07.8433128Z torch.manual_seed(2025) 2025-05-07T20:33:07.8433683Z 2025-05-07T20:33:07.8434129Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8437771Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.8441353Z 2025-05-07T20:33:07.8441557Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.8441941Z 2025-05-07T20:33:07.8442113Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8442823Z self=, 2025-05-07T20:33:07.8443523Z T=2048, 2025-05-07T20:33:07.8443833Z D=5120, 2025-05-07T20:33:07.8444153Z scale_ub=None, 2025-05-07T20:33:07.8444505Z contiguous=False, 2025-05-07T20:33:07.8444881Z compiled=False, 2025-05-07T20:33:07.8445230Z ) 2025-05-07T20:33:07.8445760Z self = 2025-05-07T20:33:07.8446615Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.8447094Z 2025-05-07T20:33:07.8447218Z @given( 2025-05-07T20:33:07.8447599Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8448119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8448649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8449213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8449758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8450245Z ) 2025-05-07T20:33:07.8451000Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8451769Z def test_silu_mul_quant( 2025-05-07T20:33:07.8452180Z self, 2025-05-07T20:33:07.8452504Z T: int, 2025-05-07T20:33:07.8452829Z D: int, 2025-05-07T20:33:07.8453197Z scale_ub: Optional[float], 2025-05-07T20:33:07.8453660Z contiguous: bool, 2025-05-07T20:33:07.8454061Z compiled: bool, 2025-05-07T20:33:07.8454432Z ) -> None: 2025-05-07T20:33:07.8454780Z torch.manual_seed(2025) 2025-05-07T20:33:07.8455324Z 2025-05-07T20:33:07.8455774Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8459434Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.8462793Z 2025-05-07T20:33:07.8463003Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.8463478Z 2025-05-07T20:33:07.8463656Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8464357Z self=, 2025-05-07T20:33:07.8465045Z T=4096, 2025-05-07T20:33:07.8465343Z D=7168, 2025-05-07T20:33:07.8465656Z scale_ub=None, 2025-05-07T20:33:07.8466010Z contiguous=True, 2025-05-07T20:33:07.8466377Z compiled=True, 2025-05-07T20:33:07.8466701Z ) 2025-05-07T20:33:07.8467232Z self = 2025-05-07T20:33:07.8468072Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.8468653Z 2025-05-07T20:33:07.8468783Z @given( 2025-05-07T20:33:07.8469164Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8469682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8470184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8470744Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8471303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8471778Z ) 2025-05-07T20:33:07.8472382Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8473147Z def test_silu_mul_quant( 2025-05-07T20:33:07.8473550Z self, 2025-05-07T20:33:07.8473856Z T: int, 2025-05-07T20:33:07.8474177Z D: int, 2025-05-07T20:33:07.8474533Z scale_ub: Optional[float], 2025-05-07T20:33:07.8474976Z contiguous: bool, 2025-05-07T20:33:07.8475374Z compiled: bool, 2025-05-07T20:33:07.8475750Z ) -> None: 2025-05-07T20:33:07.8476098Z torch.manual_seed(2025) 2025-05-07T20:33:07.8476491Z 2025-05-07T20:33:07.8476933Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8480667Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.8483988Z 2025-05-07T20:33:07.8484196Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.8484562Z 2025-05-07T20:33:07.8484733Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8485518Z self=, 2025-05-07T20:33:07.8486212Z T=2048, 2025-05-07T20:33:07.8486510Z D=5120, 2025-05-07T20:33:07.8486821Z scale_ub=1200.0, 2025-05-07T20:33:07.8487193Z contiguous=False, 2025-05-07T20:33:07.8487554Z compiled=False, 2025-05-07T20:33:07.8487891Z ) 2025-05-07T20:33:07.8488442Z self = 2025-05-07T20:33:07.8489259Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.8489823Z 2025-05-07T20:33:07.8489949Z @given( 2025-05-07T20:33:07.8490323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8490851Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8491360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8491915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8492474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8492962Z ) 2025-05-07T20:33:07.8493555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8494317Z def test_silu_mul_quant( 2025-05-07T20:33:07.8494709Z self, 2025-05-07T20:33:07.8495032Z T: int, 2025-05-07T20:33:07.8495354Z D: int, 2025-05-07T20:33:07.8495779Z scale_ub: Optional[float], 2025-05-07T20:33:07.8496237Z contiguous: bool, 2025-05-07T20:33:07.8496625Z compiled: bool, 2025-05-07T20:33:07.8496991Z ) -> None: 2025-05-07T20:33:07.8497346Z torch.manual_seed(2025) 2025-05-07T20:33:07.8497746Z 2025-05-07T20:33:07.8498192Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8501854Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.8505365Z 2025-05-07T20:33:07.8505577Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.8505954Z 2025-05-07T20:33:07.8506121Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8506830Z self=, 2025-05-07T20:33:07.8507508Z T=4096, 2025-05-07T20:33:07.8507820Z D=7168, 2025-05-07T20:33:07.8508133Z scale_ub=1200.0, 2025-05-07T20:33:07.8508503Z contiguous=True, 2025-05-07T20:33:07.8508881Z compiled=False, 2025-05-07T20:33:07.8509221Z ) 2025-05-07T20:33:07.8509749Z self = 2025-05-07T20:33:07.8510600Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.8511072Z 2025-05-07T20:33:07.8511207Z @given( 2025-05-07T20:33:07.8511566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8512095Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8512607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8513163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8513711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8514196Z ) 2025-05-07T20:33:07.8514791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8515570Z def test_silu_mul_quant( 2025-05-07T20:33:07.8515975Z self, 2025-05-07T20:33:07.8516280Z T: int, 2025-05-07T20:33:07.8516590Z D: int, 2025-05-07T20:33:07.8516924Z scale_ub: Optional[float], 2025-05-07T20:33:07.8517370Z contiguous: bool, 2025-05-07T20:33:07.8517849Z compiled: bool, 2025-05-07T20:33:07.8518222Z ) -> None: 2025-05-07T20:33:07.8518573Z torch.manual_seed(2025) 2025-05-07T20:33:07.8518972Z 2025-05-07T20:33:07.8519418Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8523094Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.8526533Z 2025-05-07T20:33:07.8526730Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.8527091Z 2025-05-07T20:33:07.8527274Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8527969Z self=, 2025-05-07T20:33:07.8528670Z T=16384, 2025-05-07T20:33:07.8528986Z D=7168, 2025-05-07T20:33:07.8529289Z scale_ub=None, 2025-05-07T20:33:07.8529647Z contiguous=False, 2025-05-07T20:33:07.8530086Z compiled=True, 2025-05-07T20:33:07.8530429Z ) 2025-05-07T20:33:07.9821830Z self = 2025-05-07T20:33:07.9822750Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.9823206Z 2025-05-07T20:33:07.9823345Z @given( 2025-05-07T20:33:07.9823716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9824251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9824768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9825324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9826016Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9826445Z ) 2025-05-07T20:33:07.9826988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9827687Z def test_silu_mul_quant( 2025-05-07T20:33:07.9828076Z self, 2025-05-07T20:33:07.9828383Z T: int, 2025-05-07T20:33:07.9828692Z D: int, 2025-05-07T20:33:07.9829041Z scale_ub: Optional[float], 2025-05-07T20:33:07.9829497Z contiguous: bool, 2025-05-07T20:33:07.9829889Z compiled: bool, 2025-05-07T20:33:07.9830276Z ) -> None: 2025-05-07T20:33:07.9830640Z torch.manual_seed(2025) 2025-05-07T20:33:07.9831042Z 2025-05-07T20:33:07.9831503Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9835187Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9838567Z 2025-05-07T20:33:07.9838773Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9839143Z 2025-05-07T20:33:07.9839324Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9840034Z self=, 2025-05-07T20:33:07.9841046Z T=4096, 2025-05-07T20:33:07.9841367Z D=7168, 2025-05-07T20:33:07.9841677Z scale_ub=None, 2025-05-07T20:33:07.9842032Z contiguous=True, 2025-05-07T20:33:07.9842406Z compiled=False, 2025-05-07T20:33:07.9842743Z ) 2025-05-07T20:33:07.9843441Z self = 2025-05-07T20:33:07.9844310Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9844787Z 2025-05-07T20:33:07.9844923Z @given( 2025-05-07T20:33:07.9845297Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9845835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9846357Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9846913Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9847616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9848109Z ) 2025-05-07T20:33:07.9848707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9849483Z def test_silu_mul_quant( 2025-05-07T20:33:07.9849897Z self, 2025-05-07T20:33:07.9850218Z T: int, 2025-05-07T20:33:07.9850538Z D: int, 2025-05-07T20:33:07.9850901Z scale_ub: Optional[float], 2025-05-07T20:33:07.9851375Z contiguous: bool, 2025-05-07T20:33:07.9851771Z compiled: bool, 2025-05-07T20:33:07.9852151Z ) -> None: 2025-05-07T20:33:07.9852509Z torch.manual_seed(2025) 2025-05-07T20:33:07.9852918Z 2025-05-07T20:33:07.9853375Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9857050Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9860364Z 2025-05-07T20:33:07.9860587Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9861155Z 2025-05-07T20:33:07.9861338Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9862034Z self=, 2025-05-07T20:33:07.9862731Z T=16384, 2025-05-07T20:33:07.9863053Z D=7168, 2025-05-07T20:33:07.9863373Z scale_ub=None, 2025-05-07T20:33:07.9863730Z contiguous=True, 2025-05-07T20:33:07.9864112Z compiled=False, 2025-05-07T20:33:07.9864459Z ) 2025-05-07T20:33:07.9865000Z self = 2025-05-07T20:33:07.9865853Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9866340Z 2025-05-07T20:33:07.9866471Z @given( 2025-05-07T20:33:07.9866857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9867392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9867912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9868487Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9869103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9869587Z ) 2025-05-07T20:33:07.9870174Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9870938Z def test_silu_mul_quant( 2025-05-07T20:33:07.9871345Z self, 2025-05-07T20:33:07.9871665Z T: int, 2025-05-07T20:33:07.9872000Z D: int, 2025-05-07T20:33:07.9872369Z scale_ub: Optional[float], 2025-05-07T20:33:07.9872816Z contiguous: bool, 2025-05-07T20:33:07.9873226Z compiled: bool, 2025-05-07T20:33:07.9873603Z ) -> None: 2025-05-07T20:33:07.9873956Z torch.manual_seed(2025) 2025-05-07T20:33:07.9874368Z 2025-05-07T20:33:07.9874825Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9878366Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9881755Z 2025-05-07T20:33:07.9881980Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9882348Z 2025-05-07T20:33:07.9882522Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9883232Z self=, 2025-05-07T20:33:07.9883925Z T=16384, 2025-05-07T20:33:07.9884252Z D=7168, 2025-05-07T20:33:07.9884585Z scale_ub=1200.0, 2025-05-07T20:33:07.9884964Z contiguous=True, 2025-05-07T20:33:07.9885341Z compiled=False, 2025-05-07T20:33:07.9885705Z ) 2025-05-07T20:33:07.9886245Z self = 2025-05-07T20:33:07.9887103Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.9887585Z 2025-05-07T20:33:07.9887717Z @given( 2025-05-07T20:33:07.9888192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9888741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9889253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9889822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9890398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9890882Z ) 2025-05-07T20:33:07.9891476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9892249Z def test_silu_mul_quant( 2025-05-07T20:33:07.9892663Z self, 2025-05-07T20:33:07.9892990Z T: int, 2025-05-07T20:33:07.9893423Z D: int, 2025-05-07T20:33:07.9893800Z scale_ub: Optional[float], 2025-05-07T20:33:07.9894256Z contiguous: bool, 2025-05-07T20:33:07.9894667Z compiled: bool, 2025-05-07T20:33:07.9895035Z ) -> None: 2025-05-07T20:33:07.9895383Z torch.manual_seed(2025) 2025-05-07T20:33:07.9895785Z 2025-05-07T20:33:07.9896240Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9899836Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9903197Z 2025-05-07T20:33:07.9903413Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9903776Z 2025-05-07T20:33:07.9903948Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9904645Z self=, 2025-05-07T20:33:07.9905334Z T=128, 2025-05-07T20:33:07.9905644Z D=5120, 2025-05-07T20:33:07.9905964Z scale_ub=1200.0, 2025-05-07T20:33:07.9906338Z contiguous=False, 2025-05-07T20:33:07.9906705Z compiled=False, 2025-05-07T20:33:07.9907052Z ) 2025-05-07T20:33:08.1511063Z self = 2025-05-07T20:33:08.1511978Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.1512438Z 2025-05-07T20:33:08.1512572Z @given( 2025-05-07T20:33:08.1512944Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.1513447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.1514390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.1514959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.1515469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.1515876Z ) 2025-05-07T20:33:08.1516370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.1516995Z def test_silu_mul_quant( 2025-05-07T20:33:08.1517339Z self, 2025-05-07T20:33:08.1517785Z T: int, 2025-05-07T20:33:08.1518066Z D: int, 2025-05-07T20:33:08.1518382Z scale_ub: Optional[float], 2025-05-07T20:33:08.1518765Z contiguous: bool, 2025-05-07T20:33:08.1519107Z compiled: bool, 2025-05-07T20:33:08.1519442Z ) -> None: 2025-05-07T20:33:08.1519750Z torch.manual_seed(2025) 2025-05-07T20:33:08.1520088Z 2025-05-07T20:33:08.1520480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.1520983Z 2025-05-07T20:33:08.1521264Z x_sign = torch.sign(x) 2025-05-07T20:33:08.1521689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.1522139Z x = x_sign * x_clamp 2025-05-07T20:33:08.1522486Z x0 = x[:, :D] 2025-05-07T20:33:08.1522813Z x1 = x[:, D:] 2025-05-07T20:33:08.1523140Z 2025-05-07T20:33:08.1535775Z if contiguous: 2025-05-07T20:33:08.1536182Z x0 = x0.contiguous() 2025-05-07T20:33:08.1536599Z x1 = x1.contiguous() 2025-05-07T20:33:08.1537008Z 2025-05-07T20:33:08.1537328Z if scale_ub is not None: 2025-05-07T20:33:08.1537771Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.1538321Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.1538837Z ) 2025-05-07T20:33:08.1539149Z else: 2025-05-07T20:33:08.1539499Z scale_ub_tensor = None 2025-05-07T20:33:08.1539926Z 2025-05-07T20:33:08.1540697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.1541505Z op = silu_mul_quant 2025-05-07T20:33:08.1541927Z if compiled: 2025-05-07T20:33:08.1542324Z op = torch.compile(op) 2025-05-07T20:33:08.1542827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.1543294Z 2025-05-07T20:33:08.1543612Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.1543907Z 2025-05-07T20:33:08.1544071Z moe/activation_test.py:117: 2025-05-07T20:33:08.1544584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.1545165Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.1545634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.1546847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.1548076Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.1548999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.1550204Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.1551357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.1552290Z kernel = self.compile( 2025-05-07T20:33:08.1553234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.1554387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.1555052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.1555449Z 2025-05-07T20:33:08.1555808Z self = 2025-05-07T20:33:08.1557838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.1560232Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e041ca0>} 2025-05-07T20:33:08.1562475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.1564340Z context = 2025-05-07T20:33:08.1564808Z 2025-05-07T20:33:08.1565076Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.1565955Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.1566752Z module_map=module_map) 2025-05-07T20:33:08.1567371Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.1567961Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.1568397Z E ^ 2025-05-07T20:33:08.1569237Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.1570041Z 2025-05-07T20:33:08.1570879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.1571768Z 2025-05-07T20:33:08.1571950Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.1572646Z self=, 2025-05-07T20:33:08.1573346Z T=2048, 2025-05-07T20:33:08.1573646Z D=7168, 2025-05-07T20:33:08.1573962Z scale_ub=None, 2025-05-07T20:33:08.1574323Z contiguous=False, 2025-05-07T20:33:08.1574688Z compiled=False, 2025-05-07T20:33:08.1575038Z ) 2025-05-07T20:33:08.1575576Z self = 2025-05-07T20:33:08.1576486Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.1576958Z 2025-05-07T20:33:08.1577084Z @given( 2025-05-07T20:33:08.1577472Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.1577997Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.1578506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.1579069Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.1579637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.1580119Z ) 2025-05-07T20:33:08.1580718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.1581603Z def test_silu_mul_quant( 2025-05-07T20:33:08.1582002Z self, 2025-05-07T20:33:08.1582318Z T: int, 2025-05-07T20:33:08.1582630Z D: int, 2025-05-07T20:33:08.1582980Z scale_ub: Optional[float], 2025-05-07T20:33:08.1583434Z contiguous: bool, 2025-05-07T20:33:08.1583842Z compiled: bool, 2025-05-07T20:33:08.1584207Z ) -> None: 2025-05-07T20:33:08.1584558Z torch.manual_seed(2025) 2025-05-07T20:33:08.1584965Z 2025-05-07T20:33:08.1585411Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.1589080Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.1592406Z 2025-05-07T20:33:08.1592608Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.1592986Z 2025-05-07T20:33:08.1593241Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.1593953Z self=, 2025-05-07T20:33:08.1594640Z T=128, 2025-05-07T20:33:08.1594943Z D=7168, 2025-05-07T20:33:08.1595256Z scale_ub=1200.0, 2025-05-07T20:33:08.1595629Z contiguous=True, 2025-05-07T20:33:08.1595984Z compiled=True, 2025-05-07T20:33:08.1596323Z ) 2025-05-07T20:33:08.2034638Z self = 2025-05-07T20:33:08.2035760Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.2036208Z 2025-05-07T20:33:08.2036351Z @given( 2025-05-07T20:33:08.2036677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.2037137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.2037611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.2038169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.2038738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.2039219Z ) 2025-05-07T20:33:08.2039822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.2040867Z def test_silu_mul_quant( 2025-05-07T20:33:08.2041248Z self, 2025-05-07T20:33:08.2041698Z T: int, 2025-05-07T20:33:08.2042002Z D: int, 2025-05-07T20:33:08.2042339Z scale_ub: Optional[float], 2025-05-07T20:33:08.2042788Z contiguous: bool, 2025-05-07T20:33:08.2043192Z compiled: bool, 2025-05-07T20:33:08.2043543Z ) -> None: 2025-05-07T20:33:08.2043879Z torch.manual_seed(2025) 2025-05-07T20:33:08.2044271Z 2025-05-07T20:33:08.2044729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.2045318Z 2025-05-07T20:33:08.2045633Z x_sign = torch.sign(x) 2025-05-07T20:33:08.2046087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.2046693Z x = x_sign * x_clamp 2025-05-07T20:33:08.2047039Z x0 = x[:, :D] 2025-05-07T20:33:08.2047362Z x1 = x[:, D:] 2025-05-07T20:33:08.2047678Z 2025-05-07T20:33:08.2047965Z if contiguous: 2025-05-07T20:33:08.2048326Z x0 = x0.contiguous() 2025-05-07T20:33:08.2048780Z x1 = x1.contiguous() 2025-05-07T20:33:08.2049168Z 2025-05-07T20:33:08.2049452Z if scale_ub is not None: 2025-05-07T20:33:08.2049853Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.2050325Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.2050774Z ) 2025-05-07T20:33:08.2051050Z else: 2025-05-07T20:33:08.2051347Z scale_ub_tensor = None 2025-05-07T20:33:08.2051720Z 2025-05-07T20:33:08.2052063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.2052536Z op = silu_mul_quant 2025-05-07T20:33:08.2052926Z if compiled: 2025-05-07T20:33:08.2053313Z op = torch.compile(op) 2025-05-07T20:33:08.2053775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2054212Z 2025-05-07T20:33:08.2054503Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.2054739Z 2025-05-07T20:33:08.2054876Z moe/activation_test.py:117: 2025-05-07T20:33:08.2055308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2055798Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.2056214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2057070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.2057971Z return fn(*args, **kwargs) 2025-05-07T20:33:08.2058987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.2060017Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.2060946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.2062163Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.2063196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.2064011Z kernel = self.compile( 2025-05-07T20:33:08.2064852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.2066019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.2066680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2067030Z 2025-05-07T20:33:08.2067327Z self = 2025-05-07T20:33:08.2069043Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.2071255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158df3a0d0>} 2025-05-07T20:33:08.2073478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.2075080Z context = 2025-05-07T20:33:08.2075538Z 2025-05-07T20:33:08.2075795Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.2076632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.2077339Z module_map=module_map) 2025-05-07T20:33:08.2077939Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.2078493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.2078931Z E ^ 2025-05-07T20:33:08.2079665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.2080404Z 2025-05-07T20:33:08.2081046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.2081866Z 2025-05-07T20:33:08.2082028Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.2082690Z self=, 2025-05-07T20:33:08.2083326Z T=128, 2025-05-07T20:33:08.2083612Z D=7168, 2025-05-07T20:33:08.2083910Z scale_ub=1200.0, 2025-05-07T20:33:08.2084230Z contiguous=True, 2025-05-07T20:33:08.2084562Z compiled=False, 2025-05-07T20:33:08.2084857Z ) 2025-05-07T20:33:08.2085308Z self = 2025-05-07T20:33:08.2086020Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.2086423Z 2025-05-07T20:33:08.2086528Z @given( 2025-05-07T20:33:08.2086847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.2087287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.2087732Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.2088219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.2088693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.2089110Z ) 2025-05-07T20:33:08.2089621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.2090268Z def test_silu_mul_quant( 2025-05-07T20:33:08.2090617Z self, 2025-05-07T20:33:08.2090888Z T: int, 2025-05-07T20:33:08.2091162Z D: int, 2025-05-07T20:33:08.2091470Z scale_ub: Optional[float], 2025-05-07T20:33:08.2091967Z contiguous: bool, 2025-05-07T20:33:08.2092319Z compiled: bool, 2025-05-07T20:33:08.2092630Z ) -> None: 2025-05-07T20:33:08.2092935Z torch.manual_seed(2025) 2025-05-07T20:33:08.2093293Z 2025-05-07T20:33:08.2093679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.2094202Z 2025-05-07T20:33:08.2094469Z x_sign = torch.sign(x) 2025-05-07T20:33:08.2094892Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.2098076Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.2101101Z 2025-05-07T20:33:08.2101285Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:08.2101617Z 2025-05-07T20:33:08.2101780Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.2102484Z self=, 2025-05-07T20:33:08.2103131Z T=128, 2025-05-07T20:33:08.2103417Z D=5120, 2025-05-07T20:33:08.2103696Z scale_ub=1200.0, 2025-05-07T20:33:08.2104030Z contiguous=True, 2025-05-07T20:33:08.2104362Z compiled=True, 2025-05-07T20:33:08.2104664Z ) 2025-05-07T20:33:08.2105161Z self = 2025-05-07T20:33:08.2105923Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.2106323Z 2025-05-07T20:33:08.2106439Z @given( 2025-05-07T20:33:08.2106775Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.2107305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.2107771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.2108275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.2108800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.2109262Z ) 2025-05-07T20:33:08.2109818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.2110486Z def test_silu_mul_quant( 2025-05-07T20:33:08.2110847Z self, 2025-05-07T20:33:08.2111132Z T: int, 2025-05-07T20:33:08.2111412Z D: int, 2025-05-07T20:33:08.2111734Z scale_ub: Optional[float], 2025-05-07T20:33:08.2112146Z contiguous: bool, 2025-05-07T20:33:08.2112502Z compiled: bool, 2025-05-07T20:33:08.2112854Z ) -> None: 2025-05-07T20:33:08.2113195Z torch.manual_seed(2025) 2025-05-07T20:33:08.2113577Z 2025-05-07T20:33:08.2114003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.2114509Z 2025-05-07T20:33:08.2114782Z x_sign = torch.sign(x) 2025-05-07T20:33:08.2115228Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.2118354Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.2121243Z 2025-05-07T20:33:08.2121430Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:08.2121752Z 2025-05-07T20:33:08.2121971Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.2122614Z self=, 2025-05-07T20:33:08.2123241Z T=128, 2025-05-07T20:33:08.2123518Z D=7168, 2025-05-07T20:33:08.2123790Z scale_ub=None, 2025-05-07T20:33:08.2124108Z contiguous=True, 2025-05-07T20:33:08.2124440Z compiled=True, 2025-05-07T20:33:08.2124742Z ) 2025-05-07T20:33:08.4281295Z self = 2025-05-07T20:33:08.4282506Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.4282943Z 2025-05-07T20:33:08.4283074Z @given( 2025-05-07T20:33:08.4283431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4283876Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4284359Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4284904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4285464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4285950Z ) 2025-05-07T20:33:08.4286500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4287220Z def test_silu_mul_quant( 2025-05-07T20:33:08.4287625Z self, 2025-05-07T20:33:08.4287940Z T: int, 2025-05-07T20:33:08.4288419Z D: int, 2025-05-07T20:33:08.4288825Z scale_ub: Optional[float], 2025-05-07T20:33:08.4289298Z contiguous: bool, 2025-05-07T20:33:08.4289711Z compiled: bool, 2025-05-07T20:33:08.4290081Z ) -> None: 2025-05-07T20:33:08.4290429Z torch.manual_seed(2025) 2025-05-07T20:33:08.4290828Z 2025-05-07T20:33:08.4291266Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4294897Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.4298315Z 2025-05-07T20:33:08.4298513Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.4298865Z 2025-05-07T20:33:08.4307572Z FAILED 2025-05-07T20:33:08.4307766Z 2025-05-07T20:33:08.4307950Z =================================== FAILURES =================================== 2025-05-07T20:33:08.4308560Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:08.4309219Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:08.4310108Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:33:08.4310758Z | yield 2025-05-07T20:33:08.4311240Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:33:08.4311813Z | self._callTestMethod(testMethod) 2025-05-07T20:33:08.4312432Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:33:08.4313076Z | method() 2025-05-07T20:33:08.4313794Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:08.4314791Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4315648Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:08.4316284Z | raise the_error_hypothesis_found 2025-05-07T20:33:08.4316935Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:08.4317560Z +-+---------------- 1 ---------------- 2025-05-07T20:33:08.4317871Z | Traceback (most recent call last): 2025-05-07T20:33:08.4318825Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:08.4319923Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4322791Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.4325662Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:08.4326278Z | self=, 2025-05-07T20:33:08.4326840Z | T=2048, 2025-05-07T20:33:08.4327167Z | D=5120, # or any other generated value 2025-05-07T20:33:08.4327704Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:08.4328708Z | contiguous=True, # or any other generated value 2025-05-07T20:33:08.4329228Z | compiled=False, # or any other generated value 2025-05-07T20:33:08.4329655Z | ) 2025-05-07T20:33:08.4329905Z | 2025-05-07T20:33:08.4330638Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:08.4331479Z +---------------- 2 ---------------- 2025-05-07T20:33:08.4331890Z | Traceback (most recent call last): 2025-05-07T20:33:08.4332961Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:08.4334046Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4336898Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.4339682Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:08.4340584Z | self=, 2025-05-07T20:33:08.4341291Z | T=128, 2025-05-07T20:33:08.4341579Z | D=7168, 2025-05-07T20:33:08.4341863Z | scale_ub=None, 2025-05-07T20:33:08.4342201Z | contiguous=True, 2025-05-07T20:33:08.4342549Z | compiled=True, 2025-05-07T20:33:08.4342855Z | ) 2025-05-07T20:33:08.4343108Z | 2025-05-07T20:33:08.4343845Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:08.4344567Z +---------------- 3 ---------------- 2025-05-07T20:33:08.4344863Z | Traceback (most recent call last): 2025-05-07T20:33:08.4345582Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:08.4346367Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4364841Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.4367007Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:08.4367627Z | self=, 2025-05-07T20:33:08.4368207Z | T=128, 2025-05-07T20:33:08.4368504Z | D=5120, 2025-05-07T20:33:08.4368796Z | scale_ub=1200.0, 2025-05-07T20:33:08.4369150Z | contiguous=True, 2025-05-07T20:33:08.4369500Z | compiled=True, 2025-05-07T20:33:08.4369826Z | ) 2025-05-07T20:33:08.4370097Z | 2025-05-07T20:33:08.4370847Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:08.4371711Z +---------------- 4 ---------------- 2025-05-07T20:33:08.4372125Z | Traceback (most recent call last): 2025-05-07T20:33:08.4373219Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:08.4374237Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4375154Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:08.4376148Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4377324Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:08.4378550Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4379411Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:08.4380470Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4381658Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:08.4382758Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4383874Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:33:08.4384996Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4386089Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:08.4387068Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4387985Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:08.4388771Z | fn() 2025-05-07T20:33:08.4389570Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:08.4390461Z | self.fn.run( 2025-05-07T20:33:08.4391217Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:08.4392055Z | kernel = self.compile( 2025-05-07T20:33:08.4392993Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:08.4393989Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4394992Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:08.4396124Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4396860Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4397417Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4397803Z | ^ 2025-05-07T20:33:08.4398459Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4399244Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:08.4399820Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:08.4400548Z | self=, 2025-05-07T20:33:08.4401173Z | T=1, # or any other generated value 2025-05-07T20:33:08.4401614Z | D=5120, # or any other generated value 2025-05-07T20:33:08.4402104Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:08.4402639Z | contiguous=True, # or any other generated value 2025-05-07T20:33:08.4403205Z | compiled=True, # or any other generated value 2025-05-07T20:33:08.4403655Z | ) 2025-05-07T20:33:08.4403913Z | 2025-05-07T20:33:08.4404647Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:08.4405520Z +------------------------------------ 2025-05-07T20:33:08.4406040Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:08.4406580Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4407168Z self=, 2025-05-07T20:33:08.4407804Z T=1, 2025-05-07T20:33:08.4408075Z D=5120, 2025-05-07T20:33:08.4408339Z scale_ub=None, 2025-05-07T20:33:08.4408672Z contiguous=True, 2025-05-07T20:33:08.4409015Z compiled=True, 2025-05-07T20:33:08.4409321Z ) 2025-05-07T20:33:08.4409793Z self = 2025-05-07T20:33:08.4410478Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.4410850Z 2025-05-07T20:33:08.4410975Z @given( 2025-05-07T20:33:08.4411301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4411759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4412209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4412685Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4413167Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4413586Z ) 2025-05-07T20:33:08.4414083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4414712Z def test_silu_mul_quant( 2025-05-07T20:33:08.4415063Z self, 2025-05-07T20:33:08.4415333Z T: int, 2025-05-07T20:33:08.4415601Z D: int, 2025-05-07T20:33:08.4415909Z scale_ub: Optional[float], 2025-05-07T20:33:08.4416281Z contiguous: bool, 2025-05-07T20:33:08.4416634Z compiled: bool, 2025-05-07T20:33:08.4416956Z ) -> None: 2025-05-07T20:33:08.4417271Z torch.manual_seed(2025) 2025-05-07T20:33:08.4417620Z 2025-05-07T20:33:08.4418010Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4418507Z 2025-05-07T20:33:08.4418784Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4419198Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4419651Z x = x_sign * x_clamp 2025-05-07T20:33:08.4419984Z x0 = x[:, :D] 2025-05-07T20:33:08.4420281Z x1 = x[:, D:] 2025-05-07T20:33:08.4420629Z 2025-05-07T20:33:08.4420885Z if contiguous: 2025-05-07T20:33:08.4421310Z x0 = x0.contiguous() 2025-05-07T20:33:08.4421672Z x1 = x1.contiguous() 2025-05-07T20:33:08.4421998Z 2025-05-07T20:33:08.4422283Z if scale_ub is not None: 2025-05-07T20:33:08.4422672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4423153Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4423601Z ) 2025-05-07T20:33:08.4423945Z else: 2025-05-07T20:33:08.4424257Z scale_ub_tensor = None 2025-05-07T20:33:08.4424620Z 2025-05-07T20:33:08.4424959Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4425423Z op = silu_mul_quant 2025-05-07T20:33:08.4425784Z if compiled: 2025-05-07T20:33:08.4426150Z op = torch.compile(op) 2025-05-07T20:33:08.4426578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4426976Z 2025-05-07T20:33:08.4427261Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4427676Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4428086Z 2025-05-07T20:33:08.4428420Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4428874Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4429337Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4429778Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4430286Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4430730Z 2025-05-07T20:33:08.4431020Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4431294Z 2025-05-07T20:33:08.4431443Z moe/activation_test.py:126: 2025-05-07T20:33:08.4431862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4432345Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4432817Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4433987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4435053Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4435830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4436789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4437737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4438725Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4439723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4441074Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4442112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4442994Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4443784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4444469Z fn() 2025-05-07T20:33:08.4445126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4445893Z self.fn.run( 2025-05-07T20:33:08.4446509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4447196Z kernel = self.compile( 2025-05-07T20:33:08.4447901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4448875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4449408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4449714Z 2025-05-07T20:33:08.4449984Z self = 2025-05-07T20:33:08.4451469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4453452Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15925d89d0>} 2025-05-07T20:33:08.4455283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4456677Z context = 2025-05-07T20:33:08.4457086Z 2025-05-07T20:33:08.4457328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4458074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4458852Z module_map=module_map) 2025-05-07T20:33:08.4459365Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4459881Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4460267Z E ^ 2025-05-07T20:33:08.4460911Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4461600Z 2025-05-07T20:33:08.4462147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4462823Z 2025-05-07T20:33:08.4462961Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4463600Z self=, 2025-05-07T20:33:08.4464124Z T=2048, 2025-05-07T20:33:08.4464375Z D=5120, 2025-05-07T20:33:08.4464636Z scale_ub=1200.0, 2025-05-07T20:33:08.4464925Z contiguous=True, 2025-05-07T20:33:08.4465221Z compiled=False, 2025-05-07T20:33:08.4465500Z ) 2025-05-07T20:33:08.4465914Z self = 2025-05-07T20:33:08.4466573Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.4466937Z 2025-05-07T20:33:08.4467041Z @given( 2025-05-07T20:33:08.4467343Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4467750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4468159Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4468603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4469094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4469481Z ) 2025-05-07T20:33:08.4469944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4470521Z def test_silu_mul_quant( 2025-05-07T20:33:08.4470847Z self, 2025-05-07T20:33:08.4471109Z T: int, 2025-05-07T20:33:08.4471390Z D: int, 2025-05-07T20:33:08.4471689Z scale_ub: Optional[float], 2025-05-07T20:33:08.4472075Z contiguous: bool, 2025-05-07T20:33:08.4472420Z compiled: bool, 2025-05-07T20:33:08.4472720Z ) -> None: 2025-05-07T20:33:08.4473015Z torch.manual_seed(2025) 2025-05-07T20:33:08.4473347Z 2025-05-07T20:33:08.4473709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4474175Z 2025-05-07T20:33:08.4474436Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4474827Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4475254Z x = x_sign * x_clamp 2025-05-07T20:33:08.4475661Z x0 = x[:, :D] 2025-05-07T20:33:08.4475952Z x1 = x[:, D:] 2025-05-07T20:33:08.4476256Z 2025-05-07T20:33:08.4476510Z if contiguous: 2025-05-07T20:33:08.4476824Z x0 = x0.contiguous() 2025-05-07T20:33:08.4477180Z x1 = x1.contiguous() 2025-05-07T20:33:08.4477507Z 2025-05-07T20:33:08.4477768Z if scale_ub is not None: 2025-05-07T20:33:08.4478144Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4478667Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4479081Z ) 2025-05-07T20:33:08.4479346Z else: 2025-05-07T20:33:08.4479632Z scale_ub_tensor = None 2025-05-07T20:33:08.4479975Z 2025-05-07T20:33:08.4480310Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4480766Z op = silu_mul_quant 2025-05-07T20:33:08.4481127Z if compiled: 2025-05-07T20:33:08.4481486Z op = torch.compile(op) 2025-05-07T20:33:08.4481921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4482318Z 2025-05-07T20:33:08.4482590Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4482836Z 2025-05-07T20:33:08.4482980Z moe/activation_test.py:117: 2025-05-07T20:33:08.4483459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4483930Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4484341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4485316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4486274Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4487013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4487955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4488866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4489657Z kernel = self.compile( 2025-05-07T20:33:08.4490423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4491327Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4491872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4492186Z 2025-05-07T20:33:08.4492464Z self = 2025-05-07T20:33:08.4493978Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4495941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f156fced5e0>} 2025-05-07T20:33:08.4497843Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4499301Z context = 2025-05-07T20:33:08.4499704Z 2025-05-07T20:33:08.4499944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4500696Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4501457Z module_map=module_map) 2025-05-07T20:33:08.4501928Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4502401Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4502770Z E ^ 2025-05-07T20:33:08.4503482Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4504123Z 2025-05-07T20:33:08.4504696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4505414Z 2025-05-07T20:33:08.4505560Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4506132Z self=, 2025-05-07T20:33:08.4506747Z T=2048, 2025-05-07T20:33:08.4507010Z D=5120, 2025-05-07T20:33:08.4507276Z scale_ub=1200.0, 2025-05-07T20:33:08.4507594Z contiguous=True, 2025-05-07T20:33:08.4507912Z compiled=True, 2025-05-07T20:33:08.4508211Z ) 2025-05-07T20:33:08.4508685Z self = 2025-05-07T20:33:08.4509407Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.4509798Z 2025-05-07T20:33:08.4509908Z @given( 2025-05-07T20:33:08.4510247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4510666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4510986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4511327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4511716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4512013Z ) 2025-05-07T20:33:08.4512370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4512813Z def test_silu_mul_quant( 2025-05-07T20:33:08.4513071Z self, 2025-05-07T20:33:08.4513273Z T: int, 2025-05-07T20:33:08.4513470Z D: int, 2025-05-07T20:33:08.4513695Z scale_ub: Optional[float], 2025-05-07T20:33:08.4513977Z contiguous: bool, 2025-05-07T20:33:08.4514223Z compiled: bool, 2025-05-07T20:33:08.4514445Z ) -> None: 2025-05-07T20:33:08.4514667Z torch.manual_seed(2025) 2025-05-07T20:33:08.4514968Z 2025-05-07T20:33:08.4515242Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4515596Z 2025-05-07T20:33:08.4515796Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4516087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4516407Z x = x_sign * x_clamp 2025-05-07T20:33:08.4516666Z x0 = x[:, :D] 2025-05-07T20:33:08.4516887Z x1 = x[:, D:] 2025-05-07T20:33:08.4517100Z 2025-05-07T20:33:08.4517299Z if contiguous: 2025-05-07T20:33:08.4517531Z x0 = x0.contiguous() 2025-05-07T20:33:08.4517801Z x1 = x1.contiguous() 2025-05-07T20:33:08.4518049Z 2025-05-07T20:33:08.4518241Z if scale_ub is not None: 2025-05-07T20:33:08.4518523Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4518868Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4519186Z ) 2025-05-07T20:33:08.4519382Z else: 2025-05-07T20:33:08.4519609Z scale_ub_tensor = None 2025-05-07T20:33:08.4519869Z 2025-05-07T20:33:08.4520104Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4520426Z op = silu_mul_quant 2025-05-07T20:33:08.4520684Z if compiled: 2025-05-07T20:33:08.4520937Z op = torch.compile(op) 2025-05-07T20:33:08.4521246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4521531Z 2025-05-07T20:33:08.4521730Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4522028Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4522330Z 2025-05-07T20:33:08.4522566Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4522907Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4523210Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4523535Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4523897Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4524269Z 2025-05-07T20:33:08.4524481Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4524681Z 2025-05-07T20:33:08.4524786Z moe/activation_test.py:126: 2025-05-07T20:33:08.4525095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4525448Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4525776Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4526640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4527408Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4527963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4528652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4529353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4530088Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4530890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4531654Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4532391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4533038Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4533646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4534192Z fn() 2025-05-07T20:33:08.4534716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4535359Z self.fn.run( 2025-05-07T20:33:08.4535828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4536370Z kernel = self.compile( 2025-05-07T20:33:08.4536930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4537586Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4538022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4538268Z 2025-05-07T20:33:08.4538483Z self = 2025-05-07T20:33:08.4539621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4541416Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1591056430>} 2025-05-07T20:33:08.4542753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4543775Z context = 2025-05-07T20:33:08.4544073Z 2025-05-07T20:33:08.4544245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4544775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4545244Z module_map=module_map) 2025-05-07T20:33:08.4545623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4545993Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4546399Z E ^ 2025-05-07T20:33:08.4546877Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4547335Z 2025-05-07T20:33:08.4547755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4548262Z 2025-05-07T20:33:08.4548378Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4548931Z self=, 2025-05-07T20:33:08.4549344Z T=16384, 2025-05-07T20:33:08.4549549Z D=7168, 2025-05-07T20:33:08.4549747Z scale_ub=1200.0, 2025-05-07T20:33:08.4549986Z contiguous=False, 2025-05-07T20:33:08.4550226Z compiled=False, 2025-05-07T20:33:08.4550435Z ) 2025-05-07T20:33:08.4550762Z self = 2025-05-07T20:33:08.4551278Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.4551565Z 2025-05-07T20:33:08.4551654Z @given( 2025-05-07T20:33:08.4551890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4552214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4552531Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4552926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4553272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4553579Z ) 2025-05-07T20:33:08.4553932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4554384Z def test_silu_mul_quant( 2025-05-07T20:33:08.4554636Z self, 2025-05-07T20:33:08.4554835Z T: int, 2025-05-07T20:33:08.4555044Z D: int, 2025-05-07T20:33:08.4555272Z scale_ub: Optional[float], 2025-05-07T20:33:08.4555553Z contiguous: bool, 2025-05-07T20:33:08.4555797Z compiled: bool, 2025-05-07T20:33:08.4556107Z ) -> None: 2025-05-07T20:33:08.4556334Z torch.manual_seed(2025) 2025-05-07T20:33:08.4556582Z 2025-05-07T20:33:08.4556862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4557216Z 2025-05-07T20:33:08.4557415Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4557724Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4558043Z x = x_sign * x_clamp 2025-05-07T20:33:08.4558288Z x0 = x[:, :D] 2025-05-07T20:33:08.4558517Z x1 = x[:, D:] 2025-05-07T20:33:08.4558733Z 2025-05-07T20:33:08.4558927Z if contiguous: 2025-05-07T20:33:08.4559171Z x0 = x0.contiguous() 2025-05-07T20:33:08.4559440Z x1 = x1.contiguous() 2025-05-07T20:33:08.4559684Z 2025-05-07T20:33:08.4559886Z if scale_ub is not None: 2025-05-07T20:33:08.4560167Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4560514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4560832Z ) 2025-05-07T20:33:08.4561035Z else: 2025-05-07T20:33:08.4569964Z scale_ub_tensor = None 2025-05-07T20:33:08.4570271Z 2025-05-07T20:33:08.4570521Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4570855Z op = silu_mul_quant 2025-05-07T20:33:08.4571132Z if compiled: 2025-05-07T20:33:08.4571390Z op = torch.compile(op) 2025-05-07T20:33:08.4571705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4571993Z 2025-05-07T20:33:08.4572198Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4572371Z 2025-05-07T20:33:08.4572477Z moe/activation_test.py:117: 2025-05-07T20:33:08.4572783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4573132Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4573418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4574222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4574931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4575478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4576173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4576853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4577440Z kernel = self.compile( 2025-05-07T20:33:08.4577990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4578653Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4579066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4579299Z 2025-05-07T20:33:08.4579528Z self = 2025-05-07T20:33:08.4580618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4582200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590ffe9d0>} 2025-05-07T20:33:08.4583563Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4584585Z context = 2025-05-07T20:33:08.4584873Z 2025-05-07T20:33:08.4585052Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4585634Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4586106Z module_map=module_map) 2025-05-07T20:33:08.4586491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4586854Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4587129Z E ^ 2025-05-07T20:33:08.4587616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4588067Z 2025-05-07T20:33:08.4588489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4589007Z 2025-05-07T20:33:08.4589114Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4589545Z self=, 2025-05-07T20:33:08.4589955Z T=1, 2025-05-07T20:33:08.4590148Z D=7168, 2025-05-07T20:33:08.4590362Z scale_ub=None, 2025-05-07T20:33:08.4590590Z contiguous=True, 2025-05-07T20:33:08.4590823Z compiled=True, 2025-05-07T20:33:08.4591041Z ) 2025-05-07T20:33:08.4591380Z self = 2025-05-07T20:33:08.4591867Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.4592146Z 2025-05-07T20:33:08.4592229Z @given( 2025-05-07T20:33:08.4592477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4592813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4593127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4593471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4593819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4594114Z ) 2025-05-07T20:33:08.4594484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4594937Z def test_silu_mul_quant( 2025-05-07T20:33:08.4595240Z self, 2025-05-07T20:33:08.4595461Z T: int, 2025-05-07T20:33:08.4595676Z D: int, 2025-05-07T20:33:08.4595903Z scale_ub: Optional[float], 2025-05-07T20:33:08.4596196Z contiguous: bool, 2025-05-07T20:33:08.4596448Z compiled: bool, 2025-05-07T20:33:08.4596683Z ) -> None: 2025-05-07T20:33:08.4596907Z torch.manual_seed(2025) 2025-05-07T20:33:08.4597161Z 2025-05-07T20:33:08.4597449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4597845Z 2025-05-07T20:33:08.4598049Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4598353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4598675Z x = x_sign * x_clamp 2025-05-07T20:33:08.4598931Z x0 = x[:, :D] 2025-05-07T20:33:08.4599164Z x1 = x[:, D:] 2025-05-07T20:33:08.4599376Z 2025-05-07T20:33:08.4599578Z if contiguous: 2025-05-07T20:33:08.4599826Z x0 = x0.contiguous() 2025-05-07T20:33:08.4600096Z x1 = x1.contiguous() 2025-05-07T20:33:08.4600352Z 2025-05-07T20:33:08.4600556Z if scale_ub is not None: 2025-05-07T20:33:08.4600833Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4601182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4601547Z ) 2025-05-07T20:33:08.4601756Z else: 2025-05-07T20:33:08.4601975Z scale_ub_tensor = None 2025-05-07T20:33:08.4602246Z 2025-05-07T20:33:08.4602492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4602820Z op = silu_mul_quant 2025-05-07T20:33:08.4603091Z if compiled: 2025-05-07T20:33:08.4603360Z op = torch.compile(op) 2025-05-07T20:33:08.4603667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4603963Z 2025-05-07T20:33:08.4604173Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4604471Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4604829Z 2025-05-07T20:33:08.4605080Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4605421Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4605733Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4606071Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4606450Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4606767Z 2025-05-07T20:33:08.4606989Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4607189Z 2025-05-07T20:33:08.4607304Z moe/activation_test.py:126: 2025-05-07T20:33:08.4607604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4607953Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4608306Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4609108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4609886Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4610451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4611146Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4611837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4612571Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4613334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4614093Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4614872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4615529Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4616142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4616660Z fn() 2025-05-07T20:33:08.4617183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4617797Z self.fn.run( 2025-05-07T20:33:08.4618272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4618805Z kernel = self.compile( 2025-05-07T20:33:08.4619343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4620001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4620411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4620644Z 2025-05-07T20:33:08.4620859Z self = 2025-05-07T20:33:08.4622086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4623488Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590ffef70>} 2025-05-07T20:33:08.4624840Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4625865Z context = 2025-05-07T20:33:08.4626160Z 2025-05-07T20:33:08.4626387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4626909Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4627383Z module_map=module_map) 2025-05-07T20:33:08.4627775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4628137Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4628416Z E ^ 2025-05-07T20:33:08.4628953Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4629404Z 2025-05-07T20:33:08.4629829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4630337Z 2025-05-07T20:33:08.4630446Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4630870Z self=, 2025-05-07T20:33:08.4631288Z T=4096, 2025-05-07T20:33:08.4631486Z D=5120, 2025-05-07T20:33:08.4631689Z scale_ub=None, 2025-05-07T20:33:08.4631917Z contiguous=False, 2025-05-07T20:33:08.4632149Z compiled=False, 2025-05-07T20:33:08.4632367Z ) 2025-05-07T20:33:08.4632695Z self = 2025-05-07T20:33:08.4633199Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.4633487Z 2025-05-07T20:33:08.4633569Z @given( 2025-05-07T20:33:08.4633809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4634141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4634455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4634799Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4635138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4635435Z ) 2025-05-07T20:33:08.4635877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4636343Z def test_silu_mul_quant( 2025-05-07T20:33:08.4636586Z self, 2025-05-07T20:33:08.4636792Z T: int, 2025-05-07T20:33:08.4637001Z D: int, 2025-05-07T20:33:08.4637224Z scale_ub: Optional[float], 2025-05-07T20:33:08.4637511Z contiguous: bool, 2025-05-07T20:33:08.4637759Z compiled: bool, 2025-05-07T20:33:08.4637993Z ) -> None: 2025-05-07T20:33:08.4638216Z torch.manual_seed(2025) 2025-05-07T20:33:08.4638522Z 2025-05-07T20:33:08.4638807Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4639150Z 2025-05-07T20:33:08.4639355Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4639657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4639973Z x = x_sign * x_clamp 2025-05-07T20:33:08.4640541Z x0 = x[:, :D] 2025-05-07T20:33:08.4640773Z x1 = x[:, D:] 2025-05-07T20:33:08.4640988Z 2025-05-07T20:33:08.4641189Z if contiguous: 2025-05-07T20:33:08.4641433Z x0 = x0.contiguous() 2025-05-07T20:33:08.4641697Z x1 = x1.contiguous() 2025-05-07T20:33:08.4641948Z 2025-05-07T20:33:08.4642153Z if scale_ub is not None: 2025-05-07T20:33:08.4642430Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4642911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4643237Z ) 2025-05-07T20:33:08.4643445Z else: 2025-05-07T20:33:08.4643660Z scale_ub_tensor = None 2025-05-07T20:33:08.4643922Z 2025-05-07T20:33:08.4644170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4644487Z op = silu_mul_quant 2025-05-07T20:33:08.4644748Z if compiled: 2025-05-07T20:33:08.4645007Z op = torch.compile(op) 2025-05-07T20:33:08.4645306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4645590Z 2025-05-07T20:33:08.4645876Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4646061Z 2025-05-07T20:33:08.4646166Z moe/activation_test.py:117: 2025-05-07T20:33:08.4646472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4646814Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4647106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4647798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4648493Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4649037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4649727Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4650395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4650939Z kernel = self.compile( 2025-05-07T20:33:08.4651479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4652137Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4652543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4652775Z 2025-05-07T20:33:08.4652992Z self = 2025-05-07T20:33:08.4654069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4655435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590c1fca0>} 2025-05-07T20:33:08.4656839Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4657882Z context = 2025-05-07T20:33:08.4658170Z 2025-05-07T20:33:08.4658346Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4658884Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4659418Z module_map=module_map) 2025-05-07T20:33:08.4659797Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4660156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4660427Z E ^ 2025-05-07T20:33:08.4660895Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4661450Z 2025-05-07T20:33:08.4661886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4662393Z 2025-05-07T20:33:08.4662501Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4662919Z self=, 2025-05-07T20:33:08.4663376Z T=4096, 2025-05-07T20:33:08.4663573Z D=7168, 2025-05-07T20:33:08.4663774Z scale_ub=None, 2025-05-07T20:33:08.4664002Z contiguous=False, 2025-05-07T20:33:08.4664234Z compiled=False, 2025-05-07T20:33:08.4664448Z ) 2025-05-07T20:33:08.4664773Z self = 2025-05-07T20:33:08.4665271Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.4665557Z 2025-05-07T20:33:08.4665640Z @given( 2025-05-07T20:33:08.4665879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4666205Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4666570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4666909Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4667245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4667533Z ) 2025-05-07T20:33:08.4667897Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4668344Z def test_silu_mul_quant( 2025-05-07T20:33:08.4668589Z self, 2025-05-07T20:33:08.4668799Z T: int, 2025-05-07T20:33:08.4669010Z D: int, 2025-05-07T20:33:08.4669234Z scale_ub: Optional[float], 2025-05-07T20:33:08.4669516Z contiguous: bool, 2025-05-07T20:33:08.4669769Z compiled: bool, 2025-05-07T20:33:08.4670002Z ) -> None: 2025-05-07T20:33:08.4670223Z torch.manual_seed(2025) 2025-05-07T20:33:08.4670479Z 2025-05-07T20:33:08.4670759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4671107Z 2025-05-07T20:33:08.4671315Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4671621Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4671937Z x = x_sign * x_clamp 2025-05-07T20:33:08.4672189Z x0 = x[:, :D] 2025-05-07T20:33:08.4672420Z x1 = x[:, D:] 2025-05-07T20:33:08.4672631Z 2025-05-07T20:33:08.4672831Z if contiguous: 2025-05-07T20:33:08.4673075Z x0 = x0.contiguous() 2025-05-07T20:33:08.4673337Z x1 = x1.contiguous() 2025-05-07T20:33:08.4673594Z 2025-05-07T20:33:08.4673795Z if scale_ub is not None: 2025-05-07T20:33:08.4674073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4674418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4674735Z ) 2025-05-07T20:33:08.4674931Z else: 2025-05-07T20:33:08.4675151Z scale_ub_tensor = None 2025-05-07T20:33:08.4675417Z 2025-05-07T20:33:08.4675708Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4676035Z op = silu_mul_quant 2025-05-07T20:33:08.4676297Z if compiled: 2025-05-07T20:33:08.4676557Z op = torch.compile(op) 2025-05-07T20:33:08.4676856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4677138Z 2025-05-07T20:33:08.4677343Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4677513Z 2025-05-07T20:33:08.4677616Z moe/activation_test.py:117: 2025-05-07T20:33:08.4677972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4678315Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4678616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4679341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4680034Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4680581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4681266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4681929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4682464Z kernel = self.compile( 2025-05-07T20:33:08.4683053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4683723Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4684127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4684360Z 2025-05-07T20:33:08.4684580Z self = 2025-05-07T20:33:08.4685659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4687089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590c7a700>} 2025-05-07T20:33:08.4688432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4689454Z context = 2025-05-07T20:33:08.4689743Z 2025-05-07T20:33:08.4689922Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4690446Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4690922Z module_map=module_map) 2025-05-07T20:33:08.4691304Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4691665Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4691939Z E ^ 2025-05-07T20:33:08.4692412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4692858Z 2025-05-07T20:33:08.4693281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4693791Z 2025-05-07T20:33:08.4693899Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4694321Z self=, 2025-05-07T20:33:08.4694734Z T=128, 2025-05-07T20:33:08.4694941Z D=7168, 2025-05-07T20:33:08.4695137Z scale_ub=None, 2025-05-07T20:33:08.4695362Z contiguous=False, 2025-05-07T20:33:08.4695603Z compiled=True, 2025-05-07T20:33:08.4695809Z ) 2025-05-07T20:33:08.4696139Z self = 2025-05-07T20:33:08.4696685Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.4696958Z 2025-05-07T20:33:08.4697038Z @given( 2025-05-07T20:33:08.4697280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4697603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4697917Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4698256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4698678Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4698993Z ) 2025-05-07T20:33:08.4699346Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4699796Z def test_silu_mul_quant( 2025-05-07T20:33:08.4700048Z self, 2025-05-07T20:33:08.4700246Z T: int, 2025-05-07T20:33:08.4700456Z D: int, 2025-05-07T20:33:08.4700685Z scale_ub: Optional[float], 2025-05-07T20:33:08.4700962Z contiguous: bool, 2025-05-07T20:33:08.4701355Z compiled: bool, 2025-05-07T20:33:08.4701591Z ) -> None: 2025-05-07T20:33:08.4701813Z torch.manual_seed(2025) 2025-05-07T20:33:08.4702065Z 2025-05-07T20:33:08.4702350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4702694Z 2025-05-07T20:33:08.4702991Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4703300Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4703617Z x = x_sign * x_clamp 2025-05-07T20:33:08.4703873Z x0 = x[:, :D] 2025-05-07T20:33:08.4704102Z x1 = x[:, D:] 2025-05-07T20:33:08.4704321Z 2025-05-07T20:33:08.4704511Z if contiguous: 2025-05-07T20:33:08.4704753Z x0 = x0.contiguous() 2025-05-07T20:33:08.4705024Z x1 = x1.contiguous() 2025-05-07T20:33:08.4705272Z 2025-05-07T20:33:08.4705477Z if scale_ub is not None: 2025-05-07T20:33:08.4705763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4706182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4706504Z ) 2025-05-07T20:33:08.4706708Z else: 2025-05-07T20:33:08.4706924Z scale_ub_tensor = None 2025-05-07T20:33:08.4707191Z 2025-05-07T20:33:08.4707439Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4707760Z op = silu_mul_quant 2025-05-07T20:33:08.4708024Z if compiled: 2025-05-07T20:33:08.4708289Z op = torch.compile(op) 2025-05-07T20:33:08.4708594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4708877Z 2025-05-07T20:33:08.4709084Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4709375Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4709677Z 2025-05-07T20:33:08.4709925Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4710273Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4710577Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4710902Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4711269Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4711581Z 2025-05-07T20:33:08.4711795Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4711996Z 2025-05-07T20:33:08.4712111Z moe/activation_test.py:126: 2025-05-07T20:33:08.4712410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4712759Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4713102Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4713890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4714638Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4715237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4715937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4716627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4717357Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4718117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4718916Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4719647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4720290Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4720899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4721422Z fn() 2025-05-07T20:33:08.4721933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4722513Z self.fn.run( 2025-05-07T20:33:08.4723025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4723562Z kernel = self.compile( 2025-05-07T20:33:08.4724115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4737303Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4737731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4737968Z 2025-05-07T20:33:08.4738182Z self = 2025-05-07T20:33:08.4739291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4741196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590a2f5e0>} 2025-05-07T20:33:08.4742560Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4743581Z context = 2025-05-07T20:33:08.4743885Z 2025-05-07T20:33:08.4744058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4744600Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4745092Z module_map=module_map) 2025-05-07T20:33:08.4745468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4745839Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4746126Z E ^ 2025-05-07T20:33:08.4746597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4747060Z 2025-05-07T20:33:08.4747485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4748016Z 2025-05-07T20:33:08.4748124Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4748553Z self=, 2025-05-07T20:33:08.4748965Z T=128, 2025-05-07T20:33:08.4749171Z D=7168, 2025-05-07T20:33:08.4749376Z scale_ub=None, 2025-05-07T20:33:08.4749596Z contiguous=False, 2025-05-07T20:33:08.4749839Z compiled=False, 2025-05-07T20:33:08.4750064Z ) 2025-05-07T20:33:08.4750549Z self = 2025-05-07T20:33:08.4751059Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.4751338Z 2025-05-07T20:33:08.4751421Z @given( 2025-05-07T20:33:08.4751666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4751984Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4752305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4752712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4753050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4753350Z ) 2025-05-07T20:33:08.4753714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4754158Z def test_silu_mul_quant( 2025-05-07T20:33:08.4754413Z self, 2025-05-07T20:33:08.4754619Z T: int, 2025-05-07T20:33:08.4754818Z D: int, 2025-05-07T20:33:08.4755055Z scale_ub: Optional[float], 2025-05-07T20:33:08.4755340Z contiguous: bool, 2025-05-07T20:33:08.4755591Z compiled: bool, 2025-05-07T20:33:08.4755820Z ) -> None: 2025-05-07T20:33:08.4756044Z torch.manual_seed(2025) 2025-05-07T20:33:08.4756296Z 2025-05-07T20:33:08.4756642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4757002Z 2025-05-07T20:33:08.4757203Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4757503Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4757829Z x = x_sign * x_clamp 2025-05-07T20:33:08.4758086Z x0 = x[:, :D] 2025-05-07T20:33:08.4758309Z x1 = x[:, D:] 2025-05-07T20:33:08.4758539Z 2025-05-07T20:33:08.4758777Z if contiguous: 2025-05-07T20:33:08.4759021Z x0 = x0.contiguous() 2025-05-07T20:33:08.4759296Z x1 = x1.contiguous() 2025-05-07T20:33:08.4759548Z 2025-05-07T20:33:08.4759816Z if scale_ub is not None: 2025-05-07T20:33:08.4760107Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4760454Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4760779Z ) 2025-05-07T20:33:08.4760982Z else: 2025-05-07T20:33:08.4761211Z scale_ub_tensor = None 2025-05-07T20:33:08.4761482Z 2025-05-07T20:33:08.4761719Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4762048Z op = silu_mul_quant 2025-05-07T20:33:08.4762314Z if compiled: 2025-05-07T20:33:08.4762572Z op = torch.compile(op) 2025-05-07T20:33:08.4762881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4763168Z 2025-05-07T20:33:08.4763367Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4763546Z 2025-05-07T20:33:08.4763652Z moe/activation_test.py:117: 2025-05-07T20:33:08.4763963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4764304Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4764604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4765316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4766013Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4766558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4767260Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4767926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4768464Z kernel = self.compile( 2025-05-07T20:33:08.4769011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4769722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4770141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4770375Z 2025-05-07T20:33:08.4770586Z self = 2025-05-07T20:33:08.4771669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4773104Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15906d2ee0>} 2025-05-07T20:33:08.4774452Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4775475Z context = 2025-05-07T20:33:08.4775767Z 2025-05-07T20:33:08.4775941Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4776471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4776988Z module_map=module_map) 2025-05-07T20:33:08.4777363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4777732Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4778001Z E ^ 2025-05-07T20:33:08.4778473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4778929Z 2025-05-07T20:33:08.4779350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4779862Z 2025-05-07T20:33:08.4779969Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4780437Z self=, 2025-05-07T20:33:08.4780848Z T=4096, 2025-05-07T20:33:08.4781107Z D=5120, 2025-05-07T20:33:08.4781317Z scale_ub=1200.0, 2025-05-07T20:33:08.4781551Z contiguous=True, 2025-05-07T20:33:08.4781779Z compiled=False, 2025-05-07T20:33:08.4782001Z ) 2025-05-07T20:33:08.4782332Z self = 2025-05-07T20:33:08.4782834Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.4783124Z 2025-05-07T20:33:08.4783205Z @given( 2025-05-07T20:33:08.4783449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4783769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4784087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4784430Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4784773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4785079Z ) 2025-05-07T20:33:08.4785436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4785874Z def test_silu_mul_quant( 2025-05-07T20:33:08.4786129Z self, 2025-05-07T20:33:08.4786331Z T: int, 2025-05-07T20:33:08.4786533Z D: int, 2025-05-07T20:33:08.4786764Z scale_ub: Optional[float], 2025-05-07T20:33:08.4787044Z contiguous: bool, 2025-05-07T20:33:08.4787286Z compiled: bool, 2025-05-07T20:33:08.4787522Z ) -> None: 2025-05-07T20:33:08.4787744Z torch.manual_seed(2025) 2025-05-07T20:33:08.4787990Z 2025-05-07T20:33:08.4788272Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4788619Z 2025-05-07T20:33:08.4788818Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4789116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4789438Z x = x_sign * x_clamp 2025-05-07T20:33:08.4789745Z x0 = x[:, :D] 2025-05-07T20:33:08.4789967Z x1 = x[:, D:] 2025-05-07T20:33:08.4790182Z 2025-05-07T20:33:08.4790375Z if contiguous: 2025-05-07T20:33:08.4790608Z x0 = x0.contiguous() 2025-05-07T20:33:08.4790873Z x1 = x1.contiguous() 2025-05-07T20:33:08.4791120Z 2025-05-07T20:33:08.4791315Z if scale_ub is not None: 2025-05-07T20:33:08.4791594Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4791936Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4792291Z ) 2025-05-07T20:33:08.4792496Z else: 2025-05-07T20:33:08.4792712Z scale_ub_tensor = None 2025-05-07T20:33:08.4792965Z 2025-05-07T20:33:08.4793204Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4793527Z op = silu_mul_quant 2025-05-07T20:33:08.4793780Z if compiled: 2025-05-07T20:33:08.4794042Z op = torch.compile(op) 2025-05-07T20:33:08.4794356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4794638Z 2025-05-07T20:33:08.4794835Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4795009Z 2025-05-07T20:33:08.4795110Z moe/activation_test.py:117: 2025-05-07T20:33:08.4795412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4795817Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4796111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4796813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4797502Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4798048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4798732Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4799399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4799977Z kernel = self.compile( 2025-05-07T20:33:08.4800364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4800546Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4800685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4800692Z 2025-05-07T20:33:08.4800900Z self = 2025-05-07T20:33:08.4801673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4802187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15907a9670>} 2025-05-07T20:33:08.4802933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4803135Z context = 2025-05-07T20:33:08.4803140Z 2025-05-07T20:33:08.4803311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4803582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4803701Z module_map=module_map) 2025-05-07T20:33:08.4803867Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4803974Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4804058Z E ^ 2025-05-07T20:33:08.4804454Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4804462Z 2025-05-07T20:33:08.4804888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4804893Z 2025-05-07T20:33:08.4804998Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4805232Z self=, 2025-05-07T20:33:08.4805313Z T=1, 2025-05-07T20:33:08.4805430Z D=5120, 2025-05-07T20:33:08.4805523Z scale_ub=None, 2025-05-07T20:33:08.4805611Z contiguous=True, 2025-05-07T20:33:08.4805696Z compiled=True, 2025-05-07T20:33:08.4805779Z ) 2025-05-07T20:33:08.4805999Z self = 2025-05-07T20:33:08.4806165Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.4806170Z 2025-05-07T20:33:08.4806257Z @given( 2025-05-07T20:33:08.4806382Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4806487Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4806616Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4806738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4806861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4806978Z ) 2025-05-07T20:33:08.4807227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4807335Z def test_silu_mul_quant( 2025-05-07T20:33:08.4807417Z self, 2025-05-07T20:33:08.4807496Z T: int, 2025-05-07T20:33:08.4807580Z D: int, 2025-05-07T20:33:08.4807680Z scale_ub: Optional[float], 2025-05-07T20:33:08.4807771Z contiguous: bool, 2025-05-07T20:33:08.4807868Z compiled: bool, 2025-05-07T20:33:08.4807949Z ) -> None: 2025-05-07T20:33:08.4808053Z torch.manual_seed(2025) 2025-05-07T20:33:08.4808129Z 2025-05-07T20:33:08.4808306Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4808427Z 2025-05-07T20:33:08.4808521Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4808653Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4808750Z x = x_sign * x_clamp 2025-05-07T20:33:08.4808833Z x0 = x[:, :D] 2025-05-07T20:33:08.4808917Z x1 = x[:, D:] 2025-05-07T20:33:08.4808999Z 2025-05-07T20:33:08.4809086Z if contiguous: 2025-05-07T20:33:08.4809182Z x0 = x0.contiguous() 2025-05-07T20:33:08.4809282Z x1 = x1.contiguous() 2025-05-07T20:33:08.4809356Z 2025-05-07T20:33:08.4809452Z if scale_ub is not None: 2025-05-07T20:33:08.4809565Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4809715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4809794Z ) 2025-05-07T20:33:08.4809882Z else: 2025-05-07T20:33:08.4809978Z scale_ub_tensor = None 2025-05-07T20:33:08.4810061Z 2025-05-07T20:33:08.4810202Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4810296Z op = silu_mul_quant 2025-05-07T20:33:08.4810387Z if compiled: 2025-05-07T20:33:08.4810503Z op = torch.compile(op) 2025-05-07T20:33:08.4810612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4810695Z 2025-05-07T20:33:08.4810790Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4810916Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4810998Z 2025-05-07T20:33:08.4811139Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4811245Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4811354Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4811478Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4811619Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4811706Z 2025-05-07T20:33:08.4811867Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4811872Z 2025-05-07T20:33:08.4811985Z moe/activation_test.py:126: 2025-05-07T20:33:08.4812116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4812225Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4812375Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4812939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4813087Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4813453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4813679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4814053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4814318Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4814711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4815011Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4815392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4815570Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4815912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4815992Z fn() 2025-05-07T20:33:08.4816391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4816521Z self.fn.run( 2025-05-07T20:33:08.4816855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4816957Z kernel = self.compile( 2025-05-07T20:33:08.4817333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4817520Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4817650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4817657Z 2025-05-07T20:33:08.4817864Z self = 2025-05-07T20:33:08.4818643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4819147Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15903f6550>} 2025-05-07T20:33:08.4819898Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4820092Z context = 2025-05-07T20:33:08.4820099Z 2025-05-07T20:33:08.4820276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4820541Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4820650Z module_map=module_map) 2025-05-07T20:33:08.4820822Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4820926Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4821074Z E ^ 2025-05-07T20:33:08.4821486Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4821491Z 2025-05-07T20:33:08.4821909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4821914Z 2025-05-07T20:33:08.4822027Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4822251Z self=, 2025-05-07T20:33:08.4822368Z T=2048, 2025-05-07T20:33:08.4822458Z D=5120, 2025-05-07T20:33:08.4822542Z scale_ub=None, 2025-05-07T20:33:08.4822630Z contiguous=True, 2025-05-07T20:33:08.4822723Z compiled=True, 2025-05-07T20:33:08.4822798Z ) 2025-05-07T20:33:08.4823019Z self = 2025-05-07T20:33:08.4823198Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.4823206Z 2025-05-07T20:33:08.4823289Z @given( 2025-05-07T20:33:08.4823417Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4823520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4823638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4823804Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4823922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4823998Z ) 2025-05-07T20:33:08.4824258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4824355Z def test_silu_mul_quant( 2025-05-07T20:33:08.4824435Z self, 2025-05-07T20:33:08.4824521Z T: int, 2025-05-07T20:33:08.4824601Z D: int, 2025-05-07T20:33:08.4824709Z scale_ub: Optional[float], 2025-05-07T20:33:08.4824801Z contiguous: bool, 2025-05-07T20:33:08.4824891Z compiled: bool, 2025-05-07T20:33:08.4824976Z ) -> None: 2025-05-07T20:33:08.4825117Z torch.manual_seed(2025) 2025-05-07T20:33:08.4825193Z 2025-05-07T20:33:08.4825373Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4825451Z 2025-05-07T20:33:08.4825546Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4825681Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4825776Z x = x_sign * x_clamp 2025-05-07T20:33:08.4825860Z x0 = x[:, :D] 2025-05-07T20:33:08.4825950Z x1 = x[:, D:] 2025-05-07T20:33:08.4826031Z 2025-05-07T20:33:08.4826122Z if contiguous: 2025-05-07T20:33:08.4826226Z x0 = x0.contiguous() 2025-05-07T20:33:08.4826315Z x1 = x1.contiguous() 2025-05-07T20:33:08.4826401Z 2025-05-07T20:33:08.4826495Z if scale_ub is not None: 2025-05-07T20:33:08.4826602Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4826750Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4826829Z ) 2025-05-07T20:33:08.4826915Z else: 2025-05-07T20:33:08.4827022Z scale_ub_tensor = None 2025-05-07T20:33:08.4827099Z 2025-05-07T20:33:08.4827234Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4827340Z op = silu_mul_quant 2025-05-07T20:33:08.4827430Z if compiled: 2025-05-07T20:33:08.4827535Z op = torch.compile(op) 2025-05-07T20:33:08.4827653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4827732Z 2025-05-07T20:33:08.4827835Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4827959Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4828034Z 2025-05-07T20:33:08.4828180Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4828287Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4828390Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4828525Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4828720Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4828797Z 2025-05-07T20:33:08.4828910Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4828914Z 2025-05-07T20:33:08.4829014Z moe/activation_test.py:126: 2025-05-07T20:33:08.4829157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4829265Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4829402Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4830034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4830139Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4830497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4830735Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4831112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4831376Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4831807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4832067Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4832454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4832625Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4832973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4833054Z fn() 2025-05-07T20:33:08.4833453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4833586Z self.fn.run( 2025-05-07T20:33:08.4833925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4834022Z kernel = self.compile( 2025-05-07T20:33:08.4834410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4834596Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4834735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4834740Z 2025-05-07T20:33:08.4834947Z self = 2025-05-07T20:33:08.4835722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4836243Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590015f70>} 2025-05-07T20:33:08.4836985Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4837186Z context = 2025-05-07T20:33:08.4837191Z 2025-05-07T20:33:08.4837359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4837630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4837742Z module_map=module_map) 2025-05-07T20:33:08.4837907Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4838062Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4838145Z E ^ 2025-05-07T20:33:08.4838500Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4838504Z 2025-05-07T20:33:08.4838931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4838935Z 2025-05-07T20:33:08.4839078Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4839308Z self=, 2025-05-07T20:33:08.4839389Z T=128, 2025-05-07T20:33:08.4839469Z D=5120, 2025-05-07T20:33:08.4839560Z scale_ub=None, 2025-05-07T20:33:08.4839648Z contiguous=True, 2025-05-07T20:33:08.4839733Z compiled=True, 2025-05-07T20:33:08.4839815Z ) 2025-05-07T20:33:08.4840033Z self = 2025-05-07T20:33:08.4840491Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.4840508Z 2025-05-07T20:33:08.4840618Z @given( 2025-05-07T20:33:08.4840766Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4840876Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4841134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4841258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4841384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4841465Z ) 2025-05-07T20:33:08.4841714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4841816Z def test_silu_mul_quant( 2025-05-07T20:33:08.4841894Z self, 2025-05-07T20:33:08.4841974Z T: int, 2025-05-07T20:33:08.4842059Z D: int, 2025-05-07T20:33:08.4842159Z scale_ub: Optional[float], 2025-05-07T20:33:08.4842259Z contiguous: bool, 2025-05-07T20:33:08.4842420Z compiled: bool, 2025-05-07T20:33:08.4842502Z ) -> None: 2025-05-07T20:33:08.4842604Z torch.manual_seed(2025) 2025-05-07T20:33:08.4842679Z 2025-05-07T20:33:08.4842850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4842932Z 2025-05-07T20:33:08.4843033Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4843162Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4843259Z x = x_sign * x_clamp 2025-05-07T20:33:08.4843346Z x0 = x[:, :D] 2025-05-07T20:33:08.4843431Z x1 = x[:, D:] 2025-05-07T20:33:08.4843515Z 2025-05-07T20:33:08.4843602Z if contiguous: 2025-05-07T20:33:08.4843698Z x0 = x0.contiguous() 2025-05-07T20:33:08.4843798Z x1 = x1.contiguous() 2025-05-07T20:33:08.4843878Z 2025-05-07T20:33:08.4843977Z if scale_ub is not None: 2025-05-07T20:33:08.4844086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4844232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4844316Z ) 2025-05-07T20:33:08.4844398Z else: 2025-05-07T20:33:08.4844494Z scale_ub_tensor = None 2025-05-07T20:33:08.4844578Z 2025-05-07T20:33:08.4844712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4844808Z op = silu_mul_quant 2025-05-07T20:33:08.4844908Z if compiled: 2025-05-07T20:33:08.4845013Z op = torch.compile(op) 2025-05-07T20:33:08.4845123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4845204Z 2025-05-07T20:33:08.4845299Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4845429Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4845503Z 2025-05-07T20:33:08.4845642Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4845754Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4845858Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4846056Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4846211Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4846287Z 2025-05-07T20:33:08.4846391Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4846396Z 2025-05-07T20:33:08.4846505Z moe/activation_test.py:126: 2025-05-07T20:33:08.4846635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4846815Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4846952Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4847510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4847620Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4847977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4848213Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4848594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4848893Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4849302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4849563Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4849932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4850108Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4850446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4850573Z fn() 2025-05-07T20:33:08.4850974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4851058Z self.fn.run( 2025-05-07T20:33:08.4851400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4851496Z kernel = self.compile( 2025-05-07T20:33:08.4851871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4852059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4852188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4852193Z 2025-05-07T20:33:08.4852409Z self = 2025-05-07T20:33:08.4853181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4853699Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590293b80>} 2025-05-07T20:33:08.4854450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4854648Z context = 2025-05-07T20:33:08.4854652Z 2025-05-07T20:33:08.4854827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4855091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4855209Z module_map=module_map) 2025-05-07T20:33:08.4855418Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4855524Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4855609Z E ^ 2025-05-07T20:33:08.4855964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4855971Z 2025-05-07T20:33:08.4856381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4856430Z 2025-05-07T20:33:08.4856536Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4856761Z self=, 2025-05-07T20:33:08.4856848Z T=4096, 2025-05-07T20:33:08.4856927Z D=5120, 2025-05-07T20:33:08.4857013Z scale_ub=None, 2025-05-07T20:33:08.4857105Z contiguous=True, 2025-05-07T20:33:08.4857193Z compiled=True, 2025-05-07T20:33:08.4857269Z ) 2025-05-07T20:33:08.4857502Z self = 2025-05-07T20:33:08.4857676Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.4857681Z 2025-05-07T20:33:08.4857760Z @given( 2025-05-07T20:33:08.4857886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4858027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4858154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4858278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4858394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4858477Z ) 2025-05-07T20:33:08.4858724Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4858821Z def test_silu_mul_quant( 2025-05-07T20:33:08.4858907Z self, 2025-05-07T20:33:08.4858987Z T: int, 2025-05-07T20:33:08.4859068Z D: int, 2025-05-07T20:33:08.4859226Z scale_ub: Optional[float], 2025-05-07T20:33:08.4859318Z contiguous: bool, 2025-05-07T20:33:08.4859417Z compiled: bool, 2025-05-07T20:33:08.4859498Z ) -> None: 2025-05-07T20:33:08.4859596Z torch.manual_seed(2025) 2025-05-07T20:33:08.4859677Z 2025-05-07T20:33:08.4859852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4859930Z 2025-05-07T20:33:08.4860035Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4860166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4860258Z x = x_sign * x_clamp 2025-05-07T20:33:08.4860349Z x0 = x[:, :D] 2025-05-07T20:33:08.4860433Z x1 = x[:, D:] 2025-05-07T20:33:08.4860512Z 2025-05-07T20:33:08.4860608Z if contiguous: 2025-05-07T20:33:08.4860703Z x0 = x0.contiguous() 2025-05-07T20:33:08.4860795Z x1 = x1.contiguous() 2025-05-07T20:33:08.4860877Z 2025-05-07T20:33:08.4860970Z if scale_ub is not None: 2025-05-07T20:33:08.4861165Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4861307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4861388Z ) 2025-05-07T20:33:08.4861479Z else: 2025-05-07T20:33:08.4861576Z scale_ub_tensor = None 2025-05-07T20:33:08.4861651Z 2025-05-07T20:33:08.4861795Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4861889Z op = silu_mul_quant 2025-05-07T20:33:08.4861980Z if compiled: 2025-05-07T20:33:08.4862091Z op = torch.compile(op) 2025-05-07T20:33:08.4862202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4862278Z 2025-05-07T20:33:08.4862379Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4862505Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4862589Z 2025-05-07T20:33:08.4862729Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4862904Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4863016Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4863146Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4863290Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4863374Z 2025-05-07T20:33:08.4863479Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4863484Z 2025-05-07T20:33:08.4863585Z moe/activation_test.py:126: 2025-05-07T20:33:08.4863760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4863870Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4864016Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4864570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4864673Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4865048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4865274Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4865683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4865948Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4866350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4866614Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4866984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4867153Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4867551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4867632Z fn() 2025-05-07T20:33:08.4868033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4868120Z self.fn.run( 2025-05-07T20:33:08.4868454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4868560Z kernel = self.compile( 2025-05-07T20:33:08.4868933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4869111Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4869253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4869257Z 2025-05-07T20:33:08.4869464Z self = 2025-05-07T20:33:08.4870244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4870751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158fddeca0>} 2025-05-07T20:33:08.4871512Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4871706Z context = 2025-05-07T20:33:08.4871711Z 2025-05-07T20:33:08.4871878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4872188Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4872304Z module_map=module_map) 2025-05-07T20:33:08.4872476Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4872580Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4872659Z E ^ 2025-05-07T20:33:08.4873019Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4873024Z 2025-05-07T20:33:08.4873479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4873484Z 2025-05-07T20:33:08.4873589Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4873819Z self=, 2025-05-07T20:33:08.4873901Z T=16384, 2025-05-07T20:33:08.4873985Z D=5120, 2025-05-07T20:33:08.4874069Z scale_ub=None, 2025-05-07T20:33:08.4874155Z contiguous=True, 2025-05-07T20:33:08.4874253Z compiled=True, 2025-05-07T20:33:08.4874334Z ) 2025-05-07T20:33:08.4874554Z self = 2025-05-07T20:33:08.4874735Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.4874740Z 2025-05-07T20:33:08.4874862Z @given( 2025-05-07T20:33:08.4874984Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4875092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4875212Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4875338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4875453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4875530Z ) 2025-05-07T20:33:08.4875790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4875893Z def test_silu_mul_quant( 2025-05-07T20:33:08.4881702Z self, 2025-05-07T20:33:08.4881867Z T: int, 2025-05-07T20:33:08.4881956Z D: int, 2025-05-07T20:33:08.4882062Z scale_ub: Optional[float], 2025-05-07T20:33:08.4882155Z contiguous: bool, 2025-05-07T20:33:08.4882255Z compiled: bool, 2025-05-07T20:33:08.4882338Z ) -> None: 2025-05-07T20:33:08.4882437Z torch.manual_seed(2025) 2025-05-07T20:33:08.4882524Z 2025-05-07T20:33:08.4882707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4882788Z 2025-05-07T20:33:08.4882894Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4883023Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4883117Z x = x_sign * x_clamp 2025-05-07T20:33:08.4883207Z x0 = x[:, :D] 2025-05-07T20:33:08.4883291Z x1 = x[:, D:] 2025-05-07T20:33:08.4883374Z 2025-05-07T20:33:08.4883460Z if contiguous: 2025-05-07T20:33:08.4883555Z x0 = x0.contiguous() 2025-05-07T20:33:08.4883655Z x1 = x1.contiguous() 2025-05-07T20:33:08.4883735Z 2025-05-07T20:33:08.4883831Z if scale_ub is not None: 2025-05-07T20:33:08.4883949Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4884089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4884172Z ) 2025-05-07T20:33:08.4884265Z else: 2025-05-07T20:33:08.4884364Z scale_ub_tensor = None 2025-05-07T20:33:08.4884439Z 2025-05-07T20:33:08.4884584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4884680Z op = silu_mul_quant 2025-05-07T20:33:08.4884770Z if compiled: 2025-05-07T20:33:08.4884883Z op = torch.compile(op) 2025-05-07T20:33:08.4884994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4885079Z 2025-05-07T20:33:08.4885174Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4885298Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4885381Z 2025-05-07T20:33:08.4885570Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4885678Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4885791Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4885916Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4886068Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4886152Z 2025-05-07T20:33:08.4886258Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4886301Z 2025-05-07T20:33:08.4886412Z moe/activation_test.py:126: 2025-05-07T20:33:08.4886545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4886655Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4886803Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4887360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4887472Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4887845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4888074Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4888490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4888759Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4889161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4889428Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4889807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4890028Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4890376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4890456Z fn() 2025-05-07T20:33:08.4890871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4890956Z self.fn.run( 2025-05-07T20:33:08.4891296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4891403Z kernel = self.compile( 2025-05-07T20:33:08.4891779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4891970Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4892099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4892107Z 2025-05-07T20:33:08.4892319Z self = 2025-05-07T20:33:08.4893104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4893618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1590293ca0>} 2025-05-07T20:33:08.4894385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4894580Z context = 2025-05-07T20:33:08.4894585Z 2025-05-07T20:33:08.4894807Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4895083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4895194Z module_map=module_map) 2025-05-07T20:33:08.4895367Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4895474Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4895553Z E ^ 2025-05-07T20:33:08.4895922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4895967Z 2025-05-07T20:33:08.4896387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4896393Z 2025-05-07T20:33:08.4896507Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4896733Z self=, 2025-05-07T20:33:08.4896813Z T=1, 2025-05-07T20:33:08.4896902Z D=5120, 2025-05-07T20:33:08.4896994Z scale_ub=1200.0, 2025-05-07T20:33:08.4897084Z contiguous=True, 2025-05-07T20:33:08.4897177Z compiled=True, 2025-05-07T20:33:08.4897257Z ) 2025-05-07T20:33:08.4897477Z self = 2025-05-07T20:33:08.4897692Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.4897698Z 2025-05-07T20:33:08.4897779Z @given( 2025-05-07T20:33:08.4897913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4898015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4898133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4898265Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4898383Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4898460Z ) 2025-05-07T20:33:08.4898716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4898883Z def test_silu_mul_quant( 2025-05-07T20:33:08.4898965Z self, 2025-05-07T20:33:08.4899056Z T: int, 2025-05-07T20:33:08.4899137Z D: int, 2025-05-07T20:33:08.4899249Z scale_ub: Optional[float], 2025-05-07T20:33:08.4899342Z contiguous: bool, 2025-05-07T20:33:08.4899433Z compiled: bool, 2025-05-07T20:33:08.4899525Z ) -> None: 2025-05-07T20:33:08.4899625Z torch.manual_seed(2025) 2025-05-07T20:33:08.4899701Z 2025-05-07T20:33:08.4899887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4899965Z 2025-05-07T20:33:08.4900061Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4900202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4900294Z x = x_sign * x_clamp 2025-05-07T20:33:08.4900378Z x0 = x[:, :D] 2025-05-07T20:33:08.4900472Z x1 = x[:, D:] 2025-05-07T20:33:08.4900548Z 2025-05-07T20:33:08.4900635Z if contiguous: 2025-05-07T20:33:08.4900746Z x0 = x0.contiguous() 2025-05-07T20:33:08.4900839Z x1 = x1.contiguous() 2025-05-07T20:33:08.4900922Z 2025-05-07T20:33:08.4901150Z if scale_ub is not None: 2025-05-07T20:33:08.4901261Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4901413Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4901494Z ) 2025-05-07T20:33:08.4901577Z else: 2025-05-07T20:33:08.4901684Z scale_ub_tensor = None 2025-05-07T20:33:08.4901762Z 2025-05-07T20:33:08.4901899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4902000Z op = silu_mul_quant 2025-05-07T20:33:08.4902088Z if compiled: 2025-05-07T20:33:08.4902192Z op = torch.compile(op) 2025-05-07T20:33:08.4902309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4902385Z 2025-05-07T20:33:08.4902486Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4902494Z 2025-05-07T20:33:08.4902642Z moe/activation_test.py:117: 2025-05-07T20:33:08.4902776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4902891Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4902995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4903362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.4903465Z return fn(*args, **kwargs) 2025-05-07T20:33:08.4904005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4904114Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4904476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4904708Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4905058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4905157Z kernel = self.compile( 2025-05-07T20:33:08.4905541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4905763Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4905893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4905900Z 2025-05-07T20:33:08.4906120Z self = 2025-05-07T20:33:08.4906907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4907425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f6568b0>} 2025-05-07T20:33:08.4908222Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4908419Z context = 2025-05-07T20:33:08.4908424Z 2025-05-07T20:33:08.4908599Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4908871Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4908993Z module_map=module_map) 2025-05-07T20:33:08.4909161Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4909265Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4909354Z E ^ 2025-05-07T20:33:08.4909717Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4909725Z 2025-05-07T20:33:08.4910136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4910152Z 2025-05-07T20:33:08.4910262Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4910490Z self=, 2025-05-07T20:33:08.4910578Z T=1, 2025-05-07T20:33:08.4910658Z D=5120, 2025-05-07T20:33:08.4910742Z scale_ub=None, 2025-05-07T20:33:08.4910840Z contiguous=False, 2025-05-07T20:33:08.4910925Z compiled=True, 2025-05-07T20:33:08.4911001Z ) 2025-05-07T20:33:08.4911230Z self = 2025-05-07T20:33:08.4911399Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.4911404Z 2025-05-07T20:33:08.4911485Z @given( 2025-05-07T20:33:08.4911657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4911759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4911883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4912003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4912123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4912207Z ) 2025-05-07T20:33:08.4912454Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4912611Z def test_silu_mul_quant( 2025-05-07T20:33:08.4912698Z self, 2025-05-07T20:33:08.4912779Z T: int, 2025-05-07T20:33:08.4912860Z D: int, 2025-05-07T20:33:08.4912973Z scale_ub: Optional[float], 2025-05-07T20:33:08.4913068Z contiguous: bool, 2025-05-07T20:33:08.4913166Z compiled: bool, 2025-05-07T20:33:08.4913249Z ) -> None: 2025-05-07T20:33:08.4913349Z torch.manual_seed(2025) 2025-05-07T20:33:08.4913434Z 2025-05-07T20:33:08.4913615Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4913693Z 2025-05-07T20:33:08.4913796Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4913927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4914019Z x = x_sign * x_clamp 2025-05-07T20:33:08.4914156Z x0 = x[:, :D] 2025-05-07T20:33:08.4914243Z x1 = x[:, D:] 2025-05-07T20:33:08.4914320Z 2025-05-07T20:33:08.4914419Z if contiguous: 2025-05-07T20:33:08.4914515Z x0 = x0.contiguous() 2025-05-07T20:33:08.4914610Z x1 = x1.contiguous() 2025-05-07T20:33:08.4914695Z 2025-05-07T20:33:08.4914792Z if scale_ub is not None: 2025-05-07T20:33:08.4914910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4915052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4915132Z ) 2025-05-07T20:33:08.4915226Z else: 2025-05-07T20:33:08.4915375Z scale_ub_tensor = None 2025-05-07T20:33:08.4915451Z 2025-05-07T20:33:08.4915594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4915689Z op = silu_mul_quant 2025-05-07T20:33:08.4915779Z if compiled: 2025-05-07T20:33:08.4915897Z op = torch.compile(op) 2025-05-07T20:33:08.4916010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4916087Z 2025-05-07T20:33:08.4916193Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.4916322Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.4916408Z 2025-05-07T20:33:08.4916547Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4916652Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.4916754Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.4916890Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.4917032Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4917114Z 2025-05-07T20:33:08.4917226Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.4917231Z 2025-05-07T20:33:08.4917331Z moe/activation_test.py:126: 2025-05-07T20:33:08.4917473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4917586Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.4917728Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.4918297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.4918403Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.4918768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4919003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4919412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.4919692Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4920093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.4920351Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.4920770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.4920939Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.4921285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.4921366Z fn() 2025-05-07T20:33:08.4921760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.4921861Z self.fn.run( 2025-05-07T20:33:08.4922194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4922290Z kernel = self.compile( 2025-05-07T20:33:08.4922714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4922899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4923039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4923043Z 2025-05-07T20:33:08.4923253Z self = 2025-05-07T20:33:08.4924039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4924600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f6ace50>} 2025-05-07T20:33:08.4925359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4925561Z context = 2025-05-07T20:33:08.4925569Z 2025-05-07T20:33:08.4925739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4926011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4926124Z module_map=module_map) 2025-05-07T20:33:08.4926287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4926399Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.4926484Z E ^ 2025-05-07T20:33:08.4926843Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4926848Z 2025-05-07T20:33:08.4927274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4927278Z 2025-05-07T20:33:08.4927384Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4927613Z self=, 2025-05-07T20:33:08.4927694Z T=1, 2025-05-07T20:33:08.4927773Z D=5120, 2025-05-07T20:33:08.4927863Z scale_ub=None, 2025-05-07T20:33:08.4927950Z contiguous=True, 2025-05-07T20:33:08.4928036Z compiled=False, 2025-05-07T20:33:08.4928116Z ) 2025-05-07T20:33:08.4928334Z self = 2025-05-07T20:33:08.4928500Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.4928546Z 2025-05-07T20:33:08.4928636Z @given( 2025-05-07T20:33:08.4928757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4928866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4928983Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4929106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4929228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4929343Z ) 2025-05-07T20:33:08.4929593Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4929695Z def test_silu_mul_quant( 2025-05-07T20:33:08.4929774Z self, 2025-05-07T20:33:08.4929853Z T: int, 2025-05-07T20:33:08.4929939Z D: int, 2025-05-07T20:33:08.4930039Z scale_ub: Optional[float], 2025-05-07T20:33:08.4930131Z contiguous: bool, 2025-05-07T20:33:08.4930227Z compiled: bool, 2025-05-07T20:33:08.4930308Z ) -> None: 2025-05-07T20:33:08.4930418Z torch.manual_seed(2025) 2025-05-07T20:33:08.4930495Z 2025-05-07T20:33:08.4930669Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4930753Z 2025-05-07T20:33:08.4930847Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4931037Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4931136Z x = x_sign * x_clamp 2025-05-07T20:33:08.4931218Z x0 = x[:, :D] 2025-05-07T20:33:08.4931303Z x1 = x[:, D:] 2025-05-07T20:33:08.4931384Z 2025-05-07T20:33:08.4931470Z if contiguous: 2025-05-07T20:33:08.4931565Z x0 = x0.contiguous() 2025-05-07T20:33:08.4931663Z x1 = x1.contiguous() 2025-05-07T20:33:08.4931737Z 2025-05-07T20:33:08.4931836Z if scale_ub is not None: 2025-05-07T20:33:08.4931943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4932083Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4932240Z ) 2025-05-07T20:33:08.4932321Z else: 2025-05-07T20:33:08.4932417Z scale_ub_tensor = None 2025-05-07T20:33:08.4932499Z 2025-05-07T20:33:08.4932633Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4932726Z op = silu_mul_quant 2025-05-07T20:33:08.4932822Z if compiled: 2025-05-07T20:33:08.4932926Z op = torch.compile(op) 2025-05-07T20:33:08.4933034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4933119Z 2025-05-07T20:33:08.4933212Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4933217Z 2025-05-07T20:33:08.4933326Z moe/activation_test.py:117: 2025-05-07T20:33:08.4933457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4933561Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4933669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4934169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4934272Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4934635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4934863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4935215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4935313Z kernel = self.compile( 2025-05-07T20:33:08.4935692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4935879Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4936008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4936013Z 2025-05-07T20:33:08.4936268Z self = 2025-05-07T20:33:08.4937047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4937555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158fdb0790>} 2025-05-07T20:33:08.4938355Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4938550Z context = 2025-05-07T20:33:08.4938555Z 2025-05-07T20:33:08.4938729Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4939009Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4939121Z module_map=module_map) 2025-05-07T20:33:08.4939293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4939393Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4939510Z E ^ 2025-05-07T20:33:08.4939877Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4939884Z 2025-05-07T20:33:08.4940642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4940652Z 2025-05-07T20:33:08.4940798Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4941074Z self=, 2025-05-07T20:33:08.4941155Z T=128, 2025-05-07T20:33:08.4941242Z D=5120, 2025-05-07T20:33:08.4941327Z scale_ub=None, 2025-05-07T20:33:08.4941621Z contiguous=False, 2025-05-07T20:33:08.4941708Z compiled=True, 2025-05-07T20:33:08.4941785Z ) 2025-05-07T20:33:08.4942017Z self = 2025-05-07T20:33:08.4942194Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.4942203Z 2025-05-07T20:33:08.4942284Z @given( 2025-05-07T20:33:08.4942412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4942516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4942633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4942758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4942874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4942957Z ) 2025-05-07T20:33:08.4943207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4943303Z def test_silu_mul_quant( 2025-05-07T20:33:08.4943395Z self, 2025-05-07T20:33:08.4943474Z T: int, 2025-05-07T20:33:08.4943553Z D: int, 2025-05-07T20:33:08.4943661Z scale_ub: Optional[float], 2025-05-07T20:33:08.4943753Z contiguous: bool, 2025-05-07T20:33:08.4943842Z compiled: bool, 2025-05-07T20:33:08.4943933Z ) -> None: 2025-05-07T20:33:08.4944032Z torch.manual_seed(2025) 2025-05-07T20:33:08.4944108Z 2025-05-07T20:33:08.4944287Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4944370Z 2025-05-07T20:33:08.4944466Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4944608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4944704Z x = x_sign * x_clamp 2025-05-07T20:33:08.4944788Z x0 = x[:, :D] 2025-05-07T20:33:08.4944877Z x1 = x[:, D:] 2025-05-07T20:33:08.4944952Z 2025-05-07T20:33:08.4945041Z if contiguous: 2025-05-07T20:33:08.4945143Z x0 = x0.contiguous() 2025-05-07T20:33:08.4945306Z x1 = x1.contiguous() 2025-05-07T20:33:08.4945393Z 2025-05-07T20:33:08.4945489Z if scale_ub is not None: 2025-05-07T20:33:08.4945598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4945747Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4945827Z ) 2025-05-07T20:33:08.4945911Z else: 2025-05-07T20:33:08.4946018Z scale_ub_tensor = None 2025-05-07T20:33:08.4946151Z 2025-05-07T20:33:08.4946289Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4946390Z op = silu_mul_quant 2025-05-07T20:33:08.4946479Z if compiled: 2025-05-07T20:33:08.4946582Z op = torch.compile(op) 2025-05-07T20:33:08.4946703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4946780Z 2025-05-07T20:33:08.4946887Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4946891Z 2025-05-07T20:33:08.4946996Z moe/activation_test.py:117: 2025-05-07T20:33:08.4947140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4947251Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4947355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4947789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.4947894Z return fn(*args, **kwargs) 2025-05-07T20:33:08.4948389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4948499Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4948856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4949083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4949430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4949584Z kernel = self.compile( 2025-05-07T20:33:08.4949962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4950148Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4950282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4950286Z 2025-05-07T20:33:08.4950506Z self = 2025-05-07T20:33:08.4951271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4951776Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ee51040>} 2025-05-07T20:33:08.4952537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4952734Z context = 2025-05-07T20:33:08.4952739Z 2025-05-07T20:33:08.4952916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4953187Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4953302Z module_map=module_map) 2025-05-07T20:33:08.4953480Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4953582Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4953673Z E ^ 2025-05-07T20:33:08.4954069Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4954077Z 2025-05-07T20:33:08.4954495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4954500Z 2025-05-07T20:33:08.4954614Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4954841Z self=, 2025-05-07T20:33:08.4954926Z T=128, 2025-05-07T20:33:08.4955006Z D=7168, 2025-05-07T20:33:08.4955131Z scale_ub=1200.0, 2025-05-07T20:33:08.4955233Z contiguous=False, 2025-05-07T20:33:08.4955322Z compiled=False, 2025-05-07T20:33:08.4955398Z ) 2025-05-07T20:33:08.4955624Z self = 2025-05-07T20:33:08.4955800Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.4955804Z 2025-05-07T20:33:08.4955885Z @given( 2025-05-07T20:33:08.4956013Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4956123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4956250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4956372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4956488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4956570Z ) 2025-05-07T20:33:08.4956858Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4956960Z def test_silu_mul_quant( 2025-05-07T20:33:08.4957046Z self, 2025-05-07T20:33:08.4957129Z T: int, 2025-05-07T20:33:08.4957211Z D: int, 2025-05-07T20:33:08.4957317Z scale_ub: Optional[float], 2025-05-07T20:33:08.4957410Z contiguous: bool, 2025-05-07T20:33:08.4957499Z compiled: bool, 2025-05-07T20:33:08.4957586Z ) -> None: 2025-05-07T20:33:08.4957688Z torch.manual_seed(2025) 2025-05-07T20:33:08.4957772Z 2025-05-07T20:33:08.4957952Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4958071Z 2025-05-07T20:33:08.4958173Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4958302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4958395Z x = x_sign * x_clamp 2025-05-07T20:33:08.4958486Z x0 = x[:, :D] 2025-05-07T20:33:08.4958573Z x1 = x[:, D:] 2025-05-07T20:33:08.4958649Z 2025-05-07T20:33:08.4958746Z if contiguous: 2025-05-07T20:33:08.4958844Z x0 = x0.contiguous() 2025-05-07T20:33:08.4958939Z x1 = x1.contiguous() 2025-05-07T20:33:08.4959025Z 2025-05-07T20:33:08.4959117Z if scale_ub is not None: 2025-05-07T20:33:08.4959225Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4959375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4959456Z ) 2025-05-07T20:33:08.4959545Z else: 2025-05-07T20:33:08.4959642Z scale_ub_tensor = None 2025-05-07T20:33:08.4959724Z 2025-05-07T20:33:08.4959866Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4959963Z op = silu_mul_quant 2025-05-07T20:33:08.4960052Z if compiled: 2025-05-07T20:33:08.4960167Z op = torch.compile(op) 2025-05-07T20:33:08.4960279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4960356Z 2025-05-07T20:33:08.4960461Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4960465Z 2025-05-07T20:33:08.4960572Z moe/activation_test.py:117: 2025-05-07T20:33:08.4960713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4960817Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4960921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4961423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4961523Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4962778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4963028Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4963382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4963485Z kernel = self.compile( 2025-05-07T20:33:08.4963864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4964115Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4964251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4964255Z 2025-05-07T20:33:08.4964465Z self = 2025-05-07T20:33:08.4965260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4965805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ee51ca0>} 2025-05-07T20:33:08.4966559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4966769Z context = 2025-05-07T20:33:08.4966774Z 2025-05-07T20:33:08.4966941Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4967213Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4967329Z module_map=module_map) 2025-05-07T20:33:08.4967534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4967641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4967720Z E ^ 2025-05-07T20:33:08.4968081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4968094Z 2025-05-07T20:33:08.4968511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4968518Z 2025-05-07T20:33:08.4968622Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4968850Z self=, 2025-05-07T20:33:08.4968932Z T=128, 2025-05-07T20:33:08.4969011Z D=5120, 2025-05-07T20:33:08.4969105Z scale_ub=None, 2025-05-07T20:33:08.4969194Z contiguous=False, 2025-05-07T20:33:08.4969281Z compiled=False, 2025-05-07T20:33:08.4969362Z ) 2025-05-07T20:33:08.4969584Z self = 2025-05-07T20:33:08.4969766Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.4969770Z 2025-05-07T20:33:08.4969850Z @given( 2025-05-07T20:33:08.4969971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4970080Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4970198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4970321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4970445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4970522Z ) 2025-05-07T20:33:08.4970778Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4970873Z def test_silu_mul_quant( 2025-05-07T20:33:08.4970952Z self, 2025-05-07T20:33:08.4971036Z T: int, 2025-05-07T20:33:08.4971119Z D: int, 2025-05-07T20:33:08.4971268Z scale_ub: Optional[float], 2025-05-07T20:33:08.4971370Z contiguous: bool, 2025-05-07T20:33:08.4971459Z compiled: bool, 2025-05-07T20:33:08.4971540Z ) -> None: 2025-05-07T20:33:08.4971646Z torch.manual_seed(2025) 2025-05-07T20:33:08.4971721Z 2025-05-07T20:33:08.4971895Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4971980Z 2025-05-07T20:33:08.4972074Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4972241Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4972340Z x = x_sign * x_clamp 2025-05-07T20:33:08.4972424Z x0 = x[:, :D] 2025-05-07T20:33:08.4972511Z x1 = x[:, D:] 2025-05-07T20:33:08.4972585Z 2025-05-07T20:33:08.4972671Z if contiguous: 2025-05-07T20:33:08.4972771Z x0 = x0.contiguous() 2025-05-07T20:33:08.4972863Z x1 = x1.contiguous() 2025-05-07T20:33:08.4972938Z 2025-05-07T20:33:08.4973036Z if scale_ub is not None: 2025-05-07T20:33:08.4973151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4973288Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4973372Z ) 2025-05-07T20:33:08.4973452Z else: 2025-05-07T20:33:08.4973551Z scale_ub_tensor = None 2025-05-07T20:33:08.4973631Z 2025-05-07T20:33:08.4973804Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4973905Z op = silu_mul_quant 2025-05-07T20:33:08.4973997Z if compiled: 2025-05-07T20:33:08.4974100Z op = torch.compile(op) 2025-05-07T20:33:08.4974212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4974287Z 2025-05-07T20:33:08.4974380Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4974385Z 2025-05-07T20:33:08.4974491Z moe/activation_test.py:117: 2025-05-07T20:33:08.4974623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4974770Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4974880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4975378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4975485Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4975846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4976079Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4976428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4976524Z kernel = self.compile( 2025-05-07T20:33:08.4976903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4977086Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4977221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4977226Z 2025-05-07T20:33:08.4977439Z self = 2025-05-07T20:33:08.4978210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4978765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f3fd310>} 2025-05-07T20:33:08.4979530Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4979767Z context = 2025-05-07T20:33:08.4979775Z 2025-05-07T20:33:08.4979954Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4980223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4980344Z module_map=module_map) 2025-05-07T20:33:08.4980510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4980613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4980740Z E ^ 2025-05-07T20:33:08.4981206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4981211Z 2025-05-07T20:33:08.4981621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4981633Z 2025-05-07T20:33:08.4981737Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4981968Z self=, 2025-05-07T20:33:08.4982057Z T=128, 2025-05-07T20:33:08.4982137Z D=5120, 2025-05-07T20:33:08.4982223Z scale_ub=1200.0, 2025-05-07T20:33:08.4982318Z contiguous=True, 2025-05-07T20:33:08.4982405Z compiled=False, 2025-05-07T20:33:08.4982481Z ) 2025-05-07T20:33:08.4982748Z self = 2025-05-07T20:33:08.4982925Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.4982932Z 2025-05-07T20:33:08.4983020Z @given( 2025-05-07T20:33:08.4983141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4983243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4983366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4983485Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4983601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4983732Z ) 2025-05-07T20:33:08.4983984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4984083Z def test_silu_mul_quant( 2025-05-07T20:33:08.4984169Z self, 2025-05-07T20:33:08.4984248Z T: int, 2025-05-07T20:33:08.4984329Z D: int, 2025-05-07T20:33:08.4984438Z scale_ub: Optional[float], 2025-05-07T20:33:08.4984529Z contiguous: bool, 2025-05-07T20:33:08.4984625Z compiled: bool, 2025-05-07T20:33:08.4984708Z ) -> None: 2025-05-07T20:33:08.4984806Z torch.manual_seed(2025) 2025-05-07T20:33:08.4984888Z 2025-05-07T20:33:08.4985063Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4985137Z 2025-05-07T20:33:08.4985237Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4985365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4985457Z x = x_sign * x_clamp 2025-05-07T20:33:08.4985544Z x0 = x[:, :D] 2025-05-07T20:33:08.4985630Z x1 = x[:, D:] 2025-05-07T20:33:08.4985705Z 2025-05-07T20:33:08.4985798Z if contiguous: 2025-05-07T20:33:08.4985892Z x0 = x0.contiguous() 2025-05-07T20:33:08.4985984Z x1 = x1.contiguous() 2025-05-07T20:33:08.4986067Z 2025-05-07T20:33:08.4986160Z if scale_ub is not None: 2025-05-07T20:33:08.4986277Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4986417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4986498Z ) 2025-05-07T20:33:08.4986584Z else: 2025-05-07T20:33:08.4986681Z scale_ub_tensor = None 2025-05-07T20:33:08.4986756Z 2025-05-07T20:33:08.4986899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4986993Z op = silu_mul_quant 2025-05-07T20:33:08.4987080Z if compiled: 2025-05-07T20:33:08.4987189Z op = torch.compile(op) 2025-05-07T20:33:08.4987298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4987425Z 2025-05-07T20:33:08.4987528Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4987532Z 2025-05-07T20:33:08.4987632Z moe/activation_test.py:117: 2025-05-07T20:33:08.4987768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4987873Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4987978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4988487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4988643Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4989044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4989273Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4989612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4989717Z kernel = self.compile( 2025-05-07T20:33:08.4990094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4990274Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4990446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4990451Z 2025-05-07T20:33:08.4990663Z self = 2025-05-07T20:33:08.4991437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4991954Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f3fdee0>} 2025-05-07T20:33:08.4992744Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4992949Z context = 2025-05-07T20:33:08.4992954Z 2025-05-07T20:33:08.4993122Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4993410Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4993520Z module_map=module_map) 2025-05-07T20:33:08.4993685Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4993793Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4993873Z E ^ 2025-05-07T20:33:08.4994237Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4994244Z 2025-05-07T20:33:08.4994655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4994660Z 2025-05-07T20:33:08.4994766Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4994998Z self=, 2025-05-07T20:33:08.4995079Z T=1, 2025-05-07T20:33:08.4995159Z D=7168, 2025-05-07T20:33:08.4995255Z scale_ub=1200.0, 2025-05-07T20:33:08.4995344Z contiguous=True, 2025-05-07T20:33:08.4995439Z compiled=True, 2025-05-07T20:33:08.4995516Z ) 2025-05-07T20:33:08.4995736Z self = 2025-05-07T20:33:08.4995911Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.4995916Z 2025-05-07T20:33:08.4995997Z @given( 2025-05-07T20:33:08.4996118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4996308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4996433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4996554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4996679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4996757Z ) 2025-05-07T20:33:08.4997015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4997111Z def test_silu_mul_quant( 2025-05-07T20:33:08.4997229Z self, 2025-05-07T20:33:08.4997317Z T: int, 2025-05-07T20:33:08.4997397Z D: int, 2025-05-07T20:33:08.4997498Z scale_ub: Optional[float], 2025-05-07T20:33:08.4997596Z contiguous: bool, 2025-05-07T20:33:08.4997684Z compiled: bool, 2025-05-07T20:33:08.4997766Z ) -> None: 2025-05-07T20:33:08.4997868Z torch.manual_seed(2025) 2025-05-07T20:33:08.4997947Z 2025-05-07T20:33:08.4998124Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4998211Z 2025-05-07T20:33:08.4998306Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4998442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4998533Z x = x_sign * x_clamp 2025-05-07T20:33:08.4998618Z x0 = x[:, :D] 2025-05-07T20:33:08.4998747Z x1 = x[:, D:] 2025-05-07T20:33:08.4998824Z 2025-05-07T20:33:08.4998910Z if contiguous: 2025-05-07T20:33:08.4999010Z x0 = x0.contiguous() 2025-05-07T20:33:08.4999104Z x1 = x1.contiguous() 2025-05-07T20:33:08.4999179Z 2025-05-07T20:33:08.4999279Z if scale_ub is not None: 2025-05-07T20:33:08.4999386Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4999525Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4999614Z ) 2025-05-07T20:33:08.4999693Z else: 2025-05-07T20:33:08.4999796Z scale_ub_tensor = None 2025-05-07T20:33:08.4999917Z 2025-05-07T20:33:08.5000056Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5000156Z op = silu_mul_quant 2025-05-07T20:33:08.5000245Z if compiled: 2025-05-07T20:33:08.5000347Z op = torch.compile(op) 2025-05-07T20:33:08.5000463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5000543Z 2025-05-07T20:33:08.5000637Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5000641Z 2025-05-07T20:33:08.5000748Z moe/activation_test.py:117: 2025-05-07T20:33:08.5000881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5000985Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5001094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5001460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5001563Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5002055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5002157Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5002518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5002746Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5003092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5003191Z kernel = self.compile( 2025-05-07T20:33:08.5003568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5003753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5003881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5003886Z 2025-05-07T20:33:08.5004137Z self = 2025-05-07T20:33:08.5004914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5005418Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158edd6940>} 2025-05-07T20:33:08.5006220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5006414Z context = 2025-05-07T20:33:08.5006419Z 2025-05-07T20:33:08.5006591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5006860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5006971Z module_map=module_map) 2025-05-07T20:33:08.5007141Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5007277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5007358Z E ^ 2025-05-07T20:33:08.5007716Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5007723Z 2025-05-07T20:33:08.5014190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5014200Z 2025-05-07T20:33:08.5014333Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5014575Z self=, 2025-05-07T20:33:08.5014663Z T=1, 2025-05-07T20:33:08.5014745Z D=7168, 2025-05-07T20:33:08.5014927Z scale_ub=1200.0, 2025-05-07T20:33:08.5015021Z contiguous=False, 2025-05-07T20:33:08.5015111Z compiled=True, 2025-05-07T20:33:08.5015199Z ) 2025-05-07T20:33:08.5015422Z self = 2025-05-07T20:33:08.5015596Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5015601Z 2025-05-07T20:33:08.5015692Z @given( 2025-05-07T20:33:08.5015819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5015933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5016053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5016174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5016300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5016379Z ) 2025-05-07T20:33:08.5016631Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5016749Z def test_silu_mul_quant( 2025-05-07T20:33:08.5016831Z self, 2025-05-07T20:33:08.5016915Z T: int, 2025-05-07T20:33:08.5017003Z D: int, 2025-05-07T20:33:08.5017107Z scale_ub: Optional[float], 2025-05-07T20:33:08.5017200Z contiguous: bool, 2025-05-07T20:33:08.5017299Z compiled: bool, 2025-05-07T20:33:08.5017386Z ) -> None: 2025-05-07T20:33:08.5017493Z torch.manual_seed(2025) 2025-05-07T20:33:08.5017572Z 2025-05-07T20:33:08.5017753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5017839Z 2025-05-07T20:33:08.5017937Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5018069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5018172Z x = x_sign * x_clamp 2025-05-07T20:33:08.5018258Z x0 = x[:, :D] 2025-05-07T20:33:08.5018344Z x1 = x[:, D:] 2025-05-07T20:33:08.5018430Z 2025-05-07T20:33:08.5018518Z if contiguous: 2025-05-07T20:33:08.5018662Z x0 = x0.contiguous() 2025-05-07T20:33:08.5018768Z x1 = x1.contiguous() 2025-05-07T20:33:08.5018844Z 2025-05-07T20:33:08.5018939Z if scale_ub is not None: 2025-05-07T20:33:08.5019057Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5019200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5019291Z ) 2025-05-07T20:33:08.5019371Z else: 2025-05-07T20:33:08.5019470Z scale_ub_tensor = None 2025-05-07T20:33:08.5019595Z 2025-05-07T20:33:08.5019732Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5019826Z op = silu_mul_quant 2025-05-07T20:33:08.5019928Z if compiled: 2025-05-07T20:33:08.5020032Z op = torch.compile(op) 2025-05-07T20:33:08.5020147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5020231Z 2025-05-07T20:33:08.5020328Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5020333Z 2025-05-07T20:33:08.5020452Z moe/activation_test.py:117: 2025-05-07T20:33:08.5020586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5020691Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5020803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5021363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5021462Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5021977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5022080Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5022450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5022677Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5023023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5023204Z kernel = self.compile( 2025-05-07T20:33:08.5023583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5023765Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5023902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5023909Z 2025-05-07T20:33:08.5024121Z self = 2025-05-07T20:33:08.5024900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5025405Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f1f05e0>} 2025-05-07T20:33:08.5026169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5026368Z context = 2025-05-07T20:33:08.5026373Z 2025-05-07T20:33:08.5026543Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5026817Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5026931Z module_map=module_map) 2025-05-07T20:33:08.5027102Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5027214Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5027297Z E ^ 2025-05-07T20:33:08.5027699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5027707Z 2025-05-07T20:33:08.5028122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5028127Z 2025-05-07T20:33:08.5028235Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5028470Z self=, 2025-05-07T20:33:08.5028552Z T=1, 2025-05-07T20:33:08.5028682Z D=7168, 2025-05-07T20:33:08.5028767Z scale_ub=None, 2025-05-07T20:33:08.5028859Z contiguous=False, 2025-05-07T20:33:08.5028952Z compiled=True, 2025-05-07T20:33:08.5029031Z ) 2025-05-07T20:33:08.5029250Z self = 2025-05-07T20:33:08.5029425Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.5029430Z 2025-05-07T20:33:08.5029512Z @given( 2025-05-07T20:33:08.5029640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5029758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5029878Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5030011Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5030128Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5030247Z ) 2025-05-07T20:33:08.5030508Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5030607Z def test_silu_mul_quant( 2025-05-07T20:33:08.5030690Z self, 2025-05-07T20:33:08.5030778Z T: int, 2025-05-07T20:33:08.5030865Z D: int, 2025-05-07T20:33:08.5030968Z scale_ub: Optional[float], 2025-05-07T20:33:08.5031069Z contiguous: bool, 2025-05-07T20:33:08.5031159Z compiled: bool, 2025-05-07T20:33:08.5031242Z ) -> None: 2025-05-07T20:33:08.5031351Z torch.manual_seed(2025) 2025-05-07T20:33:08.5031429Z 2025-05-07T20:33:08.5031650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5031739Z 2025-05-07T20:33:08.5031835Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5031975Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5032070Z x = x_sign * x_clamp 2025-05-07T20:33:08.5032159Z x0 = x[:, :D] 2025-05-07T20:33:08.5032256Z x1 = x[:, D:] 2025-05-07T20:33:08.5032336Z 2025-05-07T20:33:08.5032424Z if contiguous: 2025-05-07T20:33:08.5032532Z x0 = x0.contiguous() 2025-05-07T20:33:08.5032626Z x1 = x1.contiguous() 2025-05-07T20:33:08.5032705Z 2025-05-07T20:33:08.5032808Z if scale_ub is not None: 2025-05-07T20:33:08.5032918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5033059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5033154Z ) 2025-05-07T20:33:08.5033235Z else: 2025-05-07T20:33:08.5033345Z scale_ub_tensor = None 2025-05-07T20:33:08.5033425Z 2025-05-07T20:33:08.5033560Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5033664Z op = silu_mul_quant 2025-05-07T20:33:08.5033755Z if compiled: 2025-05-07T20:33:08.5033858Z op = torch.compile(op) 2025-05-07T20:33:08.5033981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5034057Z 2025-05-07T20:33:08.5034153Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.5034287Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.5034366Z 2025-05-07T20:33:08.5034505Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5034619Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.5034724Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.5034858Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.5035002Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.5035128Z 2025-05-07T20:33:08.5035243Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.5035248Z 2025-05-07T20:33:08.5035350Z moe/activation_test.py:126: 2025-05-07T20:33:08.5035483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5035604Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.5035742Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.5036316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.5036463Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.5036825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5037059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5037428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.5037696Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.5038100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:08.5038391Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.5038783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.5038955Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.5039303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.5039393Z fn() 2025-05-07T20:33:08.5039796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.5039938Z self.fn.run( 2025-05-07T20:33:08.5040667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5040817Z kernel = self.compile( 2025-05-07T20:33:08.5041415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5041599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5041736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5041741Z 2025-05-07T20:33:08.5041956Z self = 2025-05-07T20:33:08.5042732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5043260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f042160>} 2025-05-07T20:33:08.5044024Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5044228Z context = 2025-05-07T20:33:08.5044235Z 2025-05-07T20:33:08.5044407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5044674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5044793Z module_map=module_map) 2025-05-07T20:33:08.5044958Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5045063Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.5045339Z E ^ 2025-05-07T20:33:08.5045704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5045709Z 2025-05-07T20:33:08.5046129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5046136Z 2025-05-07T20:33:08.5046243Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5046467Z self=, 2025-05-07T20:33:08.5046616Z T=1, 2025-05-07T20:33:08.5046696Z D=5120, 2025-05-07T20:33:08.5046789Z scale_ub=1200.0, 2025-05-07T20:33:08.5046887Z contiguous=False, 2025-05-07T20:33:08.5046975Z compiled=True, 2025-05-07T20:33:08.5047061Z ) 2025-05-07T20:33:08.5047280Z self = 2025-05-07T20:33:08.5047454Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5047464Z 2025-05-07T20:33:08.5047553Z @given( 2025-05-07T20:33:08.5047677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5047781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5047911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5048098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5048228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5048307Z ) 2025-05-07T20:33:08.5048559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5048665Z def test_silu_mul_quant( 2025-05-07T20:33:08.5048748Z self, 2025-05-07T20:33:08.5048830Z T: int, 2025-05-07T20:33:08.5048920Z D: int, 2025-05-07T20:33:08.5049024Z scale_ub: Optional[float], 2025-05-07T20:33:08.5049117Z contiguous: bool, 2025-05-07T20:33:08.5049219Z compiled: bool, 2025-05-07T20:33:08.5049306Z ) -> None: 2025-05-07T20:33:08.5049476Z torch.manual_seed(2025) 2025-05-07T20:33:08.5049566Z 2025-05-07T20:33:08.5049744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5049828Z 2025-05-07T20:33:08.5049935Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5050068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5050169Z x = x_sign * x_clamp 2025-05-07T20:33:08.5050255Z x0 = x[:, :D] 2025-05-07T20:33:08.5050341Z x1 = x[:, D:] 2025-05-07T20:33:08.5050430Z 2025-05-07T20:33:08.5050519Z if contiguous: 2025-05-07T20:33:08.5050616Z x0 = x0.contiguous() 2025-05-07T20:33:08.5050718Z x1 = x1.contiguous() 2025-05-07T20:33:08.5050801Z 2025-05-07T20:33:08.5050895Z if scale_ub is not None: 2025-05-07T20:33:08.5051004Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5051150Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5051234Z ) 2025-05-07T20:33:08.5051326Z else: 2025-05-07T20:33:08.5051424Z scale_ub_tensor = None 2025-05-07T20:33:08.5051500Z 2025-05-07T20:33:08.5051643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5051737Z op = silu_mul_quant 2025-05-07T20:33:08.5051826Z if compiled: 2025-05-07T20:33:08.5051940Z op = torch.compile(op) 2025-05-07T20:33:08.5052049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5052126Z 2025-05-07T20:33:08.5052230Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5052234Z 2025-05-07T20:33:08.5052335Z moe/activation_test.py:117: 2025-05-07T20:33:08.5052466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5052577Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5052680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5053110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5053213Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5053705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5053817Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5054181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5054518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5054855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5054953Z kernel = self.compile( 2025-05-07T20:33:08.5055337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5055514Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5055652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5055657Z 2025-05-07T20:33:08.5055877Z self = 2025-05-07T20:33:08.5056718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5057236Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f042b80>} 2025-05-07T20:33:08.5057988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5058193Z context = 2025-05-07T20:33:08.5058236Z 2025-05-07T20:33:08.5058407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5058677Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5058798Z module_map=module_map) 2025-05-07T20:33:08.5058962Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5059062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5059150Z E ^ 2025-05-07T20:33:08.5059502Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5059507Z 2025-05-07T20:33:08.5059928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5059933Z 2025-05-07T20:33:08.5060038Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5060264Z self=, 2025-05-07T20:33:08.5060354Z T=1, 2025-05-07T20:33:08.5060433Z D=5120, 2025-05-07T20:33:08.5060520Z scale_ub=1200.0, 2025-05-07T20:33:08.5060613Z contiguous=False, 2025-05-07T20:33:08.5060701Z compiled=False, 2025-05-07T20:33:08.5060782Z ) 2025-05-07T20:33:08.5061092Z self = 2025-05-07T20:33:08.5061270Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.5061278Z 2025-05-07T20:33:08.5061366Z @given( 2025-05-07T20:33:08.5061486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5061591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5061720Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5061840Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5061962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5062050Z ) 2025-05-07T20:33:08.5062352Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5062460Z def test_silu_mul_quant( 2025-05-07T20:33:08.5062540Z self, 2025-05-07T20:33:08.5062620Z T: int, 2025-05-07T20:33:08.5062708Z D: int, 2025-05-07T20:33:08.5062811Z scale_ub: Optional[float], 2025-05-07T20:33:08.5062903Z contiguous: bool, 2025-05-07T20:33:08.5062999Z compiled: bool, 2025-05-07T20:33:08.5063121Z ) -> None: 2025-05-07T20:33:08.5063219Z torch.manual_seed(2025) 2025-05-07T20:33:08.5063301Z 2025-05-07T20:33:08.5063475Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5063553Z 2025-05-07T20:33:08.5063655Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5063782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5063886Z x = x_sign * x_clamp 2025-05-07T20:33:08.5063968Z x0 = x[:, :D] 2025-05-07T20:33:08.5064061Z x1 = x[:, D:] 2025-05-07T20:33:08.5064144Z 2025-05-07T20:33:08.5064231Z if contiguous: 2025-05-07T20:33:08.5064324Z x0 = x0.contiguous() 2025-05-07T20:33:08.5064422Z x1 = x1.contiguous() 2025-05-07T20:33:08.5064498Z 2025-05-07T20:33:08.5064591Z if scale_ub is not None: 2025-05-07T20:33:08.5064750Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5064890Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5064973Z ) 2025-05-07T20:33:08.5065067Z else: 2025-05-07T20:33:08.5065165Z scale_ub_tensor = None 2025-05-07T20:33:08.5065247Z 2025-05-07T20:33:08.5065381Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5065474Z op = silu_mul_quant 2025-05-07T20:33:08.5065567Z if compiled: 2025-05-07T20:33:08.5065672Z op = torch.compile(op) 2025-05-07T20:33:08.5065779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5065906Z 2025-05-07T20:33:08.5066002Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5066007Z 2025-05-07T20:33:08.5066108Z moe/activation_test.py:117: 2025-05-07T20:33:08.5066248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5066358Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5066464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5066969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5067073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5067437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5067663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5068007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5068114Z kernel = self.compile( 2025-05-07T20:33:08.5068493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5068683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5068815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5068819Z 2025-05-07T20:33:08.5069028Z self = 2025-05-07T20:33:08.5069810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5070354Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f44c550>} 2025-05-07T20:33:08.5071118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5071312Z context = 2025-05-07T20:33:08.5071317Z 2025-05-07T20:33:08.5071487Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5071799Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5071912Z module_map=module_map) 2025-05-07T20:33:08.5072083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5072187Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5072269Z E ^ 2025-05-07T20:33:08.5072632Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5072640Z 2025-05-07T20:33:08.5073050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5073054Z 2025-05-07T20:33:08.5073166Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5073429Z self=, 2025-05-07T20:33:08.5073512Z T=16384, 2025-05-07T20:33:08.5073600Z D=5120, 2025-05-07T20:33:08.5073692Z scale_ub=1200.0, 2025-05-07T20:33:08.5073782Z contiguous=False, 2025-05-07T20:33:08.5073876Z compiled=True, 2025-05-07T20:33:08.5073953Z ) 2025-05-07T20:33:08.5074171Z self = 2025-05-07T20:33:08.5074357Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5074361Z 2025-05-07T20:33:08.5074441Z @given( 2025-05-07T20:33:08.5074571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5074718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5074836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5074963Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5075079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5075156Z ) 2025-05-07T20:33:08.5075415Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5075513Z def test_silu_mul_quant( 2025-05-07T20:33:08.5075593Z self, 2025-05-07T20:33:08.5075678Z T: int, 2025-05-07T20:33:08.5075757Z D: int, 2025-05-07T20:33:08.5075865Z scale_ub: Optional[float], 2025-05-07T20:33:08.5075958Z contiguous: bool, 2025-05-07T20:33:08.5076046Z compiled: bool, 2025-05-07T20:33:08.5076134Z ) -> None: 2025-05-07T20:33:08.5076232Z torch.manual_seed(2025) 2025-05-07T20:33:08.5076310Z 2025-05-07T20:33:08.5076488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5076587Z 2025-05-07T20:33:08.5076683Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5076819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5076914Z x = x_sign * x_clamp 2025-05-07T20:33:08.5076997Z x0 = x[:, :D] 2025-05-07T20:33:08.5077091Z x1 = x[:, D:] 2025-05-07T20:33:08.5077166Z 2025-05-07T20:33:08.5077254Z if contiguous: 2025-05-07T20:33:08.5077361Z x0 = x0.contiguous() 2025-05-07T20:33:08.5077454Z x1 = x1.contiguous() 2025-05-07T20:33:08.5077529Z 2025-05-07T20:33:08.5077630Z if scale_ub is not None: 2025-05-07T20:33:08.5077737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5077879Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5077970Z ) 2025-05-07T20:33:08.5078051Z else: 2025-05-07T20:33:08.5078149Z scale_ub_tensor = None 2025-05-07T20:33:08.5078234Z 2025-05-07T20:33:08.5078416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5078521Z op = silu_mul_quant 2025-05-07T20:33:08.5078610Z if compiled: 2025-05-07T20:33:08.5078714Z op = torch.compile(op) 2025-05-07T20:33:08.5078829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5078908Z 2025-05-07T20:33:08.5079002Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5079006Z 2025-05-07T20:33:08.5079152Z moe/activation_test.py:117: 2025-05-07T20:33:08.5079284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5079389Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5079500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5079867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5079969Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5080461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5080563Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5080926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5081194Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5081545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5081646Z kernel = self.compile( 2025-05-07T20:33:08.5082022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5082205Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5082334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5082338Z 2025-05-07T20:33:08.5082605Z self = 2025-05-07T20:33:08.5083385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5083897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ea901f0>} 2025-05-07T20:33:08.5084662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5084854Z context = 2025-05-07T20:33:08.5084859Z 2025-05-07T20:33:08.5085032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5085307Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5085418Z module_map=module_map) 2025-05-07T20:33:08.5085592Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5085693Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5085772Z E ^ 2025-05-07T20:33:08.5086132Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5086140Z 2025-05-07T20:33:08.5086548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5086553Z 2025-05-07T20:33:08.5086662Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5086885Z self=, 2025-05-07T20:33:08.5086965Z T=2048, 2025-05-07T20:33:08.5087055Z D=7168, 2025-05-07T20:33:08.5087187Z scale_ub=1200.0, 2025-05-07T20:33:08.5087278Z contiguous=False, 2025-05-07T20:33:08.5087376Z compiled=True, 2025-05-07T20:33:08.5087452Z ) 2025-05-07T20:33:08.5087677Z self = 2025-05-07T20:33:08.5087855Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5087859Z 2025-05-07T20:33:08.5087940Z @given( 2025-05-07T20:33:08.5088138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5088240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5088357Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5088483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5088598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5088675Z ) 2025-05-07T20:33:08.5088929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5089030Z def test_silu_mul_quant( 2025-05-07T20:33:08.5089115Z self, 2025-05-07T20:33:08.5089194Z T: int, 2025-05-07T20:33:08.5089275Z D: int, 2025-05-07T20:33:08.5089381Z scale_ub: Optional[float], 2025-05-07T20:33:08.5089470Z contiguous: bool, 2025-05-07T20:33:08.5089558Z compiled: bool, 2025-05-07T20:33:08.5089686Z ) -> None: 2025-05-07T20:33:08.5089785Z torch.manual_seed(2025) 2025-05-07T20:33:08.5089860Z 2025-05-07T20:33:08.5090037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5090113Z 2025-05-07T20:33:08.5090206Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5090344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5090435Z x = x_sign * x_clamp 2025-05-07T20:33:08.5090525Z x0 = x[:, :D] 2025-05-07T20:33:08.5090608Z x1 = x[:, D:] 2025-05-07T20:33:08.5090683Z 2025-05-07T20:33:08.5090780Z if contiguous: 2025-05-07T20:33:08.5090926Z x0 = x0.contiguous() 2025-05-07T20:33:08.5091018Z x1 = x1.contiguous() 2025-05-07T20:33:08.5091101Z 2025-05-07T20:33:08.5091195Z if scale_ub is not None: 2025-05-07T20:33:08.5091303Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5091454Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5091535Z ) 2025-05-07T20:33:08.5091615Z else: 2025-05-07T20:33:08.5091718Z scale_ub_tensor = None 2025-05-07T20:33:08.5091796Z 2025-05-07T20:33:08.5091930Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5092032Z op = silu_mul_quant 2025-05-07T20:33:08.5092124Z if compiled: 2025-05-07T20:33:08.5092234Z op = torch.compile(op) 2025-05-07T20:33:08.5092344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5092419Z 2025-05-07T20:33:08.5092523Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5092530Z 2025-05-07T20:33:08.5092631Z moe/activation_test.py:117: 2025-05-07T20:33:08.5092762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5092872Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5092981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5093354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5093457Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5093949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5094054Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5094409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5094636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5095022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5095123Z kernel = self.compile( 2025-05-07T20:33:08.5095507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5095690Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5095818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5095861Z 2025-05-07T20:33:08.5096084Z self = 2025-05-07T20:33:08.5096868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5097383Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ea90ee0>} 2025-05-07T20:33:08.5098124Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5098352Z context = 2025-05-07T20:33:08.5098366Z 2025-05-07T20:33:08.5098537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5098806Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5098923Z module_map=module_map) 2025-05-07T20:33:08.5099087Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5099187Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5099271Z E ^ 2025-05-07T20:33:08.5099628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5099675Z 2025-05-07T20:33:08.5100100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5100105Z 2025-05-07T20:33:08.5100211Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5100438Z self=, 2025-05-07T20:33:08.5100526Z T=1, 2025-05-07T20:33:08.5100605Z D=5120, 2025-05-07T20:33:08.5100688Z scale_ub=None, 2025-05-07T20:33:08.5100783Z contiguous=False, 2025-05-07T20:33:08.5100870Z compiled=False, 2025-05-07T20:33:08.5100946Z ) 2025-05-07T20:33:08.5101273Z self = 2025-05-07T20:33:08.5101443Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.5101448Z 2025-05-07T20:33:08.5101535Z @given( 2025-05-07T20:33:08.5101660Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5101762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5101887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5102008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5102125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5102209Z ) 2025-05-07T20:33:08.5102458Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5102557Z def test_silu_mul_quant( 2025-05-07T20:33:08.5102643Z self, 2025-05-07T20:33:08.5102722Z T: int, 2025-05-07T20:33:08.5102807Z D: int, 2025-05-07T20:33:08.5102909Z scale_ub: Optional[float], 2025-05-07T20:33:08.5103000Z contiguous: bool, 2025-05-07T20:33:08.5103094Z compiled: bool, 2025-05-07T20:33:08.5103175Z ) -> None: 2025-05-07T20:33:08.5103272Z torch.manual_seed(2025) 2025-05-07T20:33:08.5103356Z 2025-05-07T20:33:08.5103579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5103659Z 2025-05-07T20:33:08.5103760Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5103888Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5103979Z x = x_sign * x_clamp 2025-05-07T20:33:08.5104071Z x0 = x[:, :D] 2025-05-07T20:33:08.5104153Z x1 = x[:, D:] 2025-05-07T20:33:08.5104229Z 2025-05-07T20:33:08.5104362Z if contiguous: 2025-05-07T20:33:08.5104458Z x0 = x0.contiguous() 2025-05-07T20:33:08.5104557Z x1 = x1.contiguous() 2025-05-07T20:33:08.5104634Z 2025-05-07T20:33:08.5104728Z if scale_ub is not None: 2025-05-07T20:33:08.5104842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5104982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5105061Z ) 2025-05-07T20:33:08.5105148Z else: 2025-05-07T20:33:08.5105251Z scale_ub_tensor = None 2025-05-07T20:33:08.5105327Z 2025-05-07T20:33:08.5105467Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5105559Z op = silu_mul_quant 2025-05-07T20:33:08.5105646Z if compiled: 2025-05-07T20:33:08.5105756Z op = torch.compile(op) 2025-05-07T20:33:08.5105907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5105991Z 2025-05-07T20:33:08.5106085Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5106092Z 2025-05-07T20:33:08.5106192Z moe/activation_test.py:117: 2025-05-07T20:33:08.5106328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5106433Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5106535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5107045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5107189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5107554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5107780Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5108120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5108225Z kernel = self.compile( 2025-05-07T20:33:08.5108606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5108786Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5108922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5108927Z 2025-05-07T20:33:08.5109135Z self = 2025-05-07T20:33:08.5109914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5110422Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f1585e0>} 2025-05-07T20:33:08.5111180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5111375Z context = 2025-05-07T20:33:08.5111380Z 2025-05-07T20:33:08.5111548Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5111818Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5111979Z module_map=module_map) 2025-05-07T20:33:08.5112150Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5112251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5112331Z E ^ 2025-05-07T20:33:08.5112694Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5112699Z 2025-05-07T20:33:08.5113115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5113162Z 2025-05-07T20:33:08.5113275Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5113506Z self=, 2025-05-07T20:33:08.5113588Z T=4096, 2025-05-07T20:33:08.5113673Z D=7168, 2025-05-07T20:33:08.5113760Z scale_ub=1200.0, 2025-05-07T20:33:08.5113850Z contiguous=False, 2025-05-07T20:33:08.5113943Z compiled=False, 2025-05-07T20:33:08.5114025Z ) 2025-05-07T20:33:08.5114248Z self = 2025-05-07T20:33:08.5114435Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.5114440Z 2025-05-07T20:33:08.5114519Z @given( 2025-05-07T20:33:08.5114679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5114789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5114910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5115038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5115158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5115234Z ) 2025-05-07T20:33:08.5115488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5115583Z def test_silu_mul_quant( 2025-05-07T20:33:08.5115663Z self, 2025-05-07T20:33:08.5115748Z T: int, 2025-05-07T20:33:08.5115874Z D: int, 2025-05-07T20:33:08.5115975Z scale_ub: Optional[float], 2025-05-07T20:33:08.5116075Z contiguous: bool, 2025-05-07T20:33:08.5116164Z compiled: bool, 2025-05-07T20:33:08.5116244Z ) -> None: 2025-05-07T20:33:08.5116347Z torch.manual_seed(2025) 2025-05-07T20:33:08.5116421Z 2025-05-07T20:33:08.5116604Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5116681Z 2025-05-07T20:33:08.5116777Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5116912Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5117004Z x = x_sign * x_clamp 2025-05-07T20:33:08.5117088Z x0 = x[:, :D] 2025-05-07T20:33:08.5117179Z x1 = x[:, D:] 2025-05-07T20:33:08.5117253Z 2025-05-07T20:33:08.5117339Z if contiguous: 2025-05-07T20:33:08.5117438Z x0 = x0.contiguous() 2025-05-07T20:33:08.5117531Z x1 = x1.contiguous() 2025-05-07T20:33:08.5117610Z 2025-05-07T20:33:08.5117711Z if scale_ub is not None: 2025-05-07T20:33:08.5117821Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5117965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5118046Z ) 2025-05-07T20:33:08.5118125Z else: 2025-05-07T20:33:08.5118231Z scale_ub_tensor = None 2025-05-07T20:33:08.5118308Z 2025-05-07T20:33:08.5118441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5118544Z op = silu_mul_quant 2025-05-07T20:33:08.5118631Z if compiled: 2025-05-07T20:33:08.5118733Z op = torch.compile(op) 2025-05-07T20:33:08.5118847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5118922Z 2025-05-07T20:33:08.5119015Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5119020Z 2025-05-07T20:33:08.5119131Z moe/activation_test.py:117: 2025-05-07T20:33:08.5119337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5119453Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5119556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5120049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5120155Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5120518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5120788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5121132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5121227Z kernel = self.compile( 2025-05-07T20:33:08.5121615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5121796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5121926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5121931Z 2025-05-07T20:33:08.5122146Z self = 2025-05-07T20:33:08.5122952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5123468Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158eb0d1f0>} 2025-05-07T20:33:08.5124207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5124441Z context = 2025-05-07T20:33:08.5124452Z 2025-05-07T20:33:08.5124621Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5124886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5125004Z module_map=module_map) 2025-05-07T20:33:08.5125171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5125275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5125362Z E ^ 2025-05-07T20:33:08.5125718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5125723Z 2025-05-07T20:33:08.5126140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5126145Z 2025-05-07T20:33:08.5126249Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5126478Z self=, 2025-05-07T20:33:08.5126570Z T=16384, 2025-05-07T20:33:08.5126649Z D=7168, 2025-05-07T20:33:08.5126735Z scale_ub=None, 2025-05-07T20:33:08.5126829Z contiguous=True, 2025-05-07T20:33:08.5126915Z compiled=True, 2025-05-07T20:33:08.5126995Z ) 2025-05-07T20:33:08.5127222Z self = 2025-05-07T20:33:08.5127402Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.5127406Z 2025-05-07T20:33:08.5127494Z @given( 2025-05-07T20:33:08.5127615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5127717Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5127847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5127965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5128124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5128213Z ) 2025-05-07T20:33:08.5128461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5128563Z def test_silu_mul_quant( 2025-05-07T20:33:08.5128644Z self, 2025-05-07T20:33:08.5128724Z T: int, 2025-05-07T20:33:08.5128812Z D: int, 2025-05-07T20:33:08.5128912Z scale_ub: Optional[float], 2025-05-07T20:33:08.5129003Z contiguous: bool, 2025-05-07T20:33:08.5129137Z compiled: bool, 2025-05-07T20:33:08.5129216Z ) -> None: 2025-05-07T20:33:08.5129313Z torch.manual_seed(2025) 2025-05-07T20:33:08.5129394Z 2025-05-07T20:33:08.5129563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5129639Z 2025-05-07T20:33:08.5129740Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5129868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5129963Z x = x_sign * x_clamp 2025-05-07T20:33:08.5130057Z x0 = x[:, :D] 2025-05-07T20:33:08.5130139Z x1 = x[:, D:] 2025-05-07T20:33:08.5130221Z 2025-05-07T20:33:08.5130309Z if contiguous: 2025-05-07T20:33:08.5130402Z x0 = x0.contiguous() 2025-05-07T20:33:08.5130499Z x1 = x1.contiguous() 2025-05-07T20:33:08.5130576Z 2025-05-07T20:33:08.5130707Z if scale_ub is not None: 2025-05-07T20:33:08.5130822Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5130965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5131043Z ) 2025-05-07T20:33:08.5131130Z else: 2025-05-07T20:33:08.5131227Z scale_ub_tensor = None 2025-05-07T20:33:08.5131302Z 2025-05-07T20:33:08.5131442Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5131537Z op = silu_mul_quant 2025-05-07T20:33:08.5131635Z if compiled: 2025-05-07T20:33:08.5131736Z op = torch.compile(op) 2025-05-07T20:33:08.5131893Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5131974Z 2025-05-07T20:33:08.5132069Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5132073Z 2025-05-07T20:33:08.5132172Z moe/activation_test.py:117: 2025-05-07T20:33:08.5132309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5132415Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5132516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5132889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5132984Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5133481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5133580Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5133938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5134176Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5134509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5134614Z kernel = self.compile( 2025-05-07T20:33:08.5134991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5135174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5135306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5135310Z 2025-05-07T20:33:08.5135516Z self = 2025-05-07T20:33:08.5136338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5136862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158eb0dee0>} 2025-05-07T20:33:08.5137616Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5137854Z context = 2025-05-07T20:33:08.5137859Z 2025-05-07T20:33:08.5138025Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5138295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5138403Z module_map=module_map) 2025-05-07T20:33:08.5144076Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5144213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5144311Z E ^ 2025-05-07T20:33:08.5144680Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5144685Z 2025-05-07T20:33:08.5145313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5145318Z 2025-05-07T20:33:08.5145441Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5145668Z self=, 2025-05-07T20:33:08.5145750Z T=4096, 2025-05-07T20:33:08.5145843Z D=5120, 2025-05-07T20:33:08.5145930Z scale_ub=None, 2025-05-07T20:33:08.5146027Z contiguous=False, 2025-05-07T20:33:08.5146114Z compiled=True, 2025-05-07T20:33:08.5146192Z ) 2025-05-07T20:33:08.5146420Z self = 2025-05-07T20:33:08.5146681Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.5146686Z 2025-05-07T20:33:08.5146768Z @given( 2025-05-07T20:33:08.5146901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5147006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5147129Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5147258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5147381Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5147469Z ) 2025-05-07T20:33:08.5147719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5147818Z def test_silu_mul_quant( 2025-05-07T20:33:08.5147908Z self, 2025-05-07T20:33:08.5147992Z T: int, 2025-05-07T20:33:08.5148074Z D: int, 2025-05-07T20:33:08.5148182Z scale_ub: Optional[float], 2025-05-07T20:33:08.5148275Z contiguous: bool, 2025-05-07T20:33:08.5148369Z compiled: bool, 2025-05-07T20:33:08.5148460Z ) -> None: 2025-05-07T20:33:08.5148559Z torch.manual_seed(2025) 2025-05-07T20:33:08.5148640Z 2025-05-07T20:33:08.5148821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5148900Z 2025-05-07T20:33:08.5149010Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5149139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5149236Z x = x_sign * x_clamp 2025-05-07T20:33:08.5149329Z x0 = x[:, :D] 2025-05-07T20:33:08.5149415Z x1 = x[:, D:] 2025-05-07T20:33:08.5149491Z 2025-05-07T20:33:08.5149587Z if contiguous: 2025-05-07T20:33:08.5149682Z x0 = x0.contiguous() 2025-05-07T20:33:08.5149774Z x1 = x1.contiguous() 2025-05-07T20:33:08.5149862Z 2025-05-07T20:33:08.5149957Z if scale_ub is not None: 2025-05-07T20:33:08.5150066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5150289Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5150375Z ) 2025-05-07T20:33:08.5150468Z else: 2025-05-07T20:33:08.5150565Z scale_ub_tensor = None 2025-05-07T20:33:08.5150643Z 2025-05-07T20:33:08.5150785Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5150881Z op = silu_mul_quant 2025-05-07T20:33:08.5150969Z if compiled: 2025-05-07T20:33:08.5151079Z op = torch.compile(op) 2025-05-07T20:33:08.5151251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5151329Z 2025-05-07T20:33:08.5151431Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5151435Z 2025-05-07T20:33:08.5151536Z moe/activation_test.py:117: 2025-05-07T20:33:08.5151668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5151783Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5151885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5152269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5152365Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5152900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5153014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5153370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5153605Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5153948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5154049Z kernel = self.compile( 2025-05-07T20:33:08.5154432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5154685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5154815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5154819Z 2025-05-07T20:33:08.5155032Z self = 2025-05-07T20:33:08.5155808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5156325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158f2b9940>} 2025-05-07T20:33:08.5157079Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5157283Z context = 2025-05-07T20:33:08.5157288Z 2025-05-07T20:33:08.5157460Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5157724Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5157841Z module_map=module_map) 2025-05-07T20:33:08.5158008Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5158108Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5158195Z E ^ 2025-05-07T20:33:08.5158552Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5158557Z 2025-05-07T20:33:08.5158981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5158988Z 2025-05-07T20:33:08.5159135Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5159360Z self=, 2025-05-07T20:33:08.5159448Z T=4096, 2025-05-07T20:33:08.5159528Z D=5120, 2025-05-07T20:33:08.5159612Z scale_ub=1200.0, 2025-05-07T20:33:08.5159708Z contiguous=False, 2025-05-07T20:33:08.5159799Z compiled=False, 2025-05-07T20:33:08.5159883Z ) 2025-05-07T20:33:08.5160102Z self = 2025-05-07T20:33:08.5160320Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.5160324Z 2025-05-07T20:33:08.5160416Z @given( 2025-05-07T20:33:08.5160538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5160641Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5160768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5160888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5161015Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5161101Z ) 2025-05-07T20:33:08.5161348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5161452Z def test_silu_mul_quant( 2025-05-07T20:33:08.5161533Z self, 2025-05-07T20:33:08.5161652Z T: int, 2025-05-07T20:33:08.5161742Z D: int, 2025-05-07T20:33:08.5161845Z scale_ub: Optional[float], 2025-05-07T20:33:08.5161940Z contiguous: bool, 2025-05-07T20:33:08.5162036Z compiled: bool, 2025-05-07T20:33:08.5162118Z ) -> None: 2025-05-07T20:33:08.5162217Z torch.manual_seed(2025) 2025-05-07T20:33:08.5162302Z 2025-05-07T20:33:08.5162474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5162551Z 2025-05-07T20:33:08.5162653Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5162785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5162930Z x = x_sign * x_clamp 2025-05-07T20:33:08.5163014Z x0 = x[:, :D] 2025-05-07T20:33:08.5163099Z x1 = x[:, D:] 2025-05-07T20:33:08.5163182Z 2025-05-07T20:33:08.5163270Z if contiguous: 2025-05-07T20:33:08.5163368Z x0 = x0.contiguous() 2025-05-07T20:33:08.5163468Z x1 = x1.contiguous() 2025-05-07T20:33:08.5163550Z 2025-05-07T20:33:08.5163648Z if scale_ub is not None: 2025-05-07T20:33:08.5163765Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5163905Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5163986Z ) 2025-05-07T20:33:08.5164075Z else: 2025-05-07T20:33:08.5164171Z scale_ub_tensor = None 2025-05-07T20:33:08.5164246Z 2025-05-07T20:33:08.5164390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5164485Z op = silu_mul_quant 2025-05-07T20:33:08.5164582Z if compiled: 2025-05-07T20:33:08.5164692Z op = torch.compile(op) 2025-05-07T20:33:08.5164799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5164882Z 2025-05-07T20:33:08.5164976Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5164981Z 2025-05-07T20:33:08.5165082Z moe/activation_test.py:117: 2025-05-07T20:33:08.5165223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5165328Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5165436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5165949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5166049Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5166423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5166648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5167037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5167143Z kernel = self.compile( 2025-05-07T20:33:08.5167527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5167720Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5167849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5167889Z 2025-05-07T20:33:08.5168103Z self = 2025-05-07T20:33:08.5168935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5169445Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ec343a0>} 2025-05-07T20:33:08.5170245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5170441Z context = 2025-05-07T20:33:08.5170448Z 2025-05-07T20:33:08.5170619Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5170897Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5171009Z module_map=module_map) 2025-05-07T20:33:08.5171181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5171283Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5171365Z E ^ 2025-05-07T20:33:08.5171774Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5171779Z 2025-05-07T20:33:08.5172197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5172202Z 2025-05-07T20:33:08.5172317Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5172542Z self=, 2025-05-07T20:33:08.5172625Z T=4096, 2025-05-07T20:33:08.5172710Z D=5120, 2025-05-07T20:33:08.5172795Z scale_ub=1200.0, 2025-05-07T20:33:08.5172884Z contiguous=False, 2025-05-07T20:33:08.5172979Z compiled=True, 2025-05-07T20:33:08.5173054Z ) 2025-05-07T20:33:08.5173272Z self = 2025-05-07T20:33:08.5173453Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5173458Z 2025-05-07T20:33:08.5173545Z @given( 2025-05-07T20:33:08.5173674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5173778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5173900Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5174029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5174145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5174222Z ) 2025-05-07T20:33:08.5174476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5174575Z def test_silu_mul_quant( 2025-05-07T20:33:08.5174655Z self, 2025-05-07T20:33:08.5174744Z T: int, 2025-05-07T20:33:08.5174828Z D: int, 2025-05-07T20:33:08.5174932Z scale_ub: Optional[float], 2025-05-07T20:33:08.5175032Z contiguous: bool, 2025-05-07T20:33:08.5175121Z compiled: bool, 2025-05-07T20:33:08.5175212Z ) -> None: 2025-05-07T20:33:08.5175355Z torch.manual_seed(2025) 2025-05-07T20:33:08.5175436Z 2025-05-07T20:33:08.5175619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5175699Z 2025-05-07T20:33:08.5175794Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5175932Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5176028Z x = x_sign * x_clamp 2025-05-07T20:33:08.5176113Z x0 = x[:, :D] 2025-05-07T20:33:08.5176204Z x1 = x[:, D:] 2025-05-07T20:33:08.5176320Z 2025-05-07T20:33:08.5176409Z if contiguous: 2025-05-07T20:33:08.5176512Z x0 = x0.contiguous() 2025-05-07T20:33:08.5176607Z x1 = x1.contiguous() 2025-05-07T20:33:08.5176690Z 2025-05-07T20:33:08.5176783Z if scale_ub is not None: 2025-05-07T20:33:08.5176893Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5177042Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5177123Z ) 2025-05-07T20:33:08.5177211Z else: 2025-05-07T20:33:08.5177320Z scale_ub_tensor = None 2025-05-07T20:33:08.5177396Z 2025-05-07T20:33:08.5177535Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5177639Z op = silu_mul_quant 2025-05-07T20:33:08.5177729Z if compiled: 2025-05-07T20:33:08.5177871Z op = torch.compile(op) 2025-05-07T20:33:08.5177990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5178070Z 2025-05-07T20:33:08.5178172Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5178176Z 2025-05-07T20:33:08.5178283Z moe/activation_test.py:117: 2025-05-07T20:33:08.5178414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5178517Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5178633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5179003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5179143Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5179653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5179752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5180118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5180342Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5180681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5180784Z kernel = self.compile( 2025-05-07T20:33:08.5181307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5181494Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5181628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5181635Z 2025-05-07T20:33:08.5181845Z self = 2025-05-07T20:33:08.5182623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5183137Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ec34280>} 2025-05-07T20:33:08.5183886Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5184079Z context = 2025-05-07T20:33:08.5184132Z 2025-05-07T20:33:08.5184304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5184579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5184690Z module_map=module_map) 2025-05-07T20:33:08.5184865Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5184965Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5185085Z E ^ 2025-05-07T20:33:08.5185454Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5185459Z 2025-05-07T20:33:08.5185876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5185881Z 2025-05-07T20:33:08.5185993Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5186221Z self=, 2025-05-07T20:33:08.5186303Z T=2048, 2025-05-07T20:33:08.5186388Z D=7168, 2025-05-07T20:33:08.5186475Z scale_ub=1200.0, 2025-05-07T20:33:08.5186564Z contiguous=False, 2025-05-07T20:33:08.5186661Z compiled=False, 2025-05-07T20:33:08.5186736Z ) 2025-05-07T20:33:08.5187028Z self = 2025-05-07T20:33:08.5187219Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.5187226Z 2025-05-07T20:33:08.5187308Z @given( 2025-05-07T20:33:08.5187434Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5187536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5187653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5187778Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5187893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5187970Z ) 2025-05-07T20:33:08.5188266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5188364Z def test_silu_mul_quant( 2025-05-07T20:33:08.5188449Z self, 2025-05-07T20:33:08.5188535Z T: int, 2025-05-07T20:33:08.5188614Z D: int, 2025-05-07T20:33:08.5188715Z scale_ub: Optional[float], 2025-05-07T20:33:08.5188815Z contiguous: bool, 2025-05-07T20:33:08.5188903Z compiled: bool, 2025-05-07T20:33:08.5188991Z ) -> None: 2025-05-07T20:33:08.5189091Z torch.manual_seed(2025) 2025-05-07T20:33:08.5189165Z 2025-05-07T20:33:08.5189344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5189420Z 2025-05-07T20:33:08.5189514Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5189647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5189739Z x = x_sign * x_clamp 2025-05-07T20:33:08.5189822Z x0 = x[:, :D] 2025-05-07T20:33:08.5189912Z x1 = x[:, D:] 2025-05-07T20:33:08.5189991Z 2025-05-07T20:33:08.5190079Z if contiguous: 2025-05-07T20:33:08.5190179Z x0 = x0.contiguous() 2025-05-07T20:33:08.5190272Z x1 = x1.contiguous() 2025-05-07T20:33:08.5190356Z 2025-05-07T20:33:08.5190449Z if scale_ub is not None: 2025-05-07T20:33:08.5190560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5190707Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5190788Z ) 2025-05-07T20:33:08.5190870Z else: 2025-05-07T20:33:08.5190974Z scale_ub_tensor = None 2025-05-07T20:33:08.5191048Z 2025-05-07T20:33:08.5191182Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5191281Z op = silu_mul_quant 2025-05-07T20:33:08.5191369Z if compiled: 2025-05-07T20:33:08.5191470Z op = torch.compile(op) 2025-05-07T20:33:08.5191585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5191710Z 2025-05-07T20:33:08.5191815Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5191820Z 2025-05-07T20:33:08.5191919Z moe/activation_test.py:117: 2025-05-07T20:33:08.5192049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5192161Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5192266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5192767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5192937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5193294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5193525Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5193863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5193965Z kernel = self.compile( 2025-05-07T20:33:08.5194348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5194526Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5194691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5194704Z 2025-05-07T20:33:08.5194914Z self = 2025-05-07T20:33:08.5195687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5196206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ed85670>} 2025-05-07T20:33:08.5197003Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5197205Z context = 2025-05-07T20:33:08.5197211Z 2025-05-07T20:33:08.5197378Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5197646Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5197765Z module_map=module_map) 2025-05-07T20:33:08.5197929Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5198029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5198117Z E ^ 2025-05-07T20:33:08.5198476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5198486Z 2025-05-07T20:33:08.5198955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5198960Z 2025-05-07T20:33:08.5199064Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5199290Z self=, 2025-05-07T20:33:08.5199378Z T=1, 2025-05-07T20:33:08.5199457Z D=7168, 2025-05-07T20:33:08.5199547Z scale_ub=None, 2025-05-07T20:33:08.5199640Z contiguous=True, 2025-05-07T20:33:08.5199727Z compiled=False, 2025-05-07T20:33:08.5199807Z ) 2025-05-07T20:33:08.5200025Z self = 2025-05-07T20:33:08.5200193Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.5200198Z 2025-05-07T20:33:08.5200284Z @given( 2025-05-07T20:33:08.5200406Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5200553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5200679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5200801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5200924Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5201002Z ) 2025-05-07T20:33:08.5201251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5201352Z def test_silu_mul_quant( 2025-05-07T20:33:08.5201468Z self, 2025-05-07T20:33:08.5201549Z T: int, 2025-05-07T20:33:08.5201634Z D: int, 2025-05-07T20:33:08.5201737Z scale_ub: Optional[float], 2025-05-07T20:33:08.5201831Z contiguous: bool, 2025-05-07T20:33:08.5201925Z compiled: bool, 2025-05-07T20:33:08.5202006Z ) -> None: 2025-05-07T20:33:08.5202105Z torch.manual_seed(2025) 2025-05-07T20:33:08.5202186Z 2025-05-07T20:33:08.5202362Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5202446Z 2025-05-07T20:33:08.5202546Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5202673Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5202772Z x = x_sign * x_clamp 2025-05-07T20:33:08.5202855Z x0 = x[:, :D] 2025-05-07T20:33:08.5202937Z x1 = x[:, D:] 2025-05-07T20:33:08.5203056Z 2025-05-07T20:33:08.5203144Z if contiguous: 2025-05-07T20:33:08.5203238Z x0 = x0.contiguous() 2025-05-07T20:33:08.5203337Z x1 = x1.contiguous() 2025-05-07T20:33:08.5203412Z 2025-05-07T20:33:08.5203504Z if scale_ub is not None: 2025-05-07T20:33:08.5203618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5203757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5203835Z ) 2025-05-07T20:33:08.5203934Z else: 2025-05-07T20:33:08.5204031Z scale_ub_tensor = None 2025-05-07T20:33:08.5204106Z 2025-05-07T20:33:08.5204293Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5204388Z op = silu_mul_quant 2025-05-07T20:33:08.5204480Z if compiled: 2025-05-07T20:33:08.5204589Z op = torch.compile(op) 2025-05-07T20:33:08.5204699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5204775Z 2025-05-07T20:33:08.5204881Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5204885Z 2025-05-07T20:33:08.5204986Z moe/activation_test.py:117: 2025-05-07T20:33:08.5205130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5205237Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5205341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5205842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5205944Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5206302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5206537Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5206874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5206977Z kernel = self.compile( 2025-05-07T20:33:08.5207354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5207535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5207672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5207676Z 2025-05-07T20:33:08.5207882Z self = 2025-05-07T20:33:08.5208705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5209221Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ec5f280>} 2025-05-07T20:33:08.5209968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5210208Z context = 2025-05-07T20:33:08.5210213Z 2025-05-07T20:33:08.5210381Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5210652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5210762Z module_map=module_map) 2025-05-07T20:33:08.5210934Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5211041Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5211120Z E ^ 2025-05-07T20:33:08.5211484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5211489Z 2025-05-07T20:33:08.5211935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5211943Z 2025-05-07T20:33:08.5212049Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5212277Z self=, 2025-05-07T20:33:08.5212358Z T=16384, 2025-05-07T20:33:08.5212436Z D=7168, 2025-05-07T20:33:08.5212525Z scale_ub=1200.0, 2025-05-07T20:33:08.5212613Z contiguous=False, 2025-05-07T20:33:08.5212704Z compiled=True, 2025-05-07T20:33:08.5212779Z ) 2025-05-07T20:33:08.5213001Z self = 2025-05-07T20:33:08.5213234Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5213238Z 2025-05-07T20:33:08.5213318Z @given( 2025-05-07T20:33:08.5213438Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5213553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5213670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5213788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5213912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5213988Z ) 2025-05-07T20:33:08.5214241Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5214338Z def test_silu_mul_quant( 2025-05-07T20:33:08.5214416Z self, 2025-05-07T20:33:08.5214501Z T: int, 2025-05-07T20:33:08.5214580Z D: int, 2025-05-07T20:33:08.5214680Z scale_ub: Optional[float], 2025-05-07T20:33:08.5214784Z contiguous: bool, 2025-05-07T20:33:08.5214874Z compiled: bool, 2025-05-07T20:33:08.5214954Z ) -> None: 2025-05-07T20:33:08.5215058Z torch.manual_seed(2025) 2025-05-07T20:33:08.5215133Z 2025-05-07T20:33:08.5215304Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5215388Z 2025-05-07T20:33:08.5215483Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5215615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5215710Z x = x_sign * x_clamp 2025-05-07T20:33:08.5215793Z x0 = x[:, :D] 2025-05-07T20:33:08.5215882Z x1 = x[:, D:] 2025-05-07T20:33:08.5215958Z 2025-05-07T20:33:08.5216044Z if contiguous: 2025-05-07T20:33:08.5216143Z x0 = x0.contiguous() 2025-05-07T20:33:08.5216234Z x1 = x1.contiguous() 2025-05-07T20:33:08.5216308Z 2025-05-07T20:33:08.5216410Z if scale_ub is not None: 2025-05-07T20:33:08.5216569Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5216711Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5216796Z ) 2025-05-07T20:33:08.5216876Z else: 2025-05-07T20:33:08.5216978Z scale_ub_tensor = None 2025-05-07T20:33:08.5217052Z 2025-05-07T20:33:08.5217190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5217288Z op = silu_mul_quant 2025-05-07T20:33:08.5217448Z if compiled: 2025-05-07T20:33:08.5217552Z op = torch.compile(op) 2025-05-07T20:33:08.5217668Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5217743Z 2025-05-07T20:33:08.5217836Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5217841Z 2025-05-07T20:33:08.5217948Z moe/activation_test.py:117: 2025-05-07T20:33:08.5218081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5218183Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5218302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5218675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5218778Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5219307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5219413Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5219781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5220007Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5220352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5220452Z kernel = self.compile( 2025-05-07T20:33:08.5220832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5221180Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5221311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5221316Z 2025-05-07T20:33:08.5221524Z self = 2025-05-07T20:33:08.5222317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5222824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ec5fee0>} 2025-05-07T20:33:08.5223572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5223765Z context = 2025-05-07T20:33:08.5223770Z 2025-05-07T20:33:08.5223944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5224211Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5224322Z module_map=module_map) 2025-05-07T20:33:08.5224495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5224595Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5224675Z E ^ 2025-05-07T20:33:08.5225038Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5225043Z 2025-05-07T20:33:08.5225498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5225506Z 2025-05-07T20:33:08.5225623Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5225847Z self=, 2025-05-07T20:33:08.5225926Z T=1, 2025-05-07T20:33:08.5226011Z D=7168, 2025-05-07T20:33:08.5226098Z scale_ub=None, 2025-05-07T20:33:08.5226189Z contiguous=False, 2025-05-07T20:33:08.5226287Z compiled=False, 2025-05-07T20:33:08.5226363Z ) 2025-05-07T20:33:08.5226628Z self = 2025-05-07T20:33:08.5226799Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.5226804Z 2025-05-07T20:33:08.5226884Z @given( 2025-05-07T20:33:08.5227009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5227113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5227229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5227358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5227479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5227950Z ) 2025-05-07T20:33:08.5228197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5228294Z def test_silu_mul_quant( 2025-05-07T20:33:08.5228423Z self, 2025-05-07T20:33:08.5228503Z T: int, 2025-05-07T20:33:08.5228582Z D: int, 2025-05-07T20:33:08.5228692Z scale_ub: Optional[float], 2025-05-07T20:33:08.5228785Z contiguous: bool, 2025-05-07T20:33:08.5228873Z compiled: bool, 2025-05-07T20:33:08.5228959Z ) -> None: 2025-05-07T20:33:08.5229055Z torch.manual_seed(2025) 2025-05-07T20:33:08.5229131Z 2025-05-07T20:33:08.5229312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5229388Z 2025-05-07T20:33:08.5229489Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5229618Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5229755Z x = x_sign * x_clamp 2025-05-07T20:33:08.5229846Z x0 = x[:, :D] 2025-05-07T20:33:08.5229928Z x1 = x[:, D:] 2025-05-07T20:33:08.5230002Z 2025-05-07T20:33:08.5230095Z if contiguous: 2025-05-07T20:33:08.5230188Z x0 = x0.contiguous() 2025-05-07T20:33:08.5230282Z x1 = x1.contiguous() 2025-05-07T20:33:08.5230365Z 2025-05-07T20:33:08.5230457Z if scale_ub is not None: 2025-05-07T20:33:08.5230566Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5230710Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5230788Z ) 2025-05-07T20:33:08.5230867Z else: 2025-05-07T20:33:08.5230968Z scale_ub_tensor = None 2025-05-07T20:33:08.5231042Z 2025-05-07T20:33:08.5231180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5231278Z op = silu_mul_quant 2025-05-07T20:33:08.5231370Z if compiled: 2025-05-07T20:33:08.5231477Z op = torch.compile(op) 2025-05-07T20:33:08.5231587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5231660Z 2025-05-07T20:33:08.5231759Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5231764Z 2025-05-07T20:33:08.5231862Z moe/activation_test.py:117: 2025-05-07T20:33:08.5231993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5232104Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5232207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5232715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5232813Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5233169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5233445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5233796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5233892Z kernel = self.compile( 2025-05-07T20:33:08.5234279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5234460Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5234631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5234635Z 2025-05-07T20:33:08.5234846Z self = 2025-05-07T20:33:08.5235617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5236133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ecb6670>} 2025-05-07T20:33:08.5236909Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5237116Z context = 2025-05-07T20:33:08.5237122Z 2025-05-07T20:33:08.5237289Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5237560Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5237671Z module_map=module_map) 2025-05-07T20:33:08.5237834Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5237942Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5238065Z E ^ 2025-05-07T20:33:08.5238420Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5238424Z 2025-05-07T20:33:08.5238841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5238848Z 2025-05-07T20:33:08.5238952Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5239181Z self=, 2025-05-07T20:33:08.5239262Z T=2048, 2025-05-07T20:33:08.5239341Z D=7168, 2025-05-07T20:33:08.5239431Z scale_ub=None, 2025-05-07T20:33:08.5239520Z contiguous=False, 2025-05-07T20:33:08.5239608Z compiled=True, 2025-05-07T20:33:08.5239690Z ) 2025-05-07T20:33:08.5239910Z self = 2025-05-07T20:33:08.5240363Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.5240395Z 2025-05-07T20:33:08.5240513Z @given( 2025-05-07T20:33:08.5240677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5240826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5240959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5241083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5241205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5241285Z ) 2025-05-07T20:33:08.5241533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5241637Z def test_silu_mul_quant( 2025-05-07T20:33:08.5241717Z self, 2025-05-07T20:33:08.5241795Z T: int, 2025-05-07T20:33:08.5241879Z D: int, 2025-05-07T20:33:08.5241977Z scale_ub: Optional[float], 2025-05-07T20:33:08.5242075Z contiguous: bool, 2025-05-07T20:33:08.5242161Z compiled: bool, 2025-05-07T20:33:08.5242242Z ) -> None: 2025-05-07T20:33:08.5242513Z torch.manual_seed(2025) 2025-05-07T20:33:08.5242592Z 2025-05-07T20:33:08.5242766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5242847Z 2025-05-07T20:33:08.5242940Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5243068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5243163Z x = x_sign * x_clamp 2025-05-07T20:33:08.5243245Z x0 = x[:, :D] 2025-05-07T20:33:08.5243392Z x1 = x[:, D:] 2025-05-07T20:33:08.5243472Z 2025-05-07T20:33:08.5243557Z if contiguous: 2025-05-07T20:33:08.5243655Z x0 = x0.contiguous() 2025-05-07T20:33:08.5243745Z x1 = x1.contiguous() 2025-05-07T20:33:08.5243821Z 2025-05-07T20:33:08.5243921Z if scale_ub is not None: 2025-05-07T20:33:08.5244027Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5244165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5244258Z ) 2025-05-07T20:33:08.5244338Z else: 2025-05-07T20:33:08.5244434Z scale_ub_tensor = None 2025-05-07T20:33:08.5244515Z 2025-05-07T20:33:08.5244647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5244740Z op = silu_mul_quant 2025-05-07T20:33:08.5244897Z if compiled: 2025-05-07T20:33:08.5245001Z op = torch.compile(op) 2025-05-07T20:33:08.5245117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5245198Z 2025-05-07T20:33:08.5245292Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5245296Z 2025-05-07T20:33:08.5245403Z moe/activation_test.py:117: 2025-05-07T20:33:08.5245532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5245634Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5245742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5246118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5246280Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5246778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5246878Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5247246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5247474Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5247814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5247915Z kernel = self.compile( 2025-05-07T20:33:08.5248291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5248474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5248607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5248611Z 2025-05-07T20:33:08.5248818Z self = 2025-05-07T20:33:08.5249596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5250106Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e8c1550>} 2025-05-07T20:33:08.5250851Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5251117Z context = 2025-05-07T20:33:08.5251125Z 2025-05-07T20:33:08.5251298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5251575Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5251687Z module_map=module_map) 2025-05-07T20:33:08.5251857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5251957Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5252080Z E ^ 2025-05-07T20:33:08.5252443Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5252447Z 2025-05-07T20:33:08.5252856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5252860Z 2025-05-07T20:33:08.5252972Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5253203Z self=, 2025-05-07T20:33:08.5253282Z T=4096, 2025-05-07T20:33:08.5253368Z D=7168, 2025-05-07T20:33:08.5253454Z scale_ub=None, 2025-05-07T20:33:08.5253543Z contiguous=False, 2025-05-07T20:33:08.5253634Z compiled=True, 2025-05-07T20:33:08.5253710Z ) 2025-05-07T20:33:08.5253967Z self = 2025-05-07T20:33:08.5254155Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.5254162Z 2025-05-07T20:33:08.5254242Z @given( 2025-05-07T20:33:08.5254370Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5254474Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5254591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5254714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5254830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5254951Z ) 2025-05-07T20:33:08.5255205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5255301Z def test_silu_mul_quant( 2025-05-07T20:33:08.5255378Z self, 2025-05-07T20:33:08.5255465Z T: int, 2025-05-07T20:33:08.5255544Z D: int, 2025-05-07T20:33:08.5255649Z scale_ub: Optional[float], 2025-05-07T20:33:08.5255749Z contiguous: bool, 2025-05-07T20:33:08.5255838Z compiled: bool, 2025-05-07T20:33:08.5255928Z ) -> None: 2025-05-07T20:33:08.5256025Z torch.manual_seed(2025) 2025-05-07T20:33:08.5256100Z 2025-05-07T20:33:08.5256278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5256357Z 2025-05-07T20:33:08.5256452Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5256587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5256679Z x = x_sign * x_clamp 2025-05-07T20:33:08.5256764Z x0 = x[:, :D] 2025-05-07T20:33:08.5256858Z x1 = x[:, D:] 2025-05-07T20:33:08.5256933Z 2025-05-07T20:33:08.5257018Z if contiguous: 2025-05-07T20:33:08.5257119Z x0 = x0.contiguous() 2025-05-07T20:33:08.5257210Z x1 = x1.contiguous() 2025-05-07T20:33:08.5257283Z 2025-05-07T20:33:08.5257382Z if scale_ub is not None: 2025-05-07T20:33:08.5257492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5257638Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5257719Z ) 2025-05-07T20:33:08.5257800Z else: 2025-05-07T20:33:08.5257903Z scale_ub_tensor = None 2025-05-07T20:33:08.5257977Z 2025-05-07T20:33:08.5258109Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5258207Z op = silu_mul_quant 2025-05-07T20:33:08.5258294Z if compiled: 2025-05-07T20:33:08.5258396Z op = torch.compile(op) 2025-05-07T20:33:08.5258566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5258660Z 2025-05-07T20:33:08.5258773Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5258785Z 2025-05-07T20:33:08.5258891Z moe/activation_test.py:117: 2025-05-07T20:33:08.5259021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5259131Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5259235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5259601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5259744Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5260236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5260342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5260701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5260934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5261386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5261482Z kernel = self.compile( 2025-05-07T20:33:08.5261901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5262092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5262222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5262227Z 2025-05-07T20:33:08.5262439Z self = 2025-05-07T20:33:08.5263225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5263775Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e92b160>} 2025-05-07T20:33:08.5264538Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5264733Z context = 2025-05-07T20:33:08.5264737Z 2025-05-07T20:33:08.5264913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5265182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5265292Z module_map=module_map) 2025-05-07T20:33:08.5265459Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5265564Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5265649Z E ^ 2025-05-07T20:33:08.5266001Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5266006Z 2025-05-07T20:33:08.5266419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5266424Z 2025-05-07T20:33:08.5266533Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5266758Z self=, 2025-05-07T20:33:08.5266843Z T=16384, 2025-05-07T20:33:08.5266922Z D=5120, 2025-05-07T20:33:08.5272173Z scale_ub=1200.0, 2025-05-07T20:33:08.5272295Z contiguous=False, 2025-05-07T20:33:08.5272386Z compiled=False, 2025-05-07T20:33:08.5272475Z ) 2025-05-07T20:33:08.5272706Z self = 2025-05-07T20:33:08.5272994Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.5273003Z 2025-05-07T20:33:08.5273087Z @given( 2025-05-07T20:33:08.5273213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5273327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5273451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5273577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5273703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5273826Z ) 2025-05-07T20:33:08.5274078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5274185Z def test_silu_mul_quant( 2025-05-07T20:33:08.5274266Z self, 2025-05-07T20:33:08.5274356Z T: int, 2025-05-07T20:33:08.5274436Z D: int, 2025-05-07T20:33:08.5274539Z scale_ub: Optional[float], 2025-05-07T20:33:08.5274641Z contiguous: bool, 2025-05-07T20:33:08.5274740Z compiled: bool, 2025-05-07T20:33:08.5274822Z ) -> None: 2025-05-07T20:33:08.5274931Z torch.manual_seed(2025) 2025-05-07T20:33:08.5275010Z 2025-05-07T20:33:08.5275183Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5275269Z 2025-05-07T20:33:08.5275366Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5275536Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5275640Z x = x_sign * x_clamp 2025-05-07T20:33:08.5275729Z x0 = x[:, :D] 2025-05-07T20:33:08.5275813Z x1 = x[:, D:] 2025-05-07T20:33:08.5275897Z 2025-05-07T20:33:08.5275985Z if contiguous: 2025-05-07T20:33:08.5276087Z x0 = x0.contiguous() 2025-05-07T20:33:08.5276180Z x1 = x1.contiguous() 2025-05-07T20:33:08.5276257Z 2025-05-07T20:33:08.5276362Z if scale_ub is not None: 2025-05-07T20:33:08.5276471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5276619Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5276748Z ) 2025-05-07T20:33:08.5276829Z else: 2025-05-07T20:33:08.5276927Z scale_ub_tensor = None 2025-05-07T20:33:08.5277013Z 2025-05-07T20:33:08.5277150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5277246Z op = silu_mul_quant 2025-05-07T20:33:08.5277342Z if compiled: 2025-05-07T20:33:08.5277447Z op = torch.compile(op) 2025-05-07T20:33:08.5277566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5277644Z 2025-05-07T20:33:08.5277739Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5277744Z 2025-05-07T20:33:08.5277853Z moe/activation_test.py:117: 2025-05-07T20:33:08.5277984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5278087Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5278197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5278713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5278819Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5279178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5279410Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5279765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5279866Z kernel = self.compile( 2025-05-07T20:33:08.5280244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5280429Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5280558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5280565Z 2025-05-07T20:33:08.5280823Z self = 2025-05-07T20:33:08.5281614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5282133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e92b940>} 2025-05-07T20:33:08.5282924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5283120Z context = 2025-05-07T20:33:08.5283125Z 2025-05-07T20:33:08.5283306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5283574Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5283687Z module_map=module_map) 2025-05-07T20:33:08.5283860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5283998Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5284087Z E ^ 2025-05-07T20:33:08.5284443Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5284451Z 2025-05-07T20:33:08.5284869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5284873Z 2025-05-07T20:33:08.5284989Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5285213Z self=, 2025-05-07T20:33:08.5285301Z T=16384, 2025-05-07T20:33:08.5285453Z D=5120, 2025-05-07T20:33:08.5285539Z scale_ub=1200.0, 2025-05-07T20:33:08.5285636Z contiguous=True, 2025-05-07T20:33:08.5285723Z compiled=True, 2025-05-07T20:33:08.5285798Z ) 2025-05-07T20:33:08.5286025Z self = 2025-05-07T20:33:08.5286206Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.5286210Z 2025-05-07T20:33:08.5286290Z @given( 2025-05-07T20:33:08.5286420Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5286524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5286649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5286771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5286889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5286973Z ) 2025-05-07T20:33:08.5287221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5287324Z def test_silu_mul_quant( 2025-05-07T20:33:08.5287413Z self, 2025-05-07T20:33:08.5287494Z T: int, 2025-05-07T20:33:08.5287575Z D: int, 2025-05-07T20:33:08.5287684Z scale_ub: Optional[float], 2025-05-07T20:33:08.5287777Z contiguous: bool, 2025-05-07T20:33:08.5287867Z compiled: bool, 2025-05-07T20:33:08.5287959Z ) -> None: 2025-05-07T20:33:08.5288058Z torch.manual_seed(2025) 2025-05-07T20:33:08.5288144Z 2025-05-07T20:33:08.5288319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5288398Z 2025-05-07T20:33:08.5288501Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5288631Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5288725Z x = x_sign * x_clamp 2025-05-07T20:33:08.5288817Z x0 = x[:, :D] 2025-05-07T20:33:08.5288901Z x1 = x[:, D:] 2025-05-07T20:33:08.5288977Z 2025-05-07T20:33:08.5289073Z if contiguous: 2025-05-07T20:33:08.5289215Z x0 = x0.contiguous() 2025-05-07T20:33:08.5289309Z x1 = x1.contiguous() 2025-05-07T20:33:08.5289394Z 2025-05-07T20:33:08.5289488Z if scale_ub is not None: 2025-05-07T20:33:08.5289596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5289748Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5289829Z ) 2025-05-07T20:33:08.5289917Z else: 2025-05-07T20:33:08.5290054Z scale_ub_tensor = None 2025-05-07T20:33:08.5290133Z 2025-05-07T20:33:08.5290276Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5290370Z op = silu_mul_quant 2025-05-07T20:33:08.5290458Z if compiled: 2025-05-07T20:33:08.5290571Z op = torch.compile(op) 2025-05-07T20:33:08.5290684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5290764Z 2025-05-07T20:33:08.5290866Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5290874Z 2025-05-07T20:33:08.5290978Z moe/activation_test.py:117: 2025-05-07T20:33:08.5291119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5291223Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5291325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5291738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5291836Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5292337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5292444Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5292801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5293035Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5293425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5293521Z kernel = self.compile( 2025-05-07T20:33:08.5293906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5294088Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5294217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5294234Z 2025-05-07T20:33:08.5294444Z self = 2025-05-07T20:33:08.5295214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5295728Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158ea18550>} 2025-05-07T20:33:08.5296478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5296688Z context = 2025-05-07T20:33:08.5296693Z 2025-05-07T20:33:08.5296866Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5297139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5297261Z module_map=module_map) 2025-05-07T20:33:08.5297425Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5297534Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5297615Z E ^ 2025-05-07T20:33:08.5298012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5298020Z 2025-05-07T20:33:08.5298446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5298451Z 2025-05-07T20:33:08.5298559Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5298783Z self=, 2025-05-07T20:33:08.5298911Z T=16384, 2025-05-07T20:33:08.5298991Z D=5120, 2025-05-07T20:33:08.5299084Z scale_ub=None, 2025-05-07T20:33:08.5299176Z contiguous=False, 2025-05-07T20:33:08.5299262Z compiled=True, 2025-05-07T20:33:08.5299344Z ) 2025-05-07T20:33:08.5299565Z self = 2025-05-07T20:33:08.5299746Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.5299751Z 2025-05-07T20:33:08.5299838Z @given( 2025-05-07T20:33:08.5299967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5300070Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5300196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5300319Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5300483Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5300562Z ) 2025-05-07T20:33:08.5300811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5300915Z def test_silu_mul_quant( 2025-05-07T20:33:08.5301109Z self, 2025-05-07T20:33:08.5301189Z T: int, 2025-05-07T20:33:08.5301276Z D: int, 2025-05-07T20:33:08.5301377Z scale_ub: Optional[float], 2025-05-07T20:33:08.5301471Z contiguous: bool, 2025-05-07T20:33:08.5301566Z compiled: bool, 2025-05-07T20:33:08.5301647Z ) -> None: 2025-05-07T20:33:08.5301745Z torch.manual_seed(2025) 2025-05-07T20:33:08.5301873Z 2025-05-07T20:33:08.5302051Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5302135Z 2025-05-07T20:33:08.5302229Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5302356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5302456Z x = x_sign * x_clamp 2025-05-07T20:33:08.5302545Z x0 = x[:, :D] 2025-05-07T20:33:08.5302629Z x1 = x[:, D:] 2025-05-07T20:33:08.5302713Z 2025-05-07T20:33:08.5302803Z if contiguous: 2025-05-07T20:33:08.5302899Z x0 = x0.contiguous() 2025-05-07T20:33:08.5303001Z x1 = x1.contiguous() 2025-05-07T20:33:08.5303077Z 2025-05-07T20:33:08.5303172Z if scale_ub is not None: 2025-05-07T20:33:08.5303289Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5303429Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5303518Z ) 2025-05-07T20:33:08.5303599Z else: 2025-05-07T20:33:08.5303704Z scale_ub_tensor = None 2025-05-07T20:33:08.5303793Z 2025-05-07T20:33:08.5303931Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5304025Z op = silu_mul_quant 2025-05-07T20:33:08.5304123Z if compiled: 2025-05-07T20:33:08.5304227Z op = torch.compile(op) 2025-05-07T20:33:08.5304341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5304425Z 2025-05-07T20:33:08.5304520Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5304526Z 2025-05-07T20:33:08.5304626Z moe/activation_test.py:117: 2025-05-07T20:33:08.5304766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5304870Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5304981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5305347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5305488Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5305998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5306098Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5306454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5306694Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5307074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5307171Z kernel = self.compile( 2025-05-07T20:33:08.5307562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5307740Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5307872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5307887Z 2025-05-07T20:33:08.5308093Z self = 2025-05-07T20:33:08.5308901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5309421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e8fb1f0>} 2025-05-07T20:33:08.5310176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5310374Z context = 2025-05-07T20:33:08.5310379Z 2025-05-07T20:33:08.5310595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5310865Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5310982Z module_map=module_map) 2025-05-07T20:33:08.5311150Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5311251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5311336Z E ^ 2025-05-07T20:33:08.5311701Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5311706Z 2025-05-07T20:33:08.5312124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5312129Z 2025-05-07T20:33:08.5312235Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5312460Z self=, 2025-05-07T20:33:08.5312551Z T=2048, 2025-05-07T20:33:08.5312631Z D=5120, 2025-05-07T20:33:08.5312724Z scale_ub=None, 2025-05-07T20:33:08.5312816Z contiguous=False, 2025-05-07T20:33:08.5312903Z compiled=True, 2025-05-07T20:33:08.5312986Z ) 2025-05-07T20:33:08.5313204Z self = 2025-05-07T20:33:08.5313386Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.5313390Z 2025-05-07T20:33:08.5313482Z @given( 2025-05-07T20:33:08.5313604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5313707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5313832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5313950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5314072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5314151Z ) 2025-05-07T20:33:08.5314447Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5314555Z def test_silu_mul_quant( 2025-05-07T20:33:08.5314634Z self, 2025-05-07T20:33:08.5314713Z T: int, 2025-05-07T20:33:08.5314802Z D: int, 2025-05-07T20:33:08.5314902Z scale_ub: Optional[float], 2025-05-07T20:33:08.5314993Z contiguous: bool, 2025-05-07T20:33:08.5315090Z compiled: bool, 2025-05-07T20:33:08.5315171Z ) -> None: 2025-05-07T20:33:08.5315268Z torch.manual_seed(2025) 2025-05-07T20:33:08.5315388Z 2025-05-07T20:33:08.5315565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5315641Z 2025-05-07T20:33:08.5315743Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5315869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5315967Z x = x_sign * x_clamp 2025-05-07T20:33:08.5316055Z x0 = x[:, :D] 2025-05-07T20:33:08.5316137Z x1 = x[:, D:] 2025-05-07T20:33:08.5316222Z 2025-05-07T20:33:08.5316311Z if contiguous: 2025-05-07T20:33:08.5316405Z x0 = x0.contiguous() 2025-05-07T20:33:08.5316502Z x1 = x1.contiguous() 2025-05-07T20:33:08.5316578Z 2025-05-07T20:33:08.5316671Z if scale_ub is not None: 2025-05-07T20:33:08.5316785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5316992Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5317073Z ) 2025-05-07T20:33:08.5317165Z else: 2025-05-07T20:33:08.5317263Z scale_ub_tensor = None 2025-05-07T20:33:08.5317343Z 2025-05-07T20:33:08.5317477Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5317571Z op = silu_mul_quant 2025-05-07T20:33:08.5317665Z if compiled: 2025-05-07T20:33:08.5317768Z op = torch.compile(op) 2025-05-07T20:33:08.5317878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5317962Z 2025-05-07T20:33:08.5318100Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5318105Z 2025-05-07T20:33:08.5318205Z moe/activation_test.py:117: 2025-05-07T20:33:08.5318340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5318443Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5318554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5318919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5319017Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5319523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5319622Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5319977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5320212Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5320555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5320658Z kernel = self.compile( 2025-05-07T20:33:08.5321036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5321214Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5321350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5321356Z 2025-05-07T20:33:08.5321562Z self = 2025-05-07T20:33:08.5322341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5322886Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e8fbf70>} 2025-05-07T20:33:08.5323646Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5323847Z context = 2025-05-07T20:33:08.5323887Z 2025-05-07T20:33:08.5324060Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5324330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5324442Z module_map=module_map) 2025-05-07T20:33:08.5324607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5324716Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5324797Z E ^ 2025-05-07T20:33:08.5325166Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5325178Z 2025-05-07T20:33:08.5325595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5325600Z 2025-05-07T20:33:08.5325767Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5325999Z self=, 2025-05-07T20:33:08.5326083Z T=2048, 2025-05-07T20:33:08.5326163Z D=5120, 2025-05-07T20:33:08.5326261Z scale_ub=1200.0, 2025-05-07T20:33:08.5326353Z contiguous=False, 2025-05-07T20:33:08.5326439Z compiled=True, 2025-05-07T20:33:08.5326523Z ) 2025-05-07T20:33:08.5326742Z self = 2025-05-07T20:33:08.5326928Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5326974Z 2025-05-07T20:33:08.5327057Z @given( 2025-05-07T20:33:08.5327179Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5327289Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5327409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5327531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5327652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5327729Z ) 2025-05-07T20:33:08.5327987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5328085Z def test_silu_mul_quant( 2025-05-07T20:33:08.5328165Z self, 2025-05-07T20:33:08.5328256Z T: int, 2025-05-07T20:33:08.5328338Z D: int, 2025-05-07T20:33:08.5328439Z scale_ub: Optional[float], 2025-05-07T20:33:08.5328541Z contiguous: bool, 2025-05-07T20:33:08.5328644Z compiled: bool, 2025-05-07T20:33:08.5328735Z ) -> None: 2025-05-07T20:33:08.5328864Z torch.manual_seed(2025) 2025-05-07T20:33:08.5328944Z 2025-05-07T20:33:08.5329117Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5329203Z 2025-05-07T20:33:08.5329297Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5329424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5329525Z x = x_sign * x_clamp 2025-05-07T20:33:08.5329609Z x0 = x[:, :D] 2025-05-07T20:33:08.5329696Z x1 = x[:, D:] 2025-05-07T20:33:08.5329775Z 2025-05-07T20:33:08.5329861Z if contiguous: 2025-05-07T20:33:08.5329961Z x0 = x0.contiguous() 2025-05-07T20:33:08.5330052Z x1 = x1.contiguous() 2025-05-07T20:33:08.5330128Z 2025-05-07T20:33:08.5330228Z if scale_ub is not None: 2025-05-07T20:33:08.5330336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5330475Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5330561Z ) 2025-05-07T20:33:08.5330690Z else: 2025-05-07T20:33:08.5330789Z scale_ub_tensor = None 2025-05-07T20:33:08.5330869Z 2025-05-07T20:33:08.5331006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5331106Z op = silu_mul_quant 2025-05-07T20:33:08.5331192Z if compiled: 2025-05-07T20:33:08.5331296Z op = torch.compile(op) 2025-05-07T20:33:08.5331416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5331575Z 2025-05-07T20:33:08.5331670Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5331675Z 2025-05-07T20:33:08.5331784Z moe/activation_test.py:117: 2025-05-07T20:33:08.5331915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5332019Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5332128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5332496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5332595Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5333093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5333192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5333592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5333826Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5334171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5334275Z kernel = self.compile( 2025-05-07T20:33:08.5334652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5334829Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5335008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5335012Z 2025-05-07T20:33:08.5335220Z self = 2025-05-07T20:33:08.5336001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5336517Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e737940>} 2025-05-07T20:33:08.5337264Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5337458Z context = 2025-05-07T20:33:08.5337466Z 2025-05-07T20:33:08.5337638Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5337911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5338023Z module_map=module_map) 2025-05-07T20:33:08.5338191Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5338300Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5338382Z E ^ 2025-05-07T20:33:08.5338744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5338749Z 2025-05-07T20:33:08.5339158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5339162Z 2025-05-07T20:33:08.5339269Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5339540Z self=, 2025-05-07T20:33:08.5339624Z T=4096, 2025-05-07T20:33:08.5339714Z D=5120, 2025-05-07T20:33:08.5339801Z scale_ub=1200.0, 2025-05-07T20:33:08.5339888Z contiguous=True, 2025-05-07T20:33:08.5339981Z compiled=True, 2025-05-07T20:33:08.5340056Z ) 2025-05-07T20:33:08.5343368Z self = 2025-05-07T20:33:08.5343582Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.5343837Z 2025-05-07T20:33:08.5343919Z @given( 2025-05-07T20:33:08.5344043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5344149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5344265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5344389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5344502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5344576Z ) 2025-05-07T20:33:08.5344837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5344935Z def test_silu_mul_quant( 2025-05-07T20:33:08.5345011Z self, 2025-05-07T20:33:08.5345099Z T: int, 2025-05-07T20:33:08.5345177Z D: int, 2025-05-07T20:33:08.5345359Z scale_ub: Optional[float], 2025-05-07T20:33:08.5345458Z contiguous: bool, 2025-05-07T20:33:08.5345544Z compiled: bool, 2025-05-07T20:33:08.5345629Z ) -> None: 2025-05-07T20:33:08.5345731Z torch.manual_seed(2025) 2025-05-07T20:33:08.5345804Z 2025-05-07T20:33:08.5345981Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5346056Z 2025-05-07T20:33:08.5346148Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5346280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5346370Z x = x_sign * x_clamp 2025-05-07T20:33:08.5346451Z x0 = x[:, :D] 2025-05-07T20:33:08.5346617Z x1 = x[:, D:] 2025-05-07T20:33:08.5346691Z 2025-05-07T20:33:08.5346776Z if contiguous: 2025-05-07T20:33:08.5346877Z x0 = x0.contiguous() 2025-05-07T20:33:08.5346968Z x1 = x1.contiguous() 2025-05-07T20:33:08.5347040Z 2025-05-07T20:33:08.5347138Z if scale_ub is not None: 2025-05-07T20:33:08.5347247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5347386Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5347471Z ) 2025-05-07T20:33:08.5347550Z else: 2025-05-07T20:33:08.5347652Z scale_ub_tensor = None 2025-05-07T20:33:08.5347725Z 2025-05-07T20:33:08.5347857Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5347954Z op = silu_mul_quant 2025-05-07T20:33:08.5348039Z if compiled: 2025-05-07T20:33:08.5348140Z op = torch.compile(op) 2025-05-07T20:33:08.5348252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5348331Z 2025-05-07T20:33:08.5348424Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5348428Z 2025-05-07T20:33:08.5348535Z moe/activation_test.py:117: 2025-05-07T20:33:08.5348665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5348775Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5348877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5349248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5349353Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5349845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5349941Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5350305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5350609Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5350964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5351059Z kernel = self.compile( 2025-05-07T20:33:08.5351440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5351624Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5351793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5351798Z 2025-05-07T20:33:08.5352012Z self = 2025-05-07T20:33:08.5352783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5353287Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e5c1790>} 2025-05-07T20:33:08.5354074Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5354268Z context = 2025-05-07T20:33:08.5354275Z 2025-05-07T20:33:08.5354454Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5354717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5354828Z module_map=module_map) 2025-05-07T20:33:08.5354999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5355102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5355222Z E ^ 2025-05-07T20:33:08.5355581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5355586Z 2025-05-07T20:33:08.5356006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5356010Z 2025-05-07T20:33:08.5356120Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5356349Z self=, 2025-05-07T20:33:08.5356427Z T=128, 2025-05-07T20:33:08.5356512Z D=5120, 2025-05-07T20:33:08.5356596Z scale_ub=1200.0, 2025-05-07T20:33:08.5356692Z contiguous=False, 2025-05-07T20:33:08.5356777Z compiled=True, 2025-05-07T20:33:08.5356853Z ) 2025-05-07T20:33:08.5357077Z self = 2025-05-07T20:33:08.5357254Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5357260Z 2025-05-07T20:33:08.5357338Z @given( 2025-05-07T20:33:08.5357465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5357565Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5357680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5357809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5357922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5358005Z ) 2025-05-07T20:33:08.5358250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5358344Z def test_silu_mul_quant( 2025-05-07T20:33:08.5358426Z self, 2025-05-07T20:33:08.5358505Z T: int, 2025-05-07T20:33:08.5358581Z D: int, 2025-05-07T20:33:08.5358686Z scale_ub: Optional[float], 2025-05-07T20:33:08.5358775Z contiguous: bool, 2025-05-07T20:33:08.5358863Z compiled: bool, 2025-05-07T20:33:08.5358996Z ) -> None: 2025-05-07T20:33:08.5359094Z torch.manual_seed(2025) 2025-05-07T20:33:08.5359167Z 2025-05-07T20:33:08.5359343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5359417Z 2025-05-07T20:33:08.5359513Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5359639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5359729Z x = x_sign * x_clamp 2025-05-07T20:33:08.5359858Z x0 = x[:, :D] 2025-05-07T20:33:08.5359938Z x1 = x[:, D:] 2025-05-07T20:33:08.5360014Z 2025-05-07T20:33:08.5360104Z if contiguous: 2025-05-07T20:33:08.5360197Z x0 = x0.contiguous() 2025-05-07T20:33:08.5360287Z x1 = x1.contiguous() 2025-05-07T20:33:08.5360366Z 2025-05-07T20:33:08.5360458Z if scale_ub is not None: 2025-05-07T20:33:08.5360567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5360709Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5360795Z ) 2025-05-07T20:33:08.5360873Z else: 2025-05-07T20:33:08.5360972Z scale_ub_tensor = None 2025-05-07T20:33:08.5361045Z 2025-05-07T20:33:08.5361184Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5361274Z op = silu_mul_quant 2025-05-07T20:33:08.5361403Z if compiled: 2025-05-07T20:33:08.5361512Z op = torch.compile(op) 2025-05-07T20:33:08.5361619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5361694Z 2025-05-07T20:33:08.5361790Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5361794Z 2025-05-07T20:33:08.5361891Z moe/activation_test.py:117: 2025-05-07T20:33:08.5362022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5362128Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5362228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5362602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5362743Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5363235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5363346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5363709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5363936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5364287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5364382Z kernel = self.compile( 2025-05-07T20:33:08.5364767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5364944Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5365074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5365079Z 2025-05-07T20:33:08.5365293Z self = 2025-05-07T20:33:08.5366064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5366568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e61e0d0>} 2025-05-07T20:33:08.5367309Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5367555Z context = 2025-05-07T20:33:08.5367560Z 2025-05-07T20:33:08.5367727Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5367988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5368108Z module_map=module_map) 2025-05-07T20:33:08.5368274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5368414Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5368500Z E ^ 2025-05-07T20:33:08.5368854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5368859Z 2025-05-07T20:33:08.5369278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5369282Z 2025-05-07T20:33:08.5369386Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5369615Z self=, 2025-05-07T20:33:08.5369703Z T=16384, 2025-05-07T20:33:08.5369781Z D=7168, 2025-05-07T20:33:08.5369865Z scale_ub=1200.0, 2025-05-07T20:33:08.5369959Z contiguous=True, 2025-05-07T20:33:08.5370045Z compiled=True, 2025-05-07T20:33:08.5370165Z ) 2025-05-07T20:33:08.5370393Z self = 2025-05-07T20:33:08.5370571Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.5370575Z 2025-05-07T20:33:08.5370663Z @given( 2025-05-07T20:33:08.5370782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5370882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5371005Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5371125Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5371242Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5371407Z ) 2025-05-07T20:33:08.5371659Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5371762Z def test_silu_mul_quant( 2025-05-07T20:33:08.5371841Z self, 2025-05-07T20:33:08.5371920Z T: int, 2025-05-07T20:33:08.5372008Z D: int, 2025-05-07T20:33:08.5372112Z scale_ub: Optional[float], 2025-05-07T20:33:08.5372202Z contiguous: bool, 2025-05-07T20:33:08.5372300Z compiled: bool, 2025-05-07T20:33:08.5372380Z ) -> None: 2025-05-07T20:33:08.5372474Z torch.manual_seed(2025) 2025-05-07T20:33:08.5372555Z 2025-05-07T20:33:08.5372725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5372799Z 2025-05-07T20:33:08.5372895Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5373020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5373118Z x = x_sign * x_clamp 2025-05-07T20:33:08.5373204Z x0 = x[:, :D] 2025-05-07T20:33:08.5373285Z x1 = x[:, D:] 2025-05-07T20:33:08.5373362Z 2025-05-07T20:33:08.5373448Z if contiguous: 2025-05-07T20:33:08.5373541Z x0 = x0.contiguous() 2025-05-07T20:33:08.5373636Z x1 = x1.contiguous() 2025-05-07T20:33:08.5373708Z 2025-05-07T20:33:08.5373802Z if scale_ub is not None: 2025-05-07T20:33:08.5373916Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5374053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5374132Z ) 2025-05-07T20:33:08.5374217Z else: 2025-05-07T20:33:08.5374311Z scale_ub_tensor = None 2025-05-07T20:33:08.5374384Z 2025-05-07T20:33:08.5374522Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5374612Z op = silu_mul_quant 2025-05-07T20:33:08.5374704Z if compiled: 2025-05-07T20:33:08.5374804Z op = torch.compile(op) 2025-05-07T20:33:08.5374967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5375050Z 2025-05-07T20:33:08.5375142Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5375147Z 2025-05-07T20:33:08.5375245Z moe/activation_test.py:117: 2025-05-07T20:33:08.5375379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5375484Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5375585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5376008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5376101Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5376602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5376700Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5377059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5377296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5377630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5377731Z kernel = self.compile( 2025-05-07T20:33:08.5378158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5378341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5378474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5378479Z 2025-05-07T20:33:08.5378683Z self = 2025-05-07T20:33:08.5379455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5380013Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e61ed30>} 2025-05-07T20:33:08.5380768Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5380966Z context = 2025-05-07T20:33:08.5380971Z 2025-05-07T20:33:08.5381307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5381581Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5381690Z module_map=module_map) 2025-05-07T20:33:08.5381851Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5381963Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5382041Z E ^ 2025-05-07T20:33:08.5382393Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5382398Z 2025-05-07T20:33:08.5382827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5382831Z 2025-05-07T20:33:08.5382936Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5383168Z self=, 2025-05-07T20:33:08.5383247Z T=16384, 2025-05-07T20:33:08.5383324Z D=5120, 2025-05-07T20:33:08.5383416Z scale_ub=1200.0, 2025-05-07T20:33:08.5383502Z contiguous=True, 2025-05-07T20:33:08.5383586Z compiled=False, 2025-05-07T20:33:08.5383669Z ) 2025-05-07T20:33:08.5383886Z self = 2025-05-07T20:33:08.5384120Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.5384126Z 2025-05-07T20:33:08.5384210Z @given( 2025-05-07T20:33:08.5384330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5384437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5384555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5384673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5384838Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5384914Z ) 2025-05-07T20:33:08.5385159Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5385263Z def test_silu_mul_quant( 2025-05-07T20:33:08.5385343Z self, 2025-05-07T20:33:08.5385429Z T: int, 2025-05-07T20:33:08.5385506Z D: int, 2025-05-07T20:33:08.5385606Z scale_ub: Optional[float], 2025-05-07T20:33:08.5385703Z contiguous: bool, 2025-05-07T20:33:08.5385797Z compiled: bool, 2025-05-07T20:33:08.5385876Z ) -> None: 2025-05-07T20:33:08.5385977Z torch.manual_seed(2025) 2025-05-07T20:33:08.5386051Z 2025-05-07T20:33:08.5386223Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5386308Z 2025-05-07T20:33:08.5386444Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5386570Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5386670Z x = x_sign * x_clamp 2025-05-07T20:33:08.5386755Z x0 = x[:, :D] 2025-05-07T20:33:08.5386835Z x1 = x[:, D:] 2025-05-07T20:33:08.5386916Z 2025-05-07T20:33:08.5386999Z if contiguous: 2025-05-07T20:33:08.5387100Z x0 = x0.contiguous() 2025-05-07T20:33:08.5387190Z x1 = x1.contiguous() 2025-05-07T20:33:08.5387265Z 2025-05-07T20:33:08.5387364Z if scale_ub is not None: 2025-05-07T20:33:08.5387471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5387655Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5387739Z ) 2025-05-07T20:33:08.5387817Z else: 2025-05-07T20:33:08.5387910Z scale_ub_tensor = None 2025-05-07T20:33:08.5387989Z 2025-05-07T20:33:08.5388123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5388220Z op = silu_mul_quant 2025-05-07T20:33:08.5388313Z if compiled: 2025-05-07T20:33:08.5388412Z op = torch.compile(op) 2025-05-07T20:33:08.5388527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5388599Z 2025-05-07T20:33:08.5388690Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5388694Z 2025-05-07T20:33:08.5388799Z moe/activation_test.py:117: 2025-05-07T20:33:08.5388926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5389026Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5389131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5389634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5389739Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5390100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5390324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5390671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5390764Z kernel = self.compile( 2025-05-07T20:33:08.5391144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5391326Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5391453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5391506Z 2025-05-07T20:33:08.5391718Z self = 2025-05-07T20:33:08.5392487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5392985Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e452700>} 2025-05-07T20:33:08.5393783Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5393975Z context = 2025-05-07T20:33:08.5393979Z 2025-05-07T20:33:08.5394164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5394426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5394533Z module_map=module_map) 2025-05-07T20:33:08.5394703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5394848Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5394932Z E ^ 2025-05-07T20:33:08.5395291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5395298Z 2025-05-07T20:33:08.5395708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5395713Z 2025-05-07T20:33:08.5395820Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5396042Z self=, 2025-05-07T20:33:08.5396170Z T=1, 2025-05-07T20:33:08.5396251Z D=7168, 2025-05-07T20:33:08.5396335Z scale_ub=1200.0, 2025-05-07T20:33:08.5396428Z contiguous=False, 2025-05-07T20:33:08.5396514Z compiled=False, 2025-05-07T20:33:08.5396587Z ) 2025-05-07T20:33:08.5396810Z self = 2025-05-07T20:33:08.5396982Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.5396987Z 2025-05-07T20:33:08.5397068Z @given( 2025-05-07T20:33:08.5397194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5397294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5397415Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5397533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5402869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5402968Z ) 2025-05-07T20:33:08.5403240Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5403351Z def test_silu_mul_quant( 2025-05-07T20:33:08.5403431Z self, 2025-05-07T20:33:08.5403518Z T: int, 2025-05-07T20:33:08.5403600Z D: int, 2025-05-07T20:33:08.5403704Z scale_ub: Optional[float], 2025-05-07T20:33:08.5403811Z contiguous: bool, 2025-05-07T20:33:08.5403906Z compiled: bool, 2025-05-07T20:33:08.5403989Z ) -> None: 2025-05-07T20:33:08.5404098Z torch.manual_seed(2025) 2025-05-07T20:33:08.5404176Z 2025-05-07T20:33:08.5404367Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5404453Z 2025-05-07T20:33:08.5404549Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5404687Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5404790Z x = x_sign * x_clamp 2025-05-07T20:33:08.5404874Z x0 = x[:, :D] 2025-05-07T20:33:08.5404957Z x1 = x[:, D:] 2025-05-07T20:33:08.5405040Z 2025-05-07T20:33:08.5405205Z if contiguous: 2025-05-07T20:33:08.5405312Z x0 = x0.contiguous() 2025-05-07T20:33:08.5405404Z x1 = x1.contiguous() 2025-05-07T20:33:08.5405479Z 2025-05-07T20:33:08.5405581Z if scale_ub is not None: 2025-05-07T20:33:08.5405690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5405831Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5405919Z ) 2025-05-07T20:33:08.5405999Z else: 2025-05-07T20:33:08.5406135Z scale_ub_tensor = None 2025-05-07T20:33:08.5406219Z 2025-05-07T20:33:08.5406353Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5406448Z op = silu_mul_quant 2025-05-07T20:33:08.5406546Z if compiled: 2025-05-07T20:33:08.5406652Z op = torch.compile(op) 2025-05-07T20:33:08.5406767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5406841Z 2025-05-07T20:33:08.5406934Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5406946Z 2025-05-07T20:33:08.5407057Z moe/activation_test.py:117: 2025-05-07T20:33:08.5407190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5407292Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5407400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5407980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5408096Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5408456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5408685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5409036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5409135Z kernel = self.compile( 2025-05-07T20:33:08.5409573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5409757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5409886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5409893Z 2025-05-07T20:33:08.5410109Z self = 2025-05-07T20:33:08.5410900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5411398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e6900d0>} 2025-05-07T20:33:08.5412147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5412342Z context = 2025-05-07T20:33:08.5412347Z 2025-05-07T20:33:08.5412528Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5412793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5412908Z module_map=module_map) 2025-05-07T20:33:08.5413081Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5413180Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5413267Z E ^ 2025-05-07T20:33:08.5413621Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5413626Z 2025-05-07T20:33:08.5414082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5414089Z 2025-05-07T20:33:08.5414208Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5414431Z self=, 2025-05-07T20:33:08.5414517Z T=4096, 2025-05-07T20:33:08.5414599Z D=7168, 2025-05-07T20:33:08.5414686Z scale_ub=1200.0, 2025-05-07T20:33:08.5414787Z contiguous=False, 2025-05-07T20:33:08.5414913Z compiled=True, 2025-05-07T20:33:08.5414989Z ) 2025-05-07T20:33:08.5415219Z self = 2025-05-07T20:33:08.5415394Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5415398Z 2025-05-07T20:33:08.5415477Z @given( 2025-05-07T20:33:08.5415605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5415706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5415838Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5415956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5416075Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5416158Z ) 2025-05-07T20:33:08.5416443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5416540Z def test_silu_mul_quant( 2025-05-07T20:33:08.5416627Z self, 2025-05-07T20:33:08.5416707Z T: int, 2025-05-07T20:33:08.5416785Z D: int, 2025-05-07T20:33:08.5416893Z scale_ub: Optional[float], 2025-05-07T20:33:08.5416983Z contiguous: bool, 2025-05-07T20:33:08.5417071Z compiled: bool, 2025-05-07T20:33:08.5417161Z ) -> None: 2025-05-07T20:33:08.5417257Z torch.manual_seed(2025) 2025-05-07T20:33:08.5417338Z 2025-05-07T20:33:08.5417508Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5417583Z 2025-05-07T20:33:08.5417728Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5417857Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5417948Z x = x_sign * x_clamp 2025-05-07T20:33:08.5418038Z x0 = x[:, :D] 2025-05-07T20:33:08.5418119Z x1 = x[:, D:] 2025-05-07T20:33:08.5418196Z 2025-05-07T20:33:08.5418293Z if contiguous: 2025-05-07T20:33:08.5418388Z x0 = x0.contiguous() 2025-05-07T20:33:08.5418480Z x1 = x1.contiguous() 2025-05-07T20:33:08.5418564Z 2025-05-07T20:33:08.5418659Z if scale_ub is not None: 2025-05-07T20:33:08.5418768Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5418920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5418999Z ) 2025-05-07T20:33:08.5419086Z else: 2025-05-07T20:33:08.5419183Z scale_ub_tensor = None 2025-05-07T20:33:08.5419260Z 2025-05-07T20:33:08.5419403Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5419500Z op = silu_mul_quant 2025-05-07T20:33:08.5419589Z if compiled: 2025-05-07T20:33:08.5419699Z op = torch.compile(op) 2025-05-07T20:33:08.5419806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5419883Z 2025-05-07T20:33:08.5419986Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5419993Z 2025-05-07T20:33:08.5420093Z moe/activation_test.py:117: 2025-05-07T20:33:08.5420232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5420339Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5420442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5420823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5420917Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5421572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5421687Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5422045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5422284Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5422625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5422763Z kernel = self.compile( 2025-05-07T20:33:08.5423154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5423334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5423462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5423475Z 2025-05-07T20:33:08.5423683Z self = 2025-05-07T20:33:08.5424477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5425036Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e690dc0>} 2025-05-07T20:33:08.5425794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5425995Z context = 2025-05-07T20:33:08.5426000Z 2025-05-07T20:33:08.5426169Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5426436Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5426595Z module_map=module_map) 2025-05-07T20:33:08.5426762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5426872Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5426951Z E ^ 2025-05-07T20:33:08.5427305Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5427313Z 2025-05-07T20:33:08.5427740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5427745Z 2025-05-07T20:33:08.5427849Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5428072Z self=, 2025-05-07T20:33:08.5428162Z T=128, 2025-05-07T20:33:08.5428241Z D=7168, 2025-05-07T20:33:08.5428336Z scale_ub=1200.0, 2025-05-07T20:33:08.5428424Z contiguous=False, 2025-05-07T20:33:08.5428518Z compiled=True, 2025-05-07T20:33:08.5428606Z ) 2025-05-07T20:33:08.5428863Z self = 2025-05-07T20:33:08.5429043Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5429047Z 2025-05-07T20:33:08.5429136Z @given( 2025-05-07T20:33:08.5429257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5429361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5429491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5429612Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5429735Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5429813Z ) 2025-05-07T20:33:08.5430057Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5430160Z def test_silu_mul_quant( 2025-05-07T20:33:08.5430238Z self, 2025-05-07T20:33:08.5430362Z T: int, 2025-05-07T20:33:08.5430453Z D: int, 2025-05-07T20:33:08.5430552Z scale_ub: Optional[float], 2025-05-07T20:33:08.5430642Z contiguous: bool, 2025-05-07T20:33:08.5430738Z compiled: bool, 2025-05-07T20:33:08.5430821Z ) -> None: 2025-05-07T20:33:08.5430919Z torch.manual_seed(2025) 2025-05-07T20:33:08.5431006Z 2025-05-07T20:33:08.5431176Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5431299Z 2025-05-07T20:33:08.5431395Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5431523Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5431622Z x = x_sign * x_clamp 2025-05-07T20:33:08.5431706Z x0 = x[:, :D] 2025-05-07T20:33:08.5431787Z x1 = x[:, D:] 2025-05-07T20:33:08.5431869Z 2025-05-07T20:33:08.5431954Z if contiguous: 2025-05-07T20:33:08.5432047Z x0 = x0.contiguous() 2025-05-07T20:33:08.5432146Z x1 = x1.contiguous() 2025-05-07T20:33:08.5432228Z 2025-05-07T20:33:08.5432319Z if scale_ub is not None: 2025-05-07T20:33:08.5432435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5432573Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5432658Z ) 2025-05-07T20:33:08.5432736Z else: 2025-05-07T20:33:08.5432873Z scale_ub_tensor = None 2025-05-07T20:33:08.5432959Z 2025-05-07T20:33:08.5433094Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5433189Z op = silu_mul_quant 2025-05-07T20:33:08.5433283Z if compiled: 2025-05-07T20:33:08.5433387Z op = torch.compile(op) 2025-05-07T20:33:08.5433498Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5433579Z 2025-05-07T20:33:08.5433672Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5433676Z 2025-05-07T20:33:08.5433775Z moe/activation_test.py:117: 2025-05-07T20:33:08.5433918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5434067Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5434176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5434549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5434651Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5435162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5435266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5435625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5435862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5436199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5436306Z kernel = self.compile( 2025-05-07T20:33:08.5436693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5436870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5437011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5437016Z 2025-05-07T20:33:08.5437220Z self = 2025-05-07T20:33:08.5438016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5438518Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e3bf940>} 2025-05-07T20:33:08.5439313Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5439515Z context = 2025-05-07T20:33:08.5439519Z 2025-05-07T20:33:08.5439703Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5439966Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5440675Z module_map=module_map) 2025-05-07T20:33:08.5440878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5440978Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5441065Z E ^ 2025-05-07T20:33:08.5441420Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5441431Z 2025-05-07T20:33:08.5441860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5441865Z 2025-05-07T20:33:08.5441972Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5442375Z self=, 2025-05-07T20:33:08.5442466Z T=2048, 2025-05-07T20:33:08.5442544Z D=7168, 2025-05-07T20:33:08.5442634Z scale_ub=None, 2025-05-07T20:33:08.5442723Z contiguous=True, 2025-05-07T20:33:08.5442810Z compiled=True, 2025-05-07T20:33:08.5442890Z ) 2025-05-07T20:33:08.5443108Z self = 2025-05-07T20:33:08.5443286Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.5443290Z 2025-05-07T20:33:08.5443374Z @given( 2025-05-07T20:33:08.5443495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5443601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5443797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5443914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5444036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5444112Z ) 2025-05-07T20:33:08.5444361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5444462Z def test_silu_mul_quant( 2025-05-07T20:33:08.5444541Z self, 2025-05-07T20:33:08.5444621Z T: int, 2025-05-07T20:33:08.5444706Z D: int, 2025-05-07T20:33:08.5444806Z scale_ub: Optional[float], 2025-05-07T20:33:08.5444895Z contiguous: bool, 2025-05-07T20:33:08.5444994Z compiled: bool, 2025-05-07T20:33:08.5445073Z ) -> None: 2025-05-07T20:33:08.5445169Z torch.manual_seed(2025) 2025-05-07T20:33:08.5445251Z 2025-05-07T20:33:08.5445421Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5445508Z 2025-05-07T20:33:08.5445600Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5445726Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5445825Z x = x_sign * x_clamp 2025-05-07T20:33:08.5445906Z x0 = x[:, :D] 2025-05-07T20:33:08.5445987Z x1 = x[:, D:] 2025-05-07T20:33:08.5446069Z 2025-05-07T20:33:08.5446155Z if contiguous: 2025-05-07T20:33:08.5446246Z x0 = x0.contiguous() 2025-05-07T20:33:08.5446345Z x1 = x1.contiguous() 2025-05-07T20:33:08.5446418Z 2025-05-07T20:33:08.5446509Z if scale_ub is not None: 2025-05-07T20:33:08.5446621Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5446759Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5446836Z ) 2025-05-07T20:33:08.5446921Z else: 2025-05-07T20:33:08.5447015Z scale_ub_tensor = None 2025-05-07T20:33:08.5447096Z 2025-05-07T20:33:08.5447307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5447401Z op = silu_mul_quant 2025-05-07T20:33:08.5447493Z if compiled: 2025-05-07T20:33:08.5447595Z op = torch.compile(op) 2025-05-07T20:33:08.5447702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5447781Z 2025-05-07T20:33:08.5447875Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5447880Z 2025-05-07T20:33:08.5447977Z moe/activation_test.py:117: 2025-05-07T20:33:08.5448175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5448276Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5448384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5448749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5448843Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5449344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5449443Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5449800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5450084Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5450429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5450533Z kernel = self.compile( 2025-05-07T20:33:08.5450910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5451085Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5451220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5451224Z 2025-05-07T20:33:08.5451433Z self = 2025-05-07T20:33:08.5452268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5452767Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e383550>} 2025-05-07T20:33:08.5453522Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5453720Z context = 2025-05-07T20:33:08.5453724Z 2025-05-07T20:33:08.5453890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5454162Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5454272Z module_map=module_map) 2025-05-07T20:33:08.5454434Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5454539Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5454620Z E ^ 2025-05-07T20:33:08.5454971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5454982Z 2025-05-07T20:33:08.5455396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5455400Z 2025-05-07T20:33:08.5455505Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5455738Z self=, 2025-05-07T20:33:08.5455817Z T=16384, 2025-05-07T20:33:08.5455896Z D=5120, 2025-05-07T20:33:08.5455985Z scale_ub=None, 2025-05-07T20:33:08.5456119Z contiguous=False, 2025-05-07T20:33:08.5456209Z compiled=False, 2025-05-07T20:33:08.5456292Z ) 2025-05-07T20:33:08.5456510Z self = 2025-05-07T20:33:08.5456695Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.5456702Z 2025-05-07T20:33:08.5456779Z @given( 2025-05-07T20:33:08.5456898Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5457047Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5457162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5457280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5457403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5457478Z ) 2025-05-07T20:33:08.5457735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5457833Z def test_silu_mul_quant( 2025-05-07T20:33:08.5457917Z self, 2025-05-07T20:33:08.5458002Z T: int, 2025-05-07T20:33:08.5458079Z D: int, 2025-05-07T20:33:08.5458178Z scale_ub: Optional[float], 2025-05-07T20:33:08.5458275Z contiguous: bool, 2025-05-07T20:33:08.5458363Z compiled: bool, 2025-05-07T20:33:08.5458441Z ) -> None: 2025-05-07T20:33:08.5458585Z torch.manual_seed(2025) 2025-05-07T20:33:08.5458662Z 2025-05-07T20:33:08.5458831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5458919Z 2025-05-07T20:33:08.5459013Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5459142Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5460976Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5461114Z 2025-05-07T20:33:08.5461250Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:08.5461255Z 2025-05-07T20:33:08.5461359Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5461584Z self=, 2025-05-07T20:33:08.5461668Z T=4096, 2025-05-07T20:33:08.5461746Z D=7168, 2025-05-07T20:33:08.5461833Z scale_ub=1200.0, 2025-05-07T20:33:08.5461926Z contiguous=True, 2025-05-07T20:33:08.5462011Z compiled=True, 2025-05-07T20:33:08.5462085Z ) 2025-05-07T20:33:08.5462307Z self = 2025-05-07T20:33:08.5462482Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.5462489Z 2025-05-07T20:33:08.5462573Z @given( 2025-05-07T20:33:08.5462690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5462789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5462909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5463027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5463140Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5463225Z ) 2025-05-07T20:33:08.5463472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5463566Z def test_silu_mul_quant( 2025-05-07T20:33:08.5463648Z self, 2025-05-07T20:33:08.5463725Z T: int, 2025-05-07T20:33:08.5463809Z D: int, 2025-05-07T20:33:08.5463908Z scale_ub: Optional[float], 2025-05-07T20:33:08.5463997Z contiguous: bool, 2025-05-07T20:33:08.5464089Z compiled: bool, 2025-05-07T20:33:08.5464213Z ) -> None: 2025-05-07T20:33:08.5464311Z torch.manual_seed(2025) 2025-05-07T20:33:08.5464387Z 2025-05-07T20:33:08.5464570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5464645Z 2025-05-07T20:33:08.5464743Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5464874Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5466664Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5466724Z 2025-05-07T20:33:08.5466844Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:08.5466848Z 2025-05-07T20:33:08.5466952Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5467179Z self=, 2025-05-07T20:33:08.5467257Z T=16384, 2025-05-07T20:33:08.5467377Z D=7168, 2025-05-07T20:33:08.5467472Z scale_ub=None, 2025-05-07T20:33:08.5467561Z contiguous=False, 2025-05-07T20:33:08.5467649Z compiled=False, 2025-05-07T20:33:08.5467730Z ) 2025-05-07T20:33:08.5467947Z self = 2025-05-07T20:33:08.5468131Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.5468135Z 2025-05-07T20:33:08.5468215Z @given( 2025-05-07T20:33:08.5468334Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5468439Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5468601Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5468719Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5468842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5468919Z ) 2025-05-07T20:33:08.5469168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5469271Z def test_silu_mul_quant( 2025-05-07T20:33:08.5469350Z self, 2025-05-07T20:33:08.5469439Z T: int, 2025-05-07T20:33:08.5469516Z D: int, 2025-05-07T20:33:08.5469615Z scale_ub: Optional[float], 2025-05-07T20:33:08.5469714Z contiguous: bool, 2025-05-07T20:33:08.5469806Z compiled: bool, 2025-05-07T20:33:08.5469884Z ) -> None: 2025-05-07T20:33:08.5469987Z torch.manual_seed(2025) 2025-05-07T20:33:08.5470060Z 2025-05-07T20:33:08.5470232Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5471997Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5472007Z 2025-05-07T20:33:08.5472124Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5472128Z 2025-05-07T20:33:08.5472237Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5472457Z self=, 2025-05-07T20:33:08.5472539Z T=2048, 2025-05-07T20:33:08.5472616Z D=7168, 2025-05-07T20:33:08.5472702Z scale_ub=1200.0, 2025-05-07T20:33:08.5472861Z contiguous=True, 2025-05-07T20:33:08.5472947Z compiled=True, 2025-05-07T20:33:08.5473019Z ) 2025-05-07T20:33:08.5473241Z self = 2025-05-07T20:33:08.5473412Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.5473416Z 2025-05-07T20:33:08.5473495Z @given( 2025-05-07T20:33:08.5473619Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5473784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5473905Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5474022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5474135Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5474215Z ) 2025-05-07T20:33:08.5474461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5474555Z def test_silu_mul_quant( 2025-05-07T20:33:08.5474643Z self, 2025-05-07T20:33:08.5474728Z T: int, 2025-05-07T20:33:08.5474805Z D: int, 2025-05-07T20:33:08.5474911Z scale_ub: Optional[float], 2025-05-07T20:33:08.5475003Z contiguous: bool, 2025-05-07T20:33:08.5475091Z compiled: bool, 2025-05-07T20:33:08.5475174Z ) -> None: 2025-05-07T20:33:08.5475311Z torch.manual_seed(2025) 2025-05-07T20:33:08.5475395Z 2025-05-07T20:33:08.5475562Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5475639Z 2025-05-07T20:33:08.5475739Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5475864Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5477609Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5477663Z 2025-05-07T20:33:08.5477787Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:08.5477793Z 2025-05-07T20:33:08.5477895Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5478127Z self=, 2025-05-07T20:33:08.5478204Z T=2048, 2025-05-07T20:33:08.5478280Z D=7168, 2025-05-07T20:33:08.5478369Z scale_ub=None, 2025-05-07T20:33:08.5478457Z contiguous=True, 2025-05-07T20:33:08.5478548Z compiled=False, 2025-05-07T20:33:08.5478622Z ) 2025-05-07T20:33:08.5478837Z self = 2025-05-07T20:33:08.5479021Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.5479028Z 2025-05-07T20:33:08.5479104Z @given( 2025-05-07T20:33:08.5479221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5479329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5479449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5479570Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5479692Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5479770Z ) 2025-05-07T20:33:08.5480023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5480117Z def test_silu_mul_quant( 2025-05-07T20:33:08.5480195Z self, 2025-05-07T20:33:08.5480278Z T: int, 2025-05-07T20:33:08.5480356Z D: int, 2025-05-07T20:33:08.5480454Z scale_ub: Optional[float], 2025-05-07T20:33:08.5480549Z contiguous: bool, 2025-05-07T20:33:08.5480635Z compiled: bool, 2025-05-07T20:33:08.5480841Z ) -> None: 2025-05-07T20:33:08.5480947Z torch.manual_seed(2025) 2025-05-07T20:33:08.5481022Z 2025-05-07T20:33:08.5481194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5481274Z 2025-05-07T20:33:08.5481368Z > x_sign = torch.sign(x) 2025-05-07T20:33:08.5483120Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5483165Z 2025-05-07T20:33:08.5483287Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:08.5483299Z 2025-05-07T20:33:08.5483407Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5483629Z self=, 2025-05-07T20:33:08.5483705Z T=1, 2025-05-07T20:33:08.5483790Z D=7168, 2025-05-07T20:33:08.5483874Z scale_ub=1200.0, 2025-05-07T20:33:08.5483996Z contiguous=True, 2025-05-07T20:33:08.5484089Z compiled=False, 2025-05-07T20:33:08.5484162Z ) 2025-05-07T20:33:08.5484380Z self = 2025-05-07T20:33:08.5484553Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.5484558Z 2025-05-07T20:33:08.5484635Z @given( 2025-05-07T20:33:08.5484759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5484859Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5484975Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5485102Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5485255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5485329Z ) 2025-05-07T20:33:08.5485582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5485676Z def test_silu_mul_quant( 2025-05-07T20:33:08.5485754Z self, 2025-05-07T20:33:08.5485841Z T: int, 2025-05-07T20:33:08.5485919Z D: int, 2025-05-07T20:33:08.5486025Z scale_ub: Optional[float], 2025-05-07T20:33:08.5486118Z contiguous: bool, 2025-05-07T20:33:08.5486205Z compiled: bool, 2025-05-07T20:33:08.5486291Z ) -> None: 2025-05-07T20:33:08.5486385Z torch.manual_seed(2025) 2025-05-07T20:33:08.5486457Z 2025-05-07T20:33:08.5486629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5486704Z 2025-05-07T20:33:08.5486796Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5486927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5487022Z x = x_sign * x_clamp 2025-05-07T20:33:08.5487103Z x0 = x[:, :D] 2025-05-07T20:33:08.5487190Z x1 = x[:, D:] 2025-05-07T20:33:08.5487262Z 2025-05-07T20:33:08.5487346Z if contiguous: 2025-05-07T20:33:08.5487443Z x0 = x0.contiguous() 2025-05-07T20:33:08.5487537Z x1 = x1.contiguous() 2025-05-07T20:33:08.5487617Z 2025-05-07T20:33:08.5487708Z if scale_ub is not None: 2025-05-07T20:33:08.5487816Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5487960Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5488036Z ) 2025-05-07T20:33:08.5488113Z else: 2025-05-07T20:33:08.5488214Z scale_ub_tensor = None 2025-05-07T20:33:08.5488287Z 2025-05-07T20:33:08.5488417Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5488512Z op = silu_mul_quant 2025-05-07T20:33:08.5488599Z if compiled: 2025-05-07T20:33:08.5488751Z op = torch.compile(op) 2025-05-07T20:33:08.5488866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5488939Z 2025-05-07T20:33:08.5489037Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5489041Z 2025-05-07T20:33:08.5489140Z moe/activation_test.py:117: 2025-05-07T20:33:08.5489271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5489381Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5489522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5490031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5490133Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5490499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5490731Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5491078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5491173Z kernel = self.compile( 2025-05-07T20:33:08.5491606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5491788Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5491916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5491931Z 2025-05-07T20:33:08.5492141Z self = 2025-05-07T20:33:08.5492911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5493421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e1a3040>} 2025-05-07T20:33:08.5494250Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5494446Z context = 2025-05-07T20:33:08.5494454Z 2025-05-07T20:33:08.5494622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5494887Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5495003Z module_map=module_map) 2025-05-07T20:33:08.5495166Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5495272Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5495351Z E ^ 2025-05-07T20:33:08.5495711Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5495715Z 2025-05-07T20:33:08.5496135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5496139Z 2025-05-07T20:33:08.5496245Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5496467Z self=, 2025-05-07T20:33:08.5496558Z T=128, 2025-05-07T20:33:08.5496636Z D=5120, 2025-05-07T20:33:08.5496726Z scale_ub=None, 2025-05-07T20:33:08.5496814Z contiguous=True, 2025-05-07T20:33:08.5496899Z compiled=False, 2025-05-07T20:33:08.5496984Z ) 2025-05-07T20:33:08.5497202Z self = 2025-05-07T20:33:08.5497373Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.5497380Z 2025-05-07T20:33:08.5497508Z @given( 2025-05-07T20:33:08.5497630Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5497732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5497856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5497974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5498101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5498178Z ) 2025-05-07T20:33:08.5498463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5498566Z def test_silu_mul_quant( 2025-05-07T20:33:08.5498642Z self, 2025-05-07T20:33:08.5498720Z T: int, 2025-05-07T20:33:08.5498804Z D: int, 2025-05-07T20:33:08.5498902Z scale_ub: Optional[float], 2025-05-07T20:33:08.5498994Z contiguous: bool, 2025-05-07T20:33:08.5499087Z compiled: bool, 2025-05-07T20:33:08.5499167Z ) -> None: 2025-05-07T20:33:08.5499269Z torch.manual_seed(2025) 2025-05-07T20:33:08.5499349Z 2025-05-07T20:33:08.5499517Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5499597Z 2025-05-07T20:33:08.5499690Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5499818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5499956Z x = x_sign * x_clamp 2025-05-07T20:33:08.5500041Z x0 = x[:, :D] 2025-05-07T20:33:08.5500123Z x1 = x[:, D:] 2025-05-07T20:33:08.5500205Z 2025-05-07T20:33:08.5500291Z if contiguous: 2025-05-07T20:33:08.5500384Z x0 = x0.contiguous() 2025-05-07T20:33:08.5500485Z x1 = x1.contiguous() 2025-05-07T20:33:08.5500560Z 2025-05-07T20:33:08.5500650Z if scale_ub is not None: 2025-05-07T20:33:08.5500766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5500902Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5501068Z ) 2025-05-07T20:33:08.5501218Z else: 2025-05-07T20:33:08.5501313Z scale_ub_tensor = None 2025-05-07T20:33:08.5501394Z 2025-05-07T20:33:08.5501527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5501618Z op = silu_mul_quant 2025-05-07T20:33:08.5501710Z if compiled: 2025-05-07T20:33:08.5501813Z op = torch.compile(op) 2025-05-07T20:33:08.5501925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5502008Z 2025-05-07T20:33:08.5502099Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5502104Z 2025-05-07T20:33:08.5502201Z moe/activation_test.py:117: 2025-05-07T20:33:08.5502337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5502438Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5502546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5503055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5503155Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5503522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5503754Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5504098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5504202Z kernel = self.compile( 2025-05-07T20:33:08.5504581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5504766Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5504893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5504897Z 2025-05-07T20:33:08.5505102Z self = 2025-05-07T20:33:08.5505925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5506438Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e1a3a60>} 2025-05-07T20:33:08.5507239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5507432Z context = 2025-05-07T20:33:08.5507437Z 2025-05-07T20:33:08.5507608Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5507876Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5507991Z module_map=module_map) 2025-05-07T20:33:08.5508158Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5508256Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5508335Z E ^ 2025-05-07T20:33:08.5508757Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5508764Z 2025-05-07T20:33:08.5509208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5509213Z 2025-05-07T20:33:08.5509324Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5509546Z self=, 2025-05-07T20:33:08.5509623Z T=128, 2025-05-07T20:33:08.5509706Z D=7168, 2025-05-07T20:33:08.5509789Z scale_ub=None, 2025-05-07T20:33:08.5509918Z contiguous=True, 2025-05-07T20:33:08.5510010Z compiled=False, 2025-05-07T20:33:08.5510085Z ) 2025-05-07T20:33:08.5510302Z self = 2025-05-07T20:33:08.5510479Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.5510484Z 2025-05-07T20:33:08.5510563Z @given( 2025-05-07T20:33:08.5510690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5510792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5510909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5511034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5511150Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5511225Z ) 2025-05-07T20:33:08.5511477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5511572Z def test_silu_mul_quant( 2025-05-07T20:33:08.5511656Z self, 2025-05-07T20:33:08.5511741Z T: int, 2025-05-07T20:33:08.5511818Z D: int, 2025-05-07T20:33:08.5511925Z scale_ub: Optional[float], 2025-05-07T20:33:08.5512015Z contiguous: bool, 2025-05-07T20:33:08.5512102Z compiled: bool, 2025-05-07T20:33:08.5512191Z ) -> None: 2025-05-07T20:33:08.5512289Z torch.manual_seed(2025) 2025-05-07T20:33:08.5512364Z 2025-05-07T20:33:08.5512541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5512618Z 2025-05-07T20:33:08.5512709Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5512842Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5512934Z x = x_sign * x_clamp 2025-05-07T20:33:08.5513016Z x0 = x[:, :D] 2025-05-07T20:33:08.5513106Z x1 = x[:, D:] 2025-05-07T20:33:08.5513179Z 2025-05-07T20:33:08.5513269Z if contiguous: 2025-05-07T20:33:08.5513361Z x0 = x0.contiguous() 2025-05-07T20:33:08.5513502Z x1 = x1.contiguous() 2025-05-07T20:33:08.5513584Z 2025-05-07T20:33:08.5513676Z if scale_ub is not None: 2025-05-07T20:33:08.5513784Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5513926Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5514002Z ) 2025-05-07T20:33:08.5514085Z else: 2025-05-07T20:33:08.5514188Z scale_ub_tensor = None 2025-05-07T20:33:08.5514261Z 2025-05-07T20:33:08.5514434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5514530Z op = silu_mul_quant 2025-05-07T20:33:08.5514617Z if compiled: 2025-05-07T20:33:08.5514725Z op = torch.compile(op) 2025-05-07T20:33:08.5514832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5514905Z 2025-05-07T20:33:08.5515005Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5515009Z 2025-05-07T20:33:08.5515107Z moe/activation_test.py:117: 2025-05-07T20:33:08.5515240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5515346Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5515446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5515987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5516094Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5516450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5516684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5517021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5517114Z kernel = self.compile( 2025-05-07T20:33:08.5517499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5517720Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5517852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5517857Z 2025-05-07T20:33:08.5518061Z self = 2025-05-07T20:33:08.5518831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5519343Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e29f790>} 2025-05-07T20:33:08.5520097Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5520296Z context = 2025-05-07T20:33:08.5520300Z 2025-05-07T20:33:08.5520471Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5520737Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5520851Z module_map=module_map) 2025-05-07T20:33:08.5521018Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5521122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5521199Z E ^ 2025-05-07T20:33:08.5521551Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5521556Z 2025-05-07T20:33:08.5521977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5521984Z 2025-05-07T20:33:08.5522126Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5522355Z self=, 2025-05-07T20:33:08.5522432Z T=2048, 2025-05-07T20:33:08.5522509Z D=7168, 2025-05-07T20:33:08.5522598Z scale_ub=1200.0, 2025-05-07T20:33:08.5522685Z contiguous=True, 2025-05-07T20:33:08.5522772Z compiled=False, 2025-05-07T20:33:08.5522849Z ) 2025-05-07T20:33:08.5523067Z self = 2025-05-07T20:33:08.5523282Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.5523287Z 2025-05-07T20:33:08.5523368Z @given( 2025-05-07T20:33:08.5523486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5523591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5523706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5523825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5523954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5524028Z ) 2025-05-07T20:33:08.5524274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5524374Z def test_silu_mul_quant( 2025-05-07T20:33:08.5524452Z self, 2025-05-07T20:33:08.5524570Z T: int, 2025-05-07T20:33:08.5524656Z D: int, 2025-05-07T20:33:08.5524754Z scale_ub: Optional[float], 2025-05-07T20:33:08.5524848Z contiguous: bool, 2025-05-07T20:33:08.5524942Z compiled: bool, 2025-05-07T20:33:08.5525019Z ) -> None: 2025-05-07T20:33:08.5525122Z torch.manual_seed(2025) 2025-05-07T20:33:08.5525196Z 2025-05-07T20:33:08.5530257Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5532092Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5532175Z 2025-05-07T20:33:08.5532315Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5532322Z 2025-05-07T20:33:08.5532432Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5532665Z self=, 2025-05-07T20:33:08.5532747Z T=1, 2025-05-07T20:33:08.5532833Z D=5120, 2025-05-07T20:33:08.5532931Z scale_ub=1200.0, 2025-05-07T20:33:08.5533021Z contiguous=True, 2025-05-07T20:33:08.5533110Z compiled=False, 2025-05-07T20:33:08.5533197Z ) 2025-05-07T20:33:08.5533421Z self = 2025-05-07T20:33:08.5533593Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.5533597Z 2025-05-07T20:33:08.5533684Z @given( 2025-05-07T20:33:08.5533808Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5533911Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5534037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5534159Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5534282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5534361Z ) 2025-05-07T20:33:08.5534608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5534714Z def test_silu_mul_quant( 2025-05-07T20:33:08.5534796Z self, 2025-05-07T20:33:08.5534878Z T: int, 2025-05-07T20:33:08.5534972Z D: int, 2025-05-07T20:33:08.5535121Z scale_ub: Optional[float], 2025-05-07T20:33:08.5535220Z contiguous: bool, 2025-05-07T20:33:08.5535316Z compiled: bool, 2025-05-07T20:33:08.5535403Z ) -> None: 2025-05-07T20:33:08.5535501Z torch.manual_seed(2025) 2025-05-07T20:33:08.5535586Z 2025-05-07T20:33:08.5535766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5535849Z 2025-05-07T20:33:08.5535943Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5536068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5536213Z x = x_sign * x_clamp 2025-05-07T20:33:08.5536298Z x0 = x[:, :D] 2025-05-07T20:33:08.5536380Z x1 = x[:, D:] 2025-05-07T20:33:08.5536463Z 2025-05-07T20:33:08.5536549Z if contiguous: 2025-05-07T20:33:08.5536643Z x0 = x0.contiguous() 2025-05-07T20:33:08.5536742Z x1 = x1.contiguous() 2025-05-07T20:33:08.5536817Z 2025-05-07T20:33:08.5536909Z if scale_ub is not None: 2025-05-07T20:33:08.5537037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5537175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5537260Z ) 2025-05-07T20:33:08.5537339Z else: 2025-05-07T20:33:08.5537438Z scale_ub_tensor = None 2025-05-07T20:33:08.5537519Z 2025-05-07T20:33:08.5537722Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5537817Z op = silu_mul_quant 2025-05-07T20:33:08.5537914Z if compiled: 2025-05-07T20:33:08.5538016Z op = torch.compile(op) 2025-05-07T20:33:08.5538122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5538202Z 2025-05-07T20:33:08.5538294Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5538299Z 2025-05-07T20:33:08.5538398Z moe/activation_test.py:117: 2025-05-07T20:33:08.5538542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5538643Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5538799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5539307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5539405Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5539773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5540003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5541175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5541283Z kernel = self.compile( 2025-05-07T20:33:08.5541673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5541860Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5541995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5541999Z 2025-05-07T20:33:08.5542211Z self = 2025-05-07T20:33:08.5542994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5543502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e23f040>} 2025-05-07T20:33:08.5544247Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5544440Z context = 2025-05-07T20:33:08.5544611Z 2025-05-07T20:33:08.5544791Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5545062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5545173Z module_map=module_map) 2025-05-07T20:33:08.5545347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5545448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5545590Z E ^ 2025-05-07T20:33:08.5545959Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5545964Z 2025-05-07T20:33:08.5546381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5546386Z 2025-05-07T20:33:08.5546498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5546724Z self=, 2025-05-07T20:33:08.5546807Z T=2048, 2025-05-07T20:33:08.5546893Z D=5120, 2025-05-07T20:33:08.5546978Z scale_ub=None, 2025-05-07T20:33:08.5547064Z contiguous=True, 2025-05-07T20:33:08.5547157Z compiled=False, 2025-05-07T20:33:08.5547234Z ) 2025-05-07T20:33:08.5547526Z self = 2025-05-07T20:33:08.5547708Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.5547715Z 2025-05-07T20:33:08.5547794Z @given( 2025-05-07T20:33:08.5547925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5548026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5548144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5548267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5548382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5548458Z ) 2025-05-07T20:33:08.5548787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5548883Z def test_silu_mul_quant( 2025-05-07T20:33:08.5548971Z self, 2025-05-07T20:33:08.5549052Z T: int, 2025-05-07T20:33:08.5549132Z D: int, 2025-05-07T20:33:08.5549242Z scale_ub: Optional[float], 2025-05-07T20:33:08.5549336Z contiguous: bool, 2025-05-07T20:33:08.5549426Z compiled: bool, 2025-05-07T20:33:08.5549518Z ) -> None: 2025-05-07T20:33:08.5549616Z torch.manual_seed(2025) 2025-05-07T20:33:08.5549692Z 2025-05-07T20:33:08.5549872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5549950Z 2025-05-07T20:33:08.5550043Z > x_sign = torch.sign(x) 2025-05-07T20:33:08.5551811Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5551820Z 2025-05-07T20:33:08.5551939Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:08.5551955Z 2025-05-07T20:33:08.5552060Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5552284Z self=, 2025-05-07T20:33:08.5552371Z T=16384, 2025-05-07T20:33:08.5552449Z D=5120, 2025-05-07T20:33:08.5552534Z scale_ub=None, 2025-05-07T20:33:08.5552634Z contiguous=True, 2025-05-07T20:33:08.5552721Z compiled=False, 2025-05-07T20:33:08.5552799Z ) 2025-05-07T20:33:08.5553068Z self = 2025-05-07T20:33:08.5553257Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.5553262Z 2025-05-07T20:33:08.5553340Z @given( 2025-05-07T20:33:08.5553468Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5553572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5553701Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5553822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5553979Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5554063Z ) 2025-05-07T20:33:08.5554310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5554408Z def test_silu_mul_quant( 2025-05-07T20:33:08.5554498Z self, 2025-05-07T20:33:08.5554577Z T: int, 2025-05-07T20:33:08.5554656Z D: int, 2025-05-07T20:33:08.5554770Z scale_ub: Optional[float], 2025-05-07T20:33:08.5554870Z contiguous: bool, 2025-05-07T20:33:08.5554964Z compiled: bool, 2025-05-07T20:33:08.5555047Z ) -> None: 2025-05-07T20:33:08.5555145Z torch.manual_seed(2025) 2025-05-07T20:33:08.5555230Z 2025-05-07T20:33:08.5555402Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5557231Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5557249Z 2025-05-07T20:33:08.5557368Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5557416Z 2025-05-07T20:33:08.5557522Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5557752Z self=, 2025-05-07T20:33:08.5557830Z T=4096, 2025-05-07T20:33:08.5557908Z D=5120, 2025-05-07T20:33:08.5558002Z scale_ub=None, 2025-05-07T20:33:08.5558092Z contiguous=True, 2025-05-07T20:33:08.5558177Z compiled=False, 2025-05-07T20:33:08.5558259Z ) 2025-05-07T20:33:08.5558479Z self = 2025-05-07T20:33:08.5558664Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.5558669Z 2025-05-07T20:33:08.5558746Z @given( 2025-05-07T20:33:08.5558865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5558971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5559087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5559214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5559338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5559416Z ) 2025-05-07T20:33:08.5559670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5559767Z def test_silu_mul_quant( 2025-05-07T20:33:08.5559846Z self, 2025-05-07T20:33:08.5559933Z T: int, 2025-05-07T20:33:08.5560012Z D: int, 2025-05-07T20:33:08.5560111Z scale_ub: Optional[float], 2025-05-07T20:33:08.5560211Z contiguous: bool, 2025-05-07T20:33:08.5560298Z compiled: bool, 2025-05-07T20:33:08.5560379Z ) -> None: 2025-05-07T20:33:08.5560482Z torch.manual_seed(2025) 2025-05-07T20:33:08.5560557Z 2025-05-07T20:33:08.5560726Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5562520Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5562529Z 2025-05-07T20:33:08.5562687Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5562701Z 2025-05-07T20:33:08.5562806Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5563028Z self=, 2025-05-07T20:33:08.5563115Z T=2048, 2025-05-07T20:33:08.5563191Z D=5120, 2025-05-07T20:33:08.5563274Z scale_ub=None, 2025-05-07T20:33:08.5563370Z contiguous=False, 2025-05-07T20:33:08.5563455Z compiled=False, 2025-05-07T20:33:08.5563532Z ) 2025-05-07T20:33:08.5563762Z self = 2025-05-07T20:33:08.5563939Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.5563943Z 2025-05-07T20:33:08.5564021Z @given( 2025-05-07T20:33:08.5564188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5564291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5564418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5564542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5564664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5564750Z ) 2025-05-07T20:33:08.5564996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5565097Z def test_silu_mul_quant( 2025-05-07T20:33:08.5565187Z self, 2025-05-07T20:33:08.5565267Z T: int, 2025-05-07T20:33:08.5565346Z D: int, 2025-05-07T20:33:08.5565499Z scale_ub: Optional[float], 2025-05-07T20:33:08.5565591Z contiguous: bool, 2025-05-07T20:33:08.5565689Z compiled: bool, 2025-05-07T20:33:08.5565772Z ) -> None: 2025-05-07T20:33:08.5565869Z torch.manual_seed(2025) 2025-05-07T20:33:08.5565956Z 2025-05-07T20:33:08.5566130Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5567875Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5567893Z 2025-05-07T20:33:08.5568015Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5568020Z 2025-05-07T20:33:08.5568123Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5568354Z self=, 2025-05-07T20:33:08.5568433Z T=4096, 2025-05-07T20:33:08.5568516Z D=7168, 2025-05-07T20:33:08.5568609Z scale_ub=None, 2025-05-07T20:33:08.5568700Z contiguous=True, 2025-05-07T20:33:08.5568786Z compiled=True, 2025-05-07T20:33:08.5568871Z ) 2025-05-07T20:33:08.5569089Z self = 2025-05-07T20:33:08.5569270Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.5569275Z 2025-05-07T20:33:08.5569352Z @given( 2025-05-07T20:33:08.5569465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5569569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5569737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5569863Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5569976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5570059Z ) 2025-05-07T20:33:08.5570306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5570410Z def test_silu_mul_quant( 2025-05-07T20:33:08.5570493Z self, 2025-05-07T20:33:08.5570571Z T: int, 2025-05-07T20:33:08.5570725Z D: int, 2025-05-07T20:33:08.5570825Z scale_ub: Optional[float], 2025-05-07T20:33:08.5570917Z contiguous: bool, 2025-05-07T20:33:08.5571014Z compiled: bool, 2025-05-07T20:33:08.5571093Z ) -> None: 2025-05-07T20:33:08.5571189Z torch.manual_seed(2025) 2025-05-07T20:33:08.5571274Z 2025-05-07T20:33:08.5571445Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5573242Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5573254Z 2025-05-07T20:33:08.5573374Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5573378Z 2025-05-07T20:33:08.5573492Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5573714Z self=, 2025-05-07T20:33:08.5573795Z T=2048, 2025-05-07T20:33:08.5573882Z D=5120, 2025-05-07T20:33:08.5573967Z scale_ub=1200.0, 2025-05-07T20:33:08.5574054Z contiguous=False, 2025-05-07T20:33:08.5574190Z compiled=False, 2025-05-07T20:33:08.5574264Z ) 2025-05-07T20:33:08.5574479Z self = 2025-05-07T20:33:08.5574662Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.5574666Z 2025-05-07T20:33:08.5574746Z @given( 2025-05-07T20:33:08.5574874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5574975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5575092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5575218Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5575334Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5575410Z ) 2025-05-07T20:33:08.5575663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5575759Z def test_silu_mul_quant( 2025-05-07T20:33:08.5575838Z self, 2025-05-07T20:33:08.5575928Z T: int, 2025-05-07T20:33:08.5576004Z D: int, 2025-05-07T20:33:08.5576102Z scale_ub: Optional[float], 2025-05-07T20:33:08.5576198Z contiguous: bool, 2025-05-07T20:33:08.5576284Z compiled: bool, 2025-05-07T20:33:08.5576372Z ) -> None: 2025-05-07T20:33:08.5576467Z torch.manual_seed(2025) 2025-05-07T20:33:08.5576545Z 2025-05-07T20:33:08.5576723Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5578517Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5578526Z 2025-05-07T20:33:08.5578657Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5578663Z 2025-05-07T20:33:08.5578786Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5579037Z self=, 2025-05-07T20:33:08.5579120Z T=4096, 2025-05-07T20:33:08.5579196Z D=7168, 2025-05-07T20:33:08.5579280Z scale_ub=1200.0, 2025-05-07T20:33:08.5579418Z contiguous=True, 2025-05-07T20:33:08.5579507Z compiled=False, 2025-05-07T20:33:08.5579588Z ) 2025-05-07T20:33:08.5579804Z self = 2025-05-07T20:33:08.5579974Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.5579978Z 2025-05-07T20:33:08.5580064Z @given( 2025-05-07T20:33:08.5580180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5580284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5580405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5580522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5580640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5580716Z ) 2025-05-07T20:33:08.5581066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5581173Z def test_silu_mul_quant( 2025-05-07T20:33:08.5581255Z self, 2025-05-07T20:33:08.5581333Z T: int, 2025-05-07T20:33:08.5581417Z D: int, 2025-05-07T20:33:08.5581513Z scale_ub: Optional[float], 2025-05-07T20:33:08.5581601Z contiguous: bool, 2025-05-07T20:33:08.5581692Z compiled: bool, 2025-05-07T20:33:08.5581774Z ) -> None: 2025-05-07T20:33:08.5581870Z torch.manual_seed(2025) 2025-05-07T20:33:08.5581951Z 2025-05-07T20:33:08.5582118Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5583954Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5583963Z 2025-05-07T20:33:08.5584080Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5584085Z 2025-05-07T20:33:08.5584192Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5584414Z self=, 2025-05-07T20:33:08.5584492Z T=16384, 2025-05-07T20:33:08.5584575Z D=7168, 2025-05-07T20:33:08.5584659Z scale_ub=None, 2025-05-07T20:33:08.5584745Z contiguous=False, 2025-05-07T20:33:08.5584833Z compiled=True, 2025-05-07T20:33:08.5584905Z ) 2025-05-07T20:33:08.5585119Z self = 2025-05-07T20:33:08.5585304Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.5585309Z 2025-05-07T20:33:08.5585386Z @given( 2025-05-07T20:33:08.5585509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5585610Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5585724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5585847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5585960Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5586037Z ) 2025-05-07T20:33:08.5586288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5586424Z def test_silu_mul_quant( 2025-05-07T20:33:08.5586507Z self, 2025-05-07T20:33:08.5586592Z T: int, 2025-05-07T20:33:08.5586670Z D: int, 2025-05-07T20:33:08.5586768Z scale_ub: Optional[float], 2025-05-07T20:33:08.5586866Z contiguous: bool, 2025-05-07T20:33:08.5586953Z compiled: bool, 2025-05-07T20:33:08.5587043Z ) -> None: 2025-05-07T20:33:08.5587140Z torch.manual_seed(2025) 2025-05-07T20:33:08.5587214Z 2025-05-07T20:33:08.5587388Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5589207Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5589215Z 2025-05-07T20:33:08.5589341Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5589346Z 2025-05-07T20:33:08.5589451Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5589714Z self=, 2025-05-07T20:33:08.5589802Z T=4096, 2025-05-07T20:33:08.5589884Z D=7168, 2025-05-07T20:33:08.5589967Z scale_ub=None, 2025-05-07T20:33:08.5590060Z contiguous=True, 2025-05-07T20:33:08.5590143Z compiled=False, 2025-05-07T20:33:08.5590224Z ) 2025-05-07T20:33:08.5590441Z self = 2025-05-07T20:33:08.5590615Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.5590620Z 2025-05-07T20:33:08.5590704Z @given( 2025-05-07T20:33:08.5590866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5590965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5591087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5591207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5591327Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5591408Z ) 2025-05-07T20:33:08.5591658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5591763Z def test_silu_mul_quant( 2025-05-07T20:33:08.5591842Z self, 2025-05-07T20:33:08.5591920Z T: int, 2025-05-07T20:33:08.5592007Z D: int, 2025-05-07T20:33:08.5592104Z scale_ub: Optional[float], 2025-05-07T20:33:08.5592192Z contiguous: bool, 2025-05-07T20:33:08.5592286Z compiled: bool, 2025-05-07T20:33:08.5592364Z ) -> None: 2025-05-07T20:33:08.5592459Z torch.manual_seed(2025) 2025-05-07T20:33:08.5592540Z 2025-05-07T20:33:08.5592713Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5594501Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5594510Z 2025-05-07T20:33:08.5594630Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5594634Z 2025-05-07T20:33:08.5594740Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5594961Z self=, 2025-05-07T20:33:08.5595091Z T=16384, 2025-05-07T20:33:08.5595176Z D=7168, 2025-05-07T20:33:08.5595258Z scale_ub=None, 2025-05-07T20:33:08.5595343Z contiguous=True, 2025-05-07T20:33:08.5595431Z compiled=False, 2025-05-07T20:33:08.5595505Z ) 2025-05-07T20:33:08.5595727Z self = 2025-05-07T20:33:08.5595911Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.5595916Z 2025-05-07T20:33:08.5596034Z @given( 2025-05-07T20:33:08.5596151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5596258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5596372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5596488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5596610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5596685Z ) 2025-05-07T20:33:08.5596941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5597039Z def test_silu_mul_quant( 2025-05-07T20:33:08.5597117Z self, 2025-05-07T20:33:08.5597199Z T: int, 2025-05-07T20:33:08.5597278Z D: int, 2025-05-07T20:33:08.5597377Z scale_ub: Optional[float], 2025-05-07T20:33:08.5597475Z contiguous: bool, 2025-05-07T20:33:08.5597603Z compiled: bool, 2025-05-07T20:33:08.5597687Z ) -> None: 2025-05-07T20:33:08.5597789Z torch.manual_seed(2025) 2025-05-07T20:33:08.5597868Z 2025-05-07T20:33:08.5598037Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5599820Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5599866Z 2025-05-07T20:33:08.5599987Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5599998Z 2025-05-07T20:33:08.5600103Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5600323Z self=, 2025-05-07T20:33:08.5600410Z T=16384, 2025-05-07T20:33:08.5600488Z D=7168, 2025-05-07T20:33:08.5600571Z scale_ub=1200.0, 2025-05-07T20:33:08.5600664Z contiguous=True, 2025-05-07T20:33:08.5600749Z compiled=False, 2025-05-07T20:33:08.5600822Z ) 2025-05-07T20:33:08.5601044Z self = 2025-05-07T20:33:08.5601221Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.5601228Z 2025-05-07T20:33:08.5601316Z @given( 2025-05-07T20:33:08.5601433Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5601531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5601651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5601770Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5601883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5601963Z ) 2025-05-07T20:33:08.5602210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5602304Z def test_silu_mul_quant( 2025-05-07T20:33:08.5602386Z self, 2025-05-07T20:33:08.5602463Z T: int, 2025-05-07T20:33:08.5602539Z D: int, 2025-05-07T20:33:08.5602642Z scale_ub: Optional[float], 2025-05-07T20:33:08.5602730Z contiguous: bool, 2025-05-07T20:33:08.5602822Z compiled: bool, 2025-05-07T20:33:08.5602899Z ) -> None: 2025-05-07T20:33:08.5603061Z torch.manual_seed(2025) 2025-05-07T20:33:08.5603142Z 2025-05-07T20:33:08.5603310Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5605051Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5605102Z 2025-05-07T20:33:08.5605223Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5605227Z 2025-05-07T20:33:08.5605329Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5605561Z self=, 2025-05-07T20:33:08.5605640Z T=128, 2025-05-07T20:33:08.5605717Z D=5120, 2025-05-07T20:33:08.5605807Z scale_ub=1200.0, 2025-05-07T20:33:08.5605893Z contiguous=False, 2025-05-07T20:33:08.5605985Z compiled=False, 2025-05-07T20:33:08.5606058Z ) 2025-05-07T20:33:08.5606312Z self = 2025-05-07T20:33:08.5606493Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.5606502Z 2025-05-07T20:33:08.5606578Z @given( 2025-05-07T20:33:08.5606693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5606800Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5606913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5607028Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5607148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5607263Z ) 2025-05-07T20:33:08.5607516Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5607609Z def test_silu_mul_quant( 2025-05-07T20:33:08.5607685Z self, 2025-05-07T20:33:08.5607768Z T: int, 2025-05-07T20:33:08.5607849Z D: int, 2025-05-07T20:33:08.5607949Z scale_ub: Optional[float], 2025-05-07T20:33:08.5608045Z contiguous: bool, 2025-05-07T20:33:08.5608132Z compiled: bool, 2025-05-07T20:33:08.5608215Z ) -> None: 2025-05-07T20:33:08.5608316Z torch.manual_seed(2025) 2025-05-07T20:33:08.5608388Z 2025-05-07T20:33:08.5608553Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5608635Z 2025-05-07T20:33:08.5608726Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5608860Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5608948Z x = x_sign * x_clamp 2025-05-07T20:33:08.5609028Z x0 = x[:, :D] 2025-05-07T20:33:08.5609121Z x1 = x[:, D:] 2025-05-07T20:33:08.5609195Z 2025-05-07T20:33:08.5609279Z if contiguous: 2025-05-07T20:33:08.5609380Z x0 = x0.contiguous() 2025-05-07T20:33:08.5609471Z x1 = x1.contiguous() 2025-05-07T20:33:08.5609544Z 2025-05-07T20:33:08.5609643Z if scale_ub is not None: 2025-05-07T20:33:08.5609751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5609889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5609972Z ) 2025-05-07T20:33:08.5610049Z else: 2025-05-07T20:33:08.5610146Z scale_ub_tensor = None 2025-05-07T20:33:08.5610228Z 2025-05-07T20:33:08.5610359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5610458Z op = silu_mul_quant 2025-05-07T20:33:08.5610543Z if compiled: 2025-05-07T20:33:08.5610644Z op = torch.compile(op) 2025-05-07T20:33:08.5610756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5610876Z 2025-05-07T20:33:08.5610970Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5610974Z 2025-05-07T20:33:08.5611082Z moe/activation_test.py:117: 2025-05-07T20:33:08.5611209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5611315Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5611425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5611928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5612109Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5612473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5612701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5613057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5613154Z kernel = self.compile( 2025-05-07T20:33:08.5613543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5613720Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5613885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5613890Z 2025-05-07T20:33:08.5614108Z self = 2025-05-07T20:33:08.5614879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5615397Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158e041ca0>} 2025-05-07T20:33:08.5616264Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5616461Z context = 2025-05-07T20:33:08.5616466Z 2025-05-07T20:33:08.5616642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5616912Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5617029Z module_map=module_map) 2025-05-07T20:33:08.5617190Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5617289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5617369Z E ^ 2025-05-07T20:33:08.5617729Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5617736Z 2025-05-07T20:33:08.5618146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5618158Z 2025-05-07T20:33:08.5618262Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5618488Z self=, 2025-05-07T20:33:08.5618571Z T=2048, 2025-05-07T20:33:08.5618650Z D=7168, 2025-05-07T20:33:08.5618733Z scale_ub=None, 2025-05-07T20:33:08.5618828Z contiguous=False, 2025-05-07T20:33:08.5618913Z compiled=False, 2025-05-07T20:33:08.5618985Z ) 2025-05-07T20:33:08.5619209Z self = 2025-05-07T20:33:08.5619386Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.5619390Z 2025-05-07T20:33:08.5619472Z @given( 2025-05-07T20:33:08.5619590Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5619738Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5619865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5619984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5620099Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5620180Z ) 2025-05-07T20:33:08.5620427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5620521Z def test_silu_mul_quant( 2025-05-07T20:33:08.5620649Z self, 2025-05-07T20:33:08.5620729Z T: int, 2025-05-07T20:33:08.5620806Z D: int, 2025-05-07T20:33:08.5620912Z scale_ub: Optional[float], 2025-05-07T20:33:08.5621052Z contiguous: bool, 2025-05-07T20:33:08.5621148Z compiled: bool, 2025-05-07T20:33:08.5621226Z ) -> None: 2025-05-07T20:33:08.5621321Z torch.manual_seed(2025) 2025-05-07T20:33:08.5621405Z 2025-05-07T20:33:08.5621580Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5623374Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5623391Z 2025-05-07T20:33:08.5623513Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5623517Z 2025-05-07T20:33:08.5623619Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5623846Z self=, 2025-05-07T20:33:08.5623924Z T=128, 2025-05-07T20:33:08.5624040Z D=7168, 2025-05-07T20:33:08.5624136Z scale_ub=1200.0, 2025-05-07T20:33:08.5624221Z contiguous=True, 2025-05-07T20:33:08.5624309Z compiled=True, 2025-05-07T20:33:08.5624382Z ) 2025-05-07T20:33:08.5624601Z self = 2025-05-07T20:33:08.5624778Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.5624782Z 2025-05-07T20:33:08.5624860Z @given( 2025-05-07T20:33:08.5624981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5625084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5625198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5625315Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5625433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5625505Z ) 2025-05-07T20:33:08.5625757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5625860Z def test_silu_mul_quant( 2025-05-07T20:33:08.5625936Z self, 2025-05-07T20:33:08.5626021Z T: int, 2025-05-07T20:33:08.5626102Z D: int, 2025-05-07T20:33:08.5626199Z scale_ub: Optional[float], 2025-05-07T20:33:08.5626297Z contiguous: bool, 2025-05-07T20:33:08.5626386Z compiled: bool, 2025-05-07T20:33:08.5626471Z ) -> None: 2025-05-07T20:33:08.5626574Z torch.manual_seed(2025) 2025-05-07T20:33:08.5626647Z 2025-05-07T20:33:08.5626819Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5626900Z 2025-05-07T20:33:08.5626993Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5627124Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5627214Z x = x_sign * x_clamp 2025-05-07T20:33:08.5627295Z x0 = x[:, :D] 2025-05-07T20:33:08.5627384Z x1 = x[:, D:] 2025-05-07T20:33:08.5627459Z 2025-05-07T20:33:08.5627544Z if contiguous: 2025-05-07T20:33:08.5627687Z x0 = x0.contiguous() 2025-05-07T20:33:08.5627781Z x1 = x1.contiguous() 2025-05-07T20:33:08.5627854Z 2025-05-07T20:33:08.5627951Z if scale_ub is not None: 2025-05-07T20:33:08.5628057Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5628199Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5628286Z ) 2025-05-07T20:33:08.5628362Z else: 2025-05-07T20:33:08.5628466Z scale_ub_tensor = None 2025-05-07T20:33:08.5628580Z 2025-05-07T20:33:08.5628712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5628809Z op = silu_mul_quant 2025-05-07T20:33:08.5628895Z if compiled: 2025-05-07T20:33:08.5628996Z op = torch.compile(op) 2025-05-07T20:33:08.5629109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5629182Z 2025-05-07T20:33:08.5629275Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5629282Z 2025-05-07T20:33:08.5629390Z moe/activation_test.py:117: 2025-05-07T20:33:08.5629518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5629619Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5629725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5630136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5630238Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5630744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5630842Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5631204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5631432Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5631854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5631947Z kernel = self.compile( 2025-05-07T20:33:08.5632330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5632517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5632643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5632650Z 2025-05-07T20:33:08.5632858Z self = 2025-05-07T20:33:08.5633635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5634139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f158df3a0d0>} 2025-05-07T20:33:08.5634886Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5635082Z context = 2025-05-07T20:33:08.5635087Z 2025-05-07T20:33:08.5635261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5635524Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5635632Z module_map=module_map) 2025-05-07T20:33:08.5635800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5635897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5635977Z E ^ 2025-05-07T20:33:08.5636378Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5636386Z 2025-05-07T20:33:08.5636796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5636801Z 2025-05-07T20:33:08.5636913Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5637135Z self=, 2025-05-07T20:33:08.5637252Z T=128, 2025-05-07T20:33:08.5637337Z D=7168, 2025-05-07T20:33:08.5637419Z scale_ub=1200.0, 2025-05-07T20:33:08.5637505Z contiguous=True, 2025-05-07T20:33:08.5637599Z compiled=False, 2025-05-07T20:33:08.5637673Z ) 2025-05-07T20:33:08.5637896Z self = 2025-05-07T20:33:08.5638067Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.5638072Z 2025-05-07T20:33:08.5638148Z @given( 2025-05-07T20:33:08.5638275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5638373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5638491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5638616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5638794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5638882Z ) 2025-05-07T20:33:08.5639146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5639241Z def test_silu_mul_quant( 2025-05-07T20:33:08.5639323Z self, 2025-05-07T20:33:08.5639402Z T: int, 2025-05-07T20:33:08.5639480Z D: int, 2025-05-07T20:33:08.5639585Z scale_ub: Optional[float], 2025-05-07T20:33:08.5639676Z contiguous: bool, 2025-05-07T20:33:08.5639763Z compiled: bool, 2025-05-07T20:33:08.5639849Z ) -> None: 2025-05-07T20:33:08.5639943Z torch.manual_seed(2025) 2025-05-07T20:33:08.5640386Z 2025-05-07T20:33:08.5640615Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5640693Z 2025-05-07T20:33:08.5640787Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5640915Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5643190Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5643199Z 2025-05-07T20:33:08.5643325Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:08.5643332Z 2025-05-07T20:33:08.5643438Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5643673Z self=, 2025-05-07T20:33:08.5643751Z T=128, 2025-05-07T20:33:08.5643829Z D=5120, 2025-05-07T20:33:08.5643920Z scale_ub=1200.0, 2025-05-07T20:33:08.5644005Z contiguous=True, 2025-05-07T20:33:08.5644092Z compiled=True, 2025-05-07T20:33:08.5644170Z ) 2025-05-07T20:33:08.5644385Z self = 2025-05-07T20:33:08.5644554Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.5644566Z 2025-05-07T20:33:08.5644645Z @given( 2025-05-07T20:33:08.5644763Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5644864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5644980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5645095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5645313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5645390Z ) 2025-05-07T20:33:08.5645636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5645737Z def test_silu_mul_quant( 2025-05-07T20:33:08.5645814Z self, 2025-05-07T20:33:08.5645892Z T: int, 2025-05-07T20:33:08.5645978Z D: int, 2025-05-07T20:33:08.5646074Z scale_ub: Optional[float], 2025-05-07T20:33:08.5646221Z contiguous: bool, 2025-05-07T20:33:08.5646308Z compiled: bool, 2025-05-07T20:33:08.5646386Z ) -> None: 2025-05-07T20:33:08.5646487Z torch.manual_seed(2025) 2025-05-07T20:33:08.5646561Z 2025-05-07T20:33:08.5646728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5646806Z 2025-05-07T20:33:08.5646899Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5647023Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5648860Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5648868Z 2025-05-07T20:33:08.5648989Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:08.5648994Z 2025-05-07T20:33:08.5649102Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5649322Z self=, 2025-05-07T20:33:08.5649406Z T=128, 2025-05-07T20:33:08.5649483Z D=7168, 2025-05-07T20:33:08.5649564Z scale_ub=None, 2025-05-07T20:33:08.5649716Z contiguous=True, 2025-05-07T20:33:08.5649798Z compiled=True, 2025-05-07T20:33:08.5649873Z ) 2025-05-07T20:33:08.5650096Z self = 2025-05-07T20:33:08.5650263Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.5650270Z 2025-05-07T20:33:08.5650347Z @given( 2025-05-07T20:33:08.5650468Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5650570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5650689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5650807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5650919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5650999Z ) 2025-05-07T20:33:08.5651242Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5651333Z def test_silu_mul_quant( 2025-05-07T20:33:08.5651421Z self, 2025-05-07T20:33:08.5651498Z T: int, 2025-05-07T20:33:08.5651572Z D: int, 2025-05-07T20:33:08.5651673Z scale_ub: Optional[float], 2025-05-07T20:33:08.5651759Z contiguous: bool, 2025-05-07T20:33:08.5651843Z compiled: bool, 2025-05-07T20:33:08.5651927Z ) -> None: 2025-05-07T20:33:08.5652023Z torch.manual_seed(2025) 2025-05-07T20:33:08.5652102Z 2025-05-07T20:33:08.5652268Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5654088Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:08.5654103Z 2025-05-07T20:33:08.5654221Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:08.5654356Z =============================== warnings summary =============================== 2025-05-07T20:33:08.5654666Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:08.5655009Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:08.5655309Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:08.5656183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:08.5656422Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:08.5656427Z 2025-05-07T20:33:08.5656643Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:08.5656810Z ================= 1 failed, 1 deselected, 3 warnings in 19.48s ================= 2025-05-07T20:33:10.1276882Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:10.1905725Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:10.1905972Z 2025-05-07T20:33:10.1906580Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:10.1907164Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:10.1907562Z 2025-05-07T20:33:10.1907569Z 2025-05-07T20:33:10.1907742Z 2025-05-07T20:33:10.1925326Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:10.2007169Z Post job cleanup. 2025-05-07T20:33:10.2986778Z [command]/usr/bin/git version 2025-05-07T20:33:10.3028577Z git version 2.47.1 2025-05-07T20:33:10.3064054Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/2163174b-65c6-42d6-aaf5-a8a6664bfa26/.gitconfig' 2025-05-07T20:33:10.3074567Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/2163174b-65c6-42d6-aaf5-a8a6664bfa26' before making global git config changes 2025-05-07T20:33:10.3075445Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:10.3079829Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:10.3119560Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:10.3154796Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:10.3489775Z Entering 'external/asmjit' 2025-05-07T20:33:10.3556430Z Entering 'external/composable_kernel' 2025-05-07T20:33:10.3629043Z Entering 'external/cpuinfo' 2025-05-07T20:33:10.3697219Z Entering 'external/cutlass' 2025-05-07T20:33:10.3773338Z Entering 'external/googletest' 2025-05-07T20:33:10.3838447Z Entering 'external/hipify_torch' 2025-05-07T20:33:10.3903700Z Entering 'external/json' 2025-05-07T20:33:10.3990579Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:10.4015550Z http.https://github.com/.extraheader 2025-05-07T20:33:10.4027709Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:10.4059068Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:10.4386136Z Entering 'external/asmjit' 2025-05-07T20:33:10.4431292Z http.https://github.com/.extraheader 2025-05-07T20:33:10.4473522Z Entering 'external/composable_kernel' 2025-05-07T20:33:10.4516123Z http.https://github.com/.extraheader 2025-05-07T20:33:10.4565129Z Entering 'external/cpuinfo' 2025-05-07T20:33:10.4607921Z http.https://github.com/.extraheader 2025-05-07T20:33:10.4650203Z Entering 'external/cutlass' 2025-05-07T20:33:10.4693063Z http.https://github.com/.extraheader 2025-05-07T20:33:10.4744840Z Entering 'external/googletest' 2025-05-07T20:33:10.4788039Z http.https://github.com/.extraheader 2025-05-07T20:33:10.4830350Z Entering 'external/hipify_torch' 2025-05-07T20:33:10.4873415Z http.https://github.com/.extraheader 2025-05-07T20:33:10.4915296Z Entering 'external/json' 2025-05-07T20:33:10.4961234Z http.https://github.com/.extraheader 2025-05-07T20:33:10.5118527Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:10.5153691Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:10.5163902Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:10.5164261Z ##[endgroup] 2025-05-07T20:33:10.5267936Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:21.2950980Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:37.8098561Z Cleaning up orphan processes